CN112948825A

CN112948825A - Prediction method and device for viral propagation of network information in social network

Info

Publication number: CN112948825A
Application number: CN202110405406.8A
Authority: CN
Inventors: 高立群; 周斌; 刘宇嘉; 贾焰; 李爱平; 江荣; 涂宏魁; 王晔; 喻承; 汪海洋; 庄洪武; 席闻; 蒋沂桔; 宋鑫
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-06-11
Anticipated expiration: 2041-04-15
Also published as: CN112948825B

Abstract

The invention provides a prediction method for information virologic propagation in a social network, which can judge whether an attention theme can be developed into virologic propagation or not, and comprises the steps of collecting forwarded time and evaluation content of the attention theme, constructing a propagation event, slicing according to time sequence, extracting characteristics, splicing obtained propagation magnitude characteristics and emotion ratio characteristics, inputting a first CNN filter model for learning, obtaining first fusion characteristics, splicing semantic evolution characteristics in each time slice, and inputting a Bi-LSTM network model to obtain second fusion characteristics; splicing the obtained first fusion characteristic and the second fusion characteristic, inputting the first fusion characteristic and the second fusion characteristic into a second CNN filter model for learning, and obtaining a third fusion characteristic; and constructing a time sequence prediction model, inputting the third fusion characteristics into the trained time sequence prediction model, outputting the prediction probability that the concerned subject becomes viral propagation, and judging that the concerned subject can be developed into viral propagation when the prediction probability is greater than a prediction threshold.

Description

Prediction method and device for viral propagation of network information in social network

Technical Field

The invention relates to the technical field of data mining and social network public opinion analysis, in particular to a prediction method, a prediction device and a storage medium for information viral propagation in a social network.

Background

With the rapid development of the internet, events in the social network are transmitted in the network by taking information as a carrier, the transmission and growth rate of the events becomes an evaluation index of social public sentiment, and the information transmission is accelerated too fast like virus infection and becomes a public sentiment event. Any message source in the network can become a source point for causing public sentiment events, and is forwarded in a large amount in a short time like viruses, for example, information which is easy to cause social disputes, such as network emergencies, judicial incidents, economic events, safety events and the like, is more easily and widely spread, and even becomes a 'black swan' event which causes bad social influence.

Viral transmission of network information refers to the transmission of information in the internet in the form of viruses, which cause widespread transmission in a short time. The information often causes certain influence in the network and even the society, and the virus of the predicted network information can be effectively discovered as soon as possible to be an event of network public opinion, so as to intervene the network event as soon as possible to change the propagation form of the social event.

At present, the research on information transmission virosis is mainly performed by analyzing the characteristics of an information transmission network, including the topological structure of the transmission network, the characteristics of transmission nodes, the characteristics of transmission time, and the like. And then using the characteristics to classify the scale of the spread through a machine learning model to judge whether the predicted information can spread like viruses.

However, in the actual application process, the mainstream social network media website can limit crawling of user relationship data, so that global network structure data is difficult to obtain, hidden user relationships (such as user relationship networks) cannot be obtained, and the current situation that the effect of the existing algorithm in actual application is poor and the prediction result is inaccurate is caused.

If the network structure of information transmission can not be obtained, a small amount of effective transmission characteristics are utilized to predict the virus of information transmission, the performance similar to that under the condition of knowing the user relationship network is achieved, and the method has important significance for early discovery of public sentiment events.

Disclosure of Invention

The invention aims to provide a method for predicting viral propagation of information in a social network, and solves the existing problems.

The technical scheme is as follows: a method for predicting viral propagation of information in a social network comprises the following steps:

step 1: collecting propagation data corresponding to the concerned subject in the social network media information according to the concerned subject, wherein the collected propagation data comprise forwarded time and evaluation content of the concerned subject, and constructing a propagation event through the propagation data;

step 2: slicing each propagation event according to a time sequence respectively to obtain a plurality of time slices of sub-events containing the propagation events, taking out viral propagation characteristics based on the time sequence from each sub-event respectively and performing vector representation, wherein the viral propagation characteristics based on the time sequence comprise propagation magnitude characteristics, emotion ratio characteristics and semantic evolution characteristics;

and step 3: splicing the obtained propagation magnitude characteristic and the emotion ratio characteristic to obtain a spliced characteristic, inputting the spliced characteristic into a first CNN filter model for learning to obtain a first fusion characteristic, wherein the first fusion characteristic is expressed in a vector form;

and 4, step 4: splicing the semantic evolution characteristics in each time slice by taking the time slice as separation to obtain spliced semantic evolution characteristics, inputting the spliced semantic evolution characteristics into a Bi-LSTM network model to obtain second fusion characteristics, wherein the second fusion characteristics are expressed in a vector form;

and 5: splicing the obtained first fusion characteristic and the second fusion characteristic to obtain an information propagation mixed characteristic, wherein the first fusion characteristic is represented in a vector form; inputting the information propagation mixed feature into a second CNN filter model for learning to obtain a third fusion feature, wherein the third fusion feature is expressed in a vector form;

step 6: constructing a time sequence prediction model, wherein the time sequence prediction model comprises a full connection layer and a logic classification layer which are sequentially arranged, setting a training set training model until convergence, obtaining a trained time sequence prediction model, inputting a third fusion characteristic into the trained time sequence prediction model, and outputting a prediction probability that an attention topic becomes viral propagation;

and 7: and setting a prediction threshold, comparing the output prediction probability with the prediction threshold, and judging that the concerned subject can be in a virus propagation state when the prediction probability is greater than the threshold.

Further, in step 1, according to the concerned subjects of different supervision contents, the propagation data of the concerned subjects on the social network media is collected, the time t when the propagation data are propagated and the evaluation content c are obtained, and the propagation data I is obtained_p＝[(t,c)]Representing a binary set of evaluation content and time of the p-th propagation data, and constructing a social network media information propagation event E_iIs shown as E_i＝[I₁.I₂,…I_p]P ∈ N: wherein i, N and p are non-zero natural numbers.

Further, for the propagation data I_pHaving t of₀And c₀Respectively representing the original propagated time and the original evaluation content, and if the user is only forwarding the blog text and has no new evaluation content, then c is equal to c₀。

Further, the step 2 specifically comprises the following steps:

step 201: within a set observation duration, slicing each propagation event according to a time sequence to obtain a plurality of time slices, taking each time slice as a sub-event of the propagation event, wherein the propagation event is represented as:

E_i＝[e₀,e₁...e_m]

wherein E is_iRepresenting different propagation events, e_mTo propagate sub-events of an event on a time slice, m denotes the end of the observation time of the propagating event, e_mExpressed as:

wherein

Is a sub-event e of a time slice m_mJ represents a different feature type,

a matrix of real numbers expressed as dimension (n, z);

step 202: and (3) construction of a propagation magnitude characteristic: expressed by extracting the number of propagation times within a time slice:

C^d＝[c₁,c₂,...c_m]

wherein c is_mRepresenting the total forwarded times in the time slice m and representing the propagation magnitude characteristic in the time slice m;

step 203: construction of emotional ratio characteristics: adopting a pre-trained emotion analysis model to predict emotion polarity of the evaluation content of the propagation event in each time slice, and obtaining emotion ratio characteristics, wherein the emotion ratio characteristics are expressed as:

C^e＝[r₁,r₂...,r_m]

wherein: r is_mRepresenting the affective ratio features within time slice m,

representing a time slice t_iThe s pieces of information in (1) are positive emotions, p_sThe fixed value is 1 and the fixed value is,

representing negative emotions, n_sThe fixed value is-1, and the count is a counter for counting the occurrence times;

step 204: constructing semantic evolution characteristics: extracting top with highest word frequency from evaluation contents of propagation events of each time slice_nA keyword for representing key semantic features in the time slice, and performing word embedding through a semantic embedding model word2vecExpressed as:

C^s＝[X¹,X²...,X^m],X∈R^d×n,0≤t≤m

C^sfor semantically evolving features, x^mAn embedded representation representing the keywords in time slice m,

the term "v" represents the nth keyword in time slice m, W represents the corpus, and T represents the feature words in the corpus.

The step 3 specifically comprises the following steps:

step 301: to propagation metric feature C^dTo emotional ratio feature C^eA splicing operation is performed, represented as:

h_c＝[c_m||r_m]

i is the splicing operation, h_cFor the stitching feature, a pair propagation metric feature C is represented^dTo emotional ratio feature C^mThe splicing is based on different feature vectors in a time window splicing window, c_mIs a propagation magnitude feature within a time slice m, r_mRepresenting the emotional ratio characteristics in a time slice m;

step 302: inputting the splicing characteristics into a first CNN filter model, wherein the first CNN filter model comprises n CNN filters, and the output of the first CNN filter model is represented as:

f_i(h_c)＝σ(Wⁱh_c+bⁱ)

wherein f is_i(h_c) Representing the output of the ith filter in the first CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, sigma is a nonlinear activation function, i is equal to n, and the step sizes of n CNN filters are different.

Step 303: a first fusion signature is obtained from the sum-and-average of the outputs of the multiple CNN filters of the first CNN filter model, and is represented as:

wherein n represents the number of filters, f₁(h_c) A vector representation representing the first fused feature;

further, step 4 specifically includes the following steps:

step 401: the time slices are used as partitions, and a fixed number of keywords are extracted from each time slice for splicing, and the keyword splicing is represented as:

wherein X^mRepresenting the top at different time slices m_nAnd (3) representation of each feature word, | | | represents splicing operation.

Step 402: for different time slices, the semantic evolution characteristic h of the splicing_sInputting a Bi-LSTM network model to obtain a second fusion characteristic:

wherein f is₂(h_s) For the embedded representation of the second fusion feature, slice represents the number of time slices, W represents the learning parameter of the Bi-LSTM network model, the operator | | | represents the concatenation,

and

respectively LSTM methods in different directions.

Further, the step 5 specifically includes the following steps:

step 501: and splicing the obtained first fusion characteristic and the second fusion characteristic to obtain an information propagation mixed characteristic which is expressed as:

h_d＝[f₂(h_s)||f₁(h_c)]

wherein h is_dFor information propagation blending features represented in vector form, | | represents a stitching operation.

Step 502: inputting the splicing characteristics into a second CNN filter model to obtain third fusion characteristics, wherein the second CNN filter model comprises n CNN filters and is represented as follows:

f_i(h_d)＝σ(Wⁱh_d+bⁱ)

same algorithm as step 302, f_i(h_d) Representing the output of the ith filter in the second CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, sigma is a nonlinear activation function, i is equal to n, and the step sizes of n CNN filters are different. In this embodiment, the step sizes of the 2 CNN filters are different, and one step size is 2 and the other is 3, so as to learn the feature dependence relationships with different lengths in the time sequence.

The output of the second CNN filter model is represented as:

wherein f is_i(h_d) Represents the output of the ith filter in the second CNN filter model, i belongs to n, the step lengths of n CNN filters are different, f₃(h_d) Representing a third fused feature obtained by summing and averaging the outputs of the n CNN filters of the second CNN filter model.

Further, step 6 specifically includes the following steps:

step 601: constructing a time sequence prediction model, wherein the time sequence prediction model comprises a full connection layer and a logic classification layer which are sequentially arranged, outputting prediction probability that an attention topic becomes viral propagation, and the prediction probability is expressed as follows:

h(c)＝softmax(Wf₃(h_d)+b)

h (c) is a table in vector formThe final propagation representation feature extracted from the propagation data represents the predicted probability that the subject of interest will become viral propagation, Wf₃(h_d) + b represents a fully connected layer, W, b are trainable parameters of the time series prediction model, and softmax represents a logical classification algorithm;

step 602: setting an index tau of information transmission virus, wherein the index tau represents a threshold value of a virus index, namely information of which the total transmission amount exceeds the threshold value within the observation time m is marked as a positive sample, namely label is 1, and conversely, the index tau is marked as a negative sample, namely label is 0;

constructing a training set by the method in the step 1, and adding a label to each propagation event for the training set to obtain a training set with labels, which is expressed as follows:

I＝[y_i＝label],i∈N,lable∈[0,1]

wherein I represents a training set, y_iThe real result indicates whether the ith piece of information is transmitted by virus, N indicates a sample set of information transmission, if label is 0, the sample is not transmitted by virus, otherwise, if label is 1, the sample is transmitted by virus;

step 603: and (3) training a constructed time sequence prediction model through data of a training set, comparing the final propagation expression characteristics with a true value, and optimizing a log-likelihood loss function by using a gradient descent method, wherein the log-likelihood loss function is expressed as follows:

where loss is the loss function, h (c)_i) Prediction probability, y, output for a time series prediction model_iTraining the model parameters of the time sequence prediction model by using a back propagation algorithm as an actual result, and performing training iteration on the time sequence prediction model through a training set until the model converges to obtain a trained time sequence prediction model;

step 604: and inputting the third fusion characteristics corresponding to the concerned subject to be predicted into the trained time sequence prediction model, and outputting the prediction probability that the concerned subject becomes viral propagation.

A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method for predicting viral propagation of information in a social network as described above.

A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for predicting viral propagation of information in a social network as described above.

Compared with the prior art, the invention has the following advantages:

1. aiming at the condition that a network structure of information transmission cannot be obtained, namely the bottom layer relation of the transmission network is not considered, only the transmitted time and the evaluation content are obtained, the structure does not need to process relational graph data, the information transmission data are easy to obtain, the data processing capability is more efficient, and meanwhile, the user information does not need to be collected, so that the user privacy is effectively protected;

2. according to the method, the virus of the propagation event in the social media is modeled and predicted through a time sequence deep learning technology, the features in different time windows are fused through a deep learning method, potential connections among different features in different time intervals can be learned, stronger feature representation capability is obtained, whether the information is likely to become a virus propagation state is predicted, and finally a more reliable prediction result is obtained.

3. The invention utilizes the characteristics of semantic evolution in information transmission and has more reliable prediction accuracy aiming at social events with more aggregated semantics, so that different semantic problem models can be trained aiming at different social problems, and the problems in the semantic category can be better solved. The method can be used for monitoring events with more concentrated semantics, such as the prediction and supervision of economic problems, judicial problems, civil problems and the like, and also can be used for network information supervision of enterprises and predicting whether the information concerned by the enterprises can be widely spread.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for predicting viral propagation of information in a social network according to the present invention;

FIG. 2 is a flowchart illustrating steps 1-6 of a method for predicting viral propagation of information in a social network according to the present invention;

FIG. 3 is a diagram illustrating an internal structure of a computing device according to an embodiment.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

Referring to fig. 1, the method for predicting viral propagation of information in a social network of the present invention at least includes the following steps:

Referring to fig. 2, in particular, one embodiment of the present invention comprises the following steps:

step 1: collecting the propagation data of the concerned subject on the social network media according to the concerned subjects of different supervision contents, acquiring the time t for propagation of the propagation data and the evaluation content c, and obtaining the propagation data I_p＝[(t,c)]Representing a binary set of evaluation content and time of the p-th piece of propagation data, for propagation data I_pHaving t of₀And c₀Respectively representing the original propagated time and the original evaluation content, and if the user is only forwarding the blog text and has no new evaluation content, then c is equal to c₀Through a large amount of data collection, each information transmission sequence is regarded as a transmission event, and a social network media information transmission event E is constructed_iIs shown as E_i＝[I₁.I₂,…I_p]P ∈ N: wherein i, N and p are non-zero natural numbers.

Step 2, specifically comprising the following steps:

step 201: for one propagation event, an observation time duration is set, which is defined as2 hours in this embodiment, each propagation event is sliced in time sequence within one set observation time duration, in this embodiment, the number of time slices is defined as 40, 40 time slices are obtained, each time slice is taken as a sub-event of the propagation event, and the propagation event is expressed as:

E_i＝[e₀,e₁...e_m]

wherein E is_iRepresenting different propagation events, e_mTo propagate sub-events of an event on a time slice, m denotes the end of the observation time of the propagating event, E_iSub-event e of_mIs expressed as:

wherein

Is a sub-event e of a time slice m_mJ represents different feature types and respectively comprises propagation magnitude features, semantic evolution features and vectorization representations of emotion ratio features, the construction methods of the features are defined in the subsequent steps,

a matrix of real numbers expressed as dimension (n, z);

step 202: constructing a propagation magnitude characteristic: the propagation magnitude features are represented by extracting the propagation times in the time slice:

C^d＝[c₁,c₂,...c_m]

wherein c is_mRepresents a time slice mThe total number of intermediate-to-intermediate forwarded times represents the propagation magnitude characteristic in the time slice m;

step 203: by using the pre-trained emotion analysis model, in this embodiment, the emotion analysis pre-trained model with better current performance, such as a centa model, a Bert pre-trained model, and a GPT2 emotion analysis model proposed by Baidu, is used to perform emotion polarity prediction on the evaluation content of the propagation event in each time slice, so as to obtain an emotion ratio feature, which is expressed as:

C^e＝[r₁,r₂...,r_m]

wherein: r is_mRepresenting the affective ratio features within time slice m,

representing negative emotions, n_sThe fixed value is-1, count is a counter for calculating the occurrence times, and the emotion ratio is a binary group and represents the opposition degree of emotion in the time window;

step 204: the semantic evolution characteristics are constructed in the following way:

for viral propagation, variation of topic semantics can cause propagation variation, and top with the highest word frequency is extracted from the evaluation content of the propagation event of each time slice_nA keyword, configured to represent a key semantic feature in the time slice, and perform word embedding through a semantic embedding model word2vec, where the keyword is represented as:

C^s＝[X¹,X²...,X^m],X∈R^d×n,0≤t≤m

the term "W" represents the "v" keyword in the time slice m, and the term "W" represents the corpus, which may be a large-scale public Chinese corpus, such as wikipedia Chinese word stock, and "T" represents the feature words in the corpus. In addition, if the number of samples in a time slice is insufficient to top_nAnd zero filling operation is required to be carried out on the insufficient part, and the length of the feature vector is ensured to be consistent through the zero filling operation.

The step 3 specifically comprises the following steps:

h_c＝[c_m||r_m]

step 302: inputting the splicing characteristics into a first CNN filter model, where the first CNN filter model includes n CNN filters, specifically in this implementation, the number of the set CNN filters is 2, and the output of the first CNN filter model is represented as:

f_i(h_c)＝σ(W ⁱh_c+bⁱ)

wherein f is_i(h_c) Representing the output of the ith filter in the first CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, σ is a nonlinear activation function, i ∈ n, in this embodiment, the step lengths of 2 CNN filters are different, one step length selected is 2, and the other is 3, so as not to learn the timing sequenceIn other embodiments of the present invention, the first CNN filter model may also include a greater number of CNN filters, the step size of the CNN filter may also be selected from other sizes, and the CNN filter is a convolution kernel of CNN that is successfully trained and plays an important role in the network operation process.

Step 303: the dependencies between the subsequences are sampled by the different filters of the previous step 302, and then a first fused feature is obtained by summing and averaging the outputs of the multiple CNN filters of the first CNN filter model, which is expressed as:

wherein n represents the number of filters, f₁(h_c) A vector representation representing the first fused feature.

Step 4, the feature of semantic evolution in the information transmission process needs to be extracted, and the method specifically comprises the following steps:

step 401: taking time slices as partitions, extracting a fixed number of keywords in each time slice for splicing, and splicing the semantic information in each time slice, wherein the expression is as follows:

wherein f is₂(h_s) Is an embedded representation of the second fused feature, slice representationThe number of inter-slices, W represents the learning parameters of the Bi-LSTM network model, the operator | | | represents the concatenation,

and

respectively LSTM methods in different directions. Step 402 learns the latent relationship of semantic evolution using a Bi-directional long-and-short memory network (Bi-LSTM) method, and the Bi-LSTM network model can aggregate ordered semantic information to obtain implicit associations of semantic evolution processes in different time periods.

And 5: the information propagation mixed features of the step 3 and the step 4 need to be fused, and the potential relation of the three features is extracted, which specifically comprises the following steps:

h_d＝[f₂(h_s)||f₁(h_c)]

f_i(h_d)＝σ(Wⁱh_d+bⁱ)

same algorithm as step 302, f_i(h_d) Representing the output of the ith filter in the second CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, sigma is a nonlinear activation function, i is equal to n, and the step sizes of n CNN filters are different. In this embodiment, the step sizes of the 2 CNN filters are different, and one step size is 2 and one step size is 3, so as to learn the feature dependency relationship of different lengths in the time sequenceOther sizes may be chosen, and the CNN filter is a convolution kernel of the CNN that is successfully trained and plays an important role in the network operation.

Likewise, the output of the second CNN filter model is represented as:

The step 6 specifically comprises the following steps:

step 601: constructing a time sequence prediction model, wherein the time sequence prediction model comprises a full connection layer and a logic classification layer which are sequentially arranged, performing characteristic representation calculation of the full connection layer and the logic classification layer (softmax) on the result of the step 5, and outputting prediction probability that the concerned subject becomes viral propagation, and the prediction probability is expressed as:

h(c)＝softmax(W f₃(h_d)+b)

h (c) is a final propagation representation feature extracted from the propagation data in a vector form, representing the prediction probability that the subject of interest becomes viral propagation, Wf₃(h_d) + b represents a fully connected layer, W, b are trainable parameters of the time series prediction model, and softmax represents a logical classification algorithm;

I＝[y_i＝label],i∈N,lable∈[0,1]

step 604: inputting the third fusion characteristics corresponding to the concerned subject to be predicted into the trained time series prediction model, and outputting the prediction probability that the concerned subject becomes viral propagation, wherein the output prediction probabilities are 1 and 0 respectively, and respectively represent that the prediction can be developed into viral propagation and the prediction cannot be developed into viral propagation.

In this embodiment, if the result of the prediction probability output is set to be greater than 0.5, it is determined as a positive example, that the information may be viral spread, whereas if the result is less than 0.5, the information may not progress to viral spread.

The traditional information propagation analysis is established on the basis of analysis of network user relationship, the underlying network relationship such as attention, forwarding, comment, mention, praise and the like between users needs to be acquired, and the future propagation magnitude of a certain piece of information is predicted by establishing a time sequence model. Different from the traditional information transmission analysis, the embodiment is established on the basis of the information transmission of an unknown network structure, namely, the bottom layer relation of the transmission network is not needed to be considered, three characteristics based on time sequence in the information transmission process are introduced at the same time, namely, a transmission magnitude characteristic, a positive and negative emotion ratio characteristic and a semantic evolution characteristic are respectively introduced, the dynamic change processes of the three characteristics become key characteristics influencing the information viral transmission, meanwhile, in the data sampling process, the embodiment ignores the information of a user, constructs the time sequence characteristic from the macroscopic data angle, the network information of the bottom layer user is not needed to be concerned, and the structure has two advantages:

(1) more efficient data processing capabilities. Because the construction of the network relationship generally needs to construct the complete relationship between graph data for a massive user population, for example, the relationship between users that concern each other, and the data of the graph structure is taken as non-euclidean spatial data, there is a problem of low efficiency in the operational level, it generally needs to use the process of graph node embedding to perform feature representation on each node in the graph, however, the task represented by the node needs to perform a large amount of neighbor aggregation work, but the embodiment only considers the structural (euclidean spatial) feature in the propagation process, and the operational efficiency is greatly improved, and experiments show that compared with the latest method of a known network, the method has a better classification performance index and is improved by 3.3%; compared with the latest Cas2vec method, our method has a performance improvement of 48.7% in training time efficiency.

(2) The privacy of the user is effectively protected. The unknown network does not need to collect the privacy information of the user and does not need to care about the information of the user, and in the three characteristics required by the embodiment, the unknown network is irrelevant to the information of the user, so that the purpose of privacy removal can be achieved, the privacy is protected, and the data is easier to obtain.

In the embodiment, through collecting non-network relationship data of information propagation in social media, a method of an unknown network structure is used, that is, the information propagation network bottom node relationship does not need to be considered, which means that the information propagation data is easy to obtain and privacy is removed, then a potential relationship of key feature fusion in the information propagation of the unknown network structure is obtained by using an innovative deep learning method, specifically, Bi-LSTM is used for modeling a semantic evolution process of information propagation, two CNN methods containing a plurality of convolution kernels are used for fusing different time sequence features, and finally, whether the information becomes a viral result is predicted by using a full connection layer and a softmax method. Through experiments of real data, compared with the latest Cas2vec, the method has the advantages that more key features are fused, the classification performance is remarkably improved, and a more reliable prediction result is obtained.

The embodiment models and predicts the viral of the propagation event in the social media through the deep learning technology of the time sequence, the method comprises constructing time sequence-based propagation magnitude characteristic, emotion ratio characteristic and semantic evolution characteristic, establishing vector representation of the characteristics in each time interval, then, the propagation magnitude characteristic and the emotion ratio characteristic are fused through a CNN network containing multiple convolution kernels, the multi-kernel method has the advantages that potential relations of different characteristics in different time intervals can be learned, and then, learning potential semantic changes in the information transmission process by using the advantages of the bidirectional LSTM (Bi-LSTM) in sequence relation representation, fusing the potential semantic changes with the previous characteristics by using a CNN (convolutional neural network) model of multiple convolution kernels, and finally predicting whether the information becomes viral results by using a full connection layer and a softmax method. Through ablation tests of real data, the performance difference of + 3.2% to + 7.7% between the virus prediction model without adding the emotion ratio characteristic and the semantic evolution characteristic and the result of fusing the three characteristics is verified. The method proves that the embodiment obtains stronger feature representation capability after feature fusion.

The embodiment utilizes the characteristics of semantic evolution in information propagation, and has more reliable prediction accuracy aiming at social events with more aggregated semantics, so that different semantic problem models can be trained aiming at different social problems, and the problems in semantic categories, such as economic problems, judicial problems, civil problems and the like, can be better solved.

The method provided by the embodiment can be used for the fields of online public sentiment event analysis and data mining, particularly for monitoring the prediction and supervision of events with centralized semantics, such as economic problems, judicial problems, civil problems and the like, and also can be used for network information supervision of enterprises and predicting whether the information concerned by the enterprises can be widely spread.

In an embodiment of the present invention, there is also provided a computer apparatus, including a memory and a processor, where the memory stores a computer program, and the processor, when executing the computer program, implements the method for predicting viral propagation of information in a social network as described above.

The computer apparatus may be a terminal, and its internal structure diagram may be as shown in fig. 3. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method for predicting viral propagation of information in a social network. The display screen of the computer device can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer device, an external keyboard, a touch pad or a mouse and the like.

The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory is used for storing programs, and the processor executes the programs after receiving the execution instructions.

The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that the configuration shown in fig. 3 is a block diagram of only a portion of the configuration associated with the present application and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment of the present invention, there is also provided a computer-readable storage medium, on which a program is stored, which when executed by a processor, implements the method for predicting viral propagation of information in a social network as described above.

As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, computer apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, computer apparatus, or computer program products according to embodiments of the invention. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart and/or flowchart block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart.

The prediction method, the computer device, and the computer-readable storage medium for information viral propagation in social networks provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for predicting viral propagation of information in a social network is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises: in step 1, according to the concerned subjects of different supervision contents, collecting the propagation data of the concerned subjects on the social network media, obtaining the time t when the propagation data are propagated and the evaluation content c, and obtaining the propagation data I_p＝[(t,c)]Representing a binary set of evaluation content and time of the p-th propagation data, and constructing a social network media information propagation event E_iIs shown as E_i＝[I₁.I₂,…I_p]P ∈ N: wherein i, N and p are non-zero natural numbers.

3. The method of claim 2, wherein the method comprises: for the propagation data I_pHaving t of₀And c₀Respectively representing the original propagated time and the original evaluation content, and if the user is only forwarding the blog text and has no new evaluation content, then c is equal to c₀。

4. The method of claim 1, wherein the method comprises: the step 2 specifically comprises the following steps:

E_i＝[e₀,e₁...e_m]

wherein

Is a sub-event e of a time slice m_mJ represents a different feature type,

a matrix of real numbers expressed as dimension (n, z);

C^d＝[c₁,c₂,...c_m]

C^e＝[r₁,r₂...,r_m]

wherein: r is_mRepresenting the affective ratio features within time slice m,

representing negative emotions, n_sA fixed value of-1, count isA counter for counting the number of occurrences;

step 204: constructing semantic evolution characteristics: extracting top with highest word frequency from evaluation contents of propagation events of each time slice_nA keyword, configured to represent a key semantic feature in the time slice, and perform word embedding through a semantic embedding model word2vec, where the keyword is represented as:

C^s＝[X¹,X²...,X^m],X∈R^d×n,0≤t≤m

5. The method of claim 4, wherein the method comprises: the step 3 specifically comprises the following steps:

h_c＝[c_m||r_m]

f_i(h_c)＝σ(Wⁱh_c+bⁱ)

wherein f is_i(h_c) Representing the output of the ith filter in the first CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, sigma is a nonlinear activation function, i belongs to n, and the step length of n CNN filters is different;

6. The method of claim 5, wherein the method comprises: the step 4 specifically comprises the following steps:

wherein X^mRepresenting the top at different time slices m_nThe method comprises the following steps of (1) representing characteristic words, | | | represents splicing operation;

and

respectively LSTM methods in different directions.

7. The method of claim 6, wherein the method comprises: the step 5 specifically comprises the following steps:

h_d＝[f₂(h_s)||f₁(h_c)]

wherein h is_dFor information propagation mixed features expressed in a vector form, | | represents splicing operation;

f_i(h_d)＝σ(Wⁱh_d+bⁱ)

f_i(h_d) Representing the output of the ith filter in the second CNN filter model, Wⁱ,bⁱIs the trainable parameter of the ith filter, sigma is a nonlinear activation function, i belongs to n, and the step length of n CNN filters is different;

the output of the second CNN filter model is represented as:

wherein f is_i(h_d) Is shown asThe output of the ith filter in the two CNN filter models, i belongs to n, the step lengths of the n CNN filters are different, f₃(h_d) Representing a third fused feature obtained by summing and averaging the outputs of the n CNN filters of the second CNN filter model.

8. The method of claim 7, wherein the method comprises: the step 6 specifically comprises the following steps:

h(c)＝softmax(Wf₃(h_d)+b)

h (c) is a final propagation representation feature extracted from the propagation data in a vector form, representing the prediction probability that the subject of interest becomes viral propagation, Wf₃(h_d) Representing a full connection layer, W, b are trainable parameters of a time series prediction model, and softmax represents a logic classification algorithm;

I＝[y_i＝label],i∈N,lable∈[0,1]

9. A computer apparatus comprising a memory and a processor, the memory storing a computer program, characterized in that: the processor, when executing the computer program, implements the method for predicting viral propagation of information in a social network as set forth in claim 1.

10. A computer-readable storage medium on which a program is stored, characterized in that: the program, when executed by a processor, implements a method for predicting viral propagation of information in a social network as set forth in claim 1.