CN113779382A

CN113779382A - Network public opinion prediction method based on microblog data

Info

Publication number: CN113779382A
Application number: CN202110954872.1A
Authority: CN
Inventors: 刘定一; 应毅; 李晓明; 顾问
Original assignee: Sanjiang University
Current assignee: Sanjiang University
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-12-10

Abstract

The invention relates to the field of data analysis and prediction, in particular to a network public opinion prediction method based on microblog data, which is characterized in that: constructing a prediction model: the prediction model is a long-short-term memory neural network model including two hidden layers. The hidden layer is a one-way long-term and short-term memory neural network unit, and the second hidden layer is a two-way long-term and short-term memory neural network unit; training prediction model: calculate a moment in the time series according to the prediction model, and output the current input and the previous moment Input the prediction model to get the current output as the predicted value; calculate the error according to the predicted value and the actual value, solve it through the optimizer backpropagation, update the model parameters until convergence; The total score of Weibo popularity, the time offset and the output of the first hidden layer at the previous moment are input to the trained prediction model as model input for online public opinion prediction. The present invention has high prediction accuracy.

Description

A network public opinion prediction method based on microblog data

技术领域technical field

本发明涉及数据分析及预测领域，尤其涉及一种基于微博数据的网络舆情预测方法。The invention relates to the field of data analysis and prediction, in particular to a network public opinion prediction method based on microblog data.

背景技术Background technique

当前互联网已成为公众获取信息、表达观点的重要平台，网络起着反映社情民意和引导舆论的作用，但也带来社会舆情事件易发生的风险。有效的舆情预测方法对预估网络舆情发展趋势，化解潜在的舆情危机，营造良好的网络生态环境，具有必要的现实意义。通过对网络舆情走势的提前预测，能够准确判断热点事件的发展态势，为政府相关部门应对舆情危机提供参考。At present, the Internet has become an important platform for the public to obtain information and express their opinions. The Internet plays a role in reflecting public opinion and guiding public opinion, but it also brings the risk of social public opinion incidents. Effective public opinion prediction methods have necessary practical significance for predicting the development trend of network public opinion, resolving potential public opinion crisis, and creating a good network ecological environment. By predicting the trend of network public opinion in advance, we can accurately judge the development trend of hot events, and provide a reference for relevant government departments to deal with public opinion crisis.

受众多外界因素影响，网络舆情发展趋势具有明显的模糊性和不确定性，人工神经网络具有很强的非线性拟合能力，适合于解决复杂非线性的时序数据分析问题。在互联网快速发展的今天，自媒体、移动社交平台等新兴表现形式的兴起，导致人们产生信息和获取信息的方式和规模发生了巨大变化，实时的互联网数据(微博、贴吧、微指数)成为提高预测精度的积极补充。Affected by many external factors, the development trend of network public opinion has obvious ambiguity and uncertainty. Artificial neural network has strong nonlinear fitting ability and is suitable for solving complex nonlinear time series data analysis problems. With the rapid development of the Internet today, the rise of emerging forms such as self-media and mobile social platforms has led to great changes in the way and scale of people's generation and acquisition of information. Positive additions to improve forecast accuracy.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了提供一种基于微博数据的网络舆情预测方法，预测准确度高。The purpose of the present invention is to provide a network public opinion prediction method based on microblog data with high prediction accuracy.

为解决以上技术问题，本发明的技术方案为：一种基于微博数据的网络舆情预测方法，包括：In order to solve the above technical problems, the technical solution of the present invention is: a method for predicting network public opinion based on microblog data, comprising:

步骤1：构建预测模型：定义预测模型网络结构，预测模型为包括两个隐含层的长短期记忆神经网络模型，第一隐含层为单向长短期记忆神经网络单元，第二隐含层为双向长短期记忆神经网络单元；预测模型的输入为第一隐含层的输入，预测模型的输出为第二隐含层的输出；第一隐含层的输入为百度指数、微博热度总分值、时间偏移量、上一时刻第一隐含层的输出；第二隐含层的输入为同一时刻上一隐含层的输出和同一隐含层上一时刻的输出；Step 1: Build a prediction model: Define the network structure of the prediction model. The prediction model is a long-short-term memory neural network model including two hidden layers. The first hidden layer is a unidirectional long-term and short-term memory neural network unit, and the second hidden layer. It is a bidirectional long short-term memory neural network unit; the input of the prediction model is the input of the first hidden layer, and the output of the prediction model is the output of the second hidden layer; the input of the first hidden layer is the Baidu index and the total popularity of Weibo. Score, time offset, the output of the first hidden layer at the previous moment; the input of the second hidden layer is the output of the previous hidden layer at the same moment and the output of the same hidden layer at the previous moment;

步骤2：训练预测模型：Step 2: Train the predictive model:

步骤2.1：根据预测模型计算时间序列中的一个时刻，将当前输入和前一时刻输出传入预测模型得到当前输出作为预测值；Step 2.1: Calculate a moment in the time series according to the prediction model, and pass the current input and the output of the previous moment into the prediction model to obtain the current output as the prediction value;

步骤2.2：根据预测值和真实值计算误差，通过优化器反向传播求解，更新模型参数；Step 2.2: Calculate the error according to the predicted value and the actual value, solve it through the optimizer backpropagation, and update the model parameters;

步骤2.3：重复上述步骤2.1和步骤2.2直至收敛；Step 2.3: Repeat steps 2.1 and 2.2 above until convergence;

步骤3：基于微博数据计算微博热度总分值，将百度指数、微博热度总分值、时间偏移量和上一时刻第一隐含层的输出作为模型输入量输入至训练好的预测模型进行网络舆情预测。Step 3: Calculate the total score of Weibo popularity based on Weibo data, and input the Baidu index, Weibo popularity total score, time offset and the output of the first hidden layer at the previous moment as the model input to the trained model. Prediction model for online public opinion prediction.

按以上方案，微博热度总分值的计算方法为：According to the above scheme, the calculation method of the total score of Weibo popularity is:

对网络舆情事件的微博数据进行分析，根据网络舆情事件的关键词采集微博热点，对p个关键词匹配的微博进行热点分析，计算微博热度分值，微博热度分值由转发数、评论数、点赞数的权重累加得到：Analyze the microblog data of network public opinion events, collect microblog hotspots according to the keywords of the network public opinion events, analyze the hotspots of microblogs matching p keywords, and calculate the microblog popularity score. The weights of the number of comments, the number of comments, and the number of likes are accumulated to get:

HotScore_i＝α*转发数+β*评论数+γ*点赞数HotScore _i =α*Number of reposts+β*Number of comments+γ*Number of likes

其中，HotScore_i表示第i个关键词的微博热度分值，α表示第i个关键词转发数的权重，β表示第i个关键词评论数的权重，γ表示第i个关键词点赞数的权重；Among them, HotScore _i represents the microblog popularity score of the ith keyword, α represents the weight of the number of retweets of the ith keyword, β represents the weight of the number of comments of the ith keyword, and γ represents the like of the ith keyword weight of numbers;

将p个关键词的热度分值排序，取前q个累加，得到微博热度总分值HotScore；Sort the hot scores of p keywords, and accumulate the first q to get the total hot score of Weibo HotScore;

其中，q﹤p。Among them, q﹤p.

按以上方案，第一隐含层的计算方法为：According to the above scheme, the calculation method of the first hidden layer is:

其中，

表示第一隐含层t时刻的输出，W₁表示第一隐含层的权重向量，BaiduIndex_t表示t时刻的百度指数，BaiduIndex来源于百度网站，HotScore_t表示t时刻微博热度总分值，ΔT表示时间偏移量，时间偏移量指的是被预测日与舆情事件第一天之间的时间间隔；σ表示激活函数，为Sigmoid函数。in,

Represents the output of the first hidden layer at time t, W ₁ represents the weight vector of the first hidden layer, BaiduIndex _t represents the Baidu index at time t, BaiduIndex comes from Baidu website, HotScore _t represents the total score of Weibo popularity at time t, ΔT represents the time offset, which refers to the time interval between the predicted day and the first day of the public opinion event; σ represents the activation function, which is a sigmoid function.

按以上方案，第二隐含层的计算方法为：According to the above scheme, the calculation method of the second hidden layer is:

其中，

表示第二隐含层t时刻的输出，W₂表示第二隐含层的权重矩阵，

表示第二隐含层t-1时刻的输出，

表示t时刻第一隐含层到第二隐含层的输入向量。in,

represents the output of the second hidden layer at time t, W ₂ represents the weight matrix of the second hidden layer,

represents the output of the second hidden layer at time t-1,

Represents the input vector from the first hidden layer to the second hidden layer at time t.

按以上方案，训练过程中，预测模型的误差指标为损失函数：According to the above scheme, during the training process, the error indicator of the prediction model is the loss function:

损失函数是预测误差平方和与模型权值参数的平方和之和，具体公式如下：The loss function is the sum of the squared sum of the prediction errors and the sum of the squares of the model weight parameters. The specific formula is as follows:

其中，n为样本个数，h(x_i)表示输入样本x_i时模型的预测输出，y_i为样本x_i的真实值，m为模型权重个数，

表示第j个权重的平方，α表示学习率，α取0.1。Among them, n is the number of samples, h( _xi ) represents the predicted output of the model when the sample x _i is input, y _i is the true value of the sample x _i , m is the number of model weights,

represents the square of the jth weight, α represents the learning rate, and α takes 0.1.

按以上方案，步骤1中，定义预测模型网络结构时，设置每一层网络节点的舍弃率为0.2，设置优化器为自适应矩估计Adam。According to the above scheme, in step 1, when defining the network structure of the prediction model, set the rejection rate of each layer of network nodes to 0.2, and set the optimizer to be the adaptive moment estimation Adam.

本发明具有如下有益效果：The present invention has the following beneficial effects:

本发明考虑到舆情数据量不多的特点，设计的预测模型由单向长短期记忆神经网络单元和双向长短期记忆神经网络单元两个隐含层组成，在保留长短期记忆神经网络特性的同时，降低由于训练样本较少而产生过拟合的风险，同时使用社交媒体信息即微博数据作为模型计算的输入之一，从预测模型和数据扩充两方面进行改进，提出的基于微博数据的网络舆情预测方法结合实时性的微博数据和权威性的百度指数进行网络舆情发展趋势预测，有效提高了预测精度。Considering the characteristics of a small amount of public opinion data, the present invention designs a prediction model consisting of two hidden layers, a one-way long-term and short-term memory neural network unit and a two-way long-term and short-term memory neural network unit. While retaining the characteristics of the long-term and short-term memory neural network , reduce the risk of overfitting due to fewer training samples, and use social media information, that is, microblog data as one of the inputs of model calculation, to improve from both the prediction model and data expansion. The proposed method based on microblog data The network public opinion prediction method combines real-time microblog data and authoritative Baidu index to predict the development trend of network public opinion, which effectively improves the prediction accuracy.

附图说明Description of drawings

图1为本发明预测模型的网络结构示意图；Fig. 1 is the network structure schematic diagram of the prediction model of the present invention;

图2为本发明实施例中长短期记忆神经网络的单元结构示意图。FIG. 2 is a schematic diagram of a unit structure of a long short-term memory neural network in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，下面结合附图和具体实施例对本发明作进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

请参考图1和图2，本发明提供一种基于微博数据的网络舆情预测方法，其包括：Please refer to FIG. 1 and FIG. 2 , the present invention provides a method for predicting network public opinion based on microblog data, which includes:

步骤1：构建预测模型：定义预测模型网络结构，设置每一层网络节点的舍弃率为0.2，设置优化器为自适应矩估计Adam；Step 1: Build a prediction model: define the network structure of the prediction model, set the rejection rate of each layer of network nodes to 0.2, and set the optimizer to adaptive moment estimation Adam;

在传统长短期记忆神经网络(LSTM：Long Short-Term Memory)的基础上，构建包含两个隐含层的长短期记忆神经网络模型作为预测模型，第一隐含层为单向长短期记忆神经网络单元，第二隐含层为双向长短期记忆神经网络单元；预测模型的输入为第一隐含层的输入，预测模型的输出为第二隐含层的输出；第一隐含层的输入为百度指数、微博热度总分值、时间偏移量、上一时刻第一隐含层的输出；第二隐含层的输入为同一时刻上一隐含层的输出和同一隐含层上一时刻的输出；On the basis of the traditional long short-term memory neural network (LSTM: Long Short-Term Memory), a long short-term memory neural network model containing two hidden layers is constructed as a prediction model, and the first hidden layer is a one-way long short-term memory neural network. network unit, the second hidden layer is a bidirectional long short-term memory neural network unit; the input of the prediction model is the input of the first hidden layer, and the output of the prediction model is the output of the second hidden layer; the input of the first hidden layer is the Baidu index, the total score of Weibo popularity, the time offset, the output of the first hidden layer at the previous moment; the input of the second hidden layer is the output of the previous hidden layer at the same moment and the output of the same hidden layer output at a moment;

微博热度总分值的计算方法为：The calculation method of the total score of Weibo popularity is:

首先对网络舆情事件的微博数据进行分析，根据网络舆情事件的关键词采集微博热点，对p个关键词匹配的微博进行热点分析，计算微博热度分值，微博热度分值由转发数、评论数、点赞数的权重累加得到：Firstly, analyze the microblog data of network public opinion events, collect microblog hotspots according to the keywords of network public opinion events, analyze the hotspots of microblogs matching p keywords, and calculate the microblog popularity score. The microblog popularity score is given by The weights of the number of retweets, comments, and likes are accumulated to get:

HotScore_i＝α*转发数+β*评论数+γ*点赞数 (1)HotScore _i =α*Number of reposts+β*Number of comments+γ*Number of likes (1)

其中，q﹤p；本实施例中，p取50，q取10。Among them, q<p; in this embodiment, p is 50, and q is 10.

第一隐含层的计算方法为：The calculation method of the first hidden layer is:

其中，

第二隐含层的计算方法为：The calculation method of the second hidden layer is:

其中，

表示第二隐含层t-1时刻的输出，

表示t时刻第一隐含层到第二隐含层的输入向量。in,

represents the output of the second hidden layer at time t-1,

参阅图1，第一隐含层和第二隐含层中包括LSTM存储单元，参阅图2，LSTM单元的原理为：Referring to Figure 1, the first hidden layer and the second hidden layer include LSTM storage units. Referring to Figure 2, the principle of the LSTM unit is:

1个LSTM存储单元的主要由输入门(input gate)、输出门(output gate)、遗忘门(forget gate)组成；An LSTM storage unit is mainly composed of an input gate, an output gate, and a forget gate;

LSTM单元的计算公式如下：The calculation formula of the LSTM unit is as follows:

i_t＝σ(W_i[h_t-1,x_t])i _t =σ(W _i [h _t-1 ,x _t ])

f_t＝σ(W_f[h_t-1,x_t])f _t =σ(W _f [h _t-1 ,x _t ])

z_t＝tanh(W_z[h_t-1,x_t])z _t =tanh(W _z [h _t-1 ,x _t ])

c_t＝f_t·c_t-1+i_t·z_t c _t =f _t ·c _t-1 +i _t ·z _t

o_t＝σ(W_o[h_t-1,x_t])o _t =σ(W _o [h _t-1 ,x _t ])

h_t＝o_t·tanh(c_t)h _t =o _t ·tanh(c _t )

式中：i_t为输入门；f_t为遗忘门；o_t为输出门；σ表示激活函数，通常为Sigmoid；W为各神经网络层的权重矩阵；x_t为当前时刻的输入值；h_t-1为在当前时刻t接受上一时刻的输出值；c_t-1为t-1时刻的状态值；z_t为当前时刻的候选状态值；c_t为当前时刻的状态值；h_t为当前时刻的输出值。where: i _t is the input gate; f _t is the forgetting gate; o _t is the output gate; σ is the activation function, usually Sigmoid; W is the weight matrix of each neural network layer; x _t is the input value at the current moment; h _t-1 is the output value of the previous time received at the current time t; c _t-1 is the state value at the time t-1; z _t is the candidate state value at the current time; c _t is the state value at the current time; h _t is the output value at the current moment.

步骤2：训练预测模型：Step 2: Train the predictive model:

训练过程中，预测模型的误差指标为损失函数，即用损失函数计算预测值和真实值(标签值)之间的误差：During the training process, the error indicator of the prediction model is the loss function, that is, the loss function is used to calculate the error between the predicted value and the true value (label value):

下面给出一种具体实施例：A specific embodiment is given below:

以2019年发生的“重庆保时捷女车主打人事件”(7月30日-8月14日)、“996工作制事件”(4月11日-4月26日)、“黑洞照片首发事件”(4月8日-4月23日)3起热点事件为训练样本，用训练样本训练模型优化参数。Take the "Chongqing Porsche Female Car Owner Incident" (July 30-August 14), "996 Work System Incident" (April 11-April 26), and "Black Hole Photo First Event" that happened in 2019 (April 8-April 23) Three hot events were training samples, and the training samples were used to train the model to optimize parameters.

以“山东大学学伴事件”(2019年7月12日-7月27日)为测试样本，用测试样本验证模型的有效性和准确性。Taking the "Shandong University Student Partner Incident" (July 12-July 27, 2019) as the test sample, the validity and accuracy of the model were verified with the test sample.

实验数据主要包括：百度指数、微博热度分值、时间偏移量。The experimental data mainly include: Baidu index, Weibo popularity score, and time offset.

在模型训练时，输入第一天的百度指数、微博热度分值和时间偏移量(即0)，计算下一天的百度指数，比较计算结果和百度指数真实值调整模型参数，如此反复。During model training, input the Baidu Index, Weibo popularity score and time offset (ie 0) of the first day, calculate the Baidu Index of the next day, compare the calculation result with the actual value of the Baidu Index and adjust the model parameters, and so on.

在模型测试时，只给出第一天的百度指数和每一天的微博热度分值，时间偏移量从0开始递增，测算第2天至第16天的百度指数。During the model test, only the Baidu index of the first day and the Weibo popularity score of each day are given, and the time offset starts to increase from 0, and the Baidu index from the second day to the 16th day is calculated.

模型计算的结果数据下表所示。The resulting data from the model calculations are shown in the table below.

本发明在传统长短期记忆神经网络(LSTM)的基础上，构建包含两个隐含层的长短期记忆神经网络模型，第一个隐含层为单向长短期记忆神经网络单元，第二个隐含层为双向长短期记忆神经网络单元，同时将微博数据作为模型输入，进行舆情事件百度指数的定量预测，预测准确度高。Based on the traditional long short-term memory neural network (LSTM), the present invention constructs a long-short-term memory neural network model including two hidden layers, the first hidden layer is a unidirectional long-term and short-term memory neural network unit, and the second hidden layer is a unidirectional long-short-term memory neural network unit. The hidden layer is a bidirectional long-term and short-term memory neural network unit. At the same time, the microblog data is used as the model input to quantitatively predict the Baidu index of public opinion events, and the prediction accuracy is high.

本发明未涉及部分均与现有技术相同或采用现有技术加以实现。The parts not involved in the present invention are the same as the prior art or implemented by adopting the prior art.

以上内容是结合具体的实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deductions or substitutions can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. A network public opinion prediction method based on microblog data is characterized by comprising the following steps: comprises that

Step 1: constructing a prediction model: defining a prediction model network structure, wherein the prediction model is a long short-term memory neural network model comprising two hidden layers, the first hidden layer is a unidirectional long short-term memory neural network unit, and the second hidden layer is a bidirectional long short-term memory neural network unit; the input of the prediction model is the input of the first hidden layer, and the output of the prediction model is the output of the second hidden layer; the input of the first hidden layer is a hundredth degree index, a total microblog hot degree score, a time offset and the output of the first hidden layer at the last moment; the input of the second hidden layer is the output of a hidden layer at the same moment and the output of the hidden layer at the same moment;

step 2: training a prediction model:

step 2.1: calculating a moment in the time sequence according to the prediction model, and transmitting the current input and the output of the previous moment into the prediction model to obtain the current output as a predicted value;

step 2.2: calculating errors according to the predicted values and the real values, performing back propagation solution through an optimizer, and updating model parameters;

step 2.3: repeating the step 2.1 and the step 2.2 until convergence;

and step 3: and calculating the total microblog popularity score based on microblog data, and inputting the hundredth index, the total microblog popularity score, the time offset and the output of the first hidden layer at the last moment as model input quantities into a trained prediction model to perform network public opinion prediction.

2. The microblog-data-based online public opinion prediction method according to claim 1, characterized in that: the method for calculating the total microblog popularity score comprises the following steps:

analyzing microblog data of the network public sentiment event, collecting microblog hotspots according to keywords of the network public sentiment event, performing hotspot analysis on microblogs matched with the p keywords, and calculating microblog popularity scores, wherein the microblog popularity scores are obtained by accumulating the weights of forwarding numbers, comment numbers and praise numbers:

HotScore_iforwarding number + β comment number + γ vote number

Wherein, HotScore_iThe microblog popularity score of the ith keyword is represented, alpha represents the weight of the forwarding number of the ith keyword, beta represents the weight of the comment number of the ith keyword, and gamma represents the weight of the praise number of the ith keyword;

ranking the heat scores of the p keywords, and accumulating the top q keywords to obtain a total microblog heat score HotScore;

wherein q < p.

3. The microblog-data-based online public opinion prediction method according to claim 1, characterized in that: the calculation method of the first hidden layer comprises the following steps:

wherein,

to representOutput of the first hidden layer at time t, W₁Weight vector, BaidiIndex, representing the first hidden layer_tDenotes the Baidu index at the time t, and the Baidusndex is from Baidu website, HotScore_tThe total microblog popularity score at the time T is represented, the delta T represents a time offset, and the time offset refers to a time interval between a predicted day and the first day of a public sentiment event; σ represents an activation function, which is a Sigmoid function.

4. The microblog-data-based online public opinion prediction method according to claim 1, characterized in that: the calculation method of the second hidden layer comprises the following steps:

wherein,

an output, W, representing the second hidden layer at time t₂A weight matrix representing the second hidden layer,

representing the output at the instant t-1 of the second hidden layer,

representing the input vector from the first hidden layer to the second hidden layer at time t.

5. The microblog-data-based online public opinion prediction method according to claim 1, characterized in that: in the training process, the error index of the prediction model is a loss function:

the loss function is the sum of the square sum of the prediction error and the square sum of the model weight parameter, and the specific formula is as follows:

wherein n is the number of samples, h (x)_i) Representing input samples x_iPredicted output of time model, y_iIs a sample x_iM is the number of model weights,

6. The microblog-data-based online public opinion prediction method according to claim 1, characterized in that: in step 1, when a prediction model network structure is defined, the rejection rate of each layer of network nodes is set to be 0.2, and an optimizer is set to be adaptive moment estimation Adam.