CN110083699B

CN110083699B - News popularity prediction model training method based on deep neural network

Info

Publication number: CN110083699B
Application number: CN201910202638.6A
Authority: CN
Inventors: 刘春阳; 王乾宇; 张旭; 何赛克; 张翔宇; 郑晓龙; 曾大军; 彭鑫
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2021-01-12
Anticipated expiration: 2039-03-18
Also published as: CN110083699A

Abstract

The invention provides a news popularity prediction model training method based on a deep neural network, which comprises the following steps: acquiring news article data of a specific theme in a set time period, cleaning the data by using Pandas, and then sequentially grouping according to a set time length to acquire a news popularity sequence arranged according to a time sequence; according to the news popularity sequence, sequentially taking a continuous sequence with the sampling length of w as an input sample from the first popularity, and sampling data of the next period as an output sample to construct a training sample set; randomly selecting training samples from the training sample set to train the LSTM network-based news popularity prediction model, performing relevance analysis by adopting Pearson correlation coefficients to delete bad training samples, and circulating the training process until the training is finished. The invention can obtain a news popularity prediction model for predicting trendless, seasonality-free and nonlinear news popularity with higher accuracy.

Description

News popularity prediction model training method based on deep neural network

Technical Field

The invention belongs to the field of deep learning, and particularly relates to a news popularity prediction model training method based on a deep neural network.

Background

With the increasing influence of the internet on the lives of people and the wide popularization of mobile terminal devices, data in human production and life are rapidly growing in recent years. The amount of data newly generated each year is almost the sum of thousands of years of history. With the improvement and the development of deep learning theory, the intrinsic value contained in the big data can be continuously mined. The value of the method has attracted high attention from governments, business industries and scientific and technological boundaries of various countries.

For the media industry, besides the traditional paper media, various new media platforms such as microblog, blog, forum, Twitter, etc. are also developed. These emerging media are gradually changing the habits of people in obtaining information, and the amount of data generated each day is also quite large. In the social transformation period of rapid economic development in China, social events such as accident disaster events, public health events, social security events and the like frequently occur. New media websites are now becoming the main channel for people to get news events. Therefore, based on new media, news is analyzed and researched, the development trend, wind direction and young age of the news are comprehensively predicted and analyzed, and the pertinence and the foresight of event handling are necessarily improved.

The conventional media industry often selects 3 methods for performing time sequence prediction on news popularity, which are respectively as follows: holt linear exponential smoothing (Holt quadratic exponential smoothing), Holt-Winters seasonal exponential smoothing (Holt cubic exponential smoothing), and ARIMA (Autoregressive Integrated Moving Average Model). The Holt linear exponential smoothing method is only suitable for predicting with trends, and if the popularity of news does not accord with the conventional trend, the prediction accuracy is low. The Holt-Winters seasonal index smoothing method is more suitable for seasonal related prediction, and if the time sequence of news events is unrelated to seasons, accurate prediction of results is difficult. ARIMA is a very popular time sequence prediction algorithm at present, has a wide application field, but is difficult to capture the regularity of nonlinear unstable data.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the prediction accuracy of the current prediction model for trendless, seasonality-free and non-linear popularity of news is low, the invention provides a method for training a popularity prediction model based on a deep neural network, the method comprising the following steps:

step S10, obtaining news article data of a specific theme in a set time period as a first news article data set of the theme;

step S20, using Pandas to perform data cleaning on the first news article data set to obtain a second news article data set;

step S30, the second news article data set is grouped in sequence according to the set time length, the news popularity corresponding to each group is calculated, and the news popularity sequences are obtained by arranging according to the time sequence;

step S40, according to the news popularity sequence, sequentially taking a continuous sequence with the sampling length of w from the first popularity as an input sequence X in a time step, sampling data of the next period as a Y, and constructing a training sample set by taking X as an input sample and Y as an output sample;

s50, randomly selecting N training samples from the training sample set to train the news popularity prediction model based on the LSTM network; if the training end condition is reached, executing step S70, otherwise executing step S60;

step S60, calculating a prediction result and a correlation coefficient r of a corresponding output sample by adopting a Pearson correlation coefficient, and removing N training samples selected in the training of the round from the training sample set when r is smaller than a first set threshold value; step S50 is executed;

and step S70, obtaining a trained news popularity prediction model.

In some preferred embodiments, step S10 "obtaining news article data for a specific topic for a set time period" includes:

s101, collecting news article data in a set time period;

step S102, regarding the collected news article data, taking a specific theme as an object, and performing relevance clustering on similar articles by adopting a SimHash algorithm to obtain the news article data of the specific theme in a set time period.

In some preferred embodiments, step S101 "collects news article data in a set time period," and the collection source thereof includes one or more of news, forums, blogs, and microblogs.

In some preferred embodiments, during the training of the news popularity prediction model, 10% is randomly extracted from the training sample set as the validation set, and 10-fold cross validation is performed based on the training sample set and the validation set.

In some preferred embodiments, the first set threshold is 0.6.

In some preferred embodiments, the sampling length w ∈ [10,20 ].

In some preferred embodiments, the news popularity prediction model is trained by using RMSprop algorithm to perform iterative update of the weight parameters.

In some preferred embodiments, the news popularity prediction model, the initialization parameters before training are set as: the number of hidden layer neurons is n, and n belongs to [40,60 ]; the recursion times in the time step are k, and k belongs to [10,20 ]; the number of training rounds is set as q, and q belongs to [1500,2500 ]; the training batch size is set to j, j ∈ [40,60 ].

In another aspect of the present invention, a method for predicting popularity of news based on a deep neural network is provided, the method comprising the following steps:

step A10, obtaining news article data of a selected subject in a set time period;

step A20, acquiring a news popularity sequence of the selected theme by adopting the methods of step S20 and step S30 in the deep neural network-based news popularity prediction model training method;

step A30, selecting w news popularity with the latest time sequence from the news popularity sequence as input data;

and step A40, predicting the popularity of the news in the later period by inputting data by using the trained news popularity prediction model based on the LSTM network obtained by the deep neural network-based news popularity prediction model training method.

In some preferred embodiments, the method further comprises, after step a 40:

step A50, if the prediction times are less than the set prediction period, adding the predicted news popularity into the news popularity sequence, and executing step A30.

The invention has the beneficial effects that:

according to the method for training the news popularity prediction model based on the deep neural network, the news popularity prediction model for effectively predicting and analyzing trendless, seasonality-free and nonlinear news popularity can be obtained, news popularity prediction is carried out through the model, algorithm efficiency is improved, time complexity is reduced, and accuracy of prediction results is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a News popularity prediction model training method based on a deep neural network according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for predicting popularity of news based on a deep neural network according to an embodiment of the present invention;

FIG. 3 is a comparison of results of a prediction of popularity of a certain trade battle made by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention builds a database from news data of each social media, summarizes the data and finds the earliest released data of each channel in time. The articles of the same news topic are gathered through a SimHash algorithm, and the popularity of news in each time period is calculated by taking an hour as a unit. And then the data is cleaned by using Pandas. And then, analyzing and predicting the popularity of the news by adopting a deep LSTM neural network, and performing secondary processing on part of data by utilizing a Pearson Correlation Coefficient (Pearson Correlation Coefficient) in the predicting process. And finally, performing cross verification on the obtained result to obtain the popularity of the news in a future period of time, thereby helping new media companies to analyze and judge news public opinion topics. Compared with the traditional method, the process improves the efficiency of the algorithm, reduces the time complexity, optimizes the accuracy of the prediction result, and has good prediction effect on trendless, seasonality-free and non-linear news.

The invention discloses a news popularity prediction model training method based on a deep neural network, which comprises the following steps of:

and step S70, obtaining a trained news popularity prediction model.

In order to more clearly describe the method for training the news popularity prediction model based on the deep neural network, the following describes the steps of the method in detail with reference to an embodiment.

The invention discloses a news popularity prediction model training method based on a deep neural network, which comprises the following steps:

in step S10, news article data of a specific topic in a set time period is obtained as a first news article data set of the topic.

In this embodiment, news article data may be obtained by:

step S101, collecting news article data in a set time period.

And collecting news contents in multiple channels. After the news is published, the news is generally spread through various channels, and the channels generally comprise news, forums, blogs and microblogs. The news collection is mainly characterized in that data from four channels are crawled and gathered respectively through a web crawler, a database is built, and the data from the four channels are stored in the corresponding database respectively. And simultaneously storing all data acquired by the four channels into a source-tracing summary database.

News summary on the same topic. In the process, the SimHash algorithm is adopted to perform relevance clustering on similar articles, and the SimHash algorithm can effectively classify and match various types of articles in the process, integrates similar news, and has an excellent effect on the combing performance of the news.

And step S20, performing data cleaning on the first news article data set by using Pandas to obtain a second news article data set.

And preprocessing the data of the first news article data set by adopting a Pandas method. The deep processing is carried out on the conditions that no column header exists, one column has a plurality of parameters, the unit of column data is not uniform, missing values, empty rows, repeated data, non-ASCII characters, some column headers are data instead of column name parameters and the like, so that the original state of data cleaning is achieved.

And step S30, sequentially grouping the second news article data set according to a set time length, calculating the news popularity corresponding to each group, and arranging the news popularity according to the time sequence to obtain a news popularity sequence.

The set time period in this step may be one hour; and sequencing the news popularity according to the time sequence of the news popularity in the news popularity sequence. In this embodiment, the method for calculating the popularity of news is performed in units of hours (for example, the popularity of news of a certain trade war shown in fig. 3), and the method for calculating the popularity of news of a specific topic in the period a is as follows:

acquiring a specific subject news total browsing volume A2 in an A time period in the second news article data set, acquiring all news total browsing volumes A1 in the A time period in the first news article data set, and calculating the proportion B of A2 in A1;

obtaining a news search index (which can be obtained from any search website such as google and baidu, and can also be obtained by weighting calculation after a plurality of search websites are obtained) of a specific subject in the period A, and normalizing the index to obtain C;

and B and C are weighted and summed to obtain the popularity of the news of the specific subject in the period A.

The above-described method for calculating the popularity of the news of the specific topic in the period a is only an example, and other existing methods for calculating the popularity of the news can be adopted, and are not described in detail here.

And step S40, according to the news popularity sequence, sequentially taking a continuous sequence with the sampling length of w from the first popularity as an input sequence X in a time step, sampling data in the next period as a Y, and constructing a training sample set by taking X as an input sample and Y as an output sample.

In order to process the original univariate time sequence data into data types (with input sample X and output sample Y (true value)) acceptable by LSTM, a news popularity sequence is sequentially sampled from the first to obtain a continuous sequence with the length w (w epsilon [10,20], 12 in the embodiment) as an input sequence X in a time step, data in a later period is sampled as a Y, and a plurality of xs and corresponding Y form a training sample set.

S50, randomly selecting N training samples from the training sample set to train the news popularity prediction model based on the LSTM network; if the training end condition is reached, step S70 is executed, otherwise step S60 is executed.

Before training a news popularity prediction model, a network structure needs to be determined, parameters need to be initialized, and then model training is carried out.

(1) Network architecture

The embodiment adopts a method for predicting the value of time series data by constructing a news popularity prediction model based on a deep LSTM (Long Short-Term Memory) network.

Because the conventional RNN model needs to couple the current hidden state calculation with the previous n times of calculation when realizing the long-term memory function, as shown in formula (1)

S_t＝f(U*X_t+W₁*S_t-1+W₂*S_t-2+...+W_n*S_t-n) (1)

Wherein, X_tFor the input of the input layer at the time of the t-th calculation, S_tAs a hidden layer, W₁...W_nU is a weight, e.g., when n equals 1, the hidden layer state is S_t＝f(U*X_t+W*S_t-1). In this way, the calculation amount is exponentially increased, so that the time for model training is greatly increased, and therefore, the long-term memory calculation based on the deep LSTM network model is selected. Because the depth LSTM-based network model has a cell processor for judging whether the information is valid or not, part of the information which does not accord with the rule can be screened out through a triple gate. Therefore, the algorithm has more optimal prediction effect on long-time sequence data.

Determining an activation function Sigmoid of a fully connected artificial neural network receiving the LSTM output; determining the rejection rate of each layer of network nodes to be 20%; determining a mean square error range of 20%; determining an iterative updating mode of the weight parameter by adopting an RMSprop algorithm; determining epoch and batch size of model training; the epoch is set to 10. The more the number of layers of the LSTM module is, the stronger the learning ability expressed for the high-level time is, and the number of layers is set to 3 in this embodiment; meanwhile, a common neural network layer is added for dimension reduction of output results.

(2) Parameter initialization

Setting the number of hidden layers as m layers, wherein m is generally 1; setting the number of hidden layer neurons as n, wherein n belongs to [40,60], and taking 50 in the embodiment; setting the recursion times in the time step as k, wherein k belongs to [10,20], and taking 15 in the embodiment; setting the number of training rounds as i (i is less, worse and higher), i belongs to [1500,2500], in this embodiment, 2000 is taken; the training batch size is set to j (representing that j sets of sequence samples are extracted from the training set for training in each round), j ∈ [40,60], and 50 is taken in this embodiment. The above parameters are set to initialize the data.

(3) Model training

During each model training, randomly selecting N training samples from a training sample set to train the news popularity prediction model; and judging whether to finish the training according to a preset training finishing condition, if so, executing the step S70, otherwise, executing the step S60. The training end condition may be a set number of iterations, or may be convergence of a loss function calculation value region.

Step S60, calculating a correlation coefficient r between the prediction result and the corresponding output sample by using the Pearson correlation coefficient, removing N training samples selected in the training of the current round from the training sample set when r is smaller than a first set threshold, and executing step S50.

The Pearson Correlation Coefficient (Pearson Correlation Coefficient) performs Correlation analysis on the output prediction result, specifically referring to the following formula (2):

wherein r is a correlation coefficient, N is the total number of input samples during training, x_i、y_iThe predicted value of the later period of the ith input sample and the true value of the later period are respectively.

And (3) performing relevance analysis on the predicted value by using the Pearson correlation coefficient, wherein the closer the corresponding r value is to 1, the higher the relevance between the r value and the predicted value is proved, and the closer the r value is to-1, the lower the relevance is proved. When r is lower than the set threshold (0.6 in this embodiment), N training samples selected in the training of the current round are removed from the training sample set, and then step S50 is performed again for the next training.

And in the training process, the Pearson correlation coefficient is introduced to delete bad sample data again, so that the retained data is more favorable for training the news popularity prediction model, the training speed is improved, and the model parameters can be further optimized.

And step S70, obtaining a trained news popularity prediction model.

In addition, in the training process of the news popularity prediction model, 10% of the training sample set is randomly extracted as a verification set (namely the proportion of the training sample set and the verification set after random splitting is 9:1), and 10-fold cross verification is carried out on the basis of the training sample set and the verification set so as to prevent overfitting.

In an embodiment of the invention, as shown in fig. 2, a news popularity prediction method based on a deep neural network includes the following steps:

In practical use, if a trained news popularity prediction model is used for predicting an unprecedented next period, the last w steps are directly input to obtain a predicted value of a future step, and if predicted values of more periods are obtained, the predicted values can be gradually accumulated, namely, the predicted values are used as actually-occurring values for prediction, and selection is performed according to a specific period needing prediction.

Therefore, in order to obtain predicted values of more periods, step a50 may be added after step a40 of the method, and if the prediction times are less than the set prediction periods, the predicted popularity of the news is added to the popularity sequence of the news, and step a30 is performed.

Fig. 3 shows the prediction status of a trade battle by using the trained news popularity prediction model in an embodiment of the present invention. The horizontal and vertical tables in the figure are: time, ordinate is: news popularity (normalized presentation of news popularity for ease of display, with values normalized to a specified interval, in this example designated as [0, 100]), with the gray curve being the true value and the black curve being the predicted value. The news popularity of the trade wars in the last 3 months (24 days in 7 months to 24 days in 10 months) is counted. It can be seen that the news popularity reached the first peak at 18 th 9, the second peak at 25 th 9, and then the popularity of the news slightly dropped. The predicted news popularity (black curve) is basically the same as the actual news popularity (gray curve) trend and value, and the method has excellent effect.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the deep neural network-based news popularity prediction method described above may refer to the corresponding process in the embodiment of the deep neural network-based news popularity prediction model training method, and details are not repeated herein.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A news popularity prediction model training method based on a deep neural network is characterized by comprising the following steps:

step S70, obtaining a trained news popularity prediction model;

the news popularity is calculated by the following method:

acquiring the total news browsing volume A2 of a specific subject in the A time period in the second news article data set, acquiring all the total news browsing volumes A1 in the A time period in the first news article data set, and calculating the proportion B of A2 in A1;

obtaining a news search index of a specific subject in the period A, and carrying out normalization processing on the index to obtain C;

and B and C are weighted and summed to be used as the news popularity corresponding to the A period.

2. The method for training a news popularity prediction model based on a deep neural network as claimed in claim 1, wherein the step S10 "obtaining news article data of a specific topic in a set time period" comprises the steps of:

s101, collecting news article data in a set time period;

3. The deep neural network-based news popularity prediction model training method according to claim 2, wherein in step S101, "news article data in a set time period is collected", and the collected sources include one or more of news, forums, blogs, and microblogs.

4. The training method of the news popularity prediction model based on the deep neural network as claimed in claim 1, wherein 10% of training samples are randomly extracted as a validation set during the training process of the news popularity prediction model, and 10-fold cross validation is performed based on the training samples and the validation set.

5. The deep neural network-based news popularity prediction model training method of claim 1, wherein the first set threshold is 0.6.

6. The deep neural network-based news popularity prediction model training method of claim 1, wherein a sampling length w e [10,20 ].

7. The deep neural network-based news popularity prediction model training method of claim 1, wherein the news popularity prediction model is trained by using an RMSprop algorithm to perform iterative update of weight parameters.

8. The deep neural network-based news popularity prediction model training method according to any one of claims 1 to 7, wherein the news popularity prediction model is initialized with the parameters before training set as: the number of hidden layer neurons is n, and n belongs to [40,60 ]; the recursion times in the time step are k, and k belongs to [10,20 ]; the number of training rounds is set as q, and q belongs to [1500,2500 ]; the training batch size is set to j, j ∈ [40,60 ].

9. A news popularity prediction method based on a deep neural network is characterized by comprising the following steps:

step A20, acquiring the news popularity sequence of the selected subject by adopting the method of step S20 and step S30 in the deep neural network-based news popularity prediction model training method of any one of claims 1 to 8;

step A40, predicting the popularity of the next stage of news by inputting data by using the well-trained LSTM network-based news popularity prediction model obtained by the deep neural network-based news popularity prediction model training method of any one of claims 1 to 8.

10. The method for predicting popularity of news based on a deep neural network as claimed in claim 9, wherein the method further comprises after the step a 40: