CN110334275A

CN110334275A - A kind of information popularity prediction technique, equipment and storage medium

Info

Publication number: CN110334275A
Application number: CN201910471730.2A
Authority: CN
Inventors: 郭建彬; 孔庆超; 罗引; 郝艳妮; 赵菲菲; 皇秋曼; 王磊; 曹家; 张西娜
Original assignee: Beijing Zhongke Song Polytron Technologies Inc
Current assignee: Beijing Zhongke Song Polytron Technologies Inc
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-10-15
Anticipated expiration: 2039-05-31
Also published as: CN110334275B

Abstract

The invention discloses a kind of information popularity prediction technique, equipment and storage mediums.This method comprises: obtaining information to be predicted；Feature classification is influenced according to predetermined popularity, the popularity for extracting the information to be predicted influences feature；The popularity is influenced to the popularity that the information to be predicted of the multi-model fallout predictor output is obtained in the multi-model fallout predictor for being integrated with multiple prediction models that feature input is trained in advance.The present invention treats predictive information using the multi-model fallout predictor for being integrated with multiple prediction models and carries out information Popularity prediction, the stability of prediction model prediction result not only can be improved using multi-model fallout predictor predictive information popularity, the estimated performance of prediction model can also be significantly improved, so that Popularity prediction is more stable, prediction result is more accurate.

Description

A kind of information popularity prediction technique, equipment and storage medium

Technical field

The present invention relates to field of computer technology more particularly to a kind of information popularity prediction technique, equipment and storage to be situated between Matter.

Background technique

With the continuous development of internet, various information are full of among network, in order to seek in massive information There must be the information of reference value, information Popularity prediction has attracted a large amount of concerns of academia and industry, and produces big Quantifier elimination achievement.

Information Popularity prediction is predictive information in following concerned degree, such as: to the news information in internet Carry out hot news prediction, it is intended to the hot news in research and probe future, predict the news information future concerned degree, The potential following hot news is identified in the information that can thus release news from magnanimity.Hot news is predicted in news release Part assessment, website bandwidth arrangement, hot news push, the discovery of public public sentiment event have great importance with the fields such as early warning and Wide application prospect.

In traditional information Popularity prediction, the prediction model of use is all single model.Although single model prediction is being counted Evaluation time and terms of the computation complexity have advantage, but due to the complex property of information Popularity prediction, the performance of single model Deficiency is often easy to appear performance inconsistency on the data acquisition system of different distributions, eventually leads to the stability, accurate of prediction result Property it is poor so that prediction model is to promote upper difficulty larger and perform poor on Generalization Capability.

Summary of the invention

The main purpose of the present invention is to provide a kind of information popularity prediction technique, equipment and storage mediums, to solve When using single model predictive information popularity, the poor problem of the stability and accuracy of prediction result.

The present invention provides a kind of information popularity prediction techniques, comprising: obtains information to be predicted；According to predetermined Popularity influences feature classification, and the popularity for extracting the information to be predicted influences feature；It is defeated that the popularity is influenced into feature Enter in the multi-model fallout predictor for being integrated with multiple prediction models of training in advance, obtains the described of the multi-model fallout predictor output The popularity of information to be predicted.

Wherein, pre- in the multi-model for being integrated with multiple prediction models that the popularity is influenced feature input training in advance Before surveying in device, further includes: be based on Stacking Integrated Strategy, integrate multiple prediction models；Using preset training dataset, Integrated the multiple prediction model is trained, the multi-model fallout predictor for calculating information popularity is obtained.

Wherein, described to be based on Stacking Integrated Strategy, integrate multiple prediction models, comprising: in the multiple prediction mould In type, select one of prediction model as meta learning device, remaining prediction model is as base learner；Multiple bases in parallel Learner；The output of multiple base learners is separately connected to the input of the meta learning device.

Wherein, using preset training dataset, integrated the multiple prediction model is trained, comprising: instructing Practice and mark off multiple data portions in data set, each data portion includes the sample data of multiple known popularities；For every A base learner executes following training step: step 12, in multiple data portions, one data portion of sequential selection is made For verification portion, remaining each data portion is as training part；Step 14, using the sample in multiple trained parts Data are trained the base learner, by the base after each sample data input training in the verification portion Device is practised, by the prediction popularity storage of each sample data of output to first training dataset；Step 16, by the training data The all or part of sample data concentrated inputs the base learner, and the prediction popularity of each sample data of output is deposited First test data set is stored up, to all data portions all by as verification portion is crossed, completes the training of the base learner；More After a base learner training is completed, following training step is executed for meta learning device: step 22, utilizing the member training The prediction popularity of sample data in data set and the training data concentrate the known popularity of corresponding sample data, The meta learning device is trained；Step 24, the prediction popularity for the sample data concentrated using first prediction data, with And the training data concentrates the known popularity of corresponding sample data, carries out accuracy to the meta learning device after training and tests Card terminates to train, obtain if the accuracy of the prediction popularity of meta learning device output is greater than preset trained threshold value Multi-model fallout predictor again instructs multiple base learners and the meta learning device conversely, then jumping to step 12 Practice.

Wherein, the multiple prediction model, comprising: the first extreme gradient promotes XGBoost algorithm model, adaptive boosting AdaBoost algorithm model, random forest RandomForest algorithm model, extreme random tree Extremely randomized Trees algorithm model and the 2nd XGBoost algorithm model；It is described to be based on Stacking Integrated Strategy, multiple prediction models are integrated, Include: the in parallel first XGBoost algorithm model, AdaBoost algorithm model, RandomForest algorithm model and Extremely randomized trees algorithm model；By the first XGBoost algorithm model, AdaBoost algorithm mould The output of type, RandomForest algorithm model and Extremely randomized trees algorithm model is separately connected described The input of 2nd XGBoost algorithm model.

Wherein, the popularity influences feature, comprising: attributive character and environmental characteristic；The attributive character, comprising: institute State the content characteristic of information to be predicted；The environmental characteristic, comprising: the competition intensity feature of the information to be predicted in a network With continuity feature.

Wherein, the attributive character, comprising: the message length of the information to be predicted, length for heading, picture number, hair The table time, theme distribution, reprints sequence at issue date；The competition intensity feature, comprising: the first preset time period in network The information content of interior publication；The continuity feature, comprising: the theme between the information to be predicted and hot information is similar Degree；Wherein, the popularity for the information issued in second time period in network is sorted from large to small, takes top n information as heat Point information, N >=1.

Wherein, before the popularity for extracting the information to be predicted influences feature, further includes: obtain sample information to Determine feature；Known to the popularity of the sample information；Using preset correlation analysis algorithm, the feature to be determined is calculated Related coefficient between the popularity of the sample information；The feature to be determined that related coefficient is greater than default dependent thresholds is true Being set to popularity to be extracted influences feature classification.

The present invention has the beneficial effect that:

The present invention treats predictive information using the multi-model fallout predictor for being integrated with multiple prediction models and carries out information popularity Prediction, the stability of prediction model prediction result not only can be improved using multi-model fallout predictor predictive information popularity, may be used also To significantly improve the estimated performance of prediction model, so that Popularity prediction is more stable, prediction result is more accurate.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of information popularity prediction technique according to a first embodiment of the present invention；

Fig. 2 is the acquisition flow chart of steps of multi-model fallout predictor according to a second embodiment of the present invention；

Fig. 3 is the training step flow chart of base learner according to a second embodiment of the present invention；

Fig. 4 is the training step flow chart of meta learning device according to a second embodiment of the present invention；

Fig. 5 is training schematic diagram according to a second embodiment of the present invention；

Fig. 6 is the structure chart of information Popularity prediction equipment according to a third embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with drawings and the specific embodiments, to this Invention is described in further detail.

Embodiment one

According to an embodiment of the invention, providing a kind of information popularity prediction technique.As shown in Figure 1, for according to this hair The flow chart of the information popularity prediction technique of bright first embodiment.

Step S110 obtains information to be predicted.

In the present embodiment, information to be predicted is the information of network platform publication.

In the present embodiment, information to be predicted can be long text information.

Long text information refers to the text information of contents norm (prescribed form).Such as: comprising title, text, picture, Deliver the news information of the normative contents such as time, issuing time.

Step S120 influences feature classification according to predetermined popularity, and the popularity for extracting information to be predicted influences Feature.

The present embodiment not only considers the factor of information itself, it is also contemplated that the externality factor of information.

In the present embodiment, popularity influences feature, including but not limited to: attributive character and environmental characteristic.

Attributive character, comprising: the content characteristic of information to be predicted.Attributive character is the content of the standardization of information to be predicted Feature.Attributive character includes but is not limited to: the message length of information to be predicted, picture number, delivers time, hair at length for heading The cloth date, reprints sequence at theme distribution.Wherein, reprinting sequence refers to the information to be predicted in reprinting number in different time periods Sequence.

Such as: for statistical analysis to candidate news, text size (message length), the title for extracting candidate news are long Degree, picture number, are delivered the time (0-23h), date issued (week is several)；Gensim natural language processing kit is called, is used LDA (Latent Dirichlet Allocation, document subject matter generate model) extracts the theme of candidate's news, Generate the theme distribution of candidate news；The reprinting sequence of candidate's news in preset time range is collected (to refer to and reprinted The reprinting time Number Sequence of the media representatives of candidate's news), specifically, preset time range is divided into multiple sub- periods (quantity can according to demand depending on) counts the reprinting number in each sub- period, for the candidate news constitute a length and The equal vector of sub- time segment number, to portray the reprinting trend of candidate news.Such as: by drawing after candidate news briefing for 24 hours It is divided into 5 sub- periods, then the length of vector is 5.

Certainly, attributive character can also include the tracing information of author.Such as: author's history delivers the average reading of information The standard deviation of amount and amount of reading.The timeliness of difficulty and data in view of data traceability, can set retrospect time span It is to be traced forward 6 months with current time.

Environmental characteristic, comprising: the competition intensity feature and continuity feature of information to be predicted in a network.

Competition intensity feature, comprising: the information content issued in the first preset time period in network.Such as: statistics is candidate The news that news delivers 30min before the moment delivers quantity, which is delivered quantity as the competition intensity feature of candidate news.

Continuity feature, comprising: the Topic Similarity between information to be predicted and hot information；Wherein, in network The popularity for the information issued in two periods sorts from large to small, and takes top n information as hot information, N >=1.

Such as: it counts candidate news and delivers hot news (the news column of popularity top ten list issued interior for 24 hours before the moment Table), calculate the Topic Similarity of candidate news and each hot news, the continuity feature as subsequent news.

Topic Similarity, which calculates, can use JS distance algorithm, and JS distance calculation formula is as follows:

Wherein, P indicates that the theme distribution of candidate news, Q indicate that the theme distribution of previous hot news, KL indicate to calculate The relative entropy of the theme distribution of candidate news and hot news, is defined as:

In the present embodiment, if extract popularity influence feature be it is multiple, on multiple popularities influence feature into Row coding, forming popularity influences characteristic sequence.

Step S130 predicts the multi-model for being integrated with multiple prediction models that popularity influences feature input training in advance In device, the popularity of the information to be predicted of multi-model fallout predictor output is obtained.

In the present embodiment, if the popularity extracted influences feature and has been encoded into popularity influence characteristic sequence, The popularity is then influenced into characteristic sequence and inputs multi-model fallout predictor, obtains the stream of the information to be predicted of multi-model fallout predictor output Row degree.

In the present embodiment, the popularity of the information to be predicted of multi-model fallout predictor output, can be the to be predicted of prediction The final pageview of information and/or reprinting amount.If output is pageview and reprinting amount, pageview and reprinting can be calculated The weighted sum of amount, using weighted sum as the popularity of information to be predicted.

Such as:

Wherein, P (i) indicates the popularity of information i to be predicted, ω₁Indicate pageview weight,Indicate information i to be predicted Pageview, ω₂Indicate reprinting amount weight,Indicate the reprinting amount of information i to be predicted.Pageview weight and reprinting amount weight can To be rule of thumb arranged or be configured according to experimental result.

In the present embodiment, it can be based on Stacking Integrated Strategy, integrate multiple prediction models；Utilize preset training Data set is trained integrated multiple prediction models, obtains the multi-model fallout predictor for calculating information popularity.The portion Dividing will be described in detail in embodiment two.

In the present embodiment, before the popularity for extracting information to be predicted influences feature, obtain sample information to true Determine feature；Known to the popularity of sample information；Using preset correlation analysis algorithm, feature and sample information to be determined are calculated Popularity between related coefficient；The feature to be determined that related coefficient is greater than default dependent thresholds is determined as in letter to be predicted Popularity to be extracted influences feature classification in breath.

Wherein, correlation analysis is used for measures characteristic degree associated with the popularity of information.Correlation analysis obtains Related coefficient bigger illustrate feature associated degree be bigger with the popularity of information；The related coefficient that correlation analysis obtains It is smaller to illustrate feature associated degree is smaller with the popularity of information.Wherein, if related coefficient is negative, to phase relation Number takes absolute value.

Specifically, choosing the continuous feature (attributive character) and discrete features (environmental characteristic) classification for needing to extract；It extracts The continuous feature and discrete features of sample information；Pearson (Pearson came) is carried out to the popularity of continuous feature and sample information Correlation analysis obtains the related coefficient between each continuous feature and the popularity of sample information；To discrete features and sample The popularity of information carries out spearman correlation analysis, obtains the phase relation of the popularity of various discrete feature and sample information Number；By the related coefficient and each discrete features and sample information between each continuous feature and the popularity of sample information The related coefficient of popularity is compared with preset dependent thresholds, is determined the continuous feature for being greater than related coefficient and is greater than The discrete features of related coefficient will be greater than the continuous feature of related coefficient and be used as greater than the discrete features of related coefficient popular Degree influences feature, influences feature classification as the subsequent popularity for needing to extract in information to be predicted.The dependent thresholds can be with It is empirical value or the value that experiment obtains.Such as: dependent thresholds 0.4.

In the selection of continuous feature, the present embodiment selects the content characteristic of information being typically canonicalized.

In the selection of discrete features, the present embodiment selects competition intensity feature and continuity feature.

Competition intensity feature: according to social information's communication theory, the customer flow of fixed information media platform exists Fixation can be considered as in short time, therefore there is user's attention rate race problems between the information of platform publication. And competition intensity feature is exactly the intensity of user's attention rate competition between the information portrayed Fa Bu on the platform, which exists The one period information content of publication is more, then the competition intensity between information also can be bigger, the letter which issues in a period Breath quantity is fewer, then the competition intensity between information also can be smaller.Therefore, when the competition intensity of information is smaller, which is obtained A possibility that obtaining a large number of users concern is bigger, i.e. the popularity degree of information is bigger.

Continuity feature: heatrate decaying is a cycle process, i.e., the temperature of previous hot information decays to just Ordinary water is flat to need the regular hour.Accordingly, if the hot information of previous phase remains unchanged user's attention rate with higher in the current period, So a possibility that current period information acquisition highly relevant with previous hot information higher attention rate, is larger.Therefore, information with it is past The topic relativity of phase hot information is a key factor for influencing its popularity.

In the present embodiment, predictive information is treated using the multi-model fallout predictor for being integrated with multiple prediction models carry out information The stabilization of prediction model prediction result not only can be improved using multi-model fallout predictor predictive information popularity for Popularity prediction Property, the estimated performance of prediction model can also be significantly improved, so that Popularity prediction is more stable, prediction result is more accurate.

Embodiment two

In the multi-model fallout predictor for being integrated with multiple prediction models that the popularity is influenced to feature input training in advance In before, the present invention also need training obtain multi-model fallout predictor.

The present embodiment is based on Stacking Integrated Strategy component multi-model fallout predictor.

The thought of Stacking is a kind of Fusion Model having levels, such as to the base for using different data to train , can be using base learner as base's model when learner is merged, one meta learning device of retraining, the meta learning device is for group The output using base learner is knitted as a result, allowing member to learn namely using the output result of base's model as the input of meta learning device The output result distribution weight that device is practised to base's model is most believable to select from the output result of base learner, as most Whole output result.

Acquisition step to multi-model fallout predictor is described in detail the present embodiment.As shown in Fig. 2, for according to the present invention The acquisition flow chart of steps of the multi-model fallout predictor of second embodiment.

Step S210 is based on Stacking Integrated Strategy, integrates multiple prediction models.

In the present embodiment, in multiple prediction models, select one of prediction model as meta learning device, remaining is pre- Model is surveyed as base learner；Multiple base learners in parallel；The output of multiple base learners is separately connected the defeated of meta learning device Enter.

In the present embodiment, multiple prediction models, including but not limited to: the first XGBoost (extreme gradient promotion) algorithm Model, AdaBoost (adaptive boosting) algorithm model, RandomForest (random forest) algorithm model, Extremely Randomized trees (extreme random tree) algorithm model and the 2nd XGBoost algorithm model.

Further, using the first XGBoost algorithm model as base learner, using the 2nd XGBoost algorithm model as Meta learning device.In parallel first XGBoost algorithm model, AdaBoost algorithm model, RandomForest algorithm model and Extremely randomized trees algorithm model；By the first XGBoost algorithm model, AdaBoost algorithm model, The output of RandomForest algorithm model and Extremely randomized trees algorithm model is separately connected second The input of XGBoost algorithm model.

Step S220 is trained integrated the multiple prediction model, is used using preset training dataset In the multi-model fallout predictor for calculating information popularity.

Multiple data portions are marked off in training data concentration, each data portion includes the sample of multiple known popularities Data；Using this multiple data portion, base learner and meta learning device are trained respectively.

As shown in figure 3, being the training step flow chart of base learner.Step shown in Fig. 3 is executed for each base learner Suddenly.

Step S310, in multiple data portions, for one data portion of sequential selection as verification portion, remaining is each Data portion is as training part.

Step S320 is trained base learner using the sample data in multiple trained parts, will be in verification portion The input training of each sample data after base learner, the storage of the prediction popularity of each sample data of output is instructed to member Practice data set.

Step S330, all or part of sample data that training data is concentrated inputs base learner, by the every of output The prediction popularity of a sample data is stored to first test data set.

Whether step S340 judges all data portions all by as verification portion excessively；If it is, terminating base study The training of device；If it is not, then jumping to step S310, select next data portion as verification portion, remaining every number According to part as training part, until all data portions all did verification portion, the base learner training is completed.

As shown in figure 4, being the training step flow chart of meta learning device.After in multiple base learners, all training is completed, needle The training step of Fig. 4 is executed to meta learning device.

Step S410 utilizes the prediction popularity and the trained number of the sample data that first training data is concentrated According to the known popularity for concentrating corresponding sample data, the meta learning device is trained.

Step S420 utilizes the prediction popularity and the trained number of the sample data that first prediction data is concentrated According to the known popularity for concentrating corresponding sample data, Accuracy Verification carried out to the meta learning device after training, described in judgement Whether the accuracy of the prediction popularity of meta learning device output is greater than preset trained threshold value；If it is, terminating training, obtain Multi-model fallout predictor again instructs multiple base learners and a meta learning device conversely, then jumping to step S301 Practice.

After the training of meta learning device is completed, the multi-model prediction comprising multiple base learners and a meta learning device is obtained Device, the multi-model fallout predictor can be used for predicting the popularity of information to be predicted.

In order to keep the present embodiment easier to understand, below by training to 3 base learners and meta learning device Cheng Jinhang is further described through.

As shown in figure 5, for according to the training schematic diagram of second embodiment of the invention.

It is concentrated in training data, marks off test set and training set, five data portions are marked off in training set.Example Such as: training dataset includes 12500 row sample datas, and the training set marked off is 10000 row sample datas, test set Test- Predict is 2500 row sample datas；The base learner on upper layer will do it 5 foldings crosscheck (cross-validation), that Training set is divided into five data portions, use 8000 in training set as training part Data-Learn, it is remaining 2000 rows are as verification portion Data-Predict.Certainly, it is different that the 2000 row sample datas used are cross-checked every time.

It is described below for the training step of base learner Model-k (k=1,2,3).Wherein, Fig. 5 is only schematic Give the connection relationship of base learner Model-1 Yu meta learning device, base learner Model-2 and base learner Model-3 It can be attached with reference to the connection relationship of base learner Model-1 with meta learning device.

Step S1, m=m+1 are cross-checked into the m times.

In the present embodiment, the initial value of m is 0.

Step S2, using 8000 sample datas training base learner Model-k of Data-Learn, in training completion Afterwards, 2000 sample datas of Data-Predict are inputted into base learner Model-k, using base learner Model-k to this 2000 sample datas carry out Popularity prediction respectively, export 2000 prediction results, and be stored in first training dataset.

Training base learner Model-k is the process of the parameter of continuous adjustment base learner Model-k, so that base It practises device Model-k and exports more accurate prediction result.

Since training data concentrates the popularity of all sample datas it is known that so the known flow of sample data can be used Whether row degree is accurate come the prediction popularity (prediction result) for measuring the sample data of base learner output.Such as: if sample Difference between the known popularity and prediction popularity of data is less than preset first accuracy threshold value, then base learner The training of Model-k is completed；If the difference between the known popularity and prediction popularity of sample data is more than or equal to first Accuracy threshold value then adjusts the parameter in base learner Model-k.

Further, when constructing training dataset, it can be the unique coding of each sample data setting, obtain sample It, can be according to the known popularity of the coding lookup sample data after the prediction popularity of notebook data.

Step S3 inputs the base learner Model-k after training, benefit using 2500 sample datas of Test-Predict Popularity prediction is carried out to this 2500 sample datas with base learner Model-k, exports 2500 prediction results, and be stored in member Predictive data set.

Step S4, judges whether m is equal to 5；If it is, terminating the training to base learner Model-k；If it is not, then Jump to step 1.

By above-mentioned 5 circulation steps (5 crosschecks), the prediction result of 10000 (5 × 2000) items can be stored in First training data is concentrated, and the prediction result of 12500 (5 × 2500) items is stored in first prediction data and is concentrated.

For base learner Model-1,5 × 2000 prediction results for being stored into first training dataset are spliced into 10000 Capable matrix is labeled as A1, for being stored in 5 × 2500 prediction results of first predictive data set, to the prediction knot of 5 group of 2500 row The correspondence row of fruit is weighted and averaged the prediction result of each base learner Model-k (corresponding first weight), and such as 5 the A line prediction result weighted average, the prediction result weighted average of 5 the second rows, and so on, the matrix of 2500 rows is obtained, is marked It is denoted as B1.Likewise, it is directed to base learner Model-2, available A2 and B2, it is available for base learner Model-3 Six matrixes of A1, A2, A3, B1, B2, B3 are obtained in A3 and B3,3 base learners.

The prediction result that each base learner Model-k cross validation obtains is input to first as feature by step S5 It practises in device, base learner Model-k is trained.Further, which, which refers to, is input to meta learning device for A1, A2 and A3 In, meta learning device trial learning, assign in the prediction result in A1, A2 and A3 weight w (each base learner Model-k's Prediction result corresponds to second weight), by adjusting the parameter in the size and base learner Model-k of the second weight, So that the prediction result of meta learning device final output sample data the most accurate.

Second weight is bigger, then it represents that the confidence level of prediction result is higher.

Since training data concentrates the popularity of all sample datas it is known that so the known flow of sample data can be used Whether row degree is accurate come the prediction popularity (prediction result) for measuring the sample data of meta learning device output.Such as: if sample Difference between the known popularity and prediction popularity of data is less than default second accuracy threshold value, then meta learning device Model- The training of k is completed；If the difference between the known popularity and prediction popularity of sample data is more than or equal to the second accuracy Threshold value then adjusts the parameter in meta learning device Model-k and is adjusted to the prediction result imparting of different base learner Model-k The second weight.

The prediction result that first prediction data is concentrated is input in meta learning device by step S6, to the prediction energy of meta learning device Power is verified.In other words, B1, B2 and B3 are input in meta learning device, obtain the prediction popularity of each sample data, Using the known popularity of each sample data, the prediction accuracy of meta learning device is determined；If the prediction accuracy is less than pre- If training threshold value, then jump to step 1, re -training base learner Model-k and meta learning device；If the prediction accuracy More than or equal to the training threshold value, then terminate to train, obtains multi-model fallout predictor.

Such as: if the difference between the known popularity of sample data and prediction popularity is less than third accuracy threshold Value, then be judged to predicting it is correct, if the difference between the known popularity of sample data and prediction popularity is more than or equal to the Three accuracy threshold values are then determined as that it is accurate to obtain prediction for prediction error, the quantity of statistical forecast correct quantity and prediction error It spends (the correct quantity ÷ of prediction accuracy=prediction (quantity for predicting correct quantity+prediction error)).

Embodiment three

The present embodiment provides a kind of information Popularity prediction equipment.As shown in fig. 6, for according to third embodiment of the invention The structure chart of information Popularity prediction equipment.

In the present embodiment, the information Popularity prediction equipment, including but not limited to: processor 610, memory 620.

The processor 610 is for executing the information popularity Prediction program stored in memory 620, to realize embodiment Information popularity prediction technique described in one~embodiment two.

Specifically, the processor 610 is used to execute the information popularity Prediction program stored in memory 620, with It performs the steps of and obtains information to be predicted；Feature classification is influenced according to predetermined popularity, extracts the letter to be predicted The popularity of breath influences feature；Popularity influence feature input training in advance is integrated with the multimode of multiple prediction models In type fallout predictor, the popularity of the information to be predicted of the multi-model fallout predictor output is obtained.

Example IV

The embodiment of the invention also provides a kind of storage mediums.Here storage medium is stored with one or more journey Sequence.Wherein, storage medium may include volatile memory, such as random access memory；Memory also may include non-easy The property lost memory, such as read-only memory, flash memory, hard disk or solid state hard disk；Memory can also include mentioned kind Memory combination.

When one or more program can be executed by one or more processor in storage medium, to realize above-mentioned letter Cease Popularity prediction method.

Specifically, the processor is for executing the information popularity Prediction program stored in memory, with realize with Lower step: information to be predicted is obtained；Feature classification is influenced according to predetermined popularity, extracts the stream of the information to be predicted Row degree influences feature；The popularity is influenced to the multi-model prediction for being integrated with multiple prediction models of feature input training in advance In device, the popularity of the information to be predicted of the multi-model fallout predictor output is obtained.

Advantages of the present invention is mainly as follows:

First: the present invention has carried out correlation subdivision to traditional characteristic, filtered out with prediction task unrelated feature, The precision of prediction that prediction model is improved while reducing computation complexity prevents prediction process from over-fitting risk occurs.

Second: the present invention not only only accounts for the characteristic attribute of information itself, while also to the external environment attribute of information It is extracted.Two features of popular competition intensity feature and continuity feature of information are extracted, information is completely featured Competitive relation between continuity and information.

Third: the present invention proposes to incorporate each base using the multi-model prediction scheme based on stacking Integrated Strategy The advantages of learner, improves prediction accuracy, overcomes the poor problem of single learner prediction result stability, can be more Accurately and stably the popularity of information is predicted.

In summary, the present invention can accurately improve hot news prediction compared to traditional Popularity prediction scheme The accuracy and robustness of system.

The above description is only an embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification, Equivalent replacement, improvement etc., should be included within scope of the presently claimed invention.

Claims

1. a kind of information popularity prediction technique characterized by comprising

Obtain information to be predicted；

Feature classification is influenced according to predetermined popularity, the popularity for extracting the information to be predicted influences feature；

The popularity is influenced to obtain in the multi-model fallout predictor for being integrated with multiple prediction models that feature input is trained in advance The popularity of the information to be predicted of the multi-model fallout predictor output.

2. the method according to claim 1, wherein the popularity is influenced feature input training in advance Before being integrated in the multi-model fallout predictor of multiple prediction models, further includes:

Based on Stacking Integrated Strategy, multiple prediction models are integrated；

Using preset training dataset, integrated the multiple prediction model is trained, is obtained for calculating information flow The multi-model fallout predictor of row degree.

3. according to the method described in claim 2, it is characterized in that, it is described be based on Stacking Integrated Strategy, integrate multiple pre- Survey model, comprising:

In the multiple prediction model, select one of prediction model as meta learning device, remaining prediction model is as base Learner；

Multiple base learners in parallel；

The output of multiple base learners is separately connected to the input of the meta learning device.

4. according to the method described in claim 3, it is characterized in that, using preset training dataset, to integrated described more A prediction model is trained, comprising:

Multiple data portions are marked off in training data concentration, each data portion includes the sample number of multiple known popularities According to；

Following training step is executed for each base learner:

Step 12, in multiple data portions, one data portion of sequential selection is as verification portion, remaining every number According to part as training part；

Step 14, the base learner is trained using the sample data in multiple trained parts, by the verifying The base learner after the input training of each sample data in part, by the prediction popularity of each sample data of output Store first training dataset；

Step 16, all or part of sample data training data concentrated inputs the base learner, by output The prediction popularity storage of each sample data is to first test data set, to all data portions all by as crossing verification portion, Complete the training of the base learner；

After multiple base learner training are completed, following training step is executed for meta learning device:

Step 22, the prediction popularity and the training data for the sample data concentrated using first training data are concentrated The known popularity of corresponding sample data, is trained the meta learning device；

Step 24, the prediction popularity and the training data for the sample data concentrated using first prediction data are concentrated The known popularity of corresponding sample data carries out Accuracy Verification to the meta learning device after training, if the meta learning The accuracy of the prediction popularity of device output is greater than preset trained threshold value, then terminates to train, obtain multi-model fallout predictor, instead It, then jump to step 12, is trained again to multiple base learners and the meta learning device.

5. according to the method described in claim 3, it is characterized in that,

The multiple prediction model, comprising: the first extreme gradient promotes XGBoost algorithm model, adaptive boosting AdaBoost Algorithm model, random forest RandomForest algorithm model, extreme random tree Extremely randomized trees are calculated Method model and the 2nd XGBoost algorithm model.

It is described to be based on Stacking Integrated Strategy, integrate multiple prediction models, comprising:

The in parallel first XGBoost algorithm model, AdaBoost algorithm model, RandomForest algorithm model and Extremely randomized trees algorithm model；

By the first XGBoost algorithm model, AdaBoost algorithm model, RandomForest algorithm model and The output of Extremely randomized trees algorithm model is separately connected the defeated of the 2nd XGBoost algorithm model Enter.

6. the method according to claim 1, wherein

The popularity influences feature, comprising: attributive character and environmental characteristic；

The attributive character, comprising: the content characteristic of the information to be predicted；

The environmental characteristic, comprising: the competition intensity feature and continuity feature of the information to be predicted in a network.

7. according to the method described in claim 6, it is characterized in that,

The attributive character, comprising: the message length of the information to be predicted, picture number, delivers time, hair at length for heading The cloth date, reprints sequence at theme distribution；

The competition intensity feature, comprising: the information content issued in the first preset time period in network；

The continuity feature, comprising: the Topic Similarity between the information to be predicted and hot information；Wherein, to network The popularity for the information issued in middle second time period sorts from large to small, and takes top n information as hot information, N >=1.

8. method according to any one of claims 1 to 7, which is characterized in that in the stream for extracting the information to be predicted Row degree influences before feature, further includes:

Obtain the feature to be determined of sample information；Known to the popularity of the sample information；

Using preset correlation analysis algorithm, the phase between the feature to be determined and the popularity of the sample information is calculated Relationship number；

The feature to be determined that related coefficient is greater than default dependent thresholds, which is determined as popularity to be extracted, influences feature classification.

9. a kind of information Popularity prediction equipment characterized by comprising memory, processor and be stored in the memory Computer program that is upper and can running on the processor；It realizes when the computer program is executed by the processor as weighed Benefit require any one of 1 to 8 described in method the step of.

10. a kind of storage medium, which is characterized in that the storage medium is stored with one or more program, it is one or The multiple programs of person can be executed by one or more processor, to realize the step of method described in any item of the claim 1 to 8 Suddenly.