WO2017161646A1 - 大数据量预测的三层联合动态选择最优模型方法 - Google Patents

大数据量预测的三层联合动态选择最优模型方法 Download PDF

Info

Publication number
WO2017161646A1
WO2017161646A1 PCT/CN2016/081481 CN2016081481W WO2017161646A1 WO 2017161646 A1 WO2017161646 A1 WO 2017161646A1 CN 2016081481 W CN2016081481 W CN 2016081481W WO 2017161646 A1 WO2017161646 A1 WO 2017161646A1
Authority
WO
WIPO (PCT)
Prior art keywords
algorithm
model
prediction
weight
data
Prior art date
Application number
PCT/CN2016/081481
Other languages
English (en)
French (fr)
Inventor
吴冬华
胡曼恬
宇特·亚历克西
闫兴秀
Original Assignee
南京华苏科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京华苏科技有限公司 filed Critical 南京华苏科技有限公司
Priority to US16/085,315 priority Critical patent/US20190087741A1/en
Publication of WO2017161646A1 publication Critical patent/WO2017161646A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Definitions

  • the invention relates to a three-layer joint dynamic selection optimal model method for large data volume prediction.
  • the data generated today is as high as 250 trillion bytes, more than 90% of the total amount of data generated in the past two years.
  • a large amount of data is stored in a structured form on a computer. After the data is structured, it loses the logical connection while being convenient to store. For example, the two adjacent cells on the communication interact with each other, cause each other, and continue to the future in a certain mode.
  • Stored in the computer there are only two columns of data, there is no association and mode. In reality, there may be countless columns of such data, which makes the association and pattern hidden deeper and more complicated. In such a large amount of complex data, to find the association and capture patterns to predict the future, a stable and accurate model is needed, which puts higher requirements on the existing algorithms.
  • the first problem is that the model may choose the wrong one, assuming that the simulation produces a column of data, conforming to the mathematical characteristics of the oscillation that gradually becomes smaller (assuming a sinusoid with a decreasing period), and making its period particularly large, large enough In a certain range, the local distribution is linear, but from a sufficient perspective, you can see its true face. In a period of time, its mode is likely to be caught by mistake. In actual use, if the data is not enough, or the data is not accumulated to a certain extent, then there is a possibility that there will be problems when selecting the model.
  • the second problem is that when you want to predict a large amount of different data, you need to select a model for each column of data, so you need to spend a lot of time on the model selection. Even if you do, you can't avoid the above problem. -- The model selection is wrong, and it is hoped that the selection process of each model is simple and scientific, and the model prediction results are stable and relatively accurate.
  • the third problem is that fast dynamic prediction cannot be achieved.
  • the modeling process needs to be restarted, analyzed, modeled, and evaluated. Obviously, this does not satisfy the fast and dynamic prediction.
  • this column of data like other already modeled data, can intelligently select a ready-made model for prediction and related processing, and can ensure the accuracy of the results.
  • the present invention performs a specific analysis on three problems, and finds that some of the three problems have in common.
  • the predicted value and the observed value often have large errors, and the error will vary with the predicted length. Increase and increase.
  • the present invention provides a three-layer joint dynamic selection optimal model method for predicting large data quantities.
  • the most suitable model can be dynamically selected, and the model with poor prediction effect is discarded.
  • the stability of the effect is guaranteed, and on the other hand, the error is controlled within a reasonable range.
  • a three-layer joint dynamic selection optimal model method for large data volume prediction including prediction model algorithm library, weight algorithm library, optimal weight algorithm selection algorithm, and prediction model algorithm library at the bottom layer, prediction algorithm model Above the library is the weight algorithm library, above the weight algorithm library is the optimal weight algorithm selection algorithm;
  • Predictive Model Algorithm Library Contains several predictive model algorithms, which are abstracted into a common interface, placed at the lowest level of the joint algorithm, providing predictive functions to support higher-level functions;
  • Weight algorithm library masks the diversity of the bottom-level algorithm of the prediction algorithm library. According to the prediction result of the underlying algorithm, the underlying algorithm is selected and combined according to several criteria to form several weighting algorithms.
  • Optimal weight algorithm selection algorithm According to the effect of the verification centralized weight algorithm, the optimal weight algorithm is selected for prediction.
  • Models are fitted to the data using two or more different algorithms to obtain each candidate model.
  • preprocessing the training data specifically includes:
  • Time format processing map time columns to consecutive integers
  • Data complement missing data interpolation, error data interpolation.
  • weighting algorithm uses the following algorithm:
  • Algorithm 1 Give the same weight to all prediction models
  • Algorithm 2 Excluding 20% of the models with relatively poor prediction results, and giving the remaining models the same weight;
  • Algorithm 3 Calculate the root mean square of the error of each model, and then design a function of the inverse trend according to the root mean square of the error, and assign weight to each model according to the function;
  • Algorithm 4 Calculate the minimum absolute error of each model, and then design a function of the inverse trend according to the minimum absolute error size, and assign weight to each model according to the function;
  • Algorithm 5 Calculate the error of the least squares calculation of each model, and then design a function of the inverse trend according to the error size of the least squares calculation, and assign weight to each model according to the function;
  • Algorithm 6 Calculate the Akaike information criterion for each model, and then design a function of the inverse trend according to the size of the Akaike information criterion, and assign weights to each model according to the function.
  • Each prediction model is given a corresponding weight, data prediction is performed, and predicted data is stored.
  • the optimal weight algorithm selects the optimal weight algorithm according to the prediction effect of each weighting algorithm on the test set; the specific steps of the optimal weight algorithm selection algorithm are as follows:
  • the data set predicted by the weight library is compared with the verification set to obtain an error
  • the data predicted by the optimal weight method is stored to obtain a prediction result.
  • the invention has the beneficial effects that the present invention is a three-layer joint dynamic selection optimal model method for predicting large data volume, and the three-layer structure has high scalability, prediction stability, dynamic adjustment characteristics of the model, and prediction data for the model. These four characteristics of difference.
  • This application uses a joint algorithm, which avoids some shortcomings of commonly used algorithms, and uses a method that gives weights to multiple models to organically combine multiple algorithms, giving the most adaptive algorithm a high weight, but it will be relatively bad. Algorithm gives low weight, which guarantees The accuracy of the data prediction also ensures the stability of the prediction after the data length is increased.
  • FIG. 1 is a schematic diagram showing a three-layer joint dynamic selection optimal model method for predicting large data volume according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of the KPI integrated error rate of the ARIMA algorithm in the embodiment.
  • 3 is a schematic diagram of the error rate of the Holtwinters algorithm in the KPI in the embodiment.
  • FIG. 4 is a schematic diagram of the error rate under the KPI of the Arima algorithm in the embodiment.
  • the embodiment uses a joint algorithm that circumvents some of the shortcomings of commonly used algorithms, and uses a method that assigns weights to multiple models to organically combine multiple algorithms to give the most adaptive algorithm a high weight.
  • the low weight given by the relatively bad algorithm not only ensures the accuracy of the data prediction, but also ensures the stability of the prediction after the data length increases. Subsequently, the joint algorithm was applied to the experiment, and the expected effect was achieved, and the stability and accuracy were achieved.
  • the three-layer joint dynamic selection optimal model method for large data volume prediction consists of three layers: prediction model algorithm library, weight algorithm library, and optimal weight algorithm selection algorithm.
  • the predictive model algorithm library contains various classical algorithms, classical algorithm improvements and some patented algorithms. These algorithms are abstracted into a common interface, placed at the bottom of the joint algorithm, providing predictive functions and supporting higher-level functions.
  • the weighting algorithm On top of the prediction algorithm model library is a weighting algorithm.
  • the weighting algorithm encapsulates the prediction algorithm library and shields the diversity of the lowest level algorithm. The user does not need to consider the parameters, period, convergence, error, etc. of the underlying algorithms.
  • the weighting algorithm according to the prediction result of the underlying algorithm according to several criteria (for example, averaging all underlying algorithm results, discarding some of the worst results, assigning weights according to RMSE results, assigning weights according to OLS results, assigning weights according to AIC results, According to the LAD result allocation weight, the underlying algorithm is selected and combined to form several weighting algorithms.
  • weighting algorithms have no physical difference, only differences in mathematical characteristics, these differences The difference is derived from the characteristics of the prediction data itself, and also related to the selected weight formula. These weight algorithms adapt to different data. Before determining which weight algorithm to use, we need to judge according to the result of the verification set. In this judgment, we hope that there is an algorithm that is automatically completed. This is the third layer of the joint algorithm--the optimal weight algorithm selection algorithm. The third layer algorithm is the packaging of the weight algorithm. According to the effect of the verification centralized weight algorithm, Select the optimal weighting algorithm to make predictions.
  • the three-layer structure of the three-layer joint dynamic selection optimal model method for large data volume prediction has four characteristics: high scalability, prediction stability, dynamic adjustment characteristics of the model, and invariance of prediction data to the model.
  • this algorithm also has her shortcomings - inefficiency. After weighing the rapid development of computer hardware and software performance and the rapid maturity of distributed technology, her shortcomings are insignificant compared to the other four characteristics.
  • the bottom-level predictive model algorithm library of the three-layer joint dynamic selection optimal model method for large data volume prediction includes various classical algorithms, classical algorithm improvements and some patented algorithms. These algorithms include ar, mr, arma, holtwinters, var, svar, svec, garch, svm, fourier. These algorithms have their own applicable scenarios.
  • the stationary sequence can be used for arma, arima, var, svar, svec. After the non-stationary sequence is to be smoothed, the stationary algorithm can be used. In addition to the stationary algorithm, the remaining algorithms can be used for non-stationary sequences. . For high latitude data, consider using svm.
  • the multi-time series data can use the var algorithm, and the garch model has certain advantages for long-term prediction.
  • each algorithm contains a variety of parameters, such as arima's parameters p, d, q have a variety of combinations when set.
  • svar and svec are var changes, and the arch algorithm is an extension of the scope of the arch algorithm.
  • the prediction data of the training set of some algorithms and the prediction data of the test set can not be ignored. For example, in the training set of HOLT-WINTERS, the boundary value of the first cycle in the training set cannot be Predicted, while ARIMA is predictable.
  • VAR the boundary value of the first cycle in the training set cannot be Predicted, while ARIMA is predictable.
  • the upper layer is provided with a non-differential interface, all of the above differences need to be masked.
  • the specific method is, if the model has multiple parameters, set an independent model for each parameter, and the variant also sets an independent model.
  • the parameters of the arima model p, d, q have 32 combinations, then we set 32 Models such as arima(1,1,0) and arima(2,1,0) belong to two models.
  • variants we also set up models separately, such as var and svec are different variants of the same model, we also Set it independently to a different model. For models whose boundaries cannot be predicted, the boundary values are not taken into account when calculating the error.
  • the predicted value of the first period of the HOLT-WINTERS model training set is not.
  • this part of the error is not included in the overall error.
  • the impact of not counting the part in the actual prediction is small.
  • the multi-cycle models are processed separately, and then predicted into chronological arrays.
  • VAR model VAR is a matrix based on multiple time predictions. We store the matrix in order of values and save them as an array. The values in the array are exactly the values sorted by time, making them consistent with other forms of prediction results for easy comparison.
  • the prediction algorithm library is the weight algorithm library.
  • the principle of the weight algorithm library is to choose the best. Even so, the principle of merit is not unique. Or, the "excellent” here is difficult to determine, and the “excellent” of the verification set may not be possible. Continuing into the farther future, such as over-fitting models, the validation set performs well, but not in the prediction set. Therefore, the weight algorithm library uses six weighting methods, as outlined in the overview.
  • weighting algorithms select and combine the results in the predictive model library according to their respective principles to form six algorithms. These six algorithms focus on different purposes. The purpose of this is to capture as many data characteristics as possible. A good extension to the prediction set, even if not, can dynamically adjust the parameters, reducing the impact of the "bad" model and increasing the accuracy of the prediction.
  • the six weighting algorithms are:
  • e i represents the error of the i-th model
  • x i represents the predicted value of the i-th variable
  • y i represents the observed value of the i-th variable
  • g defines an inverse trend function according to the formula.
  • i is an integer in (1 to the number of weighting algorithms), and the weighting algorithm i is called to calculate the weight.
  • Each prediction model is given a corresponding weight, data prediction is performed, and predicted data is stored.
  • the top layer is the optimal weight algorithm selection algorithm. Based on the six weighting algorithms, the optimal weight algorithm selection algorithm selects the best weighting algorithm.
  • the selection principle is the prediction effect of the six weighting algorithms on the test set.
  • the optimal weight algorithm selection algorithm is invoked to obtain the predicted data and store the data.
  • the first is data collection, data processing, algorithm model based on three-layer structure, joint algorithm and general algorithm for data prediction, and prediction results.
  • the experiment is divided into two parts.
  • the first part is to train the training data into the common model to predict and obtain the error data.
  • the training data is put into the joint algorithm model to train, predict and obtain the error data.
  • the second part is to compare the error of the joint algorithm model with the general model training set and the error of the test set to evaluate the effect of the joint algorithm.
  • the first is the collection and processing of data.
  • the frequency of data generation is half an hour.
  • a total of 121 days are collected, and 5808 data of 6 uplink KPIs and 1500 6 downlink KPIs of 1500 cells, from July 29, 2014. Data from the date of November 26, 2014.
  • the training data is put into the general model for training, prediction, and the prediction data and error data obtained by each model are saved, and then the training data is put into the joint algorithm model for training, and the prediction data and error data are saved.
  • the prediction effects of the joint algorithm and the general model are compared, and the difference between the prediction error of the general model and the joint algorithm in the training set, the prediction error on the prediction set, the prediction error of the training set and the prediction error of the prediction set are calculated respectively.
  • a weight which is 0.3, 0.3, and 0.4.
  • the combined error value is obtained.
  • Fig. 2 is a schematic diagram of the KPI integrated error rate of the ARIMA algorithm in the embodiment.
  • 3 is a schematic diagram of the error rate of the Holtwinters algorithm in the KPI in the embodiment.
  • 4 is a schematic diagram of the error rate under the KPI of the Arima algorithm in the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种大数据量预测的三层联合动态选择最优模型方法,包括预测模型算法库、权重算法库、最优权重算法甄选算法三层,预测模型算法库放置在最底层,在预测算法模型库之上是权重算法库,在权重算法库之上是最优权重算法甄选算法;该种大数据量预测的三层联合动态选择最优模型方法,三层结构具有高扩展性、预测稳定性、模型的动态调整特性、预测数据对模型的无差异性这四种特性。本方法运用了联合算法,该算法规避了常用算法的一些缺点,利用赋予多种模型权重的方法,将多种算法有机地组合在一起,将最适应的算法赋予高权重,而将相对不好的算法赋予低的权重,这样既保证了数据预测的准确性,也保证了数据长度增加后,预测的稳定性。

Description

大数据量预测的三层联合动态选择最优模型方法 技术领域
本发明涉及一种大数据量预测的三层联合动态选择最优模型方法。
背景技术
现在每天生成的数据高达250兆亿个字节,超过过去两年里生成的数据总量的90%。大量的数据,以结构化的形式存储于计算机。这些数据被结构化后,在方便储存的同时,也失去了逻辑上的关联,比如,通讯上相邻的两个小区之间,彼此影响,相互因果,并以某种模式延续到未来,而存储在计算机里的,只是两列数据,并无关联和模式,实际情况下,可能是无数列这样的数据,这让关联和模式隐藏的更深,形式更复杂。在如此大量而复杂的数据中,要发现关联、捕捉模式,以预测未来,需要一个稳定而准确的模型,这对现有算法提出了更高的要求。
为了获得这样一个理想模型,分析常规建模过程是有必要的。基于大量数据预测时,首先是借助统计和可视化的方法,分析数据的特征,比如,是线性的还是非线性的,周期多少,滞后期多少,呈何种分布等等,如果到这步还没有清晰的特征,就需要对数据进行数学转换,转换后的数据继续上述步骤,直到得到清晰的数学特征,然后是基于数学特征建模。这种建模流程固然很好,且绝大多数情况下能很好的实现目标。然而,有时候这么做会有问题。
第一个问题是,模型可能会选择错误,假设模拟产生某列数据,符合周期逐渐变小的震荡的数学特性(假定是周期渐小的正弦),并且让它的周期特别大,大到在一定范围内看,局部分布是呈线性的,但从足够的远期看,才能看到它的庐山真面目。在一段时间内,它的模式很可能被错误捕捉,实际运用中,如果数据不够多,或者数据没有累计到一定程度,那么,在选用模型时,很可能会有问题。而且,一旦选用了某个模型,很可能没有机会再去选用别的,甚至也不能修正模型本身,因为在一开始模型就被评估的很好,然后就进入正式开发,放入工程中,当数据增加或情况改变时,也不会再考虑重新选择模型。随着数据的积累或者在预测长期的数据时,问题就会凸现,预测效果会变得很差。
第二个问题是,当要预测大量不同数据时,需要针对每列数据选择一个模型,这样,需要花大量时间在模型甄选上,即使这样做了,仍然不能避免上述的问题 --模型选择错误,而希望每个模型的选择流程都简单而科学,模型预测结果稳定而相对准确。
第三个问题是,无法实现快速动态预测。当有一列新的数据需要建模和预测时,需要重新开始建模流程,分析,建模,评估。显然,这不能满足快速动态的进行预测。而希望这列数据像其他已经建模完的数据一样,能智能地选用某个现成的模型,进行预测和相关处理,并能保证结果的准确性。
发明内容
为了解决上述问题,本发明针对三个问题进行了具体的分析,发现三个问题的一些共同之处,大数据量时,预测值与观测值常常有较大误差,误差会随着预测长度的增加而增大。为了避免误差过大,本发明提供一种大数据量预测的三层联合动态选择最优模型方法,在预测时,可以动态地选择最合适的模型,摒弃预测效果不好的模型,这样做一方面,保证了效果的稳定型,另一方面将误差控制在合理的范围内。
本发明的技术解决方案是:
一种大数据量预测的三层联合动态选择最优模型方法,包括预测模型算法库、权重算法库、最优权重算法甄选算法三层,预测模型算法库放置在的最底层,在预测算法模型库之上是权重算法库,在权重算法库之上是最优权重算法甄选算法;
预测模型算法库:包含若干种预测模型算法,这些算法被抽象成共同的接口,放置在联合算法的最底层,提供预测功能,支撑更上层的功能;
权重算法库:对预测算法库的最底层算法的多样性进行屏蔽,根据底层算法的预测结果,按若干种标准对底层算法进行甄选组合,形成若干种权重算法;
最优权重算法甄选算法:根据验证集中权重算法的效果,选择最优的权重算法,进行预测。
进一步地,预测模型算法库具体的实现步骤如下。
输入训练数据;对训练数据预处理后,得到待用数据;
使用两种以上的不同算法对待用数据进行模型拟合,得到各待选模型。
进一步地,对训练数据预处理,具体包括:
数据筛选:去除过于稀疏的数据列;
时间格式的处理:将时间列映射为连续的整数;
数据补值:缺失数据插值、错误数据插值。
进一步地,权重算法采用如下算法:
算法一:给予所有预测模型相同的权重;
算法二:剔除百分之二十预测结果相对较差的模型,并给予剩下的模型相同的权重;
算法三:计算各模型误差均方根,然后根据误差均方根大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
算法四:计算各模型最小绝对误差,然后根据最小绝对误差大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
算法五:计算各模型最小二乘计算的误差,然后根据最小二乘计算的误差大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
算法六:计算各模型赤池信息量准则,然后根据赤池信息量准则大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重。
进一步地,预测模型算法库具体的实现步骤如下:
调用预测模型库,得到预测模型的预测数据集;
分别调用各个权重算法,并计算权重;
赋予各预测模型相应权重,进行数据预测,存储预测的数据。
进一步地,最优权重算法依据各权重算法在测试集上的预测效果,来甄选最优权重算法;最优权重算法甄选算法的具体步骤如下:
调用权重算法库的算法,得到权重预测的数据的集合;
利用权重库预测的数据集,与验证集比对,得到误差;
由最小误差,得到最优权重算法;
将最优权重方法预测的数据存储,得到预测结果。
本发明的有益效果是:本发明一种大数据量预测的三层联合动态选择最优模型方法,三层结构具有高扩展性、预测稳定性、模型的动态调整特性、预测数据对模型的无差异性这四种特性。本申请运用了联合算法,该算法规避了常用算法的一些缺点,利用赋予多种模型权重的方法,将多种算法有机地组合在一起,将最适应的算法赋予高权重,而将相对不好的算法赋予的低的权重,这样既保证了 数据预测的准确性,也保证了数据长度增加后,预测的稳定性。
附图说明
图1是本发明实施例大数据量预测的三层联合动态选择最优模型方法的说明示意图。
图2是实施例中ARIMA算法KPI综合误差率的示意图。
图3是实施例中Holtwinters算法在KPI下误差率的示意图。
图4是实施例中Arima算法KPI下误差率的示意图。
具体实施方式
下面结合附图详细说明本发明的优选实施例。
在小区KPI预测时,需要预测的数据准确而稳定,但是实际运用中往往不好,这是因为,一般算法有一定的局限性和适用性,导致有的数据预测不好。在这种情况下,实施例运用了联合算法,该算法规避了常用算法的一些缺点,利用赋予多种模型权重的方法,将多种算法有机地组合在一起,将最适应的算法赋予高权重,而将相对不好的算法赋予的低的权重,这样既保证了数据预测的准确性,也保证了数据长度增加后,预测的稳定性。随后,将联合算法运用于实验中,取得了预期的效果,在稳定性和准确性方面都取得了具佳的效果。
实施例
如图1,大数据量预测的三层联合动态选择最优模型方法由预测模型算法库、权重算法库、最优权重算法甄选算法三层组成。
预测模型算法库包含了各种经典算法、经典算法改进型及部分专利算法,这些算法被抽象成共同的接口,放置在联合算法的最底层,提供预测功能,支撑更上层的功能。
在预测算法模型库之上是权重算法,权重算法对预测算法库进行了一层包装,屏蔽了最底层算法的多样性,用户不需要考虑底层各种算法的参数、周期、收敛性、误差等,权重算法根据底层算法的预测结果,按若干种标准(如,将所有底层算法结果做平均、摒弃部分最差的结果、按RMSE结果分配权重、按OLS结果分配权重、按AIC结果分配权重、按LAD结果分配权重)对底层算法进行甄选组合,形成若干种权重算法。
这若干种权重算法没有物理意义上的差别,只有数学特性上的差异,这些差 异既来源于预测数据本身的特性,也与选取的权重公式有关,这些权重算法适应不同的数据,在确定具体选用哪种权重算法之前,我们需要根据验证集的结果来判断。而这样的判断,我们希望有一种算法自动完成,这就是联合算法里的第三层--最优权重算法甄选算法,第三层算法是对权重算法的包装,根据验证集中权重算法的效果,选择最优的权重算法,进行预测。
大数据量预测的三层联合动态选择最优模型方法的三层结构具有高扩展性、预测稳定性、模型的动态调整特性、预测数据对模型的无差异性这四种特性。然而,这种算法也有她的不足之处--低效性。在权衡了计算机软硬件性能的高速发展,以及分布式技术的快速成熟后,相对于其它四大特性,她的不足显得微不足道。
大数据量预测的三层联合动态选择最优模型方法的最底层的预测模型算法库包含了各种经典算法、经典算法改进型及部分专利算法。这些算法包括ar,mr,arma,holtwinters,var,svar,svec,garch,svm,fourier。这些算法有各自的适用场景,比如平稳数列可用arma,arima,var,svar,svec,对于非平稳序列要做平稳化处理后方能使用平稳算法,除了平稳算法,其余算法的可以用于非平稳序列。对于高纬数据可以考虑用svm。多时间序列数据可以用var算法,而garch模型对于远期预测有一定的优势。另外,每种算法都包含了多种参数,比如arima的参数p,d,q在设置时就有多种组合。每种算法也会有若干变种算法,比如,svar和svec是var的变化,garch算法是arch的算法在使用范围上的扩展。不同算法之间对数据输入数据的格式也有差别,有些算法的训练集的预测数据和测试集的预测数据也不禁相同,比如HOLT-WINTERS的训练集中,训练集中第一个周期的边界值是不能预测的,而ARIMA是可以预测的。还有就是有些模型需要多重周期的,比如VAR,需要特别处理。
因为要对其上一层提供无差异接口,所以,所有上述差异因素需要被屏蔽。具体作法是,如果模型有多种参数的,按每种参数设置一个独立的模型,变种也设置独立的模型,如arima模型的参数p,d,q有32种组合,那么,我们就设置32种模型,如arima(1,1,0)和arima(2,1,0)就属于两种模型,另外,变种我们也单独设置了模型,如var和svec是同类模型的不同变种,我们也将其独立设置为不同模型。边界不能预测的模型,在计算误差时,边界值不考虑进去, 如HOLT-WINTERS模型训练集的第一个周期的预测值是没有的,在计算误差时,这部分误差我们不计入整体误差,经评估,在实际预测时不计入部分的影响很小。多重周期的模型单独处理,预测后再拼成按时间顺序排列的数组,如VAR模型,VAR根据多重时间预测后的值是一个矩阵,我们将矩阵按行依次取值后存为一个数组,这样,数组中的值正好是按时间排序的值,使其和其他形式的预测结果格式统一,方便比较。
在预测算法库之上是权重算法库,权重算法库的原则是择优录用,即便如此,择优的原则仍然不是唯一的,或者说,这里的“优”难以确定,验证集的“优”可能无法延续到更远的未来,比如过拟合的模型,在验证集表现得很好,在预测集却不然。因此,权重算法库选用了六种权重方法,如概述中涉及。
这六种权重算法按各自的原则,对预测模型库中结果进行甄选组合,形成六种算法,这六种算法侧重不一,这样做的目的是尽量捕捉多的数据特性,希望这种特性能很好的延续到预测集,即便不能,也能动态地调整参数,使“坏”的模型的影响降低,增加预测的准确性。
这六种权重算法分别是:
1)给予所有预测模型相同的权重,w=1/n,n为模型数;
[根据细则91更正 24.08.2016] 
2)对所有模型误差(e1, e2..., en)进行排序,筛选误差较小的80%的模型,并给予余下模型相同的权重Wnew , Wnew=1/m, m为筛选后的模型个数。
3)计算各模型误差均方根(RMSE),然后根据误差均方根大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重,w=g(f(e1,e2,,en)),ei=error_value;
f~f(1/rmse(x1,x2,,xn;y1,y2,,yn)),xi=forecast_value,yi=observation_value;;
Figure PCTCN2016081481-appb-000001
上式子中ei表示第i个模型的误差,xi表示第i个变量的预测值,yi表示第i个变量的观察值,g则按式中定义了一个反趋势函数。
4)跟3)原理一样,依据的原则是最小绝对误差;
5)跟3)原理一样,依据的原则是最小二乘计算的误差;
6)跟3)原理一样,先计算赤池信息量准则(AIC),据此设计反趋势函数,然后计算权重。
预测模型算法库具体的实现步骤如下。
输入:训练数据;
输出:权重模型库预测的数据;
调用预测模型库,得到预测模型的预测数据集data_fcst;
i为(1~权重算法数)中的一个整数,调用权重算法i,计算权重。
赋予各预测模型相应权重,进行数据预测,存储预测的数据。
最上层是最优权重算法甄选算法,在六种权重算法的基础上,最优权重算法甄选算法选择其中最好的权重算法,选择的原则是六种权重算法在测试集上的预测效果。
最优权重算法甄选算法具体的实现步骤如下。
输入:训练数据
输出:预测的数据
1)调用权重算法库的算法,得到权重预测的数据的集合。
2)利用权重库预测的数据集,与验证集比对,得到误差。
3)由最小误差,得到最优权重算法。
4)将最优权重方法预测的数据存储,得到预测结果。
当多个指标下(KPI)多个数据列(CELL)预测时步骤如下:
输入:训练数据
输出:预测数据
对每个指标的每个数据列,调用最优权重算法甄选算法,得到预测的数据,将数据存储。
实验验证
为评估联合算法的效果,选取了1500个小区的12个KPI数据进行了实验,以得到联合算法和一般算法的准确性、稳定性的对比结果。
实验步骤如下:
首先是数据收集,数据的处理,按三层结构建立算法模型,分别运用联合算法和一般算法进行数据预测,得到预测结果。
然后整理两种算法的结果,比对联合算法模型和一般模型的准确性,稳定性,并以此综合评估联合算法模型的效果。
实验分为两部分,第一部分是将训练数据放到常用模型中训练,预测,以得到误差数据,接着将训练数据放到联合算法模型中训练,预测,得到误差数据。第二部分是将联合算法模型与一般模型训练集的误差,测试集的误差对比,以评估联合算法的效果。
实验数据
首先是数据的收集和处理,数据产生的频率是半小时,共收集了121天,1500个小区6个上行KPI和1500个6个下行KPI的的各5808个数据,即从2014年7月29日到2014年11月26日的数据。
为确保数据的完整性,需要对数据缺值和错误值进行处理。缺值和NaN值需要进行相应的插值,如果NaN和缺值过多的情况下,则需要剔除该小区的数据。
实验方法
首先将训练数据放到一般模型中训练,预测,将每个模型得到的预测数据和误差数据保存,接着将训练数据放到联合算法模型中训练,预测,将得到预测数据和误差数据保存。最后是对比联合算法和一般模型的预测效果,分别计算一般模型和联合算法在训练集预测误差、预测集上的预测误差、训练集的预测误差与预测集的预测误差之差。分别给这三个值一个权重,取0.3,0.3,0.4。最后得到综合误差值。
实验结果
在运用联合算法和一般算法对比后,分别得到了1500个小区12个KPI训练集和测试集的误差如图2、图3、图4。图2是实施例中ARIMA算法KPI综合误差率的示意图。图3是实施例中Holtwinters算法在KPI下误差率的示意图。 图4是实施例中Arima算法KPI下误差率的示意图。
由图2、图3、图4得到的数据可知,联合算法的误差在训练集和预测集上比一般算法分别提高9%和13%左右。综合误差值提高在12%左右。

Claims (6)

  1. 一种大数据量预测的三层联合动态选择最优模型方法,其特征在于:包括预测模型算法库、权重算法库、最优权重算法甄选算法三层,预测模型算法库放置在的最底层,在预测算法模型库之上是权重算法库,在权重算法库之上是最优权重算法甄选算法;
    预测模型算法库:包含若干种预测模型算法,这些算法被抽象成共同的接口,放置在联合算法的最底层,提供预测功能,支撑更上层的功能;
    权重算法库:对预测算法库的最底层算法的多样性进行屏蔽,根据底层算法的预测结果,按若干种标准对底层算法进行甄选组合,形成若干种权重算法;
    最优权重算法甄选算法:根据验证集中权重算法的效果,选择最优的权重算法,进行预测。
  2. 如权利要求1所述的大数据量预测的三层联合动态选择最优模型方法,其特征在于,预测模型算法库具体的实现步骤如下。
    输入训练数据;对训练数据预处理后,得到待用数据;
    使用两种以上的不同算法对待用数据进行模型拟合,得到各待选模型。
  3. 如权利要求2所述的大数据量预测的三层联合动态选择最优模型方法,其特征在于,对训练数据预处理,具体包括:
    数据筛选:去除过于稀疏的数据列;
    时间格式的处理:将时间列映射为连续的整数;
    数据补值:缺失数据插值、错误数据插值。
  4. 如权利要求1-3任一项所述的大数据量预测的三层联合动态选择最优模型方法,其特征在于,权重算法采用如下算法:
    算法一:给予所有预测模型相同的权重;
    算法二:剔除百分之二十预测结果相对较差的模型,并给予剩下的模型相同的权重;
    算法三:计算各模型误差均方根,然后根据误差均方根大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
    算法四:计算各模型最小绝对误差,然后根据最小绝对误差大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
    算法五:计算各模型最小二乘计算的误差,然后根据最小二乘计算的误差大 小,设计一个反趋势的函数,并根据该函数给各模型赋予权重;
    算法六:计算各模型赤池信息量准则,然后根据赤池信息量准则大小,设计一个反趋势的函数,并根据该函数给各模型赋予权重。
  5. 如权利要求1-3任一项所述的大数据量预测的三层联合动态选择最优模型方法,其特征在于,预测模型算法库具体的实现步骤如下:
    调用预测模型库,得到预测模型的预测数据集;
    分别调用各个权重算法,并计算权重;
    赋予各预测模型相应权重,进行数据预测,存储预测的数据。
  6. 如权利要求1-3任一项所述的大数据量预测的三层联合动态选择最优模型方法,其特征在于,最优权重算法依据各权重算法在测试集上的预测效果,来甄选最优权重算法;最优权重算法甄选算法的具体步骤如下:
    调用权重算法库的算法,得到权重预测的数据的集合;
    利用权重库预测的数据集,与验证集比对,得到误差;
    由最小误差,得到最优权重算法;
    将最优权重方法预测的数据存储,得到预测结果。
PCT/CN2016/081481 2016-03-23 2016-05-10 大数据量预测的三层联合动态选择最优模型方法 WO2017161646A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/085,315 US20190087741A1 (en) 2016-03-23 2016-05-10 Method for dynamically selecting optimal model by three-layer association for large data volume prediction

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201610168473 2016-03-23
CN201610168473.1 2016-03-23
CN201610192864.7 2016-03-30
CN201610192864 2016-03-30

Publications (1)

Publication Number Publication Date
WO2017161646A1 true WO2017161646A1 (zh) 2017-09-28

Family

ID=59899162

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081481 WO2017161646A1 (zh) 2016-03-23 2016-05-10 大数据量预测的三层联合动态选择最优模型方法

Country Status (2)

Country Link
US (1) US20190087741A1 (zh)
WO (1) WO2017161646A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6276732B2 (ja) * 2015-07-03 2018-02-07 横河電機株式会社 設備保全管理システムおよび設備保全管理方法
US20190236497A1 (en) * 2018-01-31 2019-08-01 Koninklijke Philips N.V. System and method for automated model selection for key performance indicator forecasting
US11755937B2 (en) 2018-08-24 2023-09-12 General Electric Company Multi-source modeling with legacy data
US11842252B2 (en) * 2019-06-27 2023-12-12 The Toronto-Dominion Bank System and method for examining data from a source used in downstream processes
CN110321960A (zh) * 2019-07-09 2019-10-11 上海新增鼎网络技术有限公司 一种工厂生产要素的预测方法及系统
CN111144617B (zh) * 2019-12-02 2023-10-31 秒针信息技术有限公司 一种确定模型的方法及装置
CN112105048B (zh) * 2020-07-27 2021-10-12 北京邮电大学 基于双周期Holt-Winters模型和SARIMA模型的组合预测方法
US20220343225A1 (en) * 2021-03-16 2022-10-27 Tata Consultancy Services Limited Method and system for creating events
CN113838522A (zh) * 2021-09-14 2021-12-24 浙江赛微思生物科技有限公司 一种基因突变位点影响剪接可能性的评估处理方法
CN113987055B (zh) * 2021-10-27 2024-07-09 西安科技大学 基于大数据分析的煤矿安全生产动态预测可视化方法
CN115564223A (zh) * 2022-09-29 2023-01-03 合肥宝资智能科技有限公司 一种车间流水线最短生产时间规划方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739614A (zh) * 2009-12-08 2010-06-16 江苏省邮电规划设计院有限责任公司 层级组合预测通信业务的方法
CN102306336A (zh) * 2011-06-10 2012-01-04 浙江大学 基于协同过滤和QoS感知的服务选择框架
CN102663513A (zh) * 2012-03-13 2012-09-12 华北电力大学 利用灰色关联度分析的风电场功率组合预测建模方法
CN102682207A (zh) * 2012-04-28 2012-09-19 中国科学院电工研究所 风电场风速超短期组合预测方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122015B2 (en) * 2007-09-21 2012-02-21 Microsoft Corporation Multi-ranker for search
US11127020B2 (en) * 2009-11-20 2021-09-21 Palo Alto Research Center Incorporated Generating an activity inference model from contextual data
US20110218978A1 (en) * 2010-02-22 2011-09-08 Vertica Systems, Inc. Operating on time sequences of data
US8826277B2 (en) * 2011-11-29 2014-09-02 International Business Machines Corporation Cloud provisioning accelerator
US9053439B2 (en) * 2012-09-28 2015-06-09 Hewlett-Packard Development Company, L.P. Predicting near-future photovoltaic generation
US9697730B2 (en) * 2015-01-30 2017-07-04 Nissan North America, Inc. Spatial clustering of vehicle probe data
US10685262B2 (en) * 2015-03-20 2020-06-16 Intel Corporation Object recognition based on boosting binary convolutional neural network features
US10713594B2 (en) * 2015-03-20 2020-07-14 Salesforce.Com, Inc. Systems, methods, and apparatuses for implementing machine learning model training and deployment with a rollback mechanism
JP6275923B2 (ja) * 2015-10-28 2018-02-07 株式会社日立製作所 施策評価システムおよび施策評価方法
US10454989B2 (en) * 2016-02-19 2019-10-22 Verizon Patent And Licensing Inc. Application quality of experience evaluator for enhancing subjective quality of experience
US10885461B2 (en) * 2016-02-29 2021-01-05 Oracle International Corporation Unsupervised method for classifying seasonal patterns
US10810491B1 (en) * 2016-03-18 2020-10-20 Amazon Technologies, Inc. Real-time visualization of machine learning models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739614A (zh) * 2009-12-08 2010-06-16 江苏省邮电规划设计院有限责任公司 层级组合预测通信业务的方法
CN102306336A (zh) * 2011-06-10 2012-01-04 浙江大学 基于协同过滤和QoS感知的服务选择框架
CN102663513A (zh) * 2012-03-13 2012-09-12 华北电力大学 利用灰色关联度分析的风电场功率组合预测建模方法
CN102682207A (zh) * 2012-04-28 2012-09-19 中国科学院电工研究所 风电场风速超短期组合预测方法

Also Published As

Publication number Publication date
US20190087741A1 (en) 2019-03-21

Similar Documents

Publication Publication Date Title
WO2017161646A1 (zh) 大数据量预测的三层联合动态选择最优模型方法
KR102611938B1 (ko) 신경망을 사용한 통합 회로 플로어 플랜 생성
CN111507032B (zh) 基于深度学习技术预测温度分布的组件布局优化设计方法
JP2023522567A (ja) ニューラルネットワークを使った集積回路配置の生成
Moreno et al. An approach for characterizing workloads in google cloud to derive realistic resource utilization models
CN109472403B (zh) 一种集合经验模态分解及遥相关的中长期径流预报方法
CN110502323B (zh) 一种云计算任务实时调度方法
Chen et al. $ d $ d-Simplexed: Adaptive Delaunay Triangulation for Performance Modeling and Prediction on Big Data Analytics
CN102857560A (zh) 一种面向多业务应用的云存储数据分布方法
CN106250933A (zh) 基于fpga的数据聚类的方法、系统及fpga处理器
Hasan et al. Application of game theoretic approaches for identification of critical parameters affecting power system small-disturbance stability
CN106257506B (zh) 大数据量预测的三层联合动态选择最优模型方法
CN118355366A (zh) 数据库仿真建模框架
CN110191015A (zh) 基于cpi指标的云服务性能智能预测方法和装置
US20140365186A1 (en) System and method for load balancing for parallel computations on structured multi-block meshes in cfd
US10803218B1 (en) Processor-implemented systems using neural networks for simulating high quantile behaviors in physical systems
Chen et al. Silhouette: Efficient cloud configuration exploration for large-scale analytics
CN113158435B (zh) 基于集成学习的复杂系统仿真运行时间预测方法与设备
CN110007371A (zh) 风速预测方法及装置
Liao et al. Multicore parallel genetic algorithm with Tabu strategy for rainfall-runoff model calibration
CN112257977B (zh) 模糊工时下资源受限的物流项目工期优化方法及系统
CN111523685B (zh) 基于主动学习的降低性能建模开销的方法
CN111522644A (zh) 基于历史运行数据预测并行程序运行时间的方法
CN111340276A (zh) 一种生成预测数据的方法及系统
Zhou et al. File heat-based Self-adaptive Replica Consistency Strategy for Cloud Storage.

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16894999

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16894999

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16894999

Country of ref document: EP

Kind code of ref document: A1