CN111563028A

CN111563028A - Data center task scale prediction method based on time series data analysis

Info

Publication number: CN111563028A
Application number: CN202010412587.2A
Authority: CN
Inventors: 周毅; 肖俊; 周波
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-08-21

Abstract

The invention discloses a data center task scale prediction method based on time sequence data analysis. The method can be applied to reasonable distribution of data center resources, avoids data center jitter caused by frequent resource scheduling, and reduces the overall service quality of the data center; the method can also be applied to rapid detection and early warning of abnormal input of the data center task scale when the deviation between the actual input quantity and the predicted value is too large.

Description

Data center task scale prediction method based on time series data analysis

Technical Field

The invention relates to the technical field of time sequence statistics and data center task scale prediction, in particular to a data center task scale prediction method based on time sequence data analysis.

Background

With the continuous development of big data technology, the information quantity is increased rapidly, and the data input quantity of the data center is larger and larger. The data volume generally determines the expense of data center storage and calculation resources, so that the data center storage and calculation resources can be intelligently distributed and expanded through analysis and prediction of data center input data scale, energy consumption waste caused by excessive reservation of the resources is avoided, and the data center jitter caused by continuous task allocation due to insufficient resource distribution is avoided, so that the service quality of the data center is reduced. Meanwhile, the reasonable prediction of the scale of the data input by the data center can be applied to the intelligent operation and maintenance of the data center, and when the deviation of the actual input quantity in the predicted value is too large, the abnormal input is rapidly detected and early warned.

In practical application, the input data volume of the data center under different scenes presents different time sequence characteristics and generally presents a growing trend, and for the data center with large input data volume in the daytime and small input volume at night, the system presents seasonal characteristics which are periodically specific and are influenced by specific time points. Therefore, it is necessary to adopt an appropriate prediction model according to specific input time sequence data specific to different scenes.

At present, the construction method of the task scale prediction model aiming at the characteristics of input data of different data centers is still lack, and a set of construction method of the prediction model relative to the data center is urgently needed, so that the task scale of the data center can be reasonably predicted.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a data center task scale prediction method based on time sequence data analysis, which can realize the construction of a task scale prediction model aiming at the characteristics of input data of different data centers and reasonably predict the task scale of the data center, thereby providing a prediction basis for the reasonable scheduling of data center resources and the rapid detection and early warning of abnormal input of the task scale of the data center and other scenes.

The purpose of the invention can be achieved by adopting the following technical scheme:

a data center task scale prediction method based on time series data analysis comprises the following steps:

t1, collecting data center input data volume, collecting and recording the receiving information of the data center input node, wherein the receiving information comprises: inputting a time point, whether input is successful or not and the input data amount, and writing the received information into a historical task scale database;

t2, establishing a task scale prediction model, reading time series data in a historical task scale database in recent m, m being 2,3,4,5 and 6 months, accumulating all successfully input data quantity by taking a period of time within the range of 20-40 minutes as time granularity, taking the accumulated data quantity as the task scale of a data center of the time period, dividing the data center into a sample set and a verification set according to the proportion of 8:2, simultaneously eliminating abnormal values in the time series data, and finally establishing a primary exponential smooth prediction model, a differential integration moving average autoregressive (ARIMA) model or a long-short term memory network (LSTM) model according to the characteristics of the processed time series data;

t3, verifying and evaluating a task scale prediction model, fitting time series data within the future d, d being 1,2,3 and 4 days according to the task scale prediction model established in the step T2, calculating the average absolute percentage error MAPE of the predicted value and the true value, evaluating the effect of the currently adopted task scale prediction model, if the average absolute percentage error MAPE of the predicted value and the true value is more than 10%, considering that the effect is not good, needing to replace the task scale prediction model, and entering the next step if the expected effect is achieved;

and T4, applying a task scale prediction model, predicting the change condition of the data center task scale within the future d, d being 1,2,3 and 4 days according to the data center task scale of the current time period by adopting the task scale prediction model achieving the expected effect in the step T3, and providing a basis for reasonable distribution of data center resources and abnormal detection of the data center task scale according to the prediction result.

Further, the data center input node is an HTTP or FTP loading service node.

Furthermore, the historical task scale database is selected from ElasticSearch or MySQL.

Further, the process of establishing the task scale prediction model in the step T2 is as follows:

and T2.1, reading data of the latest m, wherein m is 2,3,4,5 and 6 months in the historical task scale database, accumulating all successfully input data quantity by taking a period of time within the range of 20-40 minutes as time granularity to be used as the data center task scale in a corresponding period, and dividing the data into a sample set and a verification set according to the proportion of 8: 2. The sample set is used for fitting and training the task scale prediction model, and the verification set is used for evaluating the prediction effect of the task scale prediction model;

t2.2, detecting the abnormal value of the time series data through the Grabas criterion or the 3-Sigama criterion, and replacing the abnormal value of the time series data with the window average value;

and T2.3, determining the components of the time sequence data by drawing the time sequence image, wherein the components of the time sequence data comprise trend, seasonal, periodic and/or irregular fluctuation, and selecting a task scale prediction model by adopting the following rules: selecting a primary exponential smoothing prediction model as a task scale prediction model for time series data without trend and seasonal components; selecting a differential integration moving average autoregressive (ARIMA) model as a task scale prediction model for time series data containing both trend and seasonal components; for time series data with uncertain components, a long-short term memory network (LSTM) model is selected.

Furthermore, the first exponential smoothing prediction model only adopts one smoothing coefficient, the smoothing coefficient is a weight value of an actual value, and a predicted value at the t +1 moment in the first exponential smoothing prediction model is equal to a weighted average of the actual value and the predicted value at the t moment.

Further, the differential integration moving average autoregressive integrated moving average ARIMA model is established as follows:

t2.3.01, preprocessing time series data, carrying out stabilization processing on the time series data adopted by the difference integration moving average autoregressive ARIMA model by a data stabilization method comprising aggregation, smoothing, polynomial filtering and STL decomposition, and further appointing the difference order of the model as first-order difference or second-order difference so as to eliminate the trend of the time series data;

t2.3.02, verifying availability of time sequence data, verifying whether the preprocessed time sequence data is stable in a given confidence interval by using a Diky-Fowler test, if not, changing a data stabilizing method and repeating the step T2.3.01 until the preprocessed time sequence data passes the Diky-Fowler test, and meanwhile, verifying whether the preprocessed time sequence data is a purely random sequence by adopting an Ljung-Box test, if the preprocessed time sequence data is the purely random sequence, ending prediction, and if the preprocessed time sequence data is not the purely random sequence, entering the next step;

t2.3.03, determining relevant parameters to establish a model, drawing an autocorrelation diagram ACF and a partial autocorrelation diagram PACF to obtain an autoregressive model lag value, and establishing a differential integration moving average autoregressive ARIMA model according to the lag value p and q;

t2.3.04, verifying the prediction effect of the model, inputting the test set into the established difference integration moving average autoregressive ARIMA model, outputting the time sequence data predicted by the difference integration moving average autoregressive ARIMA model, comparing the predicted value with the test set, paying attention to the index mean absolute error MAE, the mean square error MSE, the mean absolute percentage error MAPE, the root mean square error MSE and the coefficient R2, and adjusting the parameters of the difference integration moving average autoregressive ARIMA model according to the index result.

Further, the process of establishing the long-short term memory network LSTM model is as follows:

t2.3.11, time sequence data normalization, converting the received information acquisition value of each data center input node into a [0,1] interval by a range conversion method;

t2.3.12, selecting a neuron, and selecting a long-short term memory network LSTM model, a first LSTM variant or a second LSTM variant as the neuron according to the prediction effect, wherein the long-short term memory network LSTM model calculates the time sequence data input at the time t according to the connection sequence of a forgetting gate, an input gate and an output gate, the first LSTM variant couples the forgetting gate and the input gate, the output of the forgetting gate is used as a part of the input gate, and the second LSTM variant is constructed by directly adopting the connection mode of a gate control circulation unit GRU;

t2.3.13, training the LSTM model of the long and short term memory network, constructing an input layer, a hidden layer and an output layer of the LSTM model of the long and short term memory network, inputting the time sequence data of a training set, performing model training by adopting a random gradient descent method, outputting the time sequence data predicted by the LSTM model of the long and short term memory network until the MSE cost function continuously outputs a difference value of less than 5% for two times, and terminating the training;

t2.3.14, verifying the prediction effect of the LSTM model of the long-short term memory network, inputting the test set into the trained LSTM model of the long-short term memory network, outputting the time sequence data predicted by the LSTM model of the long-short term memory network, comparing the predicted value with the test set, paying attention to the indexes such as mean absolute error MAE, mean square error MSE, mean absolute percentage error MAPE, root mean square error MSE and coefficient of decision R2, and adjusting the selection of the LSTM model parameters and the neurons of the long-short term memory network according to the index result.

Further, the process of step T3, verifying and evaluating the task scale prediction model is as follows:

t3.1, verifying the rationality of the task scale prediction model, and verifying whether the residual error of the task scale prediction model meets the requirement of pure randomness under the significance level by adopting Ljung-Box test;

and T3.2, evaluating the prediction effect of the task scale prediction model, predicting time series data of the future d, d being 1,2,3 and 4 days according to the task scale prediction model trained by the sample set, comparing the actual values of the time series data of the future d, d being 1,2,3 and 4 days with predicted values after the actual values are acquired, evaluating the effect of the currently adopted task scale prediction model by paying attention to the average absolute error MAE, the mean square error MSE, the average absolute percentage error MAPE, the mean square error MSE and the coefficient R2, and replacing the task scale prediction model if the attention index value is larger than the actual demand.

Compared with the prior art, the invention has the following advantages and effects:

(1) according to the method, different prediction models are selected according to different data center input time sequence data types, so that the task scale prediction model is more pertinent, and the prediction effect is better.

(2) The invention predicts the scale of a data center task by using a long short term memory network (LSTM) model in a Recurrent Neural Network (RNN) and a LSTM variant for the first time.

(3) The method can be applied to on-demand resource scheduling of a data center based on task scale prediction, and can also be applied to rapid detection and early warning of abnormal input.

Drawings

FIG. 1 is a flow chart of a data center task size prediction method based on time series data analysis as disclosed in an embodiment of the present invention;

FIG. 2 is a flow chart of predictive model selection in an embodiment of the invention;

FIG. 3 is a flow chart of ARIMA model building in the embodiment of the present invention;

FIG. 4 is a graph of a standard long short term memory network (LSTM) model neuron computation;

FIG. 5 is a neuron computational graph of a variation of the long short term memory network (LSTM) model in an embodiment of the present invention;

FIG. 6 is a neuron computational diagram of another variation of the long short term memory network (LSTM) model in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

The embodiment provides a data center task scale prediction method based on time sequence data analysis, which can realize the construction of a task scale prediction model aiming at the characteristics of input data of different data centers and reasonably predict the task scale of the data center, thereby providing a prediction basis for the reasonable scheduling of data center resources and the rapid detection and early warning of abnormal input and other scenes.

Fig. 1 to fig. 6 below specifically illustrate a data center task scale prediction method based on time series data analysis, which uses the flowchart shown in fig. 1, and the implementation method specifically includes the following steps,

t1, collecting data center input data volume, including:

collecting and recording receiving information of an input node of a data center, wherein the receiving information comprises: inputting a time point, inputting success or failure and inputting data quantity, and writing the received information into a historical task scale database.

T2, establishing a task scale prediction model, comprising:

reading time series data in a historical task scale database within the latest m, m is 2,3,4,5 and 6 months, accumulating all successfully input data quantity by taking a period of time within the range of 20-40 minutes as time granularity, taking the successfully input data quantity as the data center task scale of the period, dividing the successfully input data quantity into a sample set and a verification set according to the proportion of 8:2, and removing abnormal values in the time series data. And finally, selecting a proper prediction model according to the characteristics of the processed time series data (the specific selection mode can refer to fig. 2), solving the parameter estimation of the corresponding model, and establishing a task scale prediction model.

T3, verifying and evaluating a task scale prediction model, comprising:

according to the task scale prediction model established in the step T2, time series data in the future d, d being 1,2,3 and 4 days are fitted, the average absolute percentage error MAPE of the predicted value and the true value is calculated, the effect of the currently adopted task scale prediction model is evaluated, if the average absolute percentage error MAPE of the predicted value and the true value is more than 10%, the effect is considered to be poor, the task scale prediction model needs to be replaced, and if the expected effect is achieved, the next step is carried out;

t4, applying a task scale prediction model, comprising:

and D, predicting the change condition of the task scale of the data center within 1,2,3 and 4 days in the future according to the task scale of the data center at the current time by adopting the task scale prediction model which achieves the expected effect in the step T3, and providing a basis for reasonable distribution of data center resources and abnormal detection of the task scale of the data center according to a prediction result.

In this embodiment, in the step T1, in the step of collecting the data volume input by the data center, the data center input node is an HTTP or FTP loading service node, but the technical solution of the present invention is not limited to the above example.

In this embodiment, d is equal to 1, that is, the change situation of the data center task scale in 1 day in the future is predicted according to the data center task scale at the current time, but the technical solution of the present invention is not limited to the above example.

The historical task scale database is selected from ElasticSearch or MySQL, but is not limited to the above examples.

The specific process of the step T1 of collecting the data volume input by the data center is as follows:

t1.1, collecting and recording received information at data center input nodes (including but not limited to HTTP and FTP loading service nodes), and recording the time point of each data input, whether the input is successful and the input data volume.

And T1.2, storing all records into a historical task scale database at regular intervals, wherein the database can be selected from ElasticSearch or MySQL.

In this embodiment, the process of establishing the task scale prediction model in step T2 is as follows:

and T2.1, generating a time sequence data sample set and a verification set, reading time sequence data in the historical task scale database in the latest m, wherein m is 2,3,4,5 and 6 months, accumulating all successfully input data quantity by taking a period of time within the range of 20-40 minutes as time granularity, taking the successfully input data quantity as the task scale of the data center of the period, dividing the successfully input data quantity into the sample set and the verification set by the proportion of 8:2, wherein the sample set is used for fitting and training a task scale prediction model, and the verification set is used for evaluating the prediction effect of the task scale prediction model.

In this embodiment, time series data in the historical task scale database in the last 3 months is read, and all successfully input data amounts are accumulated by taking 30 minutes as time granularity, but the technical scheme of the present invention is not limited to the above example.

And T2.2, removing abnormal values of the time sequence data, wherein abnormal values with larger deviation degrees can be generated due to input data loss or data input acquisition module reading jitter, and can be removed through the Graves criterion or the 3-Sigama criterion and replaced by a window average value.

And T2.3, determining the time series data type and selecting a task scale prediction model. The time-series data generally comprises four components, namely trend, seasonal, periodic and irregular fluctuation, and the specific components contained in the time-series data can be visually determined by drawing a time-series image. For time series data without trend and seasonal components, a one-time exponential smoothing prediction model can be selected; for time series data containing both trend and seasonal components, a differential integration moving average autoregressive (ARIMA) model can be adopted; for time series data with uncertain components, a long-short term memory network (LSTM) model is selected.

The first exponential smoothing prediction model is specifically as follows:

the first exponential smoothing prediction model only has one smoothing coefficient, and the essence of the smoothing coefficient is the weight of an actual value. For example, the predicted value at time t +1 is equal to a weighted average of the actual value and the predicted value at time t. Therefore, the longer the actual value is from the prediction time, the smaller the weight value is, the smaller the influence is, and conversely, the closer the actual value is from the prediction time, the larger the influence is on the prediction result.

The key of the accuracy of the first-order exponential smoothing prediction model lies in the selection of the smoothing coefficient, and the value can be specifically selected through the characteristics of the smoothing coefficient: because different smoothing coefficients can have different influences on the prediction result, when the smoothing coefficient is 0, the prediction value is only the prediction result of the previous period; when the smoothing coefficient is 1, the predicted value is the actual value of the previous period. The more the smoothing coefficient is close to 1, the more timely the first exponential smoothing prediction model reacts to the time series data change, and conversely, the more the smoothing coefficient is close to 0, the slower the first exponential smoothing prediction model reacts to the time series data change.

The establishing process of the difference integration moving average autoregressive integrated autonomous moving average (ARIMA) model specifically comprises the following steps:

t2.3.01, preprocessing time sequence data. The differential integration moving average autoregressive (ARIMA) model requires that the analyzed time sequence data is stable, so that the time sequence data needs to be stabilized by adopting data stabilization methods such as aggregation, smoothing, polynomial filtering, STL decomposition and the like, the differential order i of the model is further specified, including first-order difference and second-order difference, and the trend of the time sequence data is eliminated;

t2.3.02, time series data availability verification. The pre-processed data is verified to be stable within a given confidence interval using the diky-fowler test, and if not, the stabilization method is modified and step T2.3.01 is repeated until the diky-fowler test is passed. And meanwhile, verifying whether the sequence is a pure random sequence by adopting Ljung-Box test, if the sequence is the pure random sequence, ending prediction, and if the sequence is not the pure random sequence, entering the next step.

T2.3.03, determining relevant parameters to build a model. And drawing an autocorrelation diagram ACF and a partial autocorrelation diagram PACF to obtain an autoregressive model lag value, and establishing a differential integration moving average autoregressive ARIMA model according to the obtained lag value p and q.

T2.3.04, verifying the prediction effect of the model. Inputting the test set into the established difference integration moving average autoregressive ARIMA model, outputting a time sequence predicted by the difference integration moving average autoregressive ARIMA model, comparing a predicted value with the test set, paying attention to index average absolute error MAE, mean square error MSE, average absolute percentage error MAPE, root mean square error MSE and a coefficient R2, and adjusting parameters of the difference integration moving average autoregressive ARIMA model according to index results.

The establishing process of the long-short term memory network LSTM model specifically comprises the following steps:

t2.3.11, time series data normalization. When receiving information is collected and recorded by data center input nodes (including but not limited to HTTP and FTP loading service nodes), the collected numerical value difference of the receiving information is large, if the numerical value deviates from 0 to be too large, the convergence effect and speed of the long-short term memory network LSTM model can be reduced, and the collected numerical value of the receiving information of each data center input node needs to be converted into a [0,1] interval by a range conversion method.

T2.3.12, selecting neurons. And selecting the neurons, namely selecting the long-short term memory network LSTM, the first variant of the LSTM or the second variant of the LSTM as the neurons according to the predicted effect. The input time sequence data of the long-short term memory network LSTM at the time t is calculated according to the connection sequence of a forgetting gate, an input gate and an output gate, the forgetting gate and the input gate are coupled by a first variant of the LSTM, the output of the forgetting gate is used as a part of the input gate, and a second variant of the LSTM is constructed by directly adopting the connection mode of a gate control circulation unit GRU.

t2.3.14, verifying the prediction effect of the LSTM model of the long-short term memory network, inputting the test set into the trained LSTM model of the long-short term memory network, outputting the prediction time sequence of the LSTM model of the long-short term memory network, comparing the predicted value with the test set, paying attention to the average absolute error MAE, the mean square error MSE, the average absolute percentage error MAPE, the root mean square error MSE and the coefficient R2 of the index, and adjusting the selection of the LSTM model parameters and the neurons of the long-short term memory network according to the index result.

In this embodiment, the process of the step T3, verifying and evaluating the task scale prediction model is as follows:

and T3.1, verifying the rationality of the task scale prediction model. Because the differential integration moving average autoregressive ARIMA model requires that the residual sequence of the ARIMA model conforms to pure randomness, namely, non-sequence correlation, Ljung-Box inspection is needed to verify the rationality of the task scale prediction model and inspect the significance level of the task scale prediction model.

And T3.2, evaluating the prediction effect of the task scale prediction model, predicting the time sequence data of 1 day in the future according to the task scale prediction model trained by the sample set, comparing the actual value of the time sequence data of 1 day in the future with the predicted value, concerning the index average absolute error MAE, the mean square error MSE, the average absolute percentage error MAPE, the root mean square error MSE and the coefficient R2, evaluating the effect of the currently adopted task scale prediction model, and if the concerned index value is larger than the actual requirement, replacing the task scale prediction model.

The differential integration moving average autoregressive ARIMA model aims to better understand data in time series analysis and enable time series data to be better fitted by the model so as to more accurately predict a dependent variable of a certain node in the future. It consists of an autoregressive model and a moving average model. The autoregressive model is used for regressing fitting variables according to previous data, the moving average model is used for modeling an error term, the error term is considered to be linear combination of the error term in the same period and the error terms in different time in the past, and meanwhile, difference processing is integrated, namely, the data is replaced by the difference between the next data and the previous data, so that the trend effect is eliminated, and specific construction steps can refer to fig. 3.

The LSTM model is a typical implementation of the recurrent neural network RNN, and can link previous information with a current task, and predict future information by using the previous information, and the characteristic is very suitable for being used as a task scale prediction model of a data center. The standard long-short term memory network LSTM model has a chain structure, with each node on the chain having the same neuron structure. Each neuron has four neural network layers, and the computational graph is shown in fig. 4, which includes 3 typical gate structures: the system comprises a forgetting gate, an input gate and an output gate, wherein the gate is a layer of full connection layer actually, the input is a time sequence data vector, and the output is a real number vector between 0 and 1. Forget gate determines last time unit state C_t-1How much to keep current time C_t(ii) a The input gate determines the input X of the network at the current moment_tHow much to save to cell state C_t(ii) a Output door control unit status C_tHow much current output value h is output to the long-short term memory network LSTM_tAnd the standard long-short term memory network LSTM model calculates the time sequence data input at the moment t according to the connection sequence of a forgetting gate, an input gate and an output gate.

A variant of the LSTM model, which may lead to better prediction by changing the neural network layer details of each neuron, is presented in this embodiment in two variants, where the first variant is to use coupled forgetting and entry gates, whose computational graph is shown in fig. 5, and is characterized in that instead of deciding what to forget or add new information separately, these decisions are made together, and the output of the forgetting gate is used as part of the entry gate input in the graph, thus integrating the forgetting gate and entry gate information; a second variation is to use gated round robin unit GRU whose computational graph is shown in fig. 6, which combines the forgetting gate and the entry gate into a single "update gate" that combines the neuron state and the hidden state, with some other modifications, and thus produces a model that is simpler than the standard LSTM model.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A data center task scale prediction method based on time series data analysis is characterized by comprising the following steps:

2. The method for forecasting the task size of the data center based on the analysis of the time series data as claimed in claim 1, wherein the data center input nodes are HTTP or FTP load service nodes.

3. The data center task scale prediction method based on time series data analysis according to claim 1, wherein the historical task scale database is selected from ElasticSearch or MySQL.

4. The data center task scale prediction method based on time series data analysis according to claim 1, wherein the step T2 of establishing a task scale prediction model comprises the following steps:

5. The data center task scale prediction method based on time series data analysis according to claim 4, wherein the first exponential smoothing prediction model only uses one smoothing coefficient, the smoothing coefficient is a weight value of an actual value, and a predicted value at time t +1 in the first exponential smoothing prediction model is equal to a weighted average of the actual value and the predicted value at time t.

6. The method for forecasting task size in a data center based on time series data analysis as claimed in claim 4, wherein the differential integration moving average autoregressive ARIMA model is established as follows:

7. The method for forecasting the task scale of the data center based on the time series data analysis as claimed in claim 4, wherein the long-short term memory network LSTM model is established as follows:

8. The data center task scale prediction method based on time series data analysis according to claim 1, wherein the step T3, verifying and evaluating the task scale prediction model comprises the following steps: