CN110222840B

CN110222840B - Cluster resource prediction method and device based on attention mechanism

Info

Publication number: CN110222840B
Application number: CN201910413227.1A
Authority: CN
Inventors: 窦耀勇; 唐家伟; 吴维刚
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2023-05-05
Anticipated expiration: 2039-05-17
Also published as: CN110222840A

Abstract

The invention discloses a cluster resource prediction method and device based on an attention mechanism, which adopts an improved attention mechanism and integrates the improved attention mechanism into an LSTM, so that the correlation among a plurality of time sequences can be mined, a scheme of predicting the resource demand in a cluster by using the plurality of time sequences is provided, the prediction accuracy is effectively improved, the resource planning can be effectively assisted, the resource utilization rate of the cluster is improved, and the operation and maintenance cost of a data center is effectively reduced.

Description

Cluster resource prediction method and device based on attention mechanism

Technical Field

The invention relates to the technical field of cluster resource management, in particular to a cluster resource prediction method and device based on an attention mechanism.

Background

The volume of the existing data center is larger and larger, and resource management is effectively carried out on clusters in the data center, so that the utilization rate of hardware resources can be improved, the operation and maintenance cost is reduced, and the profit of operation and maintenance is improved. One method for effectively improving the resource utilization rate is to predict the future resource demands of the clusters, so that resource planning is performed in advance, and the waste of resources is reduced.

Currently, cluster resource demand prediction mainly uses time series data of cluster resources. The common time series prediction model is ARIMA (integrated moving average autoregressive model), VAR (vector autoregressive model), GBRT (gradient lifting regression tree) LSTM (long-short-term memory network), and the like, and can be directly used for predicting the resource demand in the cluster.

However, the current cluster resource prediction method has two main problems: 1. these methods use mainly a single time series as a feature for prediction (such as ARIMA) and few use multiple time series for resource demand prediction. The accuracy of the prediction depends on whether the historical value of this time series implies a clear law or not; 2. although there are many general multi-time sequence prediction models (such as VAR) at present, these model methods do not consider the characteristics of clusters in the data center, and in particular, do not consider the correlation and mutual interference between application loads in the clusters, and both the above problems can lead to inaccurate cluster resource prediction results.

Therefore, how to provide a method for accurately predicting the cluster resource is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for predicting cluster resources based on an attention mechanism, which use a plurality of resource time sequences to predict the future resource demand, and further adopts an improved deep learning attention mechanism to mine the correlation between the plurality of resource demand time sequences according to the characteristics of the application load in the cluster on the resource usage, so as to effectively improve the accuracy of resource prediction.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a cluster resource prediction method based on an attention mechanism comprises the following steps:

s1: taking the first hidden layer state at the last moment, all time series data belonging to one deployment unit with the target instance, all time series data belonging to one host unit with the target instance and the target time sequence at the historical moment as input of an input attention layer to obtain a first input vector;

s2: inputting the first input vector to an LSTM coder to obtain a current first hidden layer state;

s3: inputting the current first hidden layer state and the second hidden layer state at the previous moment into a time correlation attention layer to obtain a context vector;

S4: inputting the context vector, the second hidden layer state at the last moment and the target time sequence at the historical moment to an LSTM decoder to obtain the current second hidden layer state;

s5: and linearly transforming the current second hidden layer state and the context vector to obtain a predicted value.

Preferably, step S1 specifically includes:

s11: taking the state of the first hidden layer at the last moment and all time series data belonging to the same deployment unit with the target instance as the input of the attention layer of the deployment unit to obtain the output vector of the attention layer of the deployment unit;

s12: taking the state of the first hidden layer at the last moment and all time series data belonging to the same host unit as the target instance as the input of the attention layer of the host unit to obtain an attention output vector of the host unit;

s13: taking the state of the first hidden layer at the last moment and the target time sequence at the historical moment as the input of the autocorrelation attention layer to obtain an autocorrelation attention layer output vector;

s14: the deployment unit attention layer output vector, the host unit attention output vector, and the autocorrelation attention layer output vector are combined as a first input vector.

Preferably, step S11 specifically includes:

Calculating a first attention weight based on the first hidden layer state at the previous time and all time series data belonging to the same deployment unit as the target instance;

calculating a normalized deployment unit attention weight using a softmax function based on the first attention weight;

calculating a deployment unit attention layer output vector based on the first hidden layer state at the previous time, all time series data belonging to one deployment unit with the target instance and the normalized deployment unit attention weight;

the step S12 specifically includes:

calculating a first-order time sequence correlation coefficient of each time sequence data belonging to one host unit with the target instance relative to the historical target time sequence, and obtaining static time correlation weights of all time sequences and the historical target time sequence in the corresponding host unit;

calculating a second attention weight based on the first hidden layer state and all time-series data of the target instance belonging to one host unit at the previous time;

obtaining the attention weight of the host unit based on the static time correlation weight and the second attention weight, and normalizing to obtain the normalized attention weight of the host unit;

calculating a host unit attention output vector based on the first hidden layer state at the previous time, all time series data belonging to one host unit with the target instance, the target time sequence at the historical time and the normalized host unit attention weight;

The step S13 specifically includes:

calculating correlation coefficients between historical moment target time sequences in different time windows, and obtaining corresponding autocorrelation weights;

calculating a third attention weight based on the first hidden layer state at the previous time and the target time sequence at the different historical time;

obtaining the attention weight of the autocorrelation unit based on the autocorrelation weight and the third attention weight, and normalizing to obtain the attention weight of the normalized autocorrelation unit;

an autocorrelation attention layer output vector is calculated based on the first hidden layer state at the last time, the target timing at the historical time, and the normalized autocorrelation unit attention weights.

Preferably, step S3 specifically includes:

calculating the time attention layer weight based on the state of the second hidden layer at the previous moment, and normalizing to obtain the normalized time attention layer weight;

a context vector is calculated based on the current first hidden layer state and the normalized temporal attention layer weight.

A cluster resource prediction apparatus based on an attention mechanism, comprising:

the first input vector calculation module is used for taking the first hidden layer state at the last moment, all time series data belonging to one deployment unit with the target instance, all time series data belonging to one host unit with the target instance and the target time sequence at the historical moment as input of the input attention layer to obtain a first input vector;

The first hidden layer state calculation module is used for inputting the first input vector to an LSTM coder to obtain a current first hidden layer state;

the context vector calculation module is used for inputting the current first hidden layer state and the second hidden layer state at the previous moment into the time correlation attention layer to obtain a context vector;

the second hidden layer state calculation module is used for inputting the context vector, the second hidden layer state at the last moment and the target time sequence at the historical moment to the LSTM decoder to obtain the current second hidden layer state;

and the linear transformation module is used for carrying out linear transformation on the current second hidden layer state and the context vector to obtain a predicted value.

Preferably, the first input vector calculation module specifically includes:

the first computing unit is used for taking the state of the first hidden layer at the previous moment and all time series data which belong to the same deployment unit as the target instance as the input of the attention layer of the deployment unit to obtain the output vector of the attention layer of the deployment unit;

the second calculating unit is used for taking the state of the first hidden layer at the last moment and all time series data which belong to the same host unit as the target instance as the input of the attention layer of the host unit to obtain an attention output vector of the host unit;

The third calculation unit is used for taking the first hidden layer state at the last moment and the target time sequence at the historical moment as the input of the autocorrelation attention layer to obtain an autocorrelation attention layer output vector;

and the merging unit is used for merging the deployment unit attention layer output vector, the host unit attention output vector and the autocorrelation attention layer output vector as a first input vector.

Preferably, the first computing unit specifically includes:

a first attention weight calculation subunit for calculating a first attention weight based on the first hidden layer state at the previous time and all time-series data belonging to the same deployment unit as the target instance;

a first normalized weight calculation subunit for calculating a normalized deployment unit attention weight using a softmax function based on the first attention weight;

the first attention layer output vector calculation unit is used for calculating an attention layer output vector of the deployment unit based on the first hidden layer state at the last moment, all time series data belonging to the same deployment unit with the target instance and the normalized attention weight of the deployment unit;

the second computing unit specifically includes:

The static time correlation weight calculation subunit is used for calculating a first-order time sequence correlation coefficient of each time sequence data which belongs to one host unit together with the target instance relative to the historical target time sequence, and obtaining static time correlation weights of all time sequences and the historical target time sequence in the corresponding host unit;

a second attention weight calculation subunit for calculating a second attention weight based on the first hidden layer state and all time-series data of the target instance belonging to one host unit at the previous time;

the second normalization weight subunit is used for obtaining the attention weight of the host unit based on the static time correlation weight and the second attention weight, and normalizing the attention weight to obtain the attention weight of the normalized host unit;

the second attention layer output vector calculation unit is used for calculating the attention layer output vector of the host unit based on the state of the first hidden layer at the last moment, all time sequence data belonging to one host unit with the target instance, the target time sequence of the historical moment and the attention weight of the normalized host unit;

the third computing unit specifically includes:

the self-correlation weight calculation subunit is used for calculating correlation coefficients among the historical moment target time sequences in different time windows and obtaining corresponding self-correlation weights;

A third attention weight calculation subunit, configured to calculate a third attention weight based on the first hidden layer state at the previous time and the target time sequence at the different historical time;

the third normalization weight subunit is configured to obtain an attention weight of the autocorrelation unit based on the autocorrelation weight and the third attention weight, and normalize the attention weight of the autocorrelation unit to obtain a normalized attention weight of the autocorrelation unit;

the third attention layer output vector subunit calculates an autocorrelation attention layer output vector based on the first hidden layer state at the last time, the target timing at the historical time, and the normalized autocorrelation unit attention weight.

Preferably, the context vector calculation module specifically includes:

the fourth normalization weight subunit is used for calculating the time attention layer weight based on the state of the second hidden layer at the previous moment, and normalizing the time attention layer weight to obtain the normalized time attention layer weight;

a context vector calculation subunit for calculating a context vector based on the current first hidden layer state and the normalized temporal attention layer weight.

Compared with the prior art, the invention discloses a cluster resource prediction method and device based on an attention mechanism, which adopts an improved attention mechanism and integrates the improved attention mechanism into an LSTM, so that the correlation among a plurality of time sequences can be mined, a scheme of predicting the resource demand in a cluster by using the plurality of time sequences is provided, the prediction accuracy is effectively improved, the resource planning can be effectively assisted, the resource utilization rate of the cluster is improved, and the operation and maintenance cost of a data center is effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a cluster resource prediction method based on an attention mechanism provided by the invention;

FIG. 2 is a flowchart showing the calculation of a first input vector according to the present invention;

FIG. 3 is a schematic diagram of a cluster resource prediction device based on an attention mechanism according to the present invention;

FIG. 4 is a schematic diagram illustrating a first input vector calculation module according to the present invention;

FIG. 5 is a schematic diagram of a time-series acquisition architecture according to the present invention;

fig. 6 is a schematic diagram of a prediction model based on an attention mechanism according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the embodiment of the invention discloses a cluster resource prediction method based on an attention mechanism, which comprises the following steps:

In the invention, when the resource prediction is carried out, a plurality of resources and time sequences are used instead of a time sequence for the prediction, in addition, the improved deep learning attention mechanism is adopted to excavate the correlation among the plurality of resource time sequences according to the characteristic of the application load in the cluster on the use of the resources, and finally, the accuracy of the resource prediction is effectively improved.

Referring to fig. 2, in order to further optimize the above technical solution, the embodiment of the present invention further discloses that step S1 specifically includes:

In order to further optimize the above technical solution, the embodiment of the present invention further discloses that step S11 specifically includes:

the step S12 specifically includes:

The step S13 specifically includes:

In order to further optimize the above technical solution, the embodiment of the present invention further discloses that step S3 specifically includes:

Referring to fig. 3, the embodiment of the invention also discloses a cluster resource prediction device based on an attention mechanism, which comprises:

the first hidden layer state calculation module is used for inputting a first input vector to the LSTM coder to obtain a current first hidden layer state;

Referring to fig. 4, in order to further optimize the above technical solution, an embodiment of the present invention further discloses that the first input vector calculation module specifically includes:

In order to further optimize the above technical solution, the embodiment of the present invention further discloses that the first computing unit specifically includes:

the second calculation unit specifically includes:

the third calculation unit specifically includes:

In order to further optimize the above technical solution, the embodiment of the present invention further discloses a context vector calculation module specifically including:

The prediction method adopted by the invention uses an attention mechanism, can excavate the correlation of a plurality of time sequences with correlation, uses weights to express the correlation of the time sequences to the target time sequence, and further predicts the target time sequence by using the correlation, thereby effectively improving the accuracy of the prediction. The model provided by the invention is applied to cluster resource prediction, so that the future resource demand of the cluster can be predicted more accurately, and the resource planning can be assisted more effectively, thereby improving the resource utilization rate of the cluster and reducing the operation and maintenance cost of the data center more effectively.

The technical scheme provided by the invention is further described in detail below in combination with a specific implementation method.

In modern clusters, an application is typically made up of multiple application instances, which are categorized as a unit of deployment. While these instances are typically distributed across different physical hosts, so that each physical host may have multiple different application instances, classifying the application instances residing on one physical host as one host unit, there is a high likelihood that the application instances within one deployment unit and one host unit will be related by correlation. Therefore, for a target instance to be predicted, in collecting the time series data of the target instance, it is also possible to collect the time series data of other application instances of the deployment unit where the target instance is located and the time series data of other application instances of the host unit where the target instance is located at the same time, and finally use these time series data in the prediction of the target instance.

Before describing the method provided by the invention in detail, the mathematical symbols used for inputting and outputting the description model in the invention are stated as follows:

TABLE 1

First, as shown in fig. 5, a time series data acquisition architecture is designed: a local time series database is deployed on each host, and the local time series databases of all hosts can upload data into the global time series database. Then for a resource to be pre-made For the target instance, it can obtain its own time data (target sequence) and all time sequence data X of a host unit same as the target instance ⁱ All time series data X belonging to the same deployment unit as the target instance are then queried and obtained from the global time series database ^o 。

The invention designs a prediction model based on the attention mechanism according to the data and is named as MLA-LSTM (Multi-level Attention LSTM, short-term memory network with Multi-layer attention)

The model sets a time window with a size T, each time sequence uses T values in the window, and then predicts the value of the next time point of the target instance, that is, the value of the t+1 time point, and this process can be abstracted as:

wherein F is a model to be trained.

Two LSTM's are included in the model: the first LSTM as encoder for processing multiple time sequences of inputs and outputting hidden state h _t The method comprises the steps of carrying out a first treatment on the surface of the The second LSTM is used as a decoder and is responsible for processing the hidden state h of the first LSTM output _t And finally outputs the predicted value, a schematic diagram of this model is shown in fig. 6.

1. LSTM encoder

The calculation process defining the LSTM encoder is:

Wherein h is _t Is the hidden state vector of LSTM at time point t, its length is set to m,

for LSTM input, this input is obtained by calculation of three attention layers, the calculation of which will be described in detail below. The LSTM encoder is expanded in the time dimension as shown in fig. 6.

For LSTM encoders, three attention layers are combined as one input attention layer to mine the correlation between time series, the three attention layers being:

(1) Mining multiple time series X within a deployment unit using a common attentiveness mechanism ⁱ And is referred to as a deployment unit attention layer.

(2) Mining multiple time series X within a host unit using improved attention mechanisms ^o And is referred to as the host unit attention layer.

(3) The autocorrelation of the time series of target instances is mined using an improved attention mechanism and is referred to as the autocorrelation attention layer.

1. The calculation formula of the deployment unit attention layer is as follows:

wherein, the parameters are described as follows:

/>

summarizing, the inputs and outputs of the deployment unit attention layer are:

2. the calculation formula of the host unit attention layer is as follows:

(1) First, a first-order timing correlation coefficient CORT (the first order temporal correlation coefficient) of each time series with respect to the target timing needs to be calculated.

In the first time sequence x in the host unit ^o，l For example, it is calculated to be in time series with the target Y _T When the first order time sequence correlation coefficient of the sequence is needed to do some clipping processing to the two sequences.

First, x is ^o，l Is removed from the last value to obtain

/>

Removing the hysteresis value Y of the target sequence at the time T _T Is the first value of (1), to obtain

Then calculate

And->

CORT absolute value C of (2) ^o，l ：

This absolute value is taken as the timing x ^o，l And a static temporal correlation weight over time for the target sequence. The CORT calculating method comprises the following steps:

wherein S is ₁ ，S ₂ Two time sequences of length q, S _1，t ，S _2，t Respectively S ₁ ，S ₂ The value at time t.

Finally, the static time correlation weights of all the time sequences and the target time sequences in the host unit can be obtained and combined into a vector C _out ：

(2) The attention weight is calculated using common attention mechanisms.

Or in the first time sequence x ^o，l For example, the attention weight at time t of this time series is calculated:

the attention weights of the time series of the whole host unit at time t may form an attention weight vector g at time t _t ：

(3) Combining the time-dependent weight vectors C obtained in the above two steps _out And an attention weight vector g _t . Combining is accomplished by a linear transformation, and a new weight vector theta is obtained after transformation _t ：

In order to normalize all elements of this vector, a softmax function is used to map to obtain the normalized weight value of the first time series in the host unit at time t

/>

(4) Obtaining a value vector of the weighted host unit time sequence at the time t:

normalized weights obtained by the previous step

This weight is multiplied by the value of the corresponding first host unit time series at time t to obtain a weighted value. The time series of all host units at time t can form a vector +.>

In summary, the input and output of the host unit attention layer is:

3. the method for calculating the autocorrelation attention layer is as follows:

(1) Similar to the host unit attention layer, correlation coefficients between target time series within different time windows are calculated first. First it is necessary to calculate the target time series Y ending with the instant r _r And a target time sequence Y ending with a time T _T CORT coefficient C between ^a，r ：

C ^a，T ＝||CORT(Y _T ，Y _r )||

Then the CORT coefficients of the corresponding target time series at each instant in the time window T may form an autocorrelation vector C of length T _auto ：

(2) The attention weight is calculated using common attention mechanisms.

/>

(3) Combining the time-dependent weight vectors C obtained in the above two steps _auto And an attention weight vector mu _t . Converting two weight vectors into one weight vector phi by linear transformation method _t ：

/>

(4) Obtaining a weighted target time series vector:

normalized weights obtained in the previous step

The degree of influence of the value of the time r of the target time sequence on the value of the time T, i.e. the correlation of the target time sequence itself with itself at different times, is described within a time window T. Weight of->

For Y _t Weighting the value of the inner r moment to obtain an output vector of t moment>

Finally, the output vectors of the three attention layers are combined as the input vector of the encoder LSTM at time t, wherein

/>

2. Decoder LSTM

Define decoder LSTM as:

let h' _t As a hidden state vector of LSTM of the decoder at time t, let the number of vector elements be n. It should be noted that it is hidden from the encoder LSTMHiding state vector h _t With differences. We spread this decoder LSTM in the time dimension as shown in fig. 6.

Integrating a time-dependent attention layer into the LSTM, wherein the weight calculation method of the time-dependent attention layer is as follows:

/>

The t moment normalization weight obtained by the above method

Can be combined with h _p Weighted summation to obtain a context vector +.>

Context vector c at time t _t And a target time-series value y _t Merging and obtaining decoder input at time t by a linear transformation

As is described in the foregoing description of the invention,

is input to the decoder LSTM for operation. I.e. it and the hidden state vector h 'of the current time instant t' _t Decoder LSTM hidden state h 'which is simultaneously used for updating t+1 at the next time' _t+1 ：

Continuously cycling the above updating process until the time T is over, and obtaining the hidden state vector h 'at the time T' _T And cell state vector c _T 。

Finally, the method for calculating the predicted value of the T+1 moment output by the decoder LSTM is as follows:

in summary, the input and output of the time-dependent attention layer is summarized as follows:

finally, MSE (mean square error) is used as a training criterion for the model:

a gradient descent algorithm is used to train this model to determine the specific values of the weight coefficient matrix/vector/bias of the neural network.

The technical scheme of the invention is further described below with reference to specific examples.

The example adopts cluster data cluster-trace published in 2018 of the Aliba, randomly selects one of containers (id is c_66550) as a target example, and takes CPU utilization time sequence data of the container as resource time sequence data of the target example. This target instance is found out as an instance belonging to one deployment unit and as an instance belonging to one host unit, and the time series data of these instances are extracted, and finally processed into time series data with an interval of 300 seconds, respectively.

The time series of the same deployment unit obtained in the end is 33, which belong to 23 of a host, and the time series of the target instance is 1. These time series are time aligned and uniformly divided into three data sets: training set, validation set and test set. Wherein the training set has 10141 time points, the verification set 563 time points and the test set 564 time points. Each data set has the same number of time series.

The model has a number of super-parameters, a window size of t= {25,35,45,60}, a hidden state vector and a cell state vector size of the hidden layer of the encoder and decoder LSTM of m=n= {32,64,128}, using MSE and MAE (mean absolute value error) as error criteria, respectively, and a batch random gradient descent algorithm is used to optimize the training model, with a learning rate of 5e-4.

And finally, training the models by using a grid searching method, and taking the hyper-parameters with the best effect obtained by each model on the verification set as the optimal parameters of the models. Then predicting in the test set, and finally accumulating errors in the test set by using MSE.

In experiments, to distinguish LSTM predicted using single sequences from multiple sequences, the designation LSTM-Un was used for single sequences and LSTM-Mul was used for multiple sequences.

The experimental results are shown in the following table:

/>

experimental results show that the error of the model proposed by the invention is much smaller than that of the model with respect to the model with 3 single sequences or the model with 2 multiple sequences: wherein 98.26% better than the best VAR at MSE; under MAE, 74.40% better than the best VAR, has very high prediction accuracy, thus proving the effectiveness of the multi-time sequence prediction model based on the attention mechanism.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting cluster resources based on an attention mechanism, comprising:

the step S1 specifically comprises the following steps:

s14: merging the deployment unit attention layer output vector, the host unit attention output vector, and the autocorrelation attention layer output vector as a first input vector;

2. The method for predicting cluster resources based on an attention mechanism according to claim 1, wherein step S11 specifically includes:

The step S12 specifically includes:

the step S13 specifically includes:

3. The method for predicting cluster resources based on an attention mechanism according to any one of claims 1 to 2, wherein step S3 specifically includes:

4. A cluster resource prediction apparatus based on an attention mechanism, comprising:

The first input vector calculation module specifically includes:

a merging unit configured to merge the deployment unit attention layer output vector, the host unit attention output vector, and the autocorrelation attention layer output vector as a first input vector;

5. The attention mechanism based cluster resource prediction device of claim 4, wherein the first computing unit specifically comprises:

The second computing unit specifically includes:

the third computing unit specifically includes:

6. The attention mechanism-based cluster resource prediction device according to any one of claims 4 to 5, wherein the context vector calculation module specifically includes: