CN115659411A

CN115659411A - Method and device for data analysis

Info

Publication number: CN115659411A
Application number: CN202211175296.1A
Authority: CN
Inventors: 张静; 张宪波
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-31

Abstract

The invention discloses a method and a device for data analysis, and relates to the technical field of computers. One embodiment of the method comprises: acquiring a periodic label of an index to be analyzed, and determining a first historical time period corresponding to a prediction time period in a historical period under the condition that the index to be analyzed is determined to have periodicity according to the periodic label; determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first history time period; determining a second prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the historical period; and determining the prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence. The implementation mode can realize the integral prediction of various indexes and improve the accuracy of the prediction effect, thereby pre-judging the system performance in advance, finding potential safety hazards in advance and ensuring the system stability.

Description

Method and device for data analysis

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for data analysis.

Background

With the development of computer and internet technologies, the internet service brings great convenience. A plurality of service indexes are often related in a service system, service index data often have a safe upper bound, and the condition that the index data are below the safe upper bound is guaranteed, so that the stability of the service system is crucial. Along with the increase of services and the increase of data volume, indexes of the type under big data may be gradually increased or even exceed the upper safety limit, and for the index data which is about to or approaches the upper safety limit, an administrator often needs to perform operations such as capacity expansion and library cleaning to ensure the safety and stability of the system.

In the prior art, operation and maintenance expert experience is relied on, a processing method based on an alarm threshold value is generally adopted, the needs of capacity expansion and cleaning can be discovered when index data reaches the threshold value or approaches a safety upper bound, and the method can not control the time for obtaining early warning and can not leave enough operation time for an administrator.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for data analysis, where a corresponding prediction scheme is selected according to a period label of an index to be analyzed, and a weighting method is added to increase a weight of data in a corresponding time period closer to a prediction time period, so that overall prediction of various indexes can be achieved, accuracy of a prediction effect is improved, and thus advance judgment can be made on system performance, potential safety hazards can be found in advance, and system stability is ensured.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data analysis method including:

acquiring a periodic label of an index to be analyzed, and determining a first historical time period corresponding to a prediction time period in a historical period under the condition that the index to be analyzed is determined to have periodicity according to the periodic label; the history cycle refers to a cycle before the prediction period;

determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first historical time period;

determining a second predicted time series of the index to be analyzed in the prediction period based on the time series data of the index to be analyzed in the historical period;

and determining the prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence.

Optionally, the obtaining a periodic tag of the index to be analyzed includes: analyzing the historical time sequence of the index to be analyzed by adopting a polymerization channel characteristic detection algorithm to obtain a candidate period set of the index to be analyzed; and under the condition that the elements in the candidate period set are irregular, setting a first period label for indicating no period for the index to be analyzed.

Optionally, the method further comprises:

under the condition that the elements in the candidate period set are regular, cutting the historical time sequence according to the value of each element in the candidate period set, and removing the time sequence of less than one period in the historical time sequence;

identifying the number of wave crests of each cut time sequence to obtain a wave crest number sequence, identifying the number of wave troughs of each cut time sequence to obtain a wave trough number sequence, and respectively carrying out stability test on the wave crest number sequence and the wave trough number sequence;

under the condition that the wave crest number sequence and the wave trough number sequence both pass stability inspection, setting a second period label for indicating a strong period for the index to be analyzed;

and under the condition that the wave crest number sequence or the wave trough number sequence does not pass stability test, setting a third period label for indicating a weak period for the index to be analyzed.

Optionally, the method further comprises: determining any one or more of the following characteristics of the index to be analyzed, and inputting the determined characteristics into a pre-trained classification model to determine the shape label of the index to be analyzed:

the method comprises the steps of SBD cross-correlation distance characteristic, duty ratio of the difference between the median value and the mean value of the historical time sequence determined on the basis of a sliding window and larger than a preset value, linear regression characteristic of the historical time sequence least square method, auto-correlation value of an auto-regression model lag operator, spectrum statistic of absolute Fourier transform and sample entropy.

Optionally, the method further comprises: acquiring a shape label of the index to be analyzed;

determining a first history period within the history cycle corresponding to the prediction period, comprising:

under the condition that multiple types of time sequence curves exist in the historical period of the index to be analyzed according to the shape label, taking the time period corresponding to one or more time sequence curves of the type corresponding to the prediction time period in the historical period as the first historical time period;

and in the case that the indicator to be analyzed only has one type of time sequence curve in the historical period according to the shape label, taking one or more statistical periods with smaller time interval between the historical period and the prediction period as the first historical period.

Optionally, the method further comprises:

under the condition that the to-be-analyzed index is determined to have strong periodicity according to the period label, determining a first prediction time sequence and a second prediction time sequence of the to-be-analyzed index in the prediction time period by using a Holt-Winters model;

and under the condition that the index to be analyzed is determined to have a certain periodicity according to the period label, determining a first prediction time sequence and a second prediction time sequence of the index to be analyzed in the prediction time period by using a DeepAR model.

Optionally, the method further comprises: and under the condition that the to-be-analyzed index is determined not to have periodicity according to the period label, determining the prediction time sequence of the to-be-analyzed index in the prediction time period by utilizing an SVR (singular value representation) model.

Optionally, the method further comprises: obtaining a dimension label of the index to be analyzed; the dimension label is determined according to the value range of the index to be analyzed; the model inputs for the model include: the shape label and the dimension label of the index to be analyzed and the time sequence data of the index to be analyzed in the historical period.

According to a second aspect of embodiments of the present invention, there is provided an apparatus for data analysis, including:

the label acquisition module is used for acquiring a periodic label of an index to be analyzed and determining a first historical time period corresponding to a prediction time period in a historical period under the condition that the index to be analyzed is determined to have periodicity according to the periodic label; the history cycle refers to a cycle before the prediction period;

the sequence prediction module is used for determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first historical time period; determining a second prediction time series of the index to be analyzed in the prediction time period based on the time series data of the index to be analyzed in the historical period;

and the result fitting module is used for determining the prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence.

Optionally, the obtaining, by the tag obtaining module, a periodic tag of the index to be analyzed includes: analyzing the historical time sequence of the index to be analyzed by adopting a polymerization channel characteristic detection algorithm to obtain a candidate period set of the index to be analyzed; and under the condition that the elements in the candidate period set are irregular, setting a first period label for indicating no period for the index to be analyzed.

Optionally, the tag obtaining module is further configured to:

determining any one or more of the following characteristics of the index to be analyzed, and inputting the determined characteristics into a pre-trained classification model to determine the shape label of the index to be analyzed:

Optionally, the tag obtaining module is further configured to: acquiring a shape label of the index to be analyzed;

Optionally, the sequence prediction module is further configured to:

Optionally, the sequence prediction module is further configured to: and under the condition that the to-be-analyzed index is determined not to have periodicity according to the period label, determining the prediction time sequence of the to-be-analyzed index in the prediction time period by utilizing an SVR (singular value representation) model.

Optionally, the tag obtaining module further includes: obtaining a dimension label of the index to be analyzed; the dimension label is determined according to the value range of the index to be analyzed; the model inputs for the model include: the shape label and the dimension label of the index to be analyzed and the time sequence data of the index to be analyzed in the historical period.

According to a third aspect of embodiments of the present invention, there is provided an electronic device for data analysis, comprising:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable medium, on which a computer program is stored, which when executed by a processor, performs the method provided by the first aspect of embodiments of the present invention.

One embodiment of the above invention has the following advantages or benefits: the corresponding prediction scheme is selected according to the periodic labels of the indexes to be analyzed, the weight of data in the corresponding time interval close to the prediction time interval is increased by adding a weighting method, the overall prediction of various indexes can be realized, the accuracy of the prediction effect is improved, the performance of the system can be pre-judged in advance, potential safety hazards can be found in advance, and the stability of the system is ensured.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of a method of data analysis according to an embodiment of the invention;

FIG. 2 is a schematic flow chart of a method of data analysis in accordance with some embodiments of the invention;

FIG. 3 is a schematic illustration of a method of data analysis in an alternative embodiment of the invention;

FIG. 4 is a schematic diagram comparing SVM and SVR models;

FIG. 5 is a schematic diagram of the main blocks of an apparatus for data analysis according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to an aspect of an embodiment of the present invention, a method of data analysis is provided.

Fig. 1 is a schematic diagram of a main flow of a method of data analysis according to an embodiment of the present invention, and as shown in fig. 1, the method of data analysis includes step S101, step S102, step S103, and step S104.

Step S101, acquiring a cycle label of an index to be analyzed, and determining a first historical time period corresponding to a prediction time period in a historical cycle under the condition that the index to be analyzed is determined to have periodicity according to the cycle label; the history cycle refers to a cycle before the prediction period.

Disk space is a platform application and Disk (Disk) refers to a memory that stores data using magnetic recording technology. The magnetic disk is the main storage medium of the computer, can store a large amount of binary data, and can keep the data from losing after power failure. Early computers used a Floppy Disk (abbreviated Floppy Disk) and Hard disks (abbreviated Hard disks) were common disks today. The tablespace of Oracle belongs to a storage structure in Oracle, and is a logical space for storing database objects (such as data files), and the tablespace is the largest logical unit for storing information in Oracle, and contains logical data types such as segments, areas, data blocks and the like. The tablespace is a space opened in the database for storing the objects of the database, and a database can be composed of a plurality of tablespaces. Tuning of Oracle can be achieved through tablespace. The existing disk space alarm prediction technology has a common effect, mostly adopts a single prediction index, does not have the idea of a comprehensive algorithm, and has poor accuracy of the prediction effect. The index to be analyzed in the embodiment of the invention is determined according to a real-time application scene, and the index to be analyzed can be disk space utilization rate, table space utilization rate and the like, and can also be transaction amount, disk capacity, file system utilization rate, memory utilization rate, CPU utilization rate and the like of a service system.

The periodic label is used for indicating whether the index to be analyzed has periodicity or periodicity strength. The types of the periodic tags can be divided according to actual situations, for example, the periodic tags are divided into two types of periodic tags with no period and with period according to whether periodicity exists, or the periodic tags are divided into three types of periodic tags with no period, strong period and weak period according to the strength of the periodicity.

The prediction period may be a past history period, for example, index data of 2 months after the last year is predicted by using index data of 10 months before the last year, so that when the prediction accuracy is judged according to the prediction result and the actual index data, 2 months after the last year may be used as the prediction period. The prediction period may be a period in the future, and for example, when data of a future day is predicted using data of the past 7 days, the future day may be used as the prediction period. The history cycle refers to one or more cycles before the prediction period, for example, the history cycle is one cycle before the prediction period and closest in time to the prediction period.

The first historical period corresponding to the prediction period in the historical period refers to a period in which the shape type of the time curve of the index to be analyzed is the same as the shape type of the time curve of the index to be analyzed in the prediction period. In some alternative embodiments, the first history period may be a period having the same phase as the prediction period within the history cycle (referred to as an in-phase period for short). For example, when the cycle of the index to be analyzed is 7 days, and data on the 8 th day is predicted using data on the 7 th day before the index to be analyzed, if the 8 th day is a prediction period, the 1 st day is regarded as a first history period corresponding to the prediction period. In other alternative embodiments, the first history period may be a period within the history cycle that includes the in-phase period. Taking the foregoing example as an example, day 1 and day 2 may be taken as the first historical period corresponding to the predicted period. In still other alternative embodiments, the first history period may also be a period within the history cycle that is most recent in time to the predicted period. For example, in the foregoing example, the 7 th day is taken as the first history period corresponding to the prediction period.

For example, to facilitate accurate determination of the first history period corresponding to the predicted period in the history cycle, the method of the embodiment of the present invention further includes: and acquiring the shape label of the index to be analyzed. The shape label is used for representing the shape type of the time curve of the index to be analyzed, and the curve shapes of different shape types are different. Determining a first historical period within the historical cycle corresponding to the predicted period, comprising: under the condition that multiple types of time sequence curves exist in the historical period of the index to be analyzed according to the shape label, taking the time period corresponding to one or more time sequence curves of the type corresponding to the prediction time period in the historical period as the first historical time period; and in the case that the indicator to be analyzed only has one type of time sequence curve in the historical period according to the shape label, taking one or more statistical periods with smaller time interval between the historical period and the prediction period as the first historical period. The statistical period can be selectively set according to actual conditions, such as one hour, one day, one week and the like.

For example, when the period of the indicator to be analyzed is 7 days, and the curve shape of each day is irregular within 7 days, or the curve shapes of each day are the same, the shape label of the indicator to be analyzed may be marked as "shape 1 × 7", and when the data of the 8 th day is predicted by using the data of the 7 th day before the indicator to be analyzed, if the 8 th day is a prediction period, the 7 th day or the 6 th to 7 th days is taken as a first history period corresponding to the prediction period; when the period of the index to be analyzed is 7 days, and the curve shape of the index to be analyzed in the previous 2 days is different from the curve shape of the index to be analyzed in the next 5 days, referring to fig. 3, the shape label of the index to be analyzed may be marked as "shape 2+5", and when the data of the 8 th day is predicted by using the data of the index to be analyzed in the previous 7 days, if the 8 th day is a prediction period, the 1 st or 2 nd day or the 1 st to 2 nd day is taken as a first history period corresponding to the prediction period.

In some embodiments of the present invention, the periodic signature is determined in real-time during the execution of the method of the embodiments of the present invention, so that the periodic signature has better real-time performance. In other embodiments of the present invention, the period label is predetermined, for example, the period label is obtained from the database in step S101, and the period label may be periodically updated in the actual application process, so as to reduce the amount of calculation. The periodic tag in the embodiment of the invention can be obtained by the following method:

analyzing the historical time sequence of the index to be analyzed by adopting an Aggregation Channel Features (ACF) algorithm to obtain a candidate period set of the index to be analyzed; and under the condition that elements in the candidate period set are irregular, setting a first period label for indicating no period for the index to be analyzed.

The ACF algorithm is to exhaust all possible cycles to form a set of candidate cycles. And (3) for the candidate period set obtained by calculation of the ACF algorithm, combining peak value inspection and SBD (boundary-base distortion) cross-correlation distance calculation, searching for the optimal period number which enables the RMSE (target function) to obtain the minimum value based on an L-BFGS (bidirectional forwarding detection) optimization method, wherein the period number is composed of a plurality of values, and the multiple period identification technology is obtained. The key point of multi-cycle recognition is to introduce a correction term to give more weight to the most valuable memory cycle on historical data. The search result has multiple values, for example, the period number of two indexes is {1,7}, {1,2,5,7}, where {1,7} represents 2 period values of the index, 1 day and 7 days respectively; {2,5,7} indicates that there are 3 cycle values for this index, 2 days, 5 days, and 7 days, respectively. And adapting the corresponding baseline model according to the cycle number search results of different indexes. Taking {2,5,7} as an example, the period of the input sequence can be identified as 7=5+2, as shown in fig. 3, 7 days being a long period, and 5 days and 2 days being two short sub-periods. And determining that no period exists for indexes with irregular values (for example, values with different sizes, such as 5min, 12min, 30min, 1 day and the like) in the candidate period set obtained by calculation of the ACF algorithm.

In the case of regularity of the elements in the candidate period set, the periodicity strength of the index to be analyzed can be further determined through peak value detection. Specifically, the historical time sequence is cut according to each element value in the candidate period set, and the time sequence of less than one period in the historical time sequence is removed; identifying the number of wave crests of each cut time sequence to obtain a wave crest number sequence, identifying the number of wave troughs of each cut time sequence to obtain a wave trough number sequence, and respectively carrying out stability test on the wave crest number sequence and the wave trough number sequence; under the condition that the wave crest number sequence and the wave trough number sequence both pass stability inspection, setting a second period label for indicating a strong period for the index to be analyzed; and under the condition that the wave crest number sequence or the wave trough number sequence does not pass stability test, setting a third period label for indicating a weak period for the index to be analyzed. The stability checking condition can be selectively set according to actual conditions.

In the embodiment of the present invention, a shape tag of an index to be analyzed may be further obtained, so as to determine, according to a time curve of the index to be analyzed, a first historical time period having the same curve type as that corresponding to the prediction time period, thereby determining a time sequence of the index to be analyzed in the prediction time period by using a weighting method. Specifically, any one or more of the following features of the index to be analyzed are determined, and the determined features are input into a pre-trained classification model to determine the shape label of the index to be analyzed: the method comprises the steps of SBD cross-correlation distance (a shape-based distance) characteristic, an occupation ratio of a sliding window to determine that the absolute value of the difference between a median and a mean value of the historical time series is larger than a preset value, a linear regression characteristic of the historical time series least square method, an autocorrelation value of an autoregressive model lag operator (AR (lag)), spectrum statistic of absolute Fourier transform and sample entropy. Wherein the classification model may be an SVM classification model.

Illustratively, a plurality of characteristics are input into an SVM classification model, shape categories are divided into two categories of 1 × 7 and 5+2, 4 categories in total are distinguished, if the shape labels of data in 7 days have data of 5+2, 1 × 7, 3+4 and 1+6, if the data of the index to be analyzed on a working day and a rest day are far different, the shape categories can be set into two categories of 1 × 7 and 5+2, if the data of the index to be analyzed on the working day and the rest day are far away, the minimum period of 1 × 7 representing the minimum period of 1, the minimum period of 3+4 and 1+6 is also 1, the prediction is performed according to the minimum period, the prediction effect can be achieved, the learning precision of the model is not high for the index with the data of the working day and the rest day far away, the shape classification is performed by adopting the method of the embodiment of the invention, the subsequent prediction is further performed, and the learning precision of the prediction model can be improved.

In consideration of the fact that the value ranges of the same index in different systems may be different, in order to improve the adaptability of the method of the embodiment of the invention, the dimension label of the index to be analyzed can be further obtained. For example, if the index range of the index to be analyzed in one system is 0 to 100, and the index range in another system is 0 to 1, different dimension labels may be set for the index to be analyzed in two different index ranges, respectively.

Illustratively, 10000 time sequences of the indexes to be analyzed are obtained, the calculation statistical characteristics and the index labels of 8000 time sequences are transmitted to the SVM for learning to obtain a classification result, the rest 2000 time sequences are used as a test set, and the calculation statistical characteristics are transmitted to the SVM classification model for classification label prediction. The statistical feature calculation method may include any one or more of: calculating 10 (maximum, minimum, mean, variance, standard deviation, skewness, kurtosis, median, sum of squares) of the time series feature numbers, calculating the sum of absolute values of first order differences of the time series, calculating whether the variance is greater than the deviation, calculating the counts in the time series that are greater than the mean, calculating the counts in the time series that are less than the mean, calculating the position of the first maximum in the time series (where the position is index/length), calculating the position of the first minimum in the time series (where the position is (index + 1)/length), checking whether any value in the time series appears repeatedly, whether the maximum in the time series appears multiple times, whether the minimum value in the time series appears for a plurality of times, calculating the length of the longest continuous subsequence of which the time series is greater than the mean value, calculating the length of the longest continuous subsequence of which the time series is less than the mean value, calculating the mean of the absolute values of the first-order differences of the time series, calculating the mean of the first-order differences of the time series, returning the percentage of unique values present in the time series more than once, if all values in the time series appear only once, returning the factor 1, returning the sum of all data points present in the time series more than once, returning the sum of all data values present in the time series more than once, calculating the sum and the range of the time series, calculating the quantiles of the time series, calculating the coefficient of variation of the time series, calculating the complexity of the time series (measure of the peak valley angle), calculating the ratio of the absolute value of the difference between the mean and the median of the time series to be greater than r times the standard deviation, calculating linear regression characteristics of a time series least square method, calculating ADF parameter test values of the time series, calculating autocorrelation values of AR (lag) of the time series, calculating spectrum statistics of absolute Fourier transform of the time series, calculating a ratio of standard deviation to range of the time series, calculating peak counts of the time series which are larger than n numbers on the left and right, calculating the number of peaks which are searched after Ricker wavelets with the width of the time series from 1 to n are smoothed and meet high signal-to-noise ratio (SNR), calculating the counts of the time series passing through a mean value, calculating sample entropy of the time series, calculating the counts which are larger than the mean value in the time series, and calculating one-dimensional discrete Fourier coefficients of the time series based on fast Fourier transform.

In an alternative embodiment of the present invention, the indicators to be analyzed include 3 labels (period, shape and dimension), and the indicators can be divided into 12 classes according to the three labels, see the following table:

period of time	Shape of	Dimension line
			Strong period of time	Shape 1 x 7	1
Strong period of time	Shape 1 x 7	2
			Strong period of time	The shape is 5 (working day) +2 (rest day)	3
Strong period of time	The shape is 5 (working day) +2 (rest day)	4
			Weak period	Shape 1 x 7	5
Weak period	Shape 1 x 7	6
			Weak period of time	The shape is 5 (working day) +2 (rest day)	7
Weak period of time	The shape is 5 (working day) +2 (rest day)	8
			Without period	Shape 1 x 7	9
Without period	Shape 1 x 7	10
			Without period	The shape is 5 (working day) +2 (rest day)	11
Without period	The shape is 5 (working day) +2 (rest day)	12

Step S102, determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first history time period;

step S103, determining a second prediction time series of the index to be analyzed in the prediction time period based on the time series data of the index to be analyzed in the history cycle.

In step S103, the time series of the index to be analyzed in the prediction period is determined by using the time series data of the index to be analyzed in the history cycle, and in the practical application process, the time series data of a plurality of history cycles can be selected for prediction. The shape type of the time curve of the index to be analyzed in the first history period is the same as the shape type of the time curve of the index to be analyzed in the prediction period, step S102 determines the time series in the prediction period by using the time series data in the first history period, the weight of the time series data in the first history period in the prediction process can be increased, and compared with a mode of not adding the weight, the accuracy of the prediction effect can be improved.

In step S102 and step S103, a pre-trained model is used to determine a first prediction time sequence and a second prediction time sequence of the index to be analyzed within the prediction period. Optionally, in a case that it is determined that the to-be-analyzed index has strong periodicity according to the period label, determining a first prediction time sequence and a second prediction time sequence of the to-be-analyzed index within the prediction time period by using a Holt-Winters model; and under the condition that the index to be analyzed is determined to have a certain periodicity according to the period label, determining a first prediction time sequence and a second prediction time sequence of the index to be analyzed in the prediction time period by using a DeepAR model. According to the Holt-Winters model in the embodiment of the invention, the optimal fitting function (Bayesian ridge regression/huber regression) is selected in a self-adaptive manner according to the prediction effect of the transmitted indexes, so that the prediction effect of the model can be further improved. In the embodiment of the invention, the dimension label of the index to be analyzed can be further obtained, and when the pre-trained model is adopted to determine the first prediction time sequence and the second prediction time sequence of the index to be analyzed in the prediction time period, the model input of the model comprises the shape label and the dimension label of the index to be analyzed and the time sequence data of the index to be analyzed in the historical period. And the accuracy of the prediction result can be further improved by adopting model input of multiple dimensions.

The Holt-Winters method is a time sequence analysis and prediction method, which is suitable for non-stationary sequences with linear trend and periodic fluctuation, utilizes an exponential smoothing method (EMA) to make model parameters continuously adapt to the change of the non-stationary sequences, and performs short-term prediction on future trend. The Holt-Winters method introduces Winters period terms (also called seasonal terms) on the basis of a Holt model, can be used for processing fluctuation behaviors of fixed periods in time sequences such as monthly data, quarterly data and weekly data, and can also be used for processing the condition that multiple periods coexist by introducing a plurality of Winters terms. The Holt-Winters method is suitable for non-stationary sequences with linear trend and fixed period and is divided into an addition model and a multiplication model.

Deepar is an upgraded version of auto regression model (Autoregressive model), and predicts future data given data in the past period, and outputs a probability distribution of the future data, and in this case, the prediction of the data in the future period needs to be generated recursively by using Deepar through a sampling method, but because the prediction is obtained by sampling from the probability distribution, this is only a possible "track", and if calculating the expected value, it needs to be obtained by averaging after sampling repeatedly by using Monte Carlo method. The method for enabling the model to output the probability distribution is particularly suitable for time sequence data with large uncertainty, the data often has certain noise, inaccuracy can be caused if the data is directly predicted by applying an algorithm to future data, for the DeepAR model for predicting the probability distribution, the mode of maximizing the likelihood function of the future sequence can better reflect the inherent random property of the data, the method can predict a numerical value and can predict future fluctuation, and the characteristic is very helpful for time sequence operation and maintenance data needing to consider risks. Compared with the prediction of future data by a recurrent neural network such as LSTM, the Deepar does not directly and simply output a determined value, but outputs the probability distribution of the predicted value, and the following two advantages are achieved: (1) Many processes have random attributes, compared with a model only outputting one numerical value, the probability distribution output is closer to essence, and the prediction precision is higher; (2) Uncertainty of the prediction and associated risk can be assessed.

The DeepaR model is adopted to determine the first prediction time sequence and the second prediction time sequence of the index to be analyzed in the prediction time period, and the method has the following advantages:

(1) Cold start prediction: a cold start condition occurs when it is desired to generate predictions for a time series with little or no historical data. Traditional methods such as ARIMA (Autoregressive Integrated Moving Average Model) or ES (evolution learning algorithm) completely rely on historical data of a single time series, and are therefore generally less accurate in cold start situations. Taking the prediction of clothing goods (such as sports shoes) as an example, the algorithm deep based on the neural network can learn typical behaviors of selling new sports shoes according to the selling modes of other types of sports shoes when being released for the first time, and the deep can provide more accurate prediction than the existing algorithm by learning the relation of a plurality of relevant time sequences in training data.

(2) Probability prediction: deepar may also generate point predictions and probability predictions. Probabilistic predictions are particularly well suited for applications such as operation and maintenance data, where the specific prediction quantile is more important than the most likely result.

(3) Multidimensional independent variables: some independent variables may be additionally added.

(4) When multiple time series are predicted, the paper mentions that category encoding can be performed on each time series, and embedding learning is performed during training.

And step S104, determining the prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence. In the practical application process, the prediction time sequence of the index to be analyzed in the prediction time period can be determined by adopting a weighted summation mode. Exemplarily, assuming that the period of the index to be analyzed is 7 days, and the data of the 8 th day is predicted by using the data of the previous 7 days, then: first predicted time series y' ₁ ＝f(input＝[DATE ₁ ]) Second predicted time series y' ₂ ＝f(input＝[DATE ₂ ]) Prediction time series y 'of the metric to be analyzed within the prediction horizon' _final ＝(y′ ₁ ×a _i +y′ ₂ ×b _i ) _i∈[1,T] Wherein: DATW ₁ Representing a time series of indicators to be analyzed, DATE, over a first historical period ₂ Representing a time sequence of the index to be analyzed in the historical period; t represents the number of statistical periods included in the history cycle, i.e., the cycle of the index to be analyzed (when there are multiple cycles of the index to be analyzed, the maximum cycle is referred to here); a is _i Representing a first weight of the first predicted time series in the final predicted time series, b _i And representing a second weight of the second prediction time sequence in the final prediction time sequence, wherein the value of i depends on the actual situation, i =1 when the first historical period is the first statistical period in the historical period, i =2, \8230;, i = T when the first historical period is the second statistical period in the historical period. It should be noted that, when the first history period corresponds to a plurality of statistical periods in the history cycle, the first weights corresponding to the statistical periods may be set to the same value, and the first weights corresponding to the statistical periods may be set to the same valueThe second weights are set to the same value, and the plurality of statistical time periods can also be divided into a plurality of new first history time periods, and each new first history time period corresponds to one statistical time period, so that the calculation is facilitated.

Optionally, the method further comprises: and under the condition that the to-be-analyzed index is determined to have no periodicity according to the period label, determining a prediction time sequence of the to-be-analyzed index in the prediction time period by utilizing an SVR (Support vector Regression) model. The SVR is oriented to trend prediction of a single index. As shown in fig. 4, the goal in SVM (Support Vector Machine) is to find a separating Hyperplane (Hyperplane) by maximizing the interval, so that most of the sample points are located outside of two Decision boundaries (Decision boundaries); SVR again considers the maximization interval, but considers points within the decision boundary such that as many sample points as possible are within the interval.

In the prior art, when a model for predicting a time sequence is trained, if an abnormal value exists in training data, a model training effect is influenced, and further the accuracy of subsequent prediction is influenced. In the embodiment of the invention, a time sequence feature library meeting indexes can be independently constructed for the features of an incoming SVR model, and the abnormal value processing method for incoming data comprises the following steps: and jointly predicting the non-abnormal condition value of the current point according to the real value and the predicted value of the abnormal point at the same period position. The predicted value is obtained by an exponential smoothing method, and the true value is obtained by a normal distribution random number constructed by the remaining points in the same period. The embodiment can reduce the influence of abnormal data and improve the accuracy of the prediction result.

For the situation that the data is stable and periodic as a whole, but mutation occurs in some places, for example, the mutation becomes higher and higher suddenly, and the traditional single prediction algorithm cannot process the characteristic; when abnormal values exist in the training data, the learning and subsequent prediction of the model are influenced; when the situation that promotion days exist in data is faced, the traditional algorithm cannot ensure the accuracy or cannot process the promotion days; if the index data has obvious mode change (such as concept drift) in a short period of time, under the requirement of higher accuracy, the traditional algorithm has the defects of excessive consumption of index resources and overlong time, and meanwhile, the existing algorithm has poor verification effect and needs to integrate ideas of various algorithms. The embodiment of the invention can decompose the overall offset, periodicity, trend and error of data, and adopts different prediction models for sparse points and dense points, thereby taking prediction accuracy and training prediction performance into consideration and realizing a stable and reliable time sequence prediction function.

Taking abnormality diagnosis and trend prediction for a disk space or a table data space as an example, the embodiment of the invention performs index classification according to ten thousand pieces of collected disk space data and table space data (the disk space data refers to disk space utilization rate, the table space data refers to table space utilization rate, the disk space data of each database instance is a time sequence, and the table space data is also a time sequence), transmits the collected disk space data to a random forest classification algorithm to obtain a shape classification label, and then performs dimension size and fluctuation feature calculation and transmits the shape classification label to an SVM classification algorithm to obtain a dimension classification label. The index of the tablespace is calculated as above. The indexes are divided into three categories of no period, weak period and strong period, and twelve categories. After classification, mapping a corresponding trend prediction model according to each corresponding label, wherein the trend prediction model comprises three types: SVR single index prediction, deepaR multi-index prediction, and holtwinter seasonal and periodic prediction. The model corresponding to the no-period model is an SVR model, and single index trend prediction is carried out by using the SVR model; the model corresponding to the weak period is a DeepAR model, and multi-index prediction is carried out by using the DeepAR model; the model corresponding to the strong period is a holtwitter model, and seasonal and periodic data can be predicted.

The data analysis method provided by the embodiment of the invention is a set of complete and universal trend prediction algorithm, and can be deployed in a light weight manner. When the indexes are classified, the method and the device have the advantages that the fluctuation periodicity of the disk/tablespace indexes of the database is considered, the time sequence shape and the dimension fluctuation range are considered, the index classification accuracy is higher, database engineers can be helped to classify and manage the disk/tablespace indexes of different examples, different tasks are distributed to the corresponding examples in an auxiliary mode, the index classification model can enable the indexes to be better adapted to the corresponding trend prediction algorithm, and the trend prediction accuracy is improved. The design of the overall technical scheme of the embodiment of the invention can perform prejudgment on the performance monitoring of the database in advance, discover potential safety hazards in advance, ensure the stability of the system, help operation and maintenance engineers and manage to leave enough time, discover and execute corresponding operations on a fault disk or an abnormal table data space, and can quickly stop loss if a fault occurs in the future.

According to a second aspect of the embodiments of the present invention, an apparatus for implementing the above method is provided. Fig. 5 is a schematic diagram of main blocks of an apparatus for data analysis according to an embodiment of the present invention, and as shown in fig. 5, the apparatus 500 for data analysis includes:

the tag obtaining module 501 is configured to obtain a periodic tag of an index to be analyzed, and determine a first history period corresponding to a prediction period in a history cycle when the periodicity of the index to be analyzed is determined according to the periodic tag; the history cycle refers to a cycle before the prediction period;

a sequence prediction module 502, which determines a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first history time period; determining a second predicted time series of the index to be analyzed in the prediction period based on the time series data of the index to be analyzed in the historical period;

and a result fitting module 503, determining a prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence.

Optionally, the tag obtaining module obtains a periodic tag of the index to be analyzed, including: analyzing the historical time sequence of the index to be analyzed by adopting a polymerization channel characteristic detection algorithm to obtain a candidate period set of the index to be analyzed; and under the condition that elements in the candidate period set are irregular, setting a first period label for indicating no period for the index to be analyzed.

Optionally, the tag obtaining module is further configured to:

setting a second period label for indicating a strong period for the index to be analyzed under the condition that the wave crest number sequence and the wave trough number sequence both pass through stability inspection;

and under the condition that the wave crest number sequence or the wave trough number sequence does not pass the stability test, setting a third period label for indicating a weak period for the index to be analyzed.

Optionally, the tag obtaining module is further configured to: determining any one or more of the following characteristics of the index to be analyzed, and inputting the determined characteristics into a pre-trained classification model to determine the shape label of the index to be analyzed: the method comprises the steps of SBD cross-correlation distance characteristic, duty ratio of the difference between the median value and the mean value of the historical time sequence determined on the basis of a sliding window and larger than a preset value, linear regression characteristic of the historical time sequence least square method, auto-correlation value of an auto-regression model lag operator, spectrum statistic of absolute Fourier transform and sample entropy.

under the condition that multiple types of timing curves exist in the historical period according to the shape label, taking the time period corresponding to one or more timing curves of the class corresponding to the prediction time period in the historical period as the first historical time period;

Optionally, the sequence prediction module is further configured to:

Optionally, the sequence prediction module is further configured to: and under the condition that the index to be analyzed does not have periodicity according to the period label, determining the prediction time sequence of the index to be analyzed in the prediction time period by utilizing an SVR (singular value representation) model.

Optionally, the tag obtaining module further includes: obtaining a dimension label of the index to be analyzed; the dimension label is determined according to the value range of the index to be analyzed; the model inputs for the model include: the shape label and the dimension label of the index to be analyzed, and the time sequence data of the index to be analyzed in the historical period.

one or more processors;

a storage device to store one or more programs,

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.

Fig. 6 illustrates an exemplary system architecture 600 of a data analysis method or apparatus to which embodiments of the invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have various messaging client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

601, 602, 603. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the method for data analysis provided in the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for data analysis is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing embodiments of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present invention, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes: the device comprises a label obtaining module, a sequence predicting module and a result fitting module. The names of these modules do not form a limitation on the module itself in some cases, for example, the tag obtaining module may also be described as a "module that determines a predicted time series of the index to be analyzed in the prediction period from the first predicted time series and the second predicted time series".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

acquiring a periodic label of an index to be analyzed, and determining a first historical time period corresponding to a prediction time period in a historical period under the condition that the index to be analyzed is determined to have periodicity according to the periodic label; the history cycle refers to a cycle before the prediction period; determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first historical time period; determining a second prediction time series of the index to be analyzed in the prediction time period based on the time series data of the index to be analyzed in the historical period; and determining the prediction time sequence of the index to be analyzed in the prediction time period according to the first prediction time sequence and the second prediction time sequence.

According to the technical scheme of the embodiment of the invention, the corresponding prediction scheme is selected according to the period label of the index to be analyzed, and the weight of data in the corresponding time interval which is closer to the prediction time interval is increased by adding a weighting method, so that the overall prediction of various indexes can be realized, the accuracy of the prediction effect is improved, the performance of the system can be pre-judged in advance, the potential safety hazard can be found in advance, and the stability of the system is ensured.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data analysis, comprising:

determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first history time period;

determining a second prediction time series of the index to be analyzed in the prediction time period based on the time series data of the index to be analyzed in the historical period;

2. The method of claim 1, wherein obtaining a periodic signature of an indicator to be analyzed comprises:

analyzing the historical time sequence of the index to be analyzed by adopting a polymerization channel characteristic detection algorithm to obtain a candidate period set of the index to be analyzed; and under the condition that elements in the candidate period set are irregular, setting a first period label for indicating no period for the index to be analyzed.

3. The method of claim 2, wherein the method further comprises:

under the condition that elements in the candidate period set are regular, cutting the historical time sequence according to the value of each element in the candidate period set, and removing the time sequence of less than one period in the historical time sequence;

4. The method of claim 2 or 3, further comprising:

5. The method of any of claims 1-4, wherein the method further comprises: acquiring a shape label of the index to be analyzed;

determining a first historical period within the historical cycle corresponding to the predicted period, comprising:

6. The method of claim 1, wherein the method further comprises:

7. The method of claim 1, wherein the method further comprises:

and under the condition that the to-be-analyzed index is determined not to have periodicity according to the period label, determining the prediction time sequence of the to-be-analyzed index in the prediction time period by utilizing an SVR (singular value representation) model.

8. The method of claim 6 or 7, wherein the method further comprises: obtaining a dimension label of the index to be analyzed; the dimension label is determined according to the value range of the index to be analyzed;

the model inputs of the model include: the shape label and the dimension label of the index to be analyzed, and the time sequence data of the index to be analyzed in the historical period.

9. An apparatus for data analysis, comprising:

the sequence prediction module is used for determining a first prediction time sequence of the index to be analyzed in the prediction time period based on the time sequence data of the index to be analyzed in the first historical time period; determining a second predicted time series of the index to be analyzed in the prediction period based on the time series data of the index to be analyzed in the historical period;

10. An electronic device for data analysis, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.