CN109857593B

CN109857593B - Data center log missing data recovery method

Info

Publication number: CN109857593B
Application number: CN201910056129.7A
Authority: CN
Inventors: 梁毅; 毕临风; 苏醒; 苏超; 陈金栋; 丁治明
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2020-08-28
Anticipated expiration: 2039-01-21
Also published as: CN109857593A

Abstract

The invention discloses a recovery method of missing data in a data center log, which comprises the steps of firstly, discovering the correlation of different data attributes in the data center log by using a correlation analysis method, selecting an optimal data attribute subset, and carrying out discretization optimization on data by using a two-stage discretization step length optimization algorithm; then, the selected optimal data attribute subset is used as the attribute of the tensor to construct a sparse tensor; finally, a tensor completion method based on tensor decomposition is used for completing the sparse tensor to obtain a dense tensor; and combining the dense tensor with the original incomplete log data to obtain a complete log data set.

Description

Data center log missing data recovery method

Technical Field

The invention belongs to the field of data center log analysis, and particularly relates to a method for recovering missing data in a data center log.

Background

The large-scale data center is an information infrastructure of the internet and related exhibition industries, and provides software and hardware resources such as calculation, storage and network for the operation of internet services. Virtualization technology, containerization technology and server integration technology are commonly applied in modern data centers. In this context, data centers often coexist with multiple computing frameworks and multiple heterogeneous workloads. The data center can generate mass log data in the operation process, and the mass log data comprises the operation information of a data center server and a load.

Data center log analysis is one of the important means for optimizing data center performance. Through data center log analysis, a data center manager can acquire important information such as data center load characteristics and resource use modes, and further guide the optimization work of task scheduling, resource management and programming models of the data center. However, with the increasing size of data centers, data center logs face increasingly serious data loss problems. The data loss problem of the data center log is that partial data in the log is null or invalid data and cannot be directly used as input of log analysis work. The reason for this problem is mainly two: firstly, in the acquisition stage of log data, Bug in the monitoring system can cause data loss. Meanwhile, since the monitoring system process is usually set at a lower priority, when the cluster load is high, resources are deprived, and data loss is caused; and (II) in the processing stage of log data, some data can be anonymized or normalized due to confidentiality and the like. The process directly causes data loss, and the Bug occurring in the process causes unexpected data loss. However, the current log analysis field mainly processes missing data by simply removing missing data items and performing missing data recovery by using a statistical completion method based on mean or regression. The prior method has the following problems:

(1) the problem of large-scale data loss cannot be solved. As the data center grows in size, the data center has a tendency to increase the missing rate of log data. When the problem of large-proportion data loss is faced, the existing simple removal method can cause the whole information quantity of log data to be greatly reduced; and the statistical completion method based on mean or regression has low recovery accuracy. Both methods cannot cope with the problem of large-scale data loss, and further influence the accuracy of log analysis work.

(2) Complex correlations between different data attributes in data center logs cannot be dealt with. Data center logs typically have ten to tens of data attributes. Different linear or nonlinear correlation relations exist among different data attributes, and the accuracy of data recovery can be improved by analyzing the correlation relations among the data attributes. The existing method does not consider the correlation among different data attributes when recovering log missing data, so that the recovery accuracy is low. And the input data attribute of the recovery algorithm needs to be manually specified during recovery, so that non-expert personnel can hardly correctly select the log data without performing correlation analysis on the log data.

Disclosure of Invention

Aiming at the problems, the invention provides a data center log missing data recovery method based on tensor. Firstly, a correlation analysis method is used for discovering the correlation of different data attributes in a data center log, an optimal data attribute subset is selected, and a two-stage discretization step optimization algorithm is used for discretization optimization of data; then, the selected optimal data attribute subset is used as the attribute of the tensor to construct a sparse tensor; and finally, a tensor completion method based on tensor decomposition is used for completing the sparse tensor to obtain a dense tensor. And combining the dense tensor with the original incomplete log data to obtain a complete log data set.

In the present invention, the candebromp/parafacc (cp) decomposition method is used to complement the sparseness. The CP decomposition is a tensor completion method which is widely applied, and sparse tensor data are completed by decomposing a sparse tensor into a plurality of rank-one tensors and mining the change rule of the tensor data. Due to the characteristics of the log data of the data center, the constructed sparse tensor has low rank, and therefore the method is suitable for tensor completion by using CP decomposition.

The method for recovering the missing data of the data center log mainly comprises the following five steps: initialization, data attribute selection, data attribute discretization optimization, tensor construction and completion, and log missing data completion. In the present method, there are five basic parameters: discretized bin number lower bound N_LDiscrete bin number upper bound N_HSelecting discretized step length S for attribute₁Discretized optimization step S₂CP decomposition rank order sheet number R, gradient descent learning rate

Gradient descent objective function weight λ₁And λ₂The gradient decreases the objective function convergence threshold θ. N is a radical of_LGenerally between 50 and 150, N_HThe value is generally between 400 and 500, S₁The value is generally between 100 and 200, S₂Generally between 25 and 50, R generally between 5 and 30,

typically 0.00001, lambda₁And λ₂Generally between 0 and 1, and theta generally 0.01.

The method is realized according to the following steps:

(1) and (5) initializing. The log is set to have n data attributes and m records. The set of data attributes may be represented as a, a ═ a₁,a₂,…a_n}. The set of data records in the log may be denoted as E, E ═ E₁,e₂,…e_m}. The data in the log may be represented as V,

wherein v is_ijIndicating the value of the ith data attribute in the jth data record. The data attribute with missing data is denoted as a_T。

(2) And selecting data attributes.

2.1) manually selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set A ', A' ═ { a }₁,a₂,…a_n′}。

2.2) construction of discretized rule set for data selection phase

Wherein each rule is r_i＝{r_i1,r_i2,…r_in′}，r_ijRepresenting the ith rule pair candidate attribute a_jNumber of discretized bins.

I.e. at the lower bound N by the discretized bin number_LDiscrete bin number upper bound N_HSelecting discretized step length S for attribute₁And traversing all combinations of data attributes and discretized bin numbers in the determined search space.

2.3) Using each discretization rule r_i∈ Rule discretizes data using a discretization Rule r_iThe discretized log data can be represented as Vⁱ,

And then data attribute selection is carried out one by one.

2.3.1) calculate all candidate data attributes a using equations (1) and (2)_i∈ A' and target data attribute a_TAMI of (a), denoted as AMI (a)_i；a_T). The priority P of the candidate data attribute is then initialized, P ═ P₁,p₂,…p_nIn which p is_i＝AMI(a_i；a_T)。

2.3.2) select a highest priority data attribute (denoted as a)_k) Added to the selected data attribute set and removed from the candidate data attribute set a'. Remaining candidate data attribute a_l∈ A' with priority updated to p_l×(1-AMI(a_l,a_k))。

2.3.3) repeat step 2.3.2) until the number of selected data attributes equals the target selected number.

2.3.4) recording the selection result as result_iAnd adding the Result into a selected Result set Result.

2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result A_S,A_S＝{a₁,a₂,…a_q}。

(3) And (4) discretizing granularity optimization.

3.1) construction of discretization rule set for discretization granularity optimization stage

Wherein each rule is r'_i＝{r′_i1,r′_i2,…r′_iq}，r′_ijRepresenting the ith rule pair candidate attribute a_jNumber of discretized bins.

I.e. at the lower bound N by the discretized bin number_LDiscrete bin number upper bound N_HSelecting discretized step length S for attribute₂And traversing all combinations of data attributes and discretized bin numbers in the determined search space.

3.2) based on the selected data attribute subset A_SUsing each discretization rule r'_i∈ Rule' for data discretization, using discretization Rule r_iThe discretized log data can be represented as Vⁱ′,

Weighted coeffients of variation are calculated for the discretized log data (WCV). WCV the calculation steps are as follows: firstly, the record in the log data is divided into a subset A according to the data attribute_SThe value of each data attribute is grouped, denoted as G, G ═ G₁,g₂,…g_pEach packet is divided into a plurality of sub-packets

Each of which is recorded in all data attributes a_k∈A_SAll have equal numerical values

Calculating the target data attribute a in each packet using equation (3)_TNumerical value

The coefficient of variation of (2) is denoted as c_i. WCV for each packet is calculated using equation (4)_iThen WCV for the entire log is calculated using equation (5).

Where σ (X) represents the standard deviation of X, μ (X) represents the mean of X, and size (X) represents the number of data entries in X.

3.3) selecting the data discretization result with the minimum WCV value as the final data discretization result, and recording the discretized log data

4) Tensor construction and tensor completion.

4.1) Using discretized Log data V^FAnd target data attribute a_TA tensor is constructed. Let each data attribute a_i∈A_SThe number of discrete values of

A q-dimensional tensor is constructed

4.1.1) associating each data attribute a_i∈A_SThe discrete numerical values in (1) are arranged in ascending order, and a mapping from the numerical value v to the arrangement serial number d is constructed

4.1.2) target data Attribute a_TThe values of (c) are filled into the tensor as tensor values. Let data record e_iIn selecting data attribute A_S＝{a₁,a₂,…a_qThe values on the { are { v } respectively^F _i1，v^F _i2…，v^F _iqAt target data attribute a }_TNumerical value of

Obtaining { v by mapping M^F _i1，v^F _i1…，v^F _iqThe corresponding permutation number { d }_i1，d_i2…，d_ijThe values in the tensor

When u records have the same tensor subscript, the target attribute data mean of the records is used

As the numerical value of the tensor.

4.2) using a CP decomposition method to complete the tensor, and using a gradient descent method to solve the decomposition process.

4.2.1) use interval [0,1]The random values above initialize q factor matrices, the factor matrices

Corresponding data attribute a_i，S_iIs a_iAnd (3) initializing the weight matrix W according to a formula (6), wherein R is a hyper-parameter of the algorithm.

4.2.2) updating the factor matrix according to the formula (7). Wherein ═ χ - [ [ F [ ]₁,F₂…,F_q]]And χ is the constructed sparse tensor, "[,"]]"is the Khatri-Rao operator (x)_(N)The N-mode matrixing of the tensor x,

λ₁and λ₂Is an algorithm hyperparameter.

4.2.3) calculating the objective function value according to the formula (8).

4.2.4) repeating steps 4.2.2) and 4.2.3) until the amount of change of the two objective function values is less than the threshold value theta.

5) And recovering the log data. For each record e with missing data_iAt data attribute A_S＝{a₁,a₂,…a_qThe values on the { are { v } respectively^F _i1，v^F _i2…，v^F _iqGet { v } through mapping M^F _i1，v^F _i1…，v^F _iqThe corresponding permutation number { d }_i1，d_i2…，d_ijUsing the compensated tensor value

And recovering the missing data.

Drawings

FIG. 1 is a deployment diagram of the method of the present invention.

Fig. 2 is a general flow diagram of the present invention.

FIG. 3 is a flow chart of log data attribute selection.

FIG. 4 is a flow chart of log data discretization optimization.

FIG. 5 is a flow chart of tensor construction and completion.

Detailed Description

The invention is described below with reference to the accompanying drawings and the detailed description.

FIG. 1 is a deployment diagram of the method of the present invention. The invention is composed of a plurality of computer servers, and the servers are connected through a network. Platform nodes are divided into two categories: comprising a storage node and a computing node. The method of the invention comprises two types of core software modules: the device comprises a log storage module and a log processing module. The log storage module is responsible for storing log data and is deployed on a storage node; the log processing module is responsible for processing log data and is deployed on the computing nodes.

Lower surface combinationFigure 2 summary of the invention the flow chart illustrates a specific implementation of the method of the invention. In the present embodiment, the basic parameters are set as follows: discretized bin number lower bound N_L100, discretized bin number upper bound N_H500, attribute selection discretization step length S₁Discretizing the optimization step size S as 100₂25, the CP decomposition rank one quantity R is 25, the gradient decline learning rate

Gradient descent objective function weight λ₁0.5 and λ₂The gradient descent target function convergence threshold θ is 0.01.

The specific implementation method can be divided into the following steps:

(1) and (5) initializing. Let there be a total of 49 data attributes in the data center log, 10364956 records. The set of data attributes may be represented as a, a ═ a₁,a₂,…a₄₉}. The set of data records in the log may be denoted as E, E ═ E₁,e₂,…e_10364956}. The log data may be represented as V,

the data attribute with missing data is real _ mem _ avg (average memory usage), denoted as a_T。

(2) Data attribute selection, a flow chart of which is shown in fig. 3.

2.1) manually selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set a ', a' ═ plan _ cpu, plan _ mem, instance _ num, duration,

real _ CPU _ avg, end _ time } (apply for CPU resources, apply for memory resources, number of instances, duration, average of actual CPU resource usage, end time).

2.2) construction of the discretization Rule set Rule of the data selection stage, Rule ═ r₁,r₂,…r₁₅₆₂₅R, where each rule is r_i＝{r_i1,r_i2,…r_i6}，r_ijRepresenting the ith rule pair candidate attribute a_jNumber of discretized bins. That is, in the search space determined by the discretization bin number lower bound 100, the discretization bin number upper bound 500 and the attribute selection discretization step length 100, the combination of all data attributes and the discretization bin number is traversed.

And then data attribute selection is carried out one by one.

2.3.1) all candidate data attributes a are calculated according to the method in summary of the invention 2.3.1)_i∈ A' and target data attribute a_TAMI of (a), denoted as AMI (a)_i；a_T). The priority P of the candidate data attribute is then initialized, P ═ {0.02,0.11,0.018,0.09,0.009,0.14 }.

2.3.2) selecting a data attribute end _ time with the highest priority to be added into the selected data attribute set and removing the selected data attribute set from the candidate data attribute set A'. Remaining candidate data attribute a according to the method in inventive content 2.3.2)_l∈ A' has a priority update of {0.018,0.09,0.015,0.07,0.0087 }.

2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result A_S,A_S＝{end_time,plan_mem,duration}。

(3) Discretized granularity optimization, a flow chart of this step is shown in fig. 4.

3.1) construction of discretization Rule set Rule ' of discretization granularity optimization stage Rule ' ═ r '₁,r′₂,…r′₄₀₉₆R 'of each rule'_i＝{r′_i1,r′_i2,r′_i3}，r′_ijRepresents the discretization box number of the ith rule to the jth attribute in the candidate attribute { end _ time, play _ mem, duration }. That is, in the search space determined by the discretization bin number lower bound 100, the discretization bin number upper bound 500 and the attribute selection discretization step length 25, the combination of all data attributes and the discretization bin number is traversed.

Calculating WCV-0.35647 of the discretized log data according to the method in the invention content 3.2)

4) Tensor construction and tensor completion, a flowchart of this step is shown in figure 5.

4.1) Using discretized Log data V^FAnd target data attribute a_TA tensor is constructed. Each data attribute a_i∈A_SThe number of the discrete numerical values is 276, 87 and 61 respectively, and a 3-dimensional tensor is constructed

4.1.2)Target data attribute a_TThe values of (c) are filled into the tensor as tensor values. Let data record e_iIn selecting data attribute A_SThe values at end time, play mem, duration are {35519, 0.016, 34} respectively, at the target data attribute a_TIs 0.023814, and by mapping M to obtain the permutation numbers {35519, 0.016, 34} corresponding to {1, 13 …, 24}, the value χ in the tensor is_{1 13 24}＝0.023814。

4.2.1) use interval [0,1]The random values above initialize three factor matrices, which are respectively

The weight matrix W is initialized according to the method in inventive content 4.2.1).

4.2.2) the three factor matrices are updated according to the method of summary 4.2.2).

4.2.3) the objective function value E-7983.348 was calculated according to the method in summary of the invention 4.2.2)

4.2.4) repeating steps 4.2.2) and 4.2.3) until the variation of the two objective function values is less than the threshold value of 0.01.

5) And recovering the log data. If there is a missing data record e₁At data attribute A_SThe values of { end _ time, play _ mem, duration } are {45682, 0.008, 89}, and the arrangement numbers {34, 5, 41} corresponding to {45682, 0.008, 89} are obtained from the map M, and the compensated tensor value x is used_{34 5 41}And recovering the missing data.

According to the data receiving channel dynamic allocation method provided by the invention, the inventor carries out related performance tests. The test result shows that the method can accurately recover the missing data in the data center log.

Performance testing used the Alibara data center logs as the test data set and the missing data recovery method in the existing log analysis work: recovering the mean value and the linear regression; and the advanced data recovery method widely applied to other fields: the KNN recovery method, the multilayer perceptron recovery method and the support vector machine recovery method are compared to embody the advantage of the accuracy rate of recovering the missing data of the data center log by the method provided by the invention. The performance test is executed by 1 computer, and the hardware configuration comprises: AMD Ryzen 71700X @3.80GHz CPU, 32GB DDR4RAM, 512GB NVMe SSD.

The performance test evaluates the data recovery error using two parameters: the average relative error (MRE) and the Root Mean Square Error (RMSE), which are calculated as shown in equation (9) and equation (10):

the performance test is divided into 4 groups according to different log data missing ratios and missing modes, namely 30% data missing rate according to an Alibap log missing mode (TM30), 85% data missing rate according to the Alibap log missing mode (TM85), 30% data missing rate of complete random missing (RM30) and 85% data missing rate of complete random missing (RM 85). The results of the performance tests are shown in tables 1 and 2.

TABLE 1 Performance test results (MRE)

TABLE 2 Performance test Results (RMSE)

From the data in tables 1 and 2, it can be seen that in the four experiments, the mean reduction in MRE, the mean reduction in RMSE, 56.6%, the maximum reduction in MRE, 85.9%, and the maximum reduction in RMSE, 92% were achieved for the inventive method relative to the five comparative methods. The errors of the multiple perception machine recovery and the support vector machine recovery of the two machine learning data recovery methods with lower average errors are obviously increased along with the increase of the data missing proportion, the errors of the method are kept stable, and the maximum improvement of MRE is 32.7 percent and 50 percent respectively under the two data missing rates of 30 percent and 85 percent. The performance test result proves that compared with the five comparison methods, the method has lower error of missing data recovery, is more stable, and can obtain higher accuracy under different data missing rates.

Finally, it should be noted that: the above examples are only for illustrating the present invention and not for limiting the technology described in the present invention, and all technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims

1. A data center log missing data recovery method is characterized by comprising the following steps: the method comprises the following steps:

(1) initialization, if the log has n data attributes and m records, the data attribute set can be represented as a, where a is { a ═ a₁，a₂，…a_nThe set of data records in the log may be denoted as E, E ═ E }₁，e₂，…e_mThe data in the log can be represented as

Wherein v is_ijThe value representing the ith data attribute in the jth data record, the data attribute with missing data is denoted as a_T；

(2) The data attribute is selected, and the data attribute is selected,

2.1) selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set A ', A' ═ { a }₁，a₂，…a_n′}；

2.2) constructing a discretization Rule set Rule of a data selection stage,

wherein each rule is r_i＝{r_i1，r_i2，…r_in′}，r_ijRepresenting the ith rule pair candidate attribute a_jThe number of discretized bins of (a) is,

i.e. at the lower bound N by the discretized bin number_LDiscretizing upper bound NH of sub-box number, selecting discretizing step length S according to attributes₁Traversing all combinations of data attributes and discretized bin numbers in the determined search space;

2.3) Using each discretization rule r_i∈ Rule discretizes data using a discretization Rule r_iThe discretized log data can be represented as

Then, selecting data attributes one by one;

2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result A_S，A_S＝{a₁，a₂，…a_q}；

(3) Discretized particle size optimization

3.1) constructing a discretization Rule set Rule' of a discretization granularity optimization stage,

wherein each rule is r'_i＝{r′_i1，r′_i2，…r′_iq}，r′_ijRepresenting the ith rule pair candidate attribute a_jThe number of discretized bins of (a) is,

i.e. at the lower bound N by the discretized bin number_LDiscrete bin number upper bound N_HSelecting discretized step length S for attribute₂Traversing all combinations of data attributes and discretized bin numbers in the determined search space;

3.2) based on the selected data attribute subset A_SUsing each discretization rule r'_i∈ Rule', discretizing the data using a discretization Rule r_iThe discretized log data can be represented as

Calculating Weighted Coeffient of Variation (WCV) for the discretized log data;

4) Tensor construction and tensor completion

4.1) Using discretized Log data V^FAnd target data attribute a_TConstructing tensor, setting each data attribute a_i∈A_SThe number of discrete values of

A q-dimensional tensor is constructed

4.2) using a CP decomposition method to complete the tensor, and using a gradient descent method to solve the decomposition process;

5) log data recovery

For each record e with missing data_iAt data attribute A_S＝{a₁，a₂，…a_qThe values on the { are { v } respectively^F _i1，v^F _i2...，v^F _iqGet { v } through mapping M^F _i1，v^F _i1...，v^F _iqThe corresponding permutation number { d }_i1，d_i2...，d_ijUsing the compensated tensor value

And recovering the missing data.

2. The data center log missing data recovery method of claim 1, wherein: 2.3) comprises:

2.3.1) calculate all candidate data attributes a using equations (1) and (2)_i∈ A' and target data attribute a_TAMI of (a), denoted as AMI (a)_i；a_T) Then, the priority P of the candidate data attribute is initialized, P ═ P₁，p₂，…p_nIn which p is_i＝AMI(a_i；a_T) ，

2.3.2) select a highest priority data attribute (denoted as a)_k) Adding the data attribute into the selected data attribute set, removing the selected data attribute from the candidate data attribute set A', and adding the remaining candidate data attributes a_l∈ A' with priority updated to p_l×(1-AMI(a_l，a_k)) ，

2.3.3) repeating step 2.3.2) until the number of selected data attributes equals the target selected number,

3. The data center log missing data recovery method of claim 1, wherein: 4.1) comprises:

4.1.2) target data Attribute a_TFilling the tensor with the values of (a) as tensor values, and setting a data record e_iIn selecting data attribute A_s＝{a₁，a₂，…a_qThe values on the { are { v } respectively^F _i1，v^F _i2...，v^F _iqAt target data attribute a }_TNumerical value of

Obtaining { v by mapping M^F _i1，v^F _i1...，v^F _iqThe corresponding permutation number { d }_i1，d_i2...，d_ij) The value in the tensor

As the numerical value of the tensor.

4. The data center log missing data recovery method of claim 1, wherein: 4.2) comprises:

Corresponding data attribute a_i，S_iIs a_iThe number of attribute discrete data, R is a hyper-parameter of the algorithm, a weight matrix W is initialized according to a formula (6),

4.2.2) updating the factor matrix according to equation (7), where

For the constructed sparse tensor, [ solution ]]]"is the Khatri-Rao operator,

tensor of representation

Is matrixed in a matrix of N-modes,

λ₁and λ₂Is an algorithm hyper-parameter;

4.2.3) calculating the objective function value according to the formula (8),

5. The data center log missing data recovery method of claim 1, wherein: 4.2) comprises: 3.2) calculation of WCV as follows: firstly, the record in the log data is divided into a subset A according to the data attribute_SThe value of each data attribute is grouped, denoted as G, G ═ G₁，g₂，…g_pEach packet is divided into a plurality of sub-packets

Each of which is recorded in all data attributes a_k∈A_SAll have equal values v^i′ _jkCalculating the target data attribute a in each packet using equation (3)_TNumerical value

The coefficient of variation of (2) is denoted as c_iWCV for each packet is calculated using equation (4)_iThen WCV of the whole log is calculated by using formula (5),