CN109857593B - Data center log missing data recovery method - Google Patents

Data center log missing data recovery method Download PDF

Info

Publication number
CN109857593B
CN109857593B CN201910056129.7A CN201910056129A CN109857593B CN 109857593 B CN109857593 B CN 109857593B CN 201910056129 A CN201910056129 A CN 201910056129A CN 109857593 B CN109857593 B CN 109857593B
Authority
CN
China
Prior art keywords
data
attribute
log
tensor
discretized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910056129.7A
Other languages
Chinese (zh)
Other versions
CN109857593A (en
Inventor
梁毅
毕临风
苏醒
苏超
陈金栋
丁治明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201910056129.7A priority Critical patent/CN109857593B/en
Publication of CN109857593A publication Critical patent/CN109857593A/en
Application granted granted Critical
Publication of CN109857593B publication Critical patent/CN109857593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a recovery method of missing data in a data center log, which comprises the steps of firstly, discovering the correlation of different data attributes in the data center log by using a correlation analysis method, selecting an optimal data attribute subset, and carrying out discretization optimization on data by using a two-stage discretization step length optimization algorithm; then, the selected optimal data attribute subset is used as the attribute of the tensor to construct a sparse tensor; finally, a tensor completion method based on tensor decomposition is used for completing the sparse tensor to obtain a dense tensor; and combining the dense tensor with the original incomplete log data to obtain a complete log data set.

Description

Data center log missing data recovery method
Technical Field
The invention belongs to the field of data center log analysis, and particularly relates to a method for recovering missing data in a data center log.
Background
The large-scale data center is an information infrastructure of the internet and related exhibition industries, and provides software and hardware resources such as calculation, storage and network for the operation of internet services. Virtualization technology, containerization technology and server integration technology are commonly applied in modern data centers. In this context, data centers often coexist with multiple computing frameworks and multiple heterogeneous workloads. The data center can generate mass log data in the operation process, and the mass log data comprises the operation information of a data center server and a load.
Data center log analysis is one of the important means for optimizing data center performance. Through data center log analysis, a data center manager can acquire important information such as data center load characteristics and resource use modes, and further guide the optimization work of task scheduling, resource management and programming models of the data center. However, with the increasing size of data centers, data center logs face increasingly serious data loss problems. The data loss problem of the data center log is that partial data in the log is null or invalid data and cannot be directly used as input of log analysis work. The reason for this problem is mainly two: firstly, in the acquisition stage of log data, Bug in the monitoring system can cause data loss. Meanwhile, since the monitoring system process is usually set at a lower priority, when the cluster load is high, resources are deprived, and data loss is caused; and (II) in the processing stage of log data, some data can be anonymized or normalized due to confidentiality and the like. The process directly causes data loss, and the Bug occurring in the process causes unexpected data loss. However, the current log analysis field mainly processes missing data by simply removing missing data items and performing missing data recovery by using a statistical completion method based on mean or regression. The prior method has the following problems:
(1) the problem of large-scale data loss cannot be solved. As the data center grows in size, the data center has a tendency to increase the missing rate of log data. When the problem of large-proportion data loss is faced, the existing simple removal method can cause the whole information quantity of log data to be greatly reduced; and the statistical completion method based on mean or regression has low recovery accuracy. Both methods cannot cope with the problem of large-scale data loss, and further influence the accuracy of log analysis work.
(2) Complex correlations between different data attributes in data center logs cannot be dealt with. Data center logs typically have ten to tens of data attributes. Different linear or nonlinear correlation relations exist among different data attributes, and the accuracy of data recovery can be improved by analyzing the correlation relations among the data attributes. The existing method does not consider the correlation among different data attributes when recovering log missing data, so that the recovery accuracy is low. And the input data attribute of the recovery algorithm needs to be manually specified during recovery, so that non-expert personnel can hardly correctly select the log data without performing correlation analysis on the log data.
Disclosure of Invention
Aiming at the problems, the invention provides a data center log missing data recovery method based on tensor. Firstly, a correlation analysis method is used for discovering the correlation of different data attributes in a data center log, an optimal data attribute subset is selected, and a two-stage discretization step optimization algorithm is used for discretization optimization of data; then, the selected optimal data attribute subset is used as the attribute of the tensor to construct a sparse tensor; and finally, a tensor completion method based on tensor decomposition is used for completing the sparse tensor to obtain a dense tensor. And combining the dense tensor with the original incomplete log data to obtain a complete log data set.
In the present invention, the candebromp/parafacc (cp) decomposition method is used to complement the sparseness. The CP decomposition is a tensor completion method which is widely applied, and sparse tensor data are completed by decomposing a sparse tensor into a plurality of rank-one tensors and mining the change rule of the tensor data. Due to the characteristics of the log data of the data center, the constructed sparse tensor has low rank, and therefore the method is suitable for tensor completion by using CP decomposition.
The method for recovering the missing data of the data center log mainly comprises the following five steps: initialization, data attribute selection, data attribute discretization optimization, tensor construction and completion, and log missing data completion. In the present method, there are five basic parameters: discretized bin number lower bound NLDiscrete bin number upper bound NHSelecting discretized step length S for attribute1Discretized optimization step S2CP decomposition rank order sheet number R, gradient descent learning rate
Figure BDA0001951316550000021
Gradient descent objective function weight λ1And λ2The gradient decreases the objective function convergence threshold θ. N is a radical ofLGenerally between 50 and 150, NHThe value is generally between 400 and 500, S1The value is generally between 100 and 200, S2Generally between 25 and 50, R generally between 5 and 30,
Figure BDA0001951316550000022
typically 0.00001, lambda1And λ2Generally between 0 and 1, and theta generally 0.01.
The method is realized according to the following steps:
(1) and (5) initializing. The log is set to have n data attributes and m records. The set of data attributes may be represented as a, a ═ a1,a2,…an}. The set of data records in the log may be denoted as E, E ═ E1,e2,…em}. The data in the log may be represented as V,
Figure BDA0001951316550000031
wherein v isijIndicating the value of the ith data attribute in the jth data record. The data attribute with missing data is denoted as aT
(2) And selecting data attributes.
2.1) manually selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set A ', A' ═ { a }1,a2,…an′}。
2.2) construction of discretized rule set for data selection phase
Figure BDA0001951316550000036
Wherein each rule is ri={ri1,ri2,…rin′},rijRepresenting the ith rule pair candidate attribute ajNumber of discretized bins.
Figure BDA0001951316550000032
I.e. at the lower bound N by the discretized bin numberLDiscrete bin number upper bound NHSelecting discretized step length S for attribute1And traversing all combinations of data attributes and discretized bin numbers in the determined search space.
2.3) Using each discretization rule ri∈ Rule discretizes data using a discretization Rule riThe discretized log data can be represented as Vi,
Figure BDA0001951316550000033
And then data attribute selection is carried out one by one.
2.3.1) calculate all candidate data attributes a using equations (1) and (2)i∈ A' and target data attribute aTAMI of (a), denoted as AMI (a)i;aT). The priority P of the candidate data attribute is then initialized, P ═ P1,p2,…pnIn which p isi=AMI(ai;aT)。
Figure BDA0001951316550000034
Figure BDA0001951316550000035
2.3.2) select a highest priority data attribute (denoted as a)k) Added to the selected data attribute set and removed from the candidate data attribute set a'. Remaining candidate data attribute al∈ A' with priority updated to pl×(1-AMI(al,ak))。
2.3.3) repeat step 2.3.2) until the number of selected data attributes equals the target selected number.
2.3.4) recording the selection result as resultiAnd adding the Result into a selected Result set Result.
2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result AS,AS={a1,a2,…aq}。
(3) And (4) discretizing granularity optimization.
3.1) construction of discretization rule set for discretization granularity optimization stage
Figure BDA00019513165500000411
Wherein each rule is r'i={r′i1,r′i2,…r′iq},r′ijRepresenting the ith rule pair candidate attribute ajNumber of discretized bins.
Figure BDA0001951316550000041
I.e. at the lower bound N by the discretized bin numberLDiscrete bin number upper bound NHSelecting discretized step length S for attribute2And traversing all combinations of data attributes and discretized bin numbers in the determined search space.
3.2) based on the selected data attribute subset ASUsing each discretization rule r'i∈ Rule' for data discretization, using discretization Rule riThe discretized log data can be represented as Vi′,
Figure BDA0001951316550000042
Figure BDA0001951316550000043
Weighted coeffients of variation are calculated for the discretized log data (WCV). WCV the calculation steps are as follows: firstly, the record in the log data is divided into a subset A according to the data attributeSThe value of each data attribute is grouped, denoted as G, G ═ G1,g2,…gpEach packet is divided into a plurality of sub-packets
Figure BDA0001951316550000044
Each of which is recorded in all data attributes ak∈ASAll have equal numerical values
Figure BDA0001951316550000045
Calculating the target data attribute a in each packet using equation (3)TNumerical value
Figure BDA0001951316550000046
The coefficient of variation of (2) is denoted as ci. WCV for each packet is calculated using equation (4)iThen WCV for the entire log is calculated using equation (5).
Figure BDA0001951316550000047
Figure BDA0001951316550000048
Figure BDA0001951316550000049
Where σ (X) represents the standard deviation of X, μ (X) represents the mean of X, and size (X) represents the number of data entries in X.
3.3) selecting the data discretization result with the minimum WCV value as the final data discretization result, and recording the discretized log data
Figure BDA00019513165500000410
4) Tensor construction and tensor completion.
4.1) Using discretized Log data VFAnd target data attribute aTA tensor is constructed. Let each data attribute ai∈ASThe number of discrete values of
Figure BDA0001951316550000051
A q-dimensional tensor is constructed
Figure BDA0001951316550000052
4.1.1) associating each data attribute ai∈ASThe discrete numerical values in (1) are arranged in ascending order, and a mapping from the numerical value v to the arrangement serial number d is constructed
Figure BDA0001951316550000053
4.1.2) target data Attribute aTThe values of (c) are filled into the tensor as tensor values. Let data record eiIn selecting data attribute AS={a1,a2,…aqThe values on the { are { v } respectivelyF i1,vF i2…,vF iqAt target data attribute a }TNumerical value of
Figure BDA0001951316550000054
Obtaining { v by mapping MF i1,vF i1…,vF iqThe corresponding permutation number { d }i1,di2…,dijThe values in the tensor
Figure BDA0001951316550000055
When u records have the same tensor subscript, the target attribute data mean of the records is used
Figure BDA0001951316550000056
As the numerical value of the tensor.
4.2) using a CP decomposition method to complete the tensor, and using a gradient descent method to solve the decomposition process.
4.2.1) use interval [0,1]The random values above initialize q factor matrices, the factor matrices
Figure BDA0001951316550000057
Corresponding data attribute ai,SiIs aiAnd (3) initializing the weight matrix W according to a formula (6), wherein R is a hyper-parameter of the algorithm.
Figure BDA0001951316550000058
4.2.2) updating the factor matrix according to the formula (7). Wherein ═ χ - [ [ F [ ]1,F2…,Fq]]And χ is the constructed sparse tensor, "[,"]]"is the Khatri-Rao operator (x)(N)The N-mode matrixing of the tensor x,
Figure BDA00019513165500000511
λ1and λ2Is an algorithm hyperparameter.
Figure BDA0001951316550000059
4.2.3) calculating the objective function value according to the formula (8).
Figure BDA00019513165500000510
4.2.4) repeating steps 4.2.2) and 4.2.3) until the amount of change of the two objective function values is less than the threshold value theta.
5) And recovering the log data. For each record e with missing dataiAt data attribute AS={a1,a2,…aqThe values on the { are { v } respectivelyF i1,vF i2…,vF iqGet { v } through mapping MF i1,vF i1…,vF iqThe corresponding permutation number { d }i1,di2…,dijUsing the compensated tensor value
Figure BDA0001951316550000061
And recovering the missing data.
Drawings
FIG. 1 is a deployment diagram of the method of the present invention.
Fig. 2 is a general flow diagram of the present invention.
FIG. 3 is a flow chart of log data attribute selection.
FIG. 4 is a flow chart of log data discretization optimization.
FIG. 5 is a flow chart of tensor construction and completion.
Detailed Description
The invention is described below with reference to the accompanying drawings and the detailed description.
FIG. 1 is a deployment diagram of the method of the present invention. The invention is composed of a plurality of computer servers, and the servers are connected through a network. Platform nodes are divided into two categories: comprising a storage node and a computing node. The method of the invention comprises two types of core software modules: the device comprises a log storage module and a log processing module. The log storage module is responsible for storing log data and is deployed on a storage node; the log processing module is responsible for processing log data and is deployed on the computing nodes.
Lower surface combinationFigure 2 summary of the invention the flow chart illustrates a specific implementation of the method of the invention. In the present embodiment, the basic parameters are set as follows: discretized bin number lower bound NL100, discretized bin number upper bound NH500, attribute selection discretization step length S1Discretizing the optimization step size S as 100225, the CP decomposition rank one quantity R is 25, the gradient decline learning rate
Figure BDA0001951316550000063
Gradient descent objective function weight λ10.5 and λ2The gradient descent target function convergence threshold θ is 0.01.
The specific implementation method can be divided into the following steps:
(1) and (5) initializing. Let there be a total of 49 data attributes in the data center log, 10364956 records. The set of data attributes may be represented as a, a ═ a1,a2,…a49}. The set of data records in the log may be denoted as E, E ═ E1,e2,…e10364956}. The log data may be represented as V,
Figure BDA0001951316550000062
the data attribute with missing data is real _ mem _ avg (average memory usage), denoted as aT
(2) Data attribute selection, a flow chart of which is shown in fig. 3.
2.1) manually selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set a ', a' ═ plan _ cpu, plan _ mem, instance _ num, duration,
real _ CPU _ avg, end _ time } (apply for CPU resources, apply for memory resources, number of instances, duration, average of actual CPU resource usage, end time).
2.2) construction of the discretization Rule set Rule of the data selection stage, Rule ═ r1,r2,…r15625R, where each rule is ri={ri1,ri2,…ri6},rijRepresenting the ith rule pair candidate attribute ajNumber of discretized bins. That is, in the search space determined by the discretization bin number lower bound 100, the discretization bin number upper bound 500 and the attribute selection discretization step length 100, the combination of all data attributes and the discretization bin number is traversed.
2.3) Using each discretization rule ri∈ Rule discretizes data using a discretization Rule riThe discretized log data can be represented as Vi,
Figure BDA0001951316550000071
And then data attribute selection is carried out one by one.
2.3.1) all candidate data attributes a are calculated according to the method in summary of the invention 2.3.1)i∈ A' and target data attribute aTAMI of (a), denoted as AMI (a)i;aT). The priority P of the candidate data attribute is then initialized, P ═ {0.02,0.11,0.018,0.09,0.009,0.14 }.
2.3.2) selecting a data attribute end _ time with the highest priority to be added into the selected data attribute set and removing the selected data attribute set from the candidate data attribute set A'. Remaining candidate data attribute a according to the method in inventive content 2.3.2)l∈ A' has a priority update of {0.018,0.09,0.015,0.07,0.0087 }.
2.3.3) repeat step 2.3.2) until the number of selected data attributes equals the target selected number.
2.3.4) recording the selection result as resultiAnd adding the Result into a selected Result set Result.
2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result AS,AS={end_time,plan_mem,duration}。
(3) Discretized granularity optimization, a flow chart of this step is shown in fig. 4.
3.1) construction of discretization Rule set Rule ' of discretization granularity optimization stage Rule ' ═ r '1,r′2,…r′4096R 'of each rule'i={r′i1,r′i2,r′i3},r′ijRepresents the discretization box number of the ith rule to the jth attribute in the candidate attribute { end _ time, play _ mem, duration }. That is, in the search space determined by the discretization bin number lower bound 100, the discretization bin number upper bound 500 and the attribute selection discretization step length 25, the combination of all data attributes and the discretization bin number is traversed.
3.2) based on the selected data attribute subset ASUsing each discretization rule r'i∈ Rule' for data discretization, using discretization Rule riThe discretized log data can be represented as Vi′,
Figure BDA0001951316550000081
Figure BDA0001951316550000082
Calculating WCV-0.35647 of the discretized log data according to the method in the invention content 3.2)
3.3) selecting the data discretization result with the minimum WCV value as the final data discretization result, and recording the discretized log data
Figure BDA0001951316550000083
4) Tensor construction and tensor completion, a flowchart of this step is shown in figure 5.
4.1) Using discretized Log data VFAnd target data attribute aTA tensor is constructed. Each data attribute ai∈ASThe number of the discrete numerical values is 276, 87 and 61 respectively, and a 3-dimensional tensor is constructed
Figure BDA0001951316550000084
4.1.1) associating each data attribute ai∈ASThe discrete numerical values in (1) are arranged in ascending order, and a mapping from the numerical value v to the arrangement serial number d is constructed
Figure BDA0001951316550000085
4.1.2)Target data attribute aTThe values of (c) are filled into the tensor as tensor values. Let data record eiIn selecting data attribute ASThe values at end time, play mem, duration are {35519, 0.016, 34} respectively, at the target data attribute aTIs 0.023814, and by mapping M to obtain the permutation numbers {35519, 0.016, 34} corresponding to {1, 13 …, 24}, the value χ in the tensor is1 13 24=0.023814。
4.2) using a CP decomposition method to complete the tensor, and using a gradient descent method to solve the decomposition process.
4.2.1) use interval [0,1]The random values above initialize three factor matrices, which are respectively
Figure BDA0001951316550000086
The weight matrix W is initialized according to the method in inventive content 4.2.1).
4.2.2) the three factor matrices are updated according to the method of summary 4.2.2).
4.2.3) the objective function value E-7983.348 was calculated according to the method in summary of the invention 4.2.2)
4.2.4) repeating steps 4.2.2) and 4.2.3) until the variation of the two objective function values is less than the threshold value of 0.01.
5) And recovering the log data. If there is a missing data record e1At data attribute ASThe values of { end _ time, play _ mem, duration } are {45682, 0.008, 89}, and the arrangement numbers {34, 5, 41} corresponding to {45682, 0.008, 89} are obtained from the map M, and the compensated tensor value x is used34 5 41And recovering the missing data.
According to the data receiving channel dynamic allocation method provided by the invention, the inventor carries out related performance tests. The test result shows that the method can accurately recover the missing data in the data center log.
Performance testing used the Alibara data center logs as the test data set and the missing data recovery method in the existing log analysis work: recovering the mean value and the linear regression; and the advanced data recovery method widely applied to other fields: the KNN recovery method, the multilayer perceptron recovery method and the support vector machine recovery method are compared to embody the advantage of the accuracy rate of recovering the missing data of the data center log by the method provided by the invention. The performance test is executed by 1 computer, and the hardware configuration comprises: AMD Ryzen 71700X @3.80GHz CPU, 32GB DDR4RAM, 512GB NVMe SSD.
The performance test evaluates the data recovery error using two parameters: the average relative error (MRE) and the Root Mean Square Error (RMSE), which are calculated as shown in equation (9) and equation (10):
Figure BDA0001951316550000091
Figure BDA0001951316550000092
the performance test is divided into 4 groups according to different log data missing ratios and missing modes, namely 30% data missing rate according to an Alibap log missing mode (TM30), 85% data missing rate according to the Alibap log missing mode (TM85), 30% data missing rate of complete random missing (RM30) and 85% data missing rate of complete random missing (RM 85). The results of the performance tests are shown in tables 1 and 2.
TABLE 1 Performance test results (MRE)
Figure BDA0001951316550000093
TABLE 2 Performance test Results (RMSE)
Figure BDA0001951316550000094
Figure BDA0001951316550000101
From the data in tables 1 and 2, it can be seen that in the four experiments, the mean reduction in MRE, the mean reduction in RMSE, 56.6%, the maximum reduction in MRE, 85.9%, and the maximum reduction in RMSE, 92% were achieved for the inventive method relative to the five comparative methods. The errors of the multiple perception machine recovery and the support vector machine recovery of the two machine learning data recovery methods with lower average errors are obviously increased along with the increase of the data missing proportion, the errors of the method are kept stable, and the maximum improvement of MRE is 32.7 percent and 50 percent respectively under the two data missing rates of 30 percent and 85 percent. The performance test result proves that compared with the five comparison methods, the method has lower error of missing data recovery, is more stable, and can obtain higher accuracy under different data missing rates.
Finally, it should be noted that: the above examples are only for illustrating the present invention and not for limiting the technology described in the present invention, and all technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims (5)

1. A data center log missing data recovery method is characterized by comprising the following steps: the method comprises the following steps:
(1) initialization, if the log has n data attributes and m records, the data attribute set can be represented as a, where a is { a ═ a1,a2,…anThe set of data records in the log may be denoted as E, E ═ E }1,e2,…emThe data in the log can be represented as
Figure FDA0001951316540000011
Wherein v isijThe value representing the ith data attribute in the jth data record, the data attribute with missing data is denoted as aT
(2) The data attribute is selected, and the data attribute is selected,
2.1) selecting all data attributes possibly having correlation with the target missing data attributes as a candidate data attribute set A ', A' ═ { a }1,a2,…an′};
2.2) constructing a discretization Rule set Rule of a data selection stage,
Figure FDA0001951316540000018
wherein each rule is ri={ri1,ri2,…rin′},rijRepresenting the ith rule pair candidate attribute ajThe number of discretized bins of (a) is,
Figure FDA0001951316540000012
Figure FDA0001951316540000013
i.e. at the lower bound N by the discretized bin numberLDiscretizing upper bound NH of sub-box number, selecting discretizing step length S according to attributes1Traversing all combinations of data attributes and discretized bin numbers in the determined search space;
2.3) Using each discretization rule ri∈ Rule discretizes data using a discretization Rule riThe discretized log data can be represented as
Figure FDA0001951316540000014
Then, selecting data attributes one by one;
2.4) counting all the selection results in the selection Result set Result, and taking the data attribute selection set with the highest frequency of occurrence as the final data attribute selection Result AS,AS={a1,a2,…aq};
(3) Discretized particle size optimization
3.1) constructing a discretization Rule set Rule' of a discretization granularity optimization stage,
Figure FDA0001951316540000019
wherein each rule is r'i={r′i1,r′i2,…r′iq},r′ijRepresenting the ith rule pair candidate attribute ajThe number of discretized bins of (a) is,
Figure FDA0001951316540000015
i.e. at the lower bound N by the discretized bin numberLDiscrete bin number upper bound NHSelecting discretized step length S for attribute2Traversing all combinations of data attributes and discretized bin numbers in the determined search space;
3.2) based on the selected data attribute subset ASUsing each discretization rule r'i∈ Rule', discretizing the data using a discretization Rule riThe discretized log data can be represented as
Figure FDA0001951316540000016
Figure FDA0001951316540000017
Calculating Weighted Coeffient of Variation (WCV) for the discretized log data;
3.3) selecting the data discretization result with the minimum WCV value as the final data discretization result, and recording the discretized log data
Figure FDA0001951316540000021
4) Tensor construction and tensor completion
4.1) Using discretized Log data VFAnd target data attribute aTConstructing tensor, setting each data attribute ai∈ASThe number of discrete values of
Figure FDA0001951316540000022
A q-dimensional tensor is constructed
Figure FDA0001951316540000023
Figure FDA0001951316540000024
4.2) using a CP decomposition method to complete the tensor, and using a gradient descent method to solve the decomposition process;
5) log data recovery
For each record e with missing dataiAt data attribute AS={a1,a2,…aqThe values on the { are { v } respectivelyF i1,vF i2...,vF iqGet { v } through mapping MF i1,vF i1...,vF iqThe corresponding permutation number { d }i1,di2...,dijUsing the compensated tensor value
Figure FDA00019513165400000210
And recovering the missing data.
2. The data center log missing data recovery method of claim 1, wherein: 2.3) comprises:
2.3.1) calculate all candidate data attributes a using equations (1) and (2)i∈ A' and target data attribute aTAMI of (a), denoted as AMI (a)i;aT) Then, the priority P of the candidate data attribute is initialized, P ═ P1,p2,…pnIn which p isi=AMI(ai;aT) ,
Figure FDA0001951316540000025
Figure FDA0001951316540000026
2.3.2) select a highest priority data attribute (denoted as a)k) Adding the data attribute into the selected data attribute set, removing the selected data attribute from the candidate data attribute set A', and adding the remaining candidate data attributes al∈ A' with priority updated to pl×(1-AMI(al,ak)) ,
2.3.3) repeating step 2.3.2) until the number of selected data attributes equals the target selected number,
2.3.4) recording the selection result as resultiAnd adding the Result into a selected Result set Result.
3. The data center log missing data recovery method of claim 1, wherein: 4.1) comprises:
4.1.1) associating each data attribute ai∈ASThe discrete numerical values in (1) are arranged in ascending order, and a mapping from the numerical value v to the arrangement serial number d is constructed
Figure FDA0001951316540000027
4.1.2) target data Attribute aTFilling the tensor with the values of (a) as tensor values, and setting a data record eiIn selecting data attribute As={a1,a2,…aqThe values on the { are { v } respectivelyF i1,vF i2...,vF iqAt target data attribute a }TNumerical value of
Figure FDA0001951316540000028
Obtaining { v by mapping MF i1,vF i1...,vF iqThe corresponding permutation number { d }i1,di2...,dij) The value in the tensor
Figure FDA0001951316540000029
When u records have the same tensor subscript, the target attribute data mean of the records is used
Figure FDA0001951316540000031
As the numerical value of the tensor.
4. The data center log missing data recovery method of claim 1, wherein: 4.2) comprises:
4.2.1) use interval [0,1]The random values above initialize q factor matrices, the factor matrices
Figure FDA0001951316540000032
Corresponding data attribute ai,SiIs aiThe number of attribute discrete data, R is a hyper-parameter of the algorithm, a weight matrix W is initialized according to a formula (6),
Figure FDA0001951316540000033
4.2.2) updating the factor matrix according to equation (7), where
Figure FDA0001951316540000034
Figure FDA0001951316540000035
For the constructed sparse tensor, [ solution ]]]"is the Khatri-Rao operator,
Figure FDA0001951316540000036
tensor of representation
Figure FDA0001951316540000037
Is matrixed in a matrix of N-modes,
Figure FDA0001951316540000038
λ1and λ2Is an algorithm hyper-parameter;
Figure FDA0001951316540000039
4.2.3) calculating the objective function value according to the formula (8),
Figure FDA00019513165400000310
4.2.4) repeating steps 4.2.2) and 4.2.3) until the amount of change of the two objective function values is less than the threshold value theta.
5. The data center log missing data recovery method of claim 1, wherein: 4.2) comprises: 3.2) calculation of WCV as follows: firstly, the record in the log data is divided into a subset A according to the data attributeSThe value of each data attribute is grouped, denoted as G, G ═ G1,g2,…gpEach packet is divided into a plurality of sub-packets
Figure FDA00019513165400000311
Figure FDA00019513165400000312
Each of which is recorded in all data attributes ak∈ASAll have equal values vi′ jkCalculating the target data attribute a in each packet using equation (3)TNumerical value
Figure FDA00019513165400000313
The coefficient of variation of (2) is denoted as ciWCV for each packet is calculated using equation (4)iThen WCV of the whole log is calculated by using formula (5),
Figure FDA00019513165400000314
Figure FDA00019513165400000315
Figure FDA00019513165400000316
where σ (X) represents the standard deviation of X, μ (X) represents the mean of X, and size (X) represents the number of data entries in X.
CN201910056129.7A 2019-01-21 2019-01-21 Data center log missing data recovery method Active CN109857593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910056129.7A CN109857593B (en) 2019-01-21 2019-01-21 Data center log missing data recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910056129.7A CN109857593B (en) 2019-01-21 2019-01-21 Data center log missing data recovery method

Publications (2)

Publication Number Publication Date
CN109857593A CN109857593A (en) 2019-06-07
CN109857593B true CN109857593B (en) 2020-08-28

Family

ID=66895519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910056129.7A Active CN109857593B (en) 2019-01-21 2019-01-21 Data center log missing data recovery method

Country Status (1)

Country Link
CN (1) CN109857593B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183644B (en) * 2020-09-29 2024-05-03 中国平安人寿保险股份有限公司 Index stability monitoring method and device, computer equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156720A (en) * 2011-03-28 2011-08-17 中国人民解放军国防科学技术大学 Method, device and system for restoring data
CN102289524A (en) * 2011-09-26 2011-12-21 深圳市万兴软件有限公司 Data recovery method and system
CN103631676A (en) * 2013-11-06 2014-03-12 华为技术有限公司 Snapshot data generating method and device for read-only snapshot
CN103838642A (en) * 2012-11-26 2014-06-04 腾讯科技(深圳)有限公司 Data recovery method, device and system
CN103942252A (en) * 2014-03-17 2014-07-23 华为技术有限公司 Method and system for recovering data
CN105955845A (en) * 2016-04-26 2016-09-21 浪潮电子信息产业股份有限公司 Data recovery method and device
CN107220142A (en) * 2016-03-22 2017-09-29 阿里巴巴集团控股有限公司 Perform the method and device of data recovery operation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9058371B2 (en) * 2011-11-07 2015-06-16 Sap Se Distributed database log recovery

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156720A (en) * 2011-03-28 2011-08-17 中国人民解放军国防科学技术大学 Method, device and system for restoring data
CN102289524A (en) * 2011-09-26 2011-12-21 深圳市万兴软件有限公司 Data recovery method and system
CN103838642A (en) * 2012-11-26 2014-06-04 腾讯科技(深圳)有限公司 Data recovery method, device and system
CN103631676A (en) * 2013-11-06 2014-03-12 华为技术有限公司 Snapshot data generating method and device for read-only snapshot
CN103942252A (en) * 2014-03-17 2014-07-23 华为技术有限公司 Method and system for recovering data
CN107220142A (en) * 2016-03-22 2017-09-29 阿里巴巴集团控股有限公司 Perform the method and device of data recovery operation
CN105955845A (en) * 2016-04-26 2016-09-21 浪潮电子信息产业股份有限公司 Data recovery method and device

Also Published As

Publication number Publication date
CN109857593A (en) 2019-06-07

Similar Documents

Publication Publication Date Title
US10693711B1 (en) Real-time event correlation in information networks
US10223437B2 (en) Adaptive data repartitioning and adaptive data replication
US9235813B1 (en) General framework for cross-validation of machine learning algorithms using SQL on distributed systems
US6510531B1 (en) Methods and systems for testing parallel queues
CN104036029B (en) Large data consistency control methods and system
CN110740054B (en) Data center virtualization network fault diagnosis method based on reinforcement learning
US10268749B1 (en) Clustering sparse high dimensional data using sketches
CN104252481A (en) Dynamic check method and device for consistency of main and salve databases
US9785657B2 (en) Method for synthetic data generation for query workloads
US8037057B2 (en) Multi-column statistics usage within index selection tools
Stein et al. EQC: ensembled quantum computing for variational quantum algorithms
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
US20090043750A1 (en) Query Optimization in a Parallel Computer System with Multiple Networks
CN109325062B (en) Data dependency mining method and system based on distributed computation
CN105556474A (en) Managing memory and storage space for a data operation
US7685277B2 (en) Automatically identifying an optimal set of attributes to facilitate generating best practices for configuring a networked system
CN110489317A (en) Cloud system task run method for diagnosing faults and system based on workflow
CN109857593B (en) Data center log missing data recovery method
US20100017510A1 (en) Multi-layered measurement model for data collection and method for data collection using same
CN109933589B (en) Data structure conversion method for data summarization based on ElasticSearch aggregation operation result
CN107480056A (en) A kind of method for testing software and device
CN110264392A (en) A kind of strongly connected graph detection method based on more GPU
CN116483831B (en) Recommendation index generation method for distributed database
US11953979B2 (en) Using workload data to train error classification model
Iuhasz et al. Monitoring of exascale data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant