CN109857593A

CN109857593A - A kind of data center's log missing data restoration methods

Info

Publication number: CN109857593A
Application number: CN201910056129.7A
Authority: CN
Inventors: 梁毅; 毕临风; 苏醒; 苏超; 陈金栋; 丁治明
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2019-06-07
Anticipated expiration: 2039-01-21
Also published as: CN109857593B

Abstract

The present invention discloses a kind of restoration methods of missing data in data center's log, the correlation of different data attribute in data center's log is excavated using correlation analysis first, optimal data attribute set is chosen, and discretization optimization is carried out to data using a two stages discretization step-length optimization algorithm；Then using the optimal data attribute set of selection as the attribute of tensor, a sparse tensor is constructed；The tensor complementing method based on tensor resolution is finally used, completion is carried out to sparse tensor, obtains a dense tensor；By the dense tensor in conjunction with original imperfect daily record data, a complete log data set is obtained.

Description

A kind of data center's log missing data restoration methods

Technical field

The invention belongs to data center's log analysis fields, and in particular to missing data is extensive in a kind of data center's log Compound method.

Background technique

Large-scale data center is the basis for IT application facility of internet and related exhibition industry, is mentioned for the operation of Internet service For software and hardware resources such as calculating, storage and networks.Commonly used virtualization technology in modern data center, containerization technique and Server Consolidation technology.In this context, often a variety of Computational frames coexist for data center, and a variety of heterogeneous workloads coexist. Data center can generate massive logs data in the process of running, comprising data center server, load operation when information.

Data center's log analysis is one of the important means of data center's performance optimization.Pass through data center's log point Analysis, the important informations such as the available data center's load characteristic of data center manager, resource use pattern further instruct number According to central task scheduling, resource management, programming model Optimization Work.However as the continuous growth of data center's scale, number Increasingly serious shortage of data problem is faced according to central log.The shortage of data problem of data center's log is score in the middle part of log According to for empty or fail data, the input that can not be worked directly as log analysis.There are two the reason of problem occurs is main: (1) Bug in the acquisition stage of daily record data, monitoring system may cause shortage of data.Simultaneously as monitoring system into Journey is typically provided at lower priority, is deprived of resource compared with Gao Shihui when a group of planes is loaded, and then lead to shortage of data；(2) In the processing stage of daily record data, due to secrecy etc., some data can be by anonymization or normalization.The process can be direct Shortage of data is caused, then will lead to unexpected shortage of data in the Bug that the process occurs.However, current log analysis field pair The processing method of missing data mainly has simple removal missing data item, and mends using based on mean value or the statistics of recurrence Full method carries out missing data recovery.Existing method has the following problems:

(1) large scale shortage of data problem can not be coped with.As data center's scale increases, data center's daily record data is lacked Mistake ratio has the tendency that rising.When facing large scale shortage of data problem, existing simple removing method will lead to log number It is greatly decreased according to Global Information amount；And it is low to restore accuracy rate based on mean value or the statistics complementing method of recurrence.Both of which Large scale shortage of data problem cannot be coped with, and then influences the accuracy of log analysis work.

(2) correlativity complicated between different data attribute in data center's log can not be coped with.Data center's log Usually possess ten to dozens of data attribute.There is different linearly or nonlinearly related passes between different data attributes System, carrying out analysis to the correlativity data attribute can be improved the accuracy of data recovery.Existing method is restoring log The correlativity between different data attribute is not considered when missing data, causes recovery accuracy rate lower.And it is needed when restoring The input data attribute of recovery algorithms is manually specified, non-expert personnel are not the case where carrying out correlation analysis to daily record data Under, it is difficult correctly to be selected.

Summary of the invention

In view of the above-mentioned problems, the invention proposes a kind of data center's log missing data restoration methods based on tensor. The present invention uses correlation analysis to excavate the correlation of different data attribute in data center's log first, chooses optimal number Discretization optimization is carried out to data according to attribute set, and using a two stages discretization step-length optimization algorithm；Then it will choose Attribute of the optimal data attribute set as tensor, construct a sparse tensor；Finally use the tensor based on tensor resolution Complementing method carries out completion to sparse tensor, obtains a dense tensor.By the dense tensor and original imperfect daily record data In conjunction with obtaining a complete log data set.

In the present invention, using CANDECOMP/PARAFAC (CP) decomposition method to sparse carry out completion.CP decomposition is one The widely applied tensor complementing method of kind, it excavates tensor data by being several one tensors of order by sparse tensor resolution Changing rule, and then completion is carried out to sparse tensor data.For data center's daily record data due to its own feature, that constructs is dilute Dredging tensor has low-rank, therefore is suitble to decompose using CP and carries out tensor completion.

Data center's log missing data restoration methods of the present invention are broadly divided into five steps: initialization, data Attribute is chosen, the optimization of data attribute discretization, tensor constructs and completion, log missing data completion.In the method, there are five Basic parameter: discretization branch mailbox number lower bound N_L, discretization branch mailbox number upper bound N_H, attribute selection discretization step-length S₁, discretization optimization Step-length S₂, CP decomposition one tensor number R of order, gradient decline learning rateGradient declines objective function weight λ₁And λ₂, gradient decline Objective function convergence threshold θ.N_LBetween general value 50-150, N_HBetween general value 400-500, S₁General value 100-200 Between, S₂Between general value 25-50, between the general value 5-30 of R,Generally take 0.00001, λ₁And λ₂Generally take 0-1 it Between, θ generally takes 0.01.

The above method is realized according to the following steps:

(1) it initializes.If sharing n data attribute in log, m item record.Then data attribute set can be expressed as A, A ={ a₁,a₂,…a_n}.Data record set in log can be expressed as E, E={ e₁,e₂,…e_m}.Data in log can be with V is expressed as,Wherein v_ijIndicate the value of i-th of data attribute in j-th strip data record.With scarce The data attribute for losing data is denoted as a_T。

(2) data attribute is chosen.

2.1) all data attributes that there may be correlativity with target missing data attribute are manually chosen as candidate number According to attribute set A ', A '={ a₁,a₂,…a_n′}。

2.2) the discretization rule set in data decimation stage is constructedEach of them rule It is then r_i={ r_i1,r_i2,…r_in′, r_ijIndicate the i-th rule to candidate attribute a_jDiscretization case number. I.e. by discretization branch mailbox number lower bound N_L, discretization branch mailbox number upper bound N_H, attribute selection discretization step-length S₁Determining search space In, traverse the combination of all data attributes Yu discretization branch mailbox number.

2.3) each discretization rule r is used_i∈ Rule uses discretization rule r to Data Discretization_iAfter discretization Daily record data can be expressed as Vⁱ,Then data attribute selection is carried out one by one.

2.3.1 all candidate data attribute a) are calculated using formula (1) and formula (2)_i∈ A ' and target data attribute a_T's AMI is denoted as AMI (a_i；a_T).Then the priority P of candidate data attribute, P={ p are initialized₁,p₂,…p_n, wherein p_i=AMI (a_i；a_T)。

2.3.2) data attribute of a highest priority is selected (to be denoted as a_k) be added to selection data attribute set, and will It is from the middle removal of candidate data attribute set A '.By remaining candidate data attribute a_lThe priority update of ∈ A ' is p_l×(1-AMI (a_l,a_k))。

2.3.3 step 2.3.2) is repeated) it is equal to Object selection quantity until choosing the quantity of data attribute.

2.3.4 result) will be chosen and be denoted as result_iAnd it is added to and chooses in results set Result.

2.4) all selection results chosen in results set Result are counted, by the highest data of the frequency of occurrences Attribute chooses set as final data attribute and chooses result A_S,A_S={ a₁,a₂,…a_q}。

(3) discretization granularity optimizes.

3.1) the discretization rule set of discretization granularity optimizing phase is constructed Each of them rule is r '_i={ r '_i1,r′_i2,…r′_iq, r '_ijIndicate the i-th rule to candidate attribute a_jDiscretization case Number.I.e. by discretization branch mailbox number lower bound N_L, discretization branch mailbox number upper bound N_H, attribute selection discretization step Long S₂In determining search space, the combination of all data attributes Yu discretization branch mailbox number is traversed.

3.2) the data attribute subset A based on selection_S, use each discretization rule r '_i∈ Rule ', carry out data from Dispersion.Use discretization rule r_iDaily record data after discretization can be expressed as Vⁱ′, Variance coefficient (Weighted coefficient of variation, WCV) is calculated to the daily record data after discretization. Steps are as follows for the calculating of WCV: the record in daily record data being pressed data attribute subset A first_SIn each data attribute value Grouping, is denoted as G, G={ g₁,g₂,…g_p, each groupingEach of them is recorded in institute There is data attribute a_k∈A_SOn be owned by equal numerical valueTarget data attribute a in each grouping is calculated using formula (3)_T Numerical valueThe coefficient of variation, be denoted as c_i.The WCV of each grouping is calculated using formula (4)_i, then calculated using formula (5) entire The WCV of log.

Wherein σ (X) indicates the standard deviation of X, and μ (X) indicates the mean value of X, and size (X) indicates the data entry number in X.

3.3) the smallest Data Discretization result of WCV value is chosen as final Data Discretization as a result, after discretization Daily record data is denoted as

4) tensor building and tensor completion.

4.1) using the daily record data V after discretization^FAnd target data attribute a_TConstruct tensor.If each data attribute a_i∈A_SOn discrete values number beIt then constructs a q and ties up tensor

4.1.1) by each data attribute a_i∈A_SIn discrete values by ascending order arrange, building numerical value v to arrange serial number d Mapping

4.1.2) by target data attribute a_TNumerical value as tensor value insert tensor.If data record e_iChoosing data Attribute A_S={ a₁,a₂,…a_qOn numerical value be respectively { v^F _i1, v^F _i2..., v^F _iq, in target data attribute a_TNumerical valueIt is logical It crosses mapping M and obtains { v^F _i1, v^F _i1..., v^F _iqCorresponding arrangement serial number { d_i1, d_i2..., d_ij, then the numerical value in tensorWhen there is u record to possess identical tensor subscript, the objective attribute target attribute data recorded using these are equal ValueNumerical value as tensor.

4.2) using CP decomposition method to tensor completion, decomposable process is solved using gradient descent method.

4.2.1 q factor matrix, factor matrix) are initialized using the random number on section [0,1]It is right Answer data attribute a_i, S_iFor a_iThe number of attribute discretization data, R are the hyper parameter of algorithm, initialize weight square according to formula (6) Battle array W.

4.2.2) factor matrix is updated according to formula (7).Wherein ε=χ-[[F₁,F₂…,F_q]], χ is building Sparse tensor, " [[]] " are Khatri-Rao operator, (χ)_(N)Indicate the N-mode matrixing of tensor χ,λ₁And λ₂For algorithm Hyper parameter.

4.2.3) according to formula (8) calculating target function value.

4.2.4 step 4.2.2) is repeated) and 4.2.3) until the variable quantity of target function value twice is less than threshold θ.

5) daily record data restores.The record e of missing data is had to each_iIn data attribute A_S={ a₁,a₂,…a_qOn Numerical value be respectively { v^F _i1, v^F _i2..., v^F _iq, { v is obtained by mapping M^F _i1, v^F _i1..., v^F _iqCorresponding arrangement serial number { d_i1, d_i2..., d_ij, then using the tensor value after completionMissing data is restored.

Detailed description of the invention

Fig. 1 is the deployment diagram of the method for the present invention.

Fig. 2 is overview flow chart of the invention.

Fig. 3 is the flow chart that daily record data attribute is chosen.

Fig. 4 is the flow chart of daily record data discretization optimization.

Fig. 5 is the flow chart of tensor building and completion.

Specific embodiment

The present invention is illustrated with reference to the accompanying drawings and detailed description.

Fig. 1 is the deployment diagram of the method for the present invention.The present invention is made of multiple computer servers, passes through network between server Connection.Platform nodes are divided into two classes: including a memory node and calculate node.The method of the present invention includes two class kernel softwares Module: log memory module and log processing module.Wherein, log memory module is responsible for the storage of daily record data, saves in storage It is disposed on point；Log processing module is responsible for handling daily record data, disposes in calculate node.

Illustrate the specific implementation method of the method for the present invention below with reference to Fig. 2 summary of the invention main-process stream.In present implementation, Basic parameter is provided that discretization branch mailbox number lower bound N_L=100, discretization branch mailbox number upper bound N_H=500, attribute is chosen discrete Change step-length S₁=100, discretization optimizes step-length S₂=25, CP decompose one tensor number R=25 of order, and gradient declines learning rateGradient declines objective function weight λ₁=0.5 and λ₂=0.5, gradient decline objective function convergence threshold θ= 0.01。

Specific implementation method can be divided into following steps:

(1) it initializes.It enables and shares 49 data attributes in data center's log, 10364956 records.Then data attribute Set can be expressed as A, A={ a₁,a₂,…a₄₉}.Data record set in log can be expressed as E, E={ e₁,e₂,… e_10364956}.Daily record data can be expressed as V,Data attribute with missing data is Real_mem_avg (average memory usage amount), is denoted as a_T。

(2) data attribute is chosen, and the flow chart of steps is as shown in Figure 3.

2.1) all data attributes that there may be correlativity with target missing data attribute are manually chosen as candidate number According to attribute set A ', A '=plan_cpu, plan_mem, instance_num, duration,

, real_cpu_avg, end_time } and (application cpu resource, application memory source, example quantity, duration, reality Border cpu resource usage amount average value, end time).

2.2) discretization rule set Rule, the Rule={ r in data decimation stage are constructed₁,r₂,…r₁₅₆₂₅, wherein each Rule is r_i={ r_i1,r_i2,…r_i6, r_ijIndicate the i-th rule to candidate attribute a_jDiscretization case number.I.e. by discrete Change branch mailbox number lower bound 100, the discretization branch mailbox number upper bound 500, attribute is chosen in the search space that discretization step-length 100 determines, time Go through the combination of all data attributes Yu discretization branch mailbox number.

2.3.1) according to the method in summary of the invention 2.3.1), all candidate data attribute a are calculated_i∈ A ' and target data Attribute a_TAMI, be denoted as AMI (a_i；a_T).Then the priority P of initialization candidate data attribute, P=0.02,0.11, 0.018,0.09,0.009,0.14}。

2.3.2 the data attribute end_time of a highest priority) is selected to be added to selection data attribute set, and will It is from the middle removal of candidate data attribute set A '.According to the method in summary of the invention 2.3.2) by remaining candidate data attribute a_l∈ The priority update of A ' is { 0.018,0.09,0.015,0.07,0.0087 }.

2.4) all selection results chosen in results set Result are counted, by the highest data of the frequency of occurrences Attribute chooses set as final data attribute and chooses result A_S,A_S={ end_time, plan_mem, duration }.

(3) discretization granularity optimizes, and the flow chart of steps is as shown in Figure 4.

3.1) discretization rule set Rule ', Rule '={ r ' of discretization granularity optimizing phase is constructed₁,r′₂,…r ′₄₀₉₆, each of them rule is r '_i={ r '_i1,r′_i2,r′_i3, r '_ijIndicate the i-th rule to candidate attribute { end_ Time, plan_mem, duration } in j-th of attribute discretization case number.It is discrete i.e. by discretization branch mailbox number lower bound 100 Change the branch mailbox number upper bound 500, attribute is chosen in the search space that discretization step-length 25 determines, all data attributes and discretization are traversed The combination of branch mailbox number.

3.2) the data attribute subset A based on selection_S, use each discretization rule r '_i∈ Rule ', carry out data from Dispersion.Use discretization rule r_iDaily record data after discretization can be expressed as Vⁱ′, Root According to summary of the invention 3.2) in method to after discretization daily record data calculate plus WCV=0.35647

4) tensor building and tensor completion, the flow chart of steps are as shown in Figure 5.

4.1) using the daily record data V after discretization^FAnd target data attribute a_TConstruct tensor.Each data attribute a_i ∈A_SOn discrete values number be respectively 276,87,61, construct one 3 dimension tensor

4.1.2) by target data attribute a_TNumerical value as tensor value insert tensor.If data record e_iChoosing data Attribute A_SNumerical value on={ end_time, plan_mem, duration } is respectively { 35519,0.016,34 }, in target data Attribute a_TNumerical value be 0.023814, by mapping M obtain { 35519,0.016,34 } corresponding arrangement serial number 1,13 ..., 24 }, then the numerical value χ in tensor_{1 13 24}=0.023814.

4.2.1 three factor matrixs) are initialized using the random number on section [0,1], three factor matrixs are respectivelyWeight matrix is initialized according to the method in summary of the invention 4.2.1) W。

4.2.2) three factor matrixs are updated according to the method in summary of the invention 4.2.2).

4.2.3) according to the method calculating target function value E=7983.348 in summary of the invention 4.2.2)

4.2.4 step 4.2.2) is repeated) and 4.2.3) until the variable quantity of target function value twice is less than threshold value 0.01.

5) daily record data restores.If having the record e of missing data₁In data attribute A_S=end_time, plan_mem, Duration } on numerical value be respectively { 45682,0.008,89 }, pass through mapping M and obtain { 45682,0.008,89 } corresponding row Column serial number { 34,5,41 }, then using the tensor value x after completion_{34 5 41}Missing data is restored.

The data receiving channel dynamic allocation method proposed according to the present invention, inventor have carried out relevant performance and have surveyed Examination.Test result shows that the method for the present invention can accurately restore the missing data in data center's log.

Performance test uses the log of data center, Alibaba as test data set, and existing log analysis is worked In missing data restoration methods: mean value restore, linear regression restore；And be widely used in other field advance data it is extensive Compound method: KNN restores, multi-layer perception (MLP) restores, support vector machines is restored totally five kinds of methods and is compared, and is mentioned with embodying the present invention Method out is in the advantage for restoring data center's log missing data accuracy rate.Performance test is run on by 1 computer, hardware Configuration includes: the CPU, 32GB DDR4RAM, 512GB NVMe SSD of 7 1700X@3.80GHz of AMD Ryzen.

Performance test uses two parameter evaluation data recovery errors: average relative error (MRE) and root-mean-square error (RMSE), their calculation formula such as formula (9) and formula (10) are shown:

Performance test is divided into 4 groupings according to different daily record data missing ratios and missing mode, respectively according to Ah Li Baba log lacks 30% shortage of data rate (TM30) of mode, lacks 85% shortage of data rate of mode according to Alibaba's log (TM85), completely random lacks 30% shortage of data rate (RM30), and completely random lacks 85% shortage of data rate (RM85).Performance The result of test is as shown in Table 1 and Table 2.

1 the performance test results of table (MRE)

2 the performance test results of table (RMSE)

By the data of Tables 1 and 2 it can be concluded that, in four groups of experiments, relative to five kinds of control methods, the method for the present invention MRE averagely reduces 47.7%, RMSE and averagely reduces 56.6%, MRE maximum and reduce 85.9%, RMSE maximum and reduces 92%.The mistake that the multiple perceptron of the lower two machine learning data reconstruction methods of mean error restores and support vector machines is restored Difference is significantly increased with the rising of shortage of data ratio, and the error of the method for the present invention then keeps stable, in 30% and 85% two kind of number Maximum lift according to MRE under miss rate is respectively 32.7% and 50%.The performance test results prove relative to five kinds of control methods, The missing data restoration errors of the method for the present invention are lower and more stable, can obtain under different shortage of data rates higher Accuracy rate.

Finally, it should be noted that above example is only to illustrate the present invention and not limits technology described in the invention, And the technical solution and its improvement of all spirit and scope for not departing from invention, it should all cover in claim model of the invention In enclosing.

Claims

1. a kind of data center's log missing data restoration methods, it is characterised in that: the following steps are included:

(1) it initializes, if sharing n data attribute in log, m item record.Then data attribute set can be expressed as A, A= {a₁, a₂... a_n, the data record set in log can be expressed as E, E={ e₁, e₂... e_m, the data in log can be with table It is shown asWherein, v_ijThe value for indicating i-th of data attribute in j-th strip data record, with missing The data attribute of data is denoted as a_T；

(2) data attribute is chosen.

2.1) selection is all to have the data attribute of correlativity as candidate data property set with target missing data attribute Close A ', A '={ a₁, a₂... a_n′}；

2.2) the discretization rule set Rule in data decimation stage is constructed,Each of them rule is r_i={ r_i1, r_i2... r_in′, r_ijIndicate the i-th rule to candidate attribute a_jDiscretization case number. Exist By discretization branch mailbox number lower bound N_L, discretization branch mailbox number upper bound NH, attribute selection discretization step-length S₁In determining search space, Traverse the combination of all data attributes Yu discretization branch mailbox number；

2.3) each discretization rule r is used_i∈ Rule uses discretization rule r to Data Discretization_iDay after discretization Will data can be expressed asThen data attribute selection is carried out one by one；

2.4) all selection results chosen in results set Result are counted, by the highest data attribute of the frequency of occurrences Set is chosen as final data attribute and chooses result A_S, A_S={ a₁, a₂... a_q}；

(3) discretization granularity optimizes

3.1) the discretization rule set Rule ' of discretization granularity optimizing phase is constructed,It is wherein each Rule is r '_i={ r '_i1, r '_i2... r '_iq, r '_ijIndicate the i-th rule to candidate attribute a_jDiscretization case number.I.e. by discretization branch mailbox number lower bound N_L, discretization branch mailbox number upper bound N_H, attribute selection discretization step-length S₂ In determining search space, the combination of all data attributes Yu discretization branch mailbox number is traversed；

3.2) the data attribute subset A based on selection_S, use each discretization rule r '_i∈ Rule ' carries out data discrete Change, uses discretization rule r_iDaily record data after discretization can be expressed as To from Daily record data after dispersion calculates variance coefficient (Weighted coefficient of variation, WCV)；

3.3) the smallest Data Discretization result of WCV value is chosen as final Data Discretization as a result, log after discretization Data are denoted as

4) tensor building and tensor completion

4.1) using the daily record data V after discretization^FAnd target data attribute a_TConstruct tensor.If each data attribute a_i∈A_S On discrete values number beIt then constructs a q and ties up tensor

4.2) using CP decomposition method to tensor completion, decomposable process is solved using gradient descent method.；

5) daily record data restores

The record e of missing data is had to each_iIn data attribute A_S={ a₁, a₂... a_qOn numerical value be respectively { v^F _i1, v^F _i2..., v^F _iq, { v is obtained by mapping M^F _i1, v^F _i1..., v^F _iqCorresponding arrangement serial number { d_i1, d_i2..., d_ij, then it uses Tensor value after completionMissing data is restored.

2. data center's log missing data restoration methods as described in claim 1, it is characterised in that: 2.3) include:

2.3.1 all candidate data attribute a) are calculated using formula (1) and formula (2)_i∈ A ' and target data attribute a_TAMI, It is denoted as AMI (a_i；a_T).Then the priority P of candidate data attribute, P={ p are initialized₁, p₂... p_n, wherein p_i=AMI (a_i； a_T)。

2.3.2) data attribute of a highest priority is selected (to be denoted as a_k) be added to choose data attribute set, and by its from The middle removal of candidate data attribute set A '.By remaining candidate data attribute a_lThe priority update of ∈ A ' is p_l×(1-AMI(a_l, a_k))。

3. data center's log missing data restoration methods as described in claim 1, it is characterised in that: 4.1) include:

4.1.1) by each data attribute a_i∈A_SIn discrete values arranged by ascending order, building numerical value v being reflected to serial number d is arranged It penetrates

4.1.2) by target data attribute a_TNumerical value as tensor value insert tensor, if data record e_iChoosing data attribute A_s={ a₁, a₂... a_qOn numerical value be respectively { v^F _i1, v^F _i2..., v^F _iq, in target data attribute a_TNumerical valuePass through It maps M and obtains { v^F _i1, v^F _i1..., v^F _iqCorresponding arrangement serial number { d_i1, d_i2..., d_ij), then the numerical value in tensorWhen there is u record to possess identical tensor subscript, the objective attribute target attribute data mean value of these records is usedNumerical value as tensor.

4. data center's log missing data restoration methods as described in claim 1, it is characterised in that: 4.2) include:

4.2.1 q factor matrix, factor matrix) are initialized using the random number on section [0,1]Corresponding number According to attribute a_i, S_iFor a_iThe number of attribute discretization data, R are the hyper parameter of algorithm, initialize weight matrix W according to formula (6),

4.2.2) factor matrix is updated according to formula (7), wherein For the dilute of building Tensor is dredged, " [[]] " is Khatri-Rao operator,Indicate tensorN-mode matrixing,λ₁And λ₂To calculate Method hyper parameter；

4.2.3) according to formula (8) calculating target function value.

5. data center's log missing data restoration methods as described in claim 1, it is characterised in that: 4.2) include: 3.2) Steps are as follows for the calculating of middle WCV: the record in daily record data being pressed data attribute subset A first_SIn each data attribute Value grouping, is denoted as G, G={ g₁, g₂... g_p, each grouping Each of them is recorded in institute There is data attribute a_k∈A_SOn be owned by equal numerical value v^i′ _jk.Target data attribute a in each grouping is calculated using formula (3)_T Numerical valueThe coefficient of variation, be denoted as c_i.The WCV of each grouping is calculated using formula (4)_i, then calculated using formula (5) entire The WCV of log.