CN103995828B

CN103995828B - A kind of cloud storage daily record data analysis method

Info

Publication number: CN103995828B
Application number: CN201410145688.2A
Authority: CN
Inventors: 樊凯; 李晖; 郝延静
Original assignee: XIDIAN-NINGBO INFORMATION TECHNOLOGY INSTITUTE
Current assignee: XIDIAN-NINGBO INFORMATION TECHNOLOGY INSTITUTE
Priority date: 2014-04-11
Filing date: 2014-04-11
Publication date: 2017-06-13
Anticipated expiration: 2034-04-11
Also published as: CN103995828A

Abstract

The present invention relates to a kind of cloud storage daily record data analysis method, step 1, the data to cloud storage daily record carry out preanalysis；Step 2, the cloud storage daily record data after preanalysis is calculated, obtain generating the frequent item set that relation maxim needs；Step 3, the frequent item set obtained according to step 2 generate the relation maxim of cloud storage daily record;The correlation rule present invention that step 4, output step 3 are obtained reduces the scale of the candidate matrix of generation by the abbreviation to frequent item set matrix, is effectively reduced the quantity of the candidate generated in successive iterations calculating process；In addition, in further improved technology scheme, invention calculates candidate matrix by customized matrix operation, whole calculating process is fairly simple, can reduce the operand in data analysis process, shortens the excavation time.

Description

A kind of cloud storage daily record data analysis method

Technical field

The invention belongs to data analysis technique field, more particularly to a kind of cloud storage daily record data analysis method can be used for The data analysis of cloud storage system daily record.

Background technology

Cloud storage system can produce substantial amounts of journal file during operation.These journal files have recorded system Keeper to the operational circumstances of system, when user is to the reception of access situation and system server, analysis request, the operation of system The various raw informations such as mistake.Data analysis is carried out to system manager's Operation Log, can be with the operation of standardized administration person；To with Family accesses situation daily record and carries out data analysis, it can be found that the behavioural habits of user, are conducive to inquiring about, analyze each user's operation, Lifting user satisfaction；Data analysis is carried out to the daily record of cloud storage service device, network failure can be excluded with detecting system state, Realize intrusion detection, additionally it is possible to find the mould that design defect, performance bottleneck and the needs of cloud storage system itself are distributed rationally Block.

Because the daily record data amount produced by cloud storage system is very huge, thus how from substantial amounts of data, quickly And valuable information is efficiently extracted out, the correlation between these information is found, as the analysis of cloud storage daily record data In the primary problem for solving.Research currently for the data analysis of cloud storage system daily record is less.Relation maxim is data point The important component of analysis process, by relation maxim it can be found that internal relation between mass data and valuable System.Data analysis is carried out to cloud storage daily record, relation maxim is generated, it is possible to cloud storage journal file is effectively utilized.

At present, referring to shown in accompanying drawing 1, it mainly includes the flow of the data analysis of existing cloud storage daily record：Cloud is deposited Storage journal file pretreatment, criterion generation and the criterion output to generating are analyzed using these steps.Wherein criterion life Into this step, relate generally to find frequent item set and generation two aspects of relation maxim.Wherein find the method master of frequent item set There are Apriori algorithm and the Apriori algorithm based on matrix.Apriori algorithm is that a kind of classical searching frequent item set is calculated Method, the algorithm obtains latter using the alternative manner successively searched for using the result of calculation of previous item.Apriori algorithm quilt Extensive research and improvement, it is the Apriori algorithm based on matrix that one of which is improved.This improvement is should by the thought of matrix Use in the algorithm, analyzed database table is shown as the form of matrix.By this method can be by the scanning of database Number of times is reduced to twice, shorten the data-analysis time, improves algorithm performance.

However, there is problems with the existing Apriori algorithm based on matrix：The amount of calculation of the algorithm first than larger, When the data items included in analyzed database are more, the time that the algorithm expends can be exponentially increased, so right When mass data is analyzed, more times can be expended；Secondly, the algorithm can produce excessive during being iterated Candidate, stores these candidates meeting committed memory space, and increase amount of calculation when follow-up iterative calculation is carried out. These shortcomings are unfavorable for from cloud storage daily record quickly extracting relation maxim, cause the data analysis of whole cloud storage daily record Journey needs long time, inefficient, it is impossible to reflects the running status situation of cloud storage system in time, is unfavorable for entering system Row optimization and performance boost.

The content of the invention

The technical problems to be solved by the invention are directed to above-mentioned prior art and provide a kind of cloud storage daily record data analysis Method, the method can reduce the candidate data generated in amount of calculation and iterative process, can greatly improve the effect of data analysis Really.

The present invention solve the technical scheme that is used of above-mentioned technical problem for：A kind of cloud storage daily record data analysis method, It comprises the following steps：

Step 1, the data to cloud storage daily record carry out preanalysis, that is, delete the repeated data in daily record data, polishing day Missing data in will data；

Step 2, the cloud storage daily record data after preanalysis is calculated, obtain generating the frequent episode that relation maxim needs Collection；

Step 3, the frequent item set obtained according to step 2 generate the relation maxim of cloud storage daily record；

The correlation rule that step 4, output step 3 are obtained；

It is characterized in that：If constant k, k are the number of times of iterative calculation, it is 2 to make the initial value of k, in the step 2, is passed through Following steps obtain generating the frequent item set that relation maxim needs：

Step 2a, using after preanalysis cloud storage daily record data generate the item collection Matrix C of candidate 1₁：

The item collection matrix of candidate 1The matrix is M row N column matrix, c_ijIt is the matrix The element of the i-th row jth row, i and j is the item collection Matrix C of candidate 1₁Location index, wherein 1≤i≤M, 1≤j≤N,I_jIt is j-th event recorded in cloud storage log database, I_j∈{I₁,I₂,…,I_N, 1,2 ... N represents the label of the event included in cloud storage log database, and N is the sum of event；T_iIt is cloud storage daily record according in storehouse I-th daily record, T_i∈{T₁,T₂,…T_M, 1,2 ... M represent cloud storage daily record according in storehouse record daily record label, M is day The sum of will；c_ijIt is a Boolean, 0 or 1 can only be taken, if i.e. i-th daily record T of cloud storage journal file record_iBag Containing j-th event I_j, then c_ijTake 0, otherwise c_ijTake 1；

Step 2b, using given minimum support S_cWith the item collection Matrix C of candidate 1₁, calculate frequent 1 item collection matrix L₁, it is right The matrix carries out abbreviation, obtains the frequent 1 item collection matrix L after abbreviation₁', wherein minimum support S_cN is multiplied by equal to constant x, often The span of number x is 0~1；It is realized especially by following steps：

Step 2b-1, the calculating item collection Matrix C of candidate 1₁The row of each row and, successively by each row and with minimum support S_cThan Compared with, Ruo Lie and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, obtain the first intermediary matrix；

Step 2b-2, calculate the first intermediary matrix per a line row and, if row and less than 2, by the row delete, otherwise protect The row is stayed, the second intermediary matrix is obtained；

Step 2b-3, calculate the second intermediary matrix respectively row row and, successively by each row and with minimum support S_cCompare, If arranging and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, generate new frequent 1 item collection matrix L₁'；

Step 2c, the number of times that k is iterative calculation is set, it is 2 to make the initial value of k, and the value of (k-1) is tried to achieve using the value of k, it is determined that Go out to calculate candidate's k item collection Matrix Cs_kWhen frequent (k-1) the item collection matrix L that needs_k-1', and according to frequent (k-1) the item collection matrix L_k-1' obtain candidate's k item collection Matrix Cs_k：

Step 2d, using minimum support S_cWith candidate's k item collection Matrix Cs_k, calculate frequent k item collections matrix L_k, and to this frequently Numerous k item collections matrix L_kAbbreviation is carried out, the frequent k item collections matrix L after abbreviation is obtained_k', comprise the following steps：

Step 2d-1, the calculating item collection Matrix C of candidate 2₂In each row row and, successively by each column arrange and with minimum support S_c Compare, if row are less than minimum support S_c, then the row are deleted, on the contrary then retain the row, obtain the 3rd intermediary matrix；

Step 2d-2, the value that k+1 is obtained according to the value of iterative calculation number of times k, calculate row of the 3rd intermediary matrix per a line With, if going and less than k+1, corresponding row being deleted and is gone, otherwise retain the row, obtain the 4th intermediary matrix；

Step 2d-3, again calculate the 4th intermediary matrix respectively row row and, successively by each row and with minimum support S_c Compare, Ruo Lie and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, the frequent k item collections square after generation abbreviation Battle array L_k'；

Step 2e, judge the frequent k item collections matrix L after abbreviation_k' whether it is empty matrix, if the matrix is empty matrix, tie Beam is calculated, and otherwise makes k from increasing 1, repeat step 2c to step 2d.

As improvement, candidate k item collection Matrix Cs in the step 2c_kIt is calculated by following manner：

IfWherein u, v difference table are frequent (k-1) item collection matrix L_k-1' position rope Draw, and according to frequent (k-1) the item collection matrix L_k-1' obtain candidate's k item collection Matrix Cs_k：

Wherein " ∧ " represents AND operator.

Improve again, the step 3 obtains relation maxim as follows：

Step 3a, given min confidence S_z, min confidence S_zSpan be 0~1；

Step 3b, for abbreviation after frequent k item collections matrix L_k', by L_k' each row constitute one-column matrix l, l a ∈ L_k', produce all nonvoid subset r of one-column matrix l；

Step 3c, for each nonvoid subset r, ifWherein S_rIt is the counting of nonvoid subset r, S_lIt is single-row square Battle array l row and, then obtain relation maxim

Compared with prior art, the advantage of the invention is that：The present invention is reduced by the abbreviation to frequent item set matrix The scale of the candidate matrix of generation, is effectively reduced the quantity of the candidate generated in successive iterations calculating process； In addition, in further improved technology scheme, invention calculates candidate matrix by customized matrix operation, whole to calculate Process is fairly simple, can reduce the operand in data analysis process, shortens the excavation time.

Brief description of the drawings

Fig. 1 is the data analysing method flow chart of cloud storage daily record in the prior art；

Fig. 2 is that step 2 obtains the flow chart of frequent item set in the embodiment of the present invention；

Fig. 3 is using generation frequent item set method in the generation frequent item set method in the embodiment of the present invention and existing method Effect contrast figure.

Specific embodiment

The present invention is described in further detail below in conjunction with accompanying drawing embodiment.

Cloud storage daily record data analysis method as shown in Figure 2, it comprises the following steps：

Step 2, the cloud storage daily record data after preanalysis is calculated, obtain generating the frequent episode that relation maxim needs Collection, in this step, if constant k, k are the number of times of iterative calculation, it is 2 to make the initial value of k, this step again through the following steps that Obtain generating the frequent item set that relation maxim needs：

Step 2b, using given minimum support S_cWith the item collection Matrix C of candidate 1₁, calculate frequent 1 item collection matrix L₁, it is right The matrix carries out abbreviation, obtains the frequent 1 item collection matrix L after abbreviation₁', step 2b is to frequent 1 item collection matrix L₁Abbreviation, is changed Frequent 1 item collection matrix L after letter₁', wherein minimum support S_cN is multiplied by equal to constant x, the span of constant x is 0~1；Its Realized especially by following steps：

Step 2b-1, the calculating item collection Matrix C of candidate 1₁The row of each row and, successively by each row and with minimum support S_cThan Compared with, Ruo Lie and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, obtain the first intermediary matrix；It is wherein minimum Support S_cN is multiplied by equal to constant x, the span of constant x is 0~1；

Step 2c, the number of times that k is iterative calculation is set, it is 2 to make the initial value of k, and the value of (k-1) is tried to achieve using the value of k, it is determined that Go out to calculate candidate's k item collection Matrix Cs_kWhen frequent (k-1) the item collection matrix L that needs_k-1', and according to frequent (k-1) the item collection matrix L_k-1' obtain candidate's k item collection Matrix Cs_k, candidate k item collection Matrix Cs in step 2c here_kIt is calculated by following manner：

Wherein " ∧ " represents AND operator；

Step 2d, using minimum support S_cWith candidate's k item collection Matrix Cs_k, calculate frequent k item collections matrix L_k, and to this frequently Numerous k item collections matrix L_kAbbreviation is carried out, the frequent k item collections matrix L after abbreviation is obtained_k', here to frequent k item collections matrix in step 2d L_kAbbreviation is carried out, the frequent k item collections matrix L after abbreviation is obtained_k', comprise the following steps：

Step 2e, judge the frequent k item collections matrix L after abbreviation_k' whether it is empty matrix, if the matrix is empty matrix, tie Beam is calculated, and otherwise makes k from increasing 1, repeat step 2c to step 2d；

Frequent item set after step 3, the abbreviation obtained according to step 2 generates the relation maxim of cloud storage daily record, specific bag Include：

Step 3a, given min confidence S_z；

Step 3a, given min confidence S_z, min confidence S_zSpan be 0~1；

The correlation rule that step 4, output step 3 are obtained.

Effect of the present invention can be further illustrated by following emulation：

1. simulated conditions

Emulation of the invention is the hardware environment and MATLAB in dominant frequency 2.5GHz intel (R) Core (TM) i5CPU Carried out under the software environment of R2009b, Window 7.

2. emulation content and interpretation of result

Using frequent item set method is generated in the generation frequent item set method in the present invention and existing method, as a result such as Fig. 3 It is shown.

It can be seen from figure 3 that the run time of two methods is all reduced with the increase of minimum support, but fortune of the invention The row time is significantly lower than existing method, i.e., the present invention when frequent item set is generated efficiency apparently higher than existing method.This be due to Present invention amount of calculation when candidate is generated it is small, it is necessary to time it is few, and abbreviation has been carried out to frequent item set, make generation Candidate quantity is significantly reduced.

Simulation result shows：The present invention generates candidate as a result of customized matrix operation, reduces life Into the amount of calculation of candidate, and the quantity that abbreviation reduces the candidate of generation is carried out by frequent item set, accelerate cloud The efficiency of storing daily record data analysis.

Claims

1. a kind of cloud storage daily record data analysis method, it comprises the following steps：

Step 1, the data to cloud storage daily record carry out preanalysis, that is, delete the repeated data in daily record data, polishing daily record number Missing data in；

Step 2, the cloud storage daily record data after preanalysis is calculated, obtain generating the frequent item set that relation maxim needs；

The correlation rule that step 4, output step 3 are obtained；

It is characterized in that：If constant k, k are the number of times of iterative calculation, it is 2 to make the initial value of k, in the step 2, by following Step obtains generating the frequent item set that relation maxim needs：

The item collection matrix of candidate 1The matrix is M row N column matrix, c_ijIt is the i-th row of the matrix The element of jth row, i and j is the item collection Matrix C of candidate 1₁Location index, wherein 1≤i≤M, 1≤j≤N,I_jIt is j-th event recorded in cloud storage log database, I_j∈{I₁,I₂,…,I_N, 1,2 ... N represents the label of the event included in cloud storage log database, and N is the sum of event；T_iIt is cloud storage daily record according in storehouse I-th daily record, T_i∈{T₁,T₂,…T_M, 1,2 ... M represent cloud storage daily record according in storehouse record daily record label, M is day The sum of will；c_ijIt is a Boolean, 0 or 1 can only be taken, if i.e. i-th daily record T of cloud storage journal file record_iBag Containing j-th event I_j, then c_ijTake 0, otherwise c_ijTake 1；

Step 2b, using given minimum support S_cWith the item collection Matrix C of candidate 1₁, calculate frequent 1 item collection matrix L₁, to the square Battle array carries out abbreviation, obtains the frequent 1 item collection matrix L after abbreviation₁', wherein minimum support S_cN is multiplied by equal to constant x, constant x's Span is 0~1；It is realized especially by following steps：

Step 2b-1, the calculating item collection Matrix C of candidate 1₁The row of each row and, successively by each row and with minimum support S_cCompare, if Arrange and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, obtain the first intermediary matrix；

Step 2b-2, calculate the first intermediary matrix per a line row and, if row and less than 2, by the row delete, otherwise retain should OK, the second intermediary matrix is obtained；

Step 2b-3, calculate the second intermediary matrix respectively row row and, successively by each row and with minimum support S_cCompare, if row With less than minimum support S_c, then the row are deleted, on the contrary then retain the row, generate new frequent 1 item collection matrix L₁'；

Step 2c, the value that (k-1) is tried to achieve using the value of k, determine calculating candidate's k item collection Matrix Cs_kWhen frequent (k-1) item that needs Collection matrix L_k-1', and according to frequent (k-1) the item collection matrix L_k-1' obtain candidate's k item collection Matrix Cs_k；

Step 2d, using minimum support S_cWith candidate's k item collection Matrix Cs_k, calculate frequent k item collections matrix L_k, and it is frequent k to this Collection matrix L_kAbbreviation is carried out, the frequent k item collections matrix L after abbreviation is obtained_k', comprise the following steps：

Step 2d-1, the calculating item collection Matrix C of candidate 2₂In each row row and, successively by each column arrange and with minimum support S_cCompare, If row are less than minimum support S_c, then the row are deleted, on the contrary then retain the row, obtain the 3rd intermediary matrix；

Step 2d-2, the value that k+1 is obtained according to the value of iterative calculation number of times k, calculate row of the 3rd intermediary matrix per a line and, if Go and less than k+1, then corresponding row is deleted and gone, otherwise retain the row, obtain the 4th intermediary matrix；

Step 2d-3, again calculate the 4th intermediary matrix respectively row row and, successively by each row and with minimum support S_cCompare, If arranging and less than minimum support S_c, then the row are deleted, on the contrary then retain the row, the frequent k item collections matrix L after generation abbreviation_k'；

Step 2e, judge the frequent k item collections matrix L after abbreviation_k' whether it is empty matrix, if the matrix is empty matrix, terminate meter Calculate, otherwise make k from increasing 1, repeat step 2c to step 2d.

2. cloud storage daily record data analysis method according to claim 1, it is characterised in that：Candidate k in the step 2c Item collection Matrix C_kIt is calculated by following manner：

IfWherein u, v difference table are frequent (k-1) item collection matrix L_k-1' location index, and root According to frequent (k-1) the item collection matrix L_k-1' obtain candidate's k item collection Matrix Cs_k：

Wherein " ∧ " represents AND operator.

3. cloud storage daily record data analysis method according to claim 1, it is characterised in that：The step 3 is by as follows Step obtains relation maxim：

Step 3a, given min confidence S_z, min confidence S_zSpan be 0~1；

Step 3b, for abbreviation after frequent k item collections matrix L_k', by L_k' each row constitute one-column matrix l, a l ∈ L_k', Produce all nonvoid subset r of one-column matrix l；

Step 3c, for each nonvoid subset r, ifWherein S_rIt is the counting of nonvoid subset r, S_lIt is one-column matrix l Row and, then obtain relation maxim