CN102737126B

CN102737126B - Classification rule mining method under cloud computing environment

Info

Publication number: CN102737126B
Application number: CN201210203816.5A
Authority: CN
Inventors: 杨善林; 丁静; 罗贺; 丁帅; 徐达宇; 范雯娟
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2012-06-19
Filing date: 2012-06-19
Publication date: 2014-03-12
Anticipated expiration: 2032-06-19
Also published as: CN102737126A

Abstract

The invention discloses a classification rule mining method under a cloud computing environment. The classification rule mining method is characterized in that a master-slave type organization structure consisting of a control center and a plurality of slave servers is adopted, and comprises the following steps of: dividing, by the control center, a data set to be classified into a training sample and a test sample, and distributing each of data blocks of the same size which are formed by uniformly dividing the training sample to one processing unit; training, by the processing units, the data blocks, by using a genetic algorithm to obtain an atomic rule for classification; and finally, reducing the atomic rule by a classifier, and selecting a reduction result which meets classification accuracy requirement as a final result of the classification rule mining. The classification rule mining method is suitable for data classification on distributed data storage under the cloud computing environment; distributed parallel processing of data classification tasks under the cloud computing environment can be performed; and the classification rule mining method has a positive effect on the classification processing problem of mass data under the cloud computing environment.

Description

Classifying rules method for digging under cloud computing environment

Technical field

The invention belongs to data analysis technique field under cloud computing environment, be specifically related to the classifying rules method for digging under a kind of cloud computing environment.

Background technology

Sorting technique research is the important research field of data analysis and management under cloud computing environment.On the one hand, classification is the important task type of data mining, data under cloud computing environment have the features such as magnanimity, distributivity and dynamic, these features have been brought challenge to the data management under cloud computing environment, by classifying rules, excavate these data analysis, contribute to improve the efficiency of magnanimity data analysis and management under cloud computing environment.On the other hand, cloud environment has ultra-large storage and computing power, and resource and structure have dynamic retractility, and provides as required service by Intel Virtualization Technology and huge resource pool, makes efficient Data classification become possibility.It is the vital task in data analysis management that classifying rules excavates, and contributes to understand better the mass data under cloud computing environment, the Operation Decision under auxiliary cloud computing environment; Meanwhile, the high performance calculating of cloud computing and storage capacity, for classifying rules excavates the guarantee that efficient operation is provided.Therefore, it is the important technology that the data analysis under cloud computing environment is processed that classifying rules excavates, and its research theoretical and application is significant.

In the research of classifying rules digging technology, Chinese scholars has proposed many solutions, comprise and take statistical method that Bayesian Method is representative, decision tree method and rule induction are representative and the learning method of take, and neural net method etc., when these methods are applied to the classifying rules excavation of small-scale static data collection, there is higher nicety of grading.Yet still there is bottleneck problem in these methods, comprise and need to carry out scanning and sequence repeatedly to data set, cause the poor efficiency of algorithm; More responsive to noise and certain data, be prone to over-fitting; For the scalability of large training set, not fine etc.Especially under cloud computing environment, extensive property and the dynamic of distributed mass data collection, cause the surge of Data classification process search volume and dimension, increased the computational complexity of classification, reduced the efficiency of traditional classification method, thereby existing classifying rules method for digging cannot directly apply in cloud computing environment.

Summary of the invention

In order to address the above problem, the present invention overcomes the limitation of prior art, and the method for digging of the classifying rules under a kind of cloud computing environment is provided.The present invention is applicable to the Data classification in Distributed Storage under cloud computing environment, can carry out the distributed parallel of Data classification task under cloud computing environment and process, and the classification processing problem of mass data under cloud computing environment has been played to positive effect.Utilize the scale of large-scale calculations node under cloud computing environment to calculate effect, effectively improve the efficiency of magnanimity Data classification rule digging under cloud computing environment; And by institutional framework and the regular training process based on genetic algorithm of master-slave mode, solve classifying rules and excavate the distributed implementation under cloud computing environment.

The present invention is that technical solution problem adopts following technical scheme:

The feature of the classifying rules method for digging under cloud computing environment of the present invention is:

Described cloud computing environment consists of a plurality of distributed servers; While implementing classifying rules excavation under described cloud computing environment, take master-slave mode institutional framework, described master-slave mode institutional framework is control center for a station server is set, and other server is dependent server; By described control center, arrange execution, the management and running of whole mining task and coordinate the operation of each dependent server; Described each dependent server is the concrete performance element of task, and described classifying rules method for digging carries out as follows:

A ,You control center is divided into training sample and test sample book by data set to be sorted, described training sample is evenly divided, obtain each data block of formed objects, for each data block, specify a dependent server of carrying out classified excavation task as processing unit, described each data block is assigned in corresponding each processing unit;

B, data block distribution being obtained by described processing unit employing genetic algorithm are implemented the training of the atomic rule for classifying, and the atomic rule that training is obtained deposits buffer zone in;

C, by described control center, the atomic rule in buffer zone is divided, select idle dependent server as sorter, format classification device carries out the redundancy yojan of atomic rule, and detect the nicety of grading of yojan result, filter out and meet the yojan result that nicety of grading requires, the net result excavating as classifying rules.

The feature of the classifying rules method for digging under cloud computing environment of the present invention is also:

The implementation of described step a is:

Control center is after receiving the request of classified excavation task, data set to be sorted is divided into training sample and test sample book, according to user, the requirement of classified excavation task and the feature of training sample are proposed, search for suitable dependent server as processing unit, and training sample is evenly divided into the data block that size is identical; If the processing units quantity satisfying condition is N, training sample size is M, and the data block size divided is M/N;

The satisfied condition of described suitable dependent server is: storage space is not less than M/N, and the response time is not more than the desired time the latest of user;

Control center accesses described training sample, and the block copy of dividing gained, to corresponding processing unit, and is utilized to the operational order of genetic algorithm training atomic rule to processing unit transmission.

The training process of the atomic rule in described step b is:

Processing unit carries out genetic coding to each data recording of composition data piece, by the loop iteration Generation of atoms rule of genetic manipulation, by described atomic rule with <key, the form of value> key-value pair deposits buffer zone in, described <key, key in value> key-value pair is class label, and value is the atomic rule under such label;

Control center periodically reads the <key in buffer zone, value> key-value pair, generate <key, the list of value list> key-value pair deposits buffer zone in, described <key, key in the list of value list> key-value pair is class label, and value list is the atomic rule list under such label;

The complete message of ，Xiang control center transmission processing unit operations after the operation of all data recording in the complete paired data piece of processing unit.

Described step c carries out according to the following procedure:

By control center, search for idle dependent server as sorter, the number of sorter is <key, the number of key value in the list of value list> key-value pair, the corresponding key value of each sorter; Control center is by <key, in atomic rule list in the list of value list> key-value pair and test sample book, have in the sorter that the record of same item label is sent to, and transmit to sorter the operational order that redundancy yojan and nicety of grading detect;

Sorter only records once a plurality of same atoms rules under same class label, deletes the atomic rule of redundancy, obtains the atomic rule after yojan;

Sorter utilizes the atomic rule after yojan to classify to test sample book, whether consistently with the class label of test sample book detect classification results, in the test sample book of supposing to be classified by atomic rule a, there is the class label of Y bar record consistent with classification results, have class label and the classification results of N bar record inconsistent, the nicety of grading of atomic rule a is Y/ (Y+N); Suppose in the classified excavation mission requirements that user proposes, nicety of grading requires as α, all niceties of grading is not less than to the net result that the atomic rule of α excavates as classifying rules and sends control center to;

Control center gathers the net result that all sorters generate, then summarized results is fed back to the requestor of classifying rules mining task.

Compare with existing data classification method, beneficial effect of the present invention is embodied in:

1, the present invention is divided into a plurality of subtasks by the classifying rules mining task of mass data, be assigned on the extensive server cluster in cloud computing environment and process, reduce the computation complexity of individual task, utilize the scale of cloud computing server cluster to calculate effect, significantly improve the efficiency of whole classifying rules mining task;

2, master-slave mode institutional framework in the present invention, has realized the distribution of task under cloud computing environment, scheduling and management, for classifying rules excavates, provides distributed realization mechanism; Meanwhile, the genetic algorithm itself that rule training adopts has good concurrency, has solved the poor problem of general classification technology concurrency in distributed environment.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of classifying rules method for digging under cloud computing environment of the present invention

Fig. 2 is the process flow diagram that in the present invention, the atomic rule based on genetic manipulation loop iteration generates

Embodiment

In classifying rules method for digging under the present embodiment cloud computing environment:

Cloud computing environment consists of a plurality of distributed servers; While implementing classifying rules excavation under cloud computing environment, take master-slave mode institutional framework, master-slave mode institutional framework is control center for a station server is set, and other server is dependent server; By control center, arrange execution, the management and running of whole mining task and coordinate the operation of each dependent server; Each dependent server is the concrete performance element of task.Classifying rules method for digging as shown in Figure 1, carries out as follows:

1 ,You control center is divided into training sample and test sample book by data set to be sorted, training sample is evenly divided, obtain each data block of formed objects, for each data block, specify a dependent server of carrying out classified excavation task as processing unit, each data block is assigned in corresponding each processing unit;

2, by processing unit, adopt genetic algorithm to distributing the data block obtaining to implement the training of the atomic rule for classifying, the atomic rule that training is obtained deposits buffer zone in;

3 ,You control centers divide the atomic rule in buffer zone, select idle dependent server as sorter, format classification device carries out the redundancy yojan of atomic rule, and detect the nicety of grading of yojan result, filter out and meet the yojan result that nicety of grading requires, the net result excavating as classifying rules.

Tentation data collection is by relation schema R (a ₁, a ₂..., a _k) expression, wherein a _i(i=1,2 ..., k) be attribute.By set of properties <a ₁, a ₂..., a _k> is divided into two parts, comprises k-1 conditional attribute and 1 class label, attribute a ₁, a ₂..., a _k-1for conditional attribute, attribute a _kfor class label.Each of data centralization data recording is a k dimensional vector [c ₁, c ₂..., c _k], c _ifor attribute a in this data recording _ivalue.

The form of expression of atomic rule is: If (a ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k-1=c _k-1), Then a _k=c _k.

The implementation of step 1 is:

Suitable dependent server need meet following condition: storage space is not less than M/N, and the response time is not more than the desired time the latest of user.

Control center access training sample, to corresponding processing unit, and utilizes the operational order of genetic algorithm training atomic rule by the block copy of dividing gained to processing unit transmission.

The training process of step 2 Atom rule is:

Processing unit carries out genetic coding to each data recording of composition data piece, by the loop iteration Generation of atoms rule of genetic manipulation, by atomic rule with <key, the form of value> key-value pair deposits buffer zone in, <key, key in value> key-value pair is class label, and value is the atomic rule under such label;

The genetic coding process of data recording is: every data logger is shown item chromosome, and chromogene value consists of the scale-of-two code value of k property value.If property value is discrete value, can directly carry out binary coding; If property value is successive value, to carries out after discretize continuous property, then carry out binary coding.Binary coding process is: establish attribute a _ithere is s Category Attributes value v ₁, v ₂..., v _s, with the binary number representation attribute a containing s code bit _ivalue.If a _ivalue be v _j, the j place value of this property value binary code is 1, the value position of all the other code bits is 0.For example, sex attribute has " man ", " female " two values, if property value is " man ", the binary coding of this attribute is " 01 "; If property value is " female ", be encoded to " 10 ".In genetic algorithm, every chromosome is that a heredity is individual, all populations of the individual formation of heredity, and population scale represents with the individual quantity n of heredity, the population being obtained by genetic coding process is initial population, a gene position for a scale-of-two code bit homologue.

The loop iteration process of genetic manipulation is as shown in Figure 2: in the t time iteration, first evaluate t for the fitness of population P (t), judge whether afterwards to stop iteration, if meet iteration stopping condition, stop iteration, using P (t) output as atomic rule; If do not meet iteration stopping condition, select, crossover and mutation operation, generate t+1 for population P (t+1), make t=t+1, implement next iteration.

Iteration stopping condition is any one of following two conditions:

1. iterations t>100;

2. fitness fitness>0.75.

Fitness evaluation is: establishing the individual corresponding data recording of a certain heredity is [c ₁, c ₂..., c _k], the fitness of this heredity individuality is fitness=TT/n+TT/ (TT+TF), wherein TT meets " (a in data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k=c _k) " data recording number, TF is that data meet " (a in fast ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k-1=c _k-1) ∧ (a _k≠ c _k) " data recording number, FT meets " ┐ [(a in data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k-1=c _k-1)] ∧ (a _k=c _k) " data recording number, FF meets " ┐ [(a in data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k-1=c _k-1)] ∧ (a _k≠ c _k) " data recording number.

Selection is operating as: hereditary individual X _iselection probability

copy p _i* n X _icopy as the individuality in genetic manipulation population next time, fitnexss (X _i) be X _ifitness value.

Interlace operation is: according to crossover probability p _ctwo individual X of heredity of random selection _iand X _j, a gene position w on random selective staining body, by X _iand X _jthe exchange of gene section below of upper gene position w forms two new individualities, as the individuality in genetic manipulation population next time.Wherein, crossover probability p _cfor the numerical value between [0.4,0.9], also can adopt adaptive crossover probability.

Mutation operation is: according to variation Probability p _man individual X of heredity of random selection _i, select at random X _ia upper gene position, carries out negate to the binary code in this gene position.Wherein, variation Probability p _mfor the numerical value between [0.01,0.1], also can adopt adaptive variation probability.

Control center periodically reads the <key in buffer zone, value> key-value pair, generate <key, the list of value list> key-value pair deposits buffer zone in, <key, key in the list of value list> key-value pair is class label, and value list is the atomic rule list under such label;

Step 3 is carried out according to the following procedure:

Sorter utilizes the atomic rule after yojan to classify to test sample book, whether consistently with the class label of test sample book detect classification results, in the test sample book of supposing to be classified by atomic rule a, there is the class label of Y bar record consistent with classification results, have class label and the classification results of N bar record inconsistent, the nicety of grading of atomic rule a is Y/ (Y+N).In the classified excavation mission requirements that user proposes, nicety of grading requires as α, and all niceties of grading are not less than to the atomic rule of α as the net result of classifying rules excavation, sends control center to;

Claims

1. the classifying rules method for digging under cloud computing environment, is characterized in that:

B, data block distribution being obtained by described processing unit employing genetic algorithm are implemented the training of the atomic rule for classifying, and the atomic rule that training is obtained deposits buffer zone in; The training process of described atomic rule is:

The complete message of ，Xiang control center transmission processing unit operations after the operation of all data recording in the complete paired data piece of processing unit;

2. according to the classifying rules method for digging under the cloud computing environment described in claims 1, it is characterized in that, the implementation of described step a is:

3. according to the classifying rules method for digging under the cloud computing environment described in claims 1, it is characterized in that, described step c carries out according to the following procedure:

By control center, search for idle dependent server as sorter, the number of sorter is <key, the number of key value in the list of value list> key-value pair, the corresponding key value of each sorter; Control center is by <key, the record in the atomic rule list in the list of value list> key-value pair and test sample book with same item label is sent in sorter, and to sorter, transmits the operational order of redundancy yojan and nicety of grading detection;