CN102737126A

CN102737126A - Classification rule mining method under cloud computing environment

Info

Publication number: CN102737126A
Application number: CN2012102038165A
Authority: CN
Inventors: 杨善林; 丁静; 罗贺; 丁帅; 徐达宇; 范雯娟
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2012-06-19
Filing date: 2012-06-19
Publication date: 2012-10-17
Anticipated expiration: 2032-06-19
Also published as: CN102737126B

Abstract

The invention discloses a classification rule mining method under a cloud computing environment. The classification rule mining method is characterized in that a master-slave type organization structure consisting of a control center and a plurality of slave servers is adopted, and comprises the following steps of: dividing, by the control center, a data set to be classified into a training sample and a test sample, and distributing each of data blocks of the same size which are formed by uniformly dividing the training sample to one processing unit; training, by the processing units, the data blocks, by using a genetic algorithm to obtain an atomic rule for classification; and finally, reducing the atomic rule by a classifier, and selecting a reduction result which meets classification accuracy requirement as a final result of the classification rule mining. The classification rule mining method is suitable for data classification on distributed data storage under the cloud computing environment; distributed parallel processing of data classification tasks under the cloud computing environment can be performed; and the classification rule mining method has a positive effect on the classification processing problem of mass data under the cloud computing environment.

Description

Classifying rules method for digging under the cloud computing environment

Technical field

The invention belongs to data analysis technique field under the cloud computing environment, be specifically related to the classifying rules method for digging under a kind of cloud computing environment.

Background technology

Sorting technique research is the important research field of data analysis and management under the cloud computing environment.On the one hand; Classification is the important task type of data mining; Data under the cloud computing environment have characteristics such as magnanimity property, distributivity and dynamic; These characteristics have been brought challenge to the data management under the cloud computing environment, excavate through classifying rules these data are analyzed, and help to improve magnanimity data analysis and efficiency of managing under the cloud computing environment.On the other hand, cloud environment has ultra-large storage and computing power, and resource and structure have dynamic retractility property, and through Intel Virtualization Technology and huge resource pool service is provided as required, makes that data qualification becomes possibility efficiently.It is the vital task in the data analysis management that classifying rules excavates, and helps to understand better the mass data under the cloud computing environment, the Operation Decision under the auxiliary cloud computing environment; Simultaneously, high performance calculating of cloud computing and storage capacity are excavated the guarantee that efficient operation is provided for classifying rules.Therefore, it is the important technology that the data analysis under the cloud computing environment is handled that classifying rules excavates, and it is theoretical significant with the research of using.

In the research of classifying rules digging technology; Chinese scholars has proposed many solutions; Comprise with Bayes's method be representative statistical method, with decision tree method and rule induction method be representative and learning method; And neural net method etc., when these methods are applied to the classifying rules excavation of small-scale static data collection, have higher nicety of grading.Yet still there is bottleneck problem in these methods, comprise carrying out repeatedly scanning and ordering to data set, cause the poor efficiency of algorithm; Noise with data are relatively more responsive really, is prone to over-fitting; For the scalability of big training set is not fine etc.Especially under cloud computing environment; The extensive property and the dynamic of distributed mass data collection; Cause the surge of data qualification process searches space and dimension; Increased the computational complexity of classification, reduced the efficient of traditional classification method, thereby existing classifying rules method for digging can't directly apply in the cloud computing environment.

Summary of the invention

In order to address the above problem, the present invention overcomes the limitation of prior art, and the method for digging of the classifying rules under a kind of cloud computing environment is provided.The present invention is applicable to the data qualification on the Distributed Storage under the cloud computing environment, can carry out the distributed parallel of data qualification task under the cloud computing environment and handle, and magnanimity classification of Data under the cloud computing environment is handled problems played positive effect.Utilize the scale of large-scale calculations node under the cloud computing environment to calculate effect, effectively improve the efficient of magnanimity data qualification rule digging under the cloud computing environment; And the institutional framework through master-slave mode and based on the regular training process of genetic algorithm, solve classifying rules and excavate the distributed realization under cloud computing environment.

The present invention adopts following technical scheme for the technical solution problem:

The characteristics of the classifying rules method for digging under the cloud computing environment of the present invention are:

Said cloud computing environment is made up of a plurality of distributed servers; When under said cloud computing environment, implementing the classifying rules excavation, take the master-slave mode institutional framework, said master-slave mode institutional framework is a control center for a station server is set, and other server is a dependent server; Arrange execution, the management and running of whole mining task and coordinate the operation of each dependent server by said control center; Said each dependent server is the concrete performance element of task, and said classifying rules method for digging carries out as follows:

A, by control center will be to be classified data set be divided into training sample and test sample book; Said training sample is evenly divided; Obtain each data block of identical size; For each data block specifies a dependent server of carrying out the classified excavation task as processing unit, with said each data block allocations to each processing unit of correspondence;

The training of the atomic rule that b, the data block enforcement of adopting genetic algorithm that distribution is obtained by said processing unit are used to classify, the atomic rule that training is obtained deposits buffer zone in;

C, the atomic rule in the buffer zone is divided by said control center; Select idle dependent server as sorter; The format classification device carries out the redundant yojan of atomic rule; And detection yojan result's nicety of grading, filter out and satisfy the yojan result that nicety of grading requires, as the net result of classifying rules excavation.

The characteristics of the classifying rules method for digging under the cloud computing environment of the present invention also are:

The implementation of said step a is:

Control centre is after the request of receiving the classified excavation task; Data set to be classified is divided into training sample and test sample book; Based on the user requirement of classified excavation task and the characteristic of training sample are proposed; Search for suitable dependent server as processing unit, and training sample evenly is divided into the identical data block of size; If the processing units quantity that satisfies condition is N, the training sample size is M, and the data block of then being divided size is M/N;

The condition that said suitable dependent server satisfies is: storage space is not less than M/N, and the response time is not more than the desired time the latest of user;

Control center visits said training sample, and the block copy of dividing gained to corresponding processing unit, and is utilized the operational order of genetic algorithm training atomic rule to the processing unit transmission.

The training process of the atomic rule among the said step b is:

Processing unit carries out genetic coding to each bar data recording of composition data piece; Loop iteration through genetic manipulation generates atomic rule; Deposit said atomic rule in buffer zone, said < key with the form of < key, value>key-value pair; Value>key type of being label in the key-value pair, value is the atomic rule under such label;

Control center periodically reads < key, the value>key-value pair in the buffer zone, generates < key; Value list>key-value pair tabulation deposits buffer zone in; Key type of being label in the tabulation of said < key, value list>key-value pair, value list is the tabulation of the atomic rule under such label;

Processing unit is accomplished after the operation to all data recording in the data block, sends the message that processing unit operation finishes to control center.

Said step c is undertaken by following process:

Search for idle dependent server as sorter by control center, the number of sorter is the number of key value in the tabulation of < key, value list>key-value pair, the corresponding key value of each sorter; In the sorter that control center is sent to the record that has the same item label in the tabulation of the atomic rule in < key, value list>key-value pair tabulation and the test sample book, and transmit the operational order of redundant yojan and nicety of grading detection to sorter;

Sorter to same type label a plurality of same atoms rules down only record once, the atomic rule of deletion redundancy obtains the atomic rule after the yojan;

Sorter utilizes the atomic rule after the yojan that test sample book is classified; Whether detect classification results consistent with the class label of test sample book; Suppose by in the test sample book of atomic rule a classification; Have the class label and the classification results of Y bar record consistent, have the class label and the classification results of N bar record inconsistent, then the nicety of grading of atomic rule a is Y/ (Y+N); Suppose in the classified excavation mission requirements that the user proposes that nicety of grading requires to be α that the atomic rule that all niceties of grading is not less than α sends control center to as the net result that classifying rules excavates;

Control center gathers the net result that all sorters generate, and summarized results is fed back to the requestor of classifying rules mining task again.

Compare with existing data classification method, beneficial effect of the present invention is embodied in:

1, the present invention is divided into a plurality of subtasks with the classifying rules mining task of mass data; Be assigned on the extensive server cluster in the cloud computing environment and handle; Reduce the computation complexity of individual task; Utilize the scale of cloud computing server cluster to calculate effect, significantly improve the efficient of whole classifying rules mining task;

2, master-slave mode institutional framework among the present invention has realized the distribution of task under the cloud computing environment, scheduling and management, for classifying rules excavates distributed realization mechanism is provided; Simultaneously, the genetic algorithm itself that the rule training is adopted has good concurrency, has solved the problem of general classification technology concurrency difference in distributed environment.

Description of drawings

Fig. 1 is the schematic diagram of classifying rules method for digging under the cloud computing environment of the present invention

The process flow diagram of Fig. 2 for generating based on the atomic rule of genetic manipulation loop iteration among the present invention

Embodiment

In the classifying rules method for digging under the present embodiment cloud computing environment:

Cloud computing environment is made up of a plurality of distributed servers; When under cloud computing environment, implementing the classifying rules excavation, take the master-slave mode institutional framework, the master-slave mode institutional framework is a control center for a station server is set, and other server is a dependent server; Arrange execution, the management and running of whole mining task and coordinate the operation of each dependent server by control center; Each dependent server is the concrete performance element of task.The classifying rules method for digging is as shown in Figure 1, carries out as follows:

1, by control center will be to be classified data set be divided into training sample and test sample book; Training sample is evenly divided; Obtain each data block of identical size; For each data block specifies a dependent server of carrying out the classified excavation task as processing unit, with each data block allocations to each processing unit of correspondence;

The training of the atomic rule that the data block enforcement of 2, adopting genetic algorithm that distribution is obtained by processing unit is used to classify, the atomic rule that training is obtained deposits buffer zone in;

3, by control center the atomic rule in the buffer zone is divided; Select idle dependent server as sorter; The format classification device carries out the redundant yojan of atomic rule; And detection yojan result's nicety of grading, filter out and satisfy the yojan result that nicety of grading requires, as the net result of classifying rules excavation.

The tentation data collection is by relation schema R (a ₁, a ₂..., a _k) expression, wherein a _i(i=1,2 ..., k) be attribute.With set of properties<a ₁, a ₂..., a _k>Be divided into two parts, comprise k-1 conditional attribute and 1 class label, attribute a ₁, a ₂..., a _K-1Be conditional attribute, attribute a _kBe the class label.Each bar data recording of data centralization is a k dimensional vector [c ₁, c ₂..., c _k], c _iBe attribute a in this data recording _iValue.

The form of expression of atomic rule is: If (a ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _K-1=c _K-1), Then a _k=c _k

The implementation of step 1 is:

Suitable dependent server need satisfy following condition: storage space is not less than M/N, and the response time is not more than the desired time the latest of user.

Control center visit training sample to corresponding processing unit, and utilizes the operational order of genetic algorithm training atomic rule with the block copy of dividing gained to the processing unit transmission.

The training process of atomic rule is in the step 2:

Processing unit carries out genetic coding to each bar data recording of composition data piece; Loop iteration through genetic manipulation generates atomic rule; Deposit atomic rule in buffer zone, < key with the form of < key, value>key-value pair; Value>key type of being label in the key-value pair, value is the atomic rule under such label;

The genetic coding process of data recording is: every data recording is expressed as a chromosome, and the chromogene value is made up of the scale-of-two code value of k property value.If property value is a discrete value, can directly carry out binary coding; If property value is successive value, then to carry out discretize to the connection attribute value after, carry out binary coding again.The binary coding process is: establish attribute a _iS discrete property value v arranged ₁, v ₂..., v _s, then with the binary number representation attribute a that contains s sign indicating number position _iValue.If a _iValue be v _j, then the j place value of this property value binary code is 1, the value position of remainder code position is 0.For example, the sex attribute has " man ", " woman " two values, if property value is " man ", then the binary coding of this attribute is " 01 "; If property value is " woman ", be encoded to " 10 ".In the genetic algorithm; Every chromosome is that a heredity is individual, population of the individual formation of all heredity, and population scale is represented with the individual quantity n of heredity; The population that is obtained by the genetic coding process is an initial population, the gene position of a binary code position homologue.

The loop iteration process of genetic manipulation is as shown in Figure 2: in the t time iteration, at first estimate the fitness of t for population P (t), judge whether to stop iteration afterwards, if satisfy the iteration stopping condition, then stop iteration, P (t) is exported as atomic rule; If do not satisfy the iteration stopping condition, then select, intersection and mutation operation, generate t+1 for population P (t+1), make t=t+1, implement next iteration.

The iteration stopping condition is any of following two conditions:

1. iterations t>100;

2. fitness fitness>0.75.

Fitness evaluation is: establish the individual corresponding data recording of a certain heredity and be [c ₁, c ₂..., c _k], the individual fitness of this heredity is fitness=TT/n+TT/ (TT+TF), wherein TT satisfies " (a in the data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _k=c _k) " data recording bar number, TF is that data satisfy " (a in fast ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _K-1=c _K-1) ∧ (a _k≠ c _k) " data recording bar number, FT satisfies " ┐ [(a in the data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _K-1=c _K-1)] ∧ (a _k=c _k) " data recording bar number, FF satisfies " ┐ [(a in the data block ₁=c ₁) ∧ (a ₂=c ₂) ∧ ... ∧ (a _K-1=c _K-1)] ∧ (a _k≠ c _k) " data recording bar number.

Selection operation is: hereditary individual X _iThe selection probability

Duplicate p _i* n X _iCopy as the individuality in the genetic manipulation population next time, fitnexss (X _i) be X _iFitness value.

Interlace operation is: according to crossover probability p _cSelect two individual X of heredity at random _iAnd X _j, a gene position w on the selective staining body at random is with X _iAnd X _jThe exchange of the gene section of last gene position w back forms two new individualities, as the individuality in the genetic manipulation population next time.Wherein, crossover probability p _cBe the numerical value between [0.4,0.9], also can adopt adaptive crossover probability.

Mutation operation is: according to the variation Probability p _mSelect an individual X of heredity at random _i, select X at random _iA last gene position is carried out negate to the binary code on this gene position.Wherein, variation Probability p _mBe the numerical value between [0.01,0.1], also can adopt adaptive variation probability.

Control center periodically reads < key, the value>key-value pair in the buffer zone, generates < key; Value list>key-value pair tabulation deposits buffer zone in; Key type of being label in the tabulation of < key, value list>key-value pair, value list is the tabulation of the atomic rule under such label;

Step 3 is undertaken by following process:

Sorter utilizes the atomic rule after the yojan that test sample book is classified; Whether detect classification results consistent with the class label of test sample book; Suppose by in the test sample book of atomic rule a classification; Have the class label and the classification results of Y bar record consistent, have the class label and the classification results of N bar record inconsistent, then the nicety of grading of atomic rule a is Y/ (Y+N).In the classified excavation mission requirements that the user proposes, nicety of grading requires to be α that the atomic rule that all niceties of grading is not less than α sends control center to as the net result that classifying rules excavates;

Claims

1. the classifying rules method for digging under the cloud computing environment is characterized in that:

2. according to the classifying rules method for digging under claims 1 described cloud computing environment, it is characterized in that the implementation of said step a is:

3. according to the classifying rules method for digging under claims 1 described cloud computing environment, it is characterized in that the training process of the atomic rule among the said step b is:

4. according to the classifying rules method for digging under claims 1 described cloud computing environment, it is characterized in that said step c is undertaken by following process: