CN104063230A

CN104063230A - Rough set parallel reduction method, device and system based on MapReduce

Info

Publication number: CN104063230A
Application number: CN201410325508.9A
Authority: CN
Inventors: 席大超; 王国胤; 张学睿; 张帆; 封雷; 李广砥; 邓伟辉; 郭义帅; 谢亮; 董建华
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2014-07-09
Filing date: 2014-07-09
Publication date: 2014-09-24
Anticipated expiration: 2034-07-09
Also published as: CN104063230B

Abstract

The invention provides a rough set parallel reduction method, device and system based on MapReduce. The method comprises the steps that after a decision table to be reduced is read, the decision table is reduced, attribute importance degree parallel calculation processing is conducted on the reduced decision table, and lastly attribute importance degree parallel reduction is conducted to obtain a final reduction result. By means of the method, the importance degrees of all attributes can be worked out through one-time MapReduce, redundant information of the reduction decision table is deleted again after the reduction result is obtained, the reduction decision table is more simplified, and thus the calculation speed can be further improved. In addition, same as the method, the rough set parallel reduction device and system can well solve the problems that certain limiting conditions exist in a knowledge reduction method and parallel reduction cannot be conducted efficiently, and further optimize the storage space.

Description

The parallel reduction method of rough set based on MapReduce, Apparatus and system

Technical field

The present invention relates to Reduction of Knowledge field, particularly relate to the parallel reduction method of a kind of rough set based on MapReduce, Apparatus and system.

Background technology

Along with the arrival of large data age, classical reduction method cannot disposablely be encased in data in internal memory, cannot meet the requirement of large data.For this reason, current those skilled in the art's a main target in the time of how can carrying out data mining quickly and accurately under large data.

Along with Google ^tMthe proposition of the distributed file system GFS of company (Google File System), Parallel Programming Models MapReduce and distributed data-storage system BigTable, for the processing of large data provides the foundation, in its prior art, existing a lot of classical data digging methods can apply to large Data processing.As a rule, the classical way for data mining relates generally to following several.

Rough set, the fuzzy and uncertain instrument of its processing as a kind of classics, is widely used in machine learning and Data Mining.In the theory of rough set, Reduction of Knowledge is one of important research contents, is also the committed step of knowledge acquisition, wherein, so-called knowledge, in rough set theory, " knowledge " is considered to a kind of classification capacity.For example, people's behavior is the ability based on differentiating object reality or abstract, and as generation in the time immemorial, it is edible what people must be able to tell in order to survive, and what cannot eat; Doctor is to diagnosing patient, and what must pick out that patient obtains is any disease.These according to the characteristic difference of things by its disaggregatedly ability all can be regarded as certain " knowledge ".In addition, so-called Reduction of Knowledge is under the condition that the classification capacity of maintenance knowledge base is constant, deletes its unnecessary knowledge.By deleting redundancy knowledge, can greatly improve the sharpness of the potential knowledge of infosystem.

MapReduce, MapReduce is the programming model (being software frame) in Hadoop distributed file system, the application program of writing out based on it can operate on the large-scale cluster being comprised of thousands of business machines, and with other data set of T level in a kind of reliable fault-tolerant formula parallel processing.A MapReduce operation (job) can be some independently data blocks the data set cutting of input conventionally, by map task (task), in the mode of complete parallel, processes them.Framework can, to the advanced line ordering of the output of map, then input to reduce task result.Conventionally the input and output of operation all can be stored in file system.Whole framework is responsible for scheduling and the monitoring of task, and re-executes failed task.

Conventionally, MapReduce framework is to operate on one group of identical node with Hadoop distributed file system, that is to say, computing node and memory node conventionally together with.This configuration allows framework to keep on the node of data scheduler task efficiently at those, and this can make the network bandwidth of whole cluster be utilized very efficiently.In addition, map function and reduce function are given user and are realized, and these two function definitions task itself.

In existing theory, refer to document:

1)【Zhang J,Li T,Ruan D,et al.A parallel method for computing rough set approximations[J].Information Sciences,2012,194:209-223】；

2)【Junbo Zhang,Jian-Syuan Wong,Tianrui Li,YiPan.A comparison of parallel large-scaleknowledge acquisition using rough set theory on different MapReduce runtime systems.International Journal of Approximate Reasoning.2013】。

In above document, the parallel approximate model of a kind of rough set and the rough set knowledge acquisition parallel model based on this model have been proposed.This model has provided good proof theoretically, has proved the feasibility of rough set parallel model, but this model has just carried out parallelization by the most basic method of rough set, and the reduction method of rough set does not relate to.

In addition, at document:

3) [money enters, and seedling is taken by force modest, Zhang Zehua. Algorithm for Reduction of Knowledge under cloud computing environment [J]. and Chinese journal of computers, 2011,34 (12): 2332-2343];

4) [money enters, and seedling is taken by force modest, Zhang Zehua. difference matrix knowledge Algorithm for Reduction research [J] under cloud computing environment. and computer science, 2011,38 (8)] in.

Proposed a kind of parallelization reduction method model of rough set, but the restriction of the method being in the majority, need to be compatible decision table, just can carry out the yojan under large data, and practice is very restricted.

Simply, mainly there is following defect in above existing knowledge Reduction:

First, although can carry out the parallel computation of rough set, process, can not carry out yojan.

Secondly, although also have, can carry out rough set parallelization reduction method, its restricted condition, the method, only for compatible decision table, is very restricted when practical application.

Finally, already present parallel reduction method model, not high in operational efficiency, need to promote.

Summary of the invention

The deficiencies in the prior art or shortcoming in view of the above, the object of the present invention is to provide a kind of rough set based on MapReduce walk abreast reduction method, Apparatus and system, for solving prior art knowledge Reduction, have certain limitation condition and can not carry out efficiently the problem of parallelization yojan.

For achieving the above object and other relevant objects, the invention provides following technical scheme:

The parallel reduction method of a kind of rough set based on MapReduce comprises:

Read and treat yojan decision table;

Initialization the one MapReduce model also makes described in its response and treats yojan decision table, to treat that to described yojan decision table carries out parallel computation and processes and to obtain being with markd simplification decision table:

If described simplification decision table is empty, make it as described, treat the final yojan result of yojan decision table and be exported;

If described simplification decision table is non-NULL, initialization the 2nd MapReduce model make it response is described and be with markd simplification decision table, obtains described with the importance degree of each attribute in markd simplification decision table and its result is write in Hadoop distributed file system with parallel computation;

Read decision table that in Hadoop distributed file system, Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of, make the described new yojan decision table for the treatment of carry out yojan again as the input value of a described MapReduce model.

In addition, the present invention gives the parallel yojan device of a kind of rough set based on MapReduce, comprising:

Operation configuration module, treats yojan decision table for reading;

Tasks in parallel is simplified module, for initialization the one MapReduce model and make its response described in treat yojan decision table, to treat that to described yojan decision table carries out parallel computation processing and obtains being with markd simplification decision table, if described simplification decision table is empty, make it as described, treat the final yojan result of yojan decision table and be exported;

Attribute Significance parallel computation module, if while being non-NULL for described simplification decision table, initialization the 2nd MapReduce model also makes that its response is described is with markd simplification decision table, obtains described with the importance degree of each attribute in markd simplification decision table and its result is write in Hadoop distributed file system with parallel computation;

The Attribute Significance yojan module that walks abreast, for reading decision table that Hadoop distributed file system Attribute Significance is the highest and deleting redundant information wherein to obtain the new yojan decision table for the treatment of, make the described new yojan decision table for the treatment of carry out yojan again as the input value of a described MapReduce model.

In addition, the present invention also provides a kind of rough set based on MapReduce yojan system that walks abreast, and comprising:

Operation dispensing unit, treats yojan decision table for reading;

Tasks in parallel simplified element, for initialization the one MapReduce model and make its response described in treat yojan decision table, to treat that to described yojan decision table carries out parallel computation processing and obtains being with markd simplification decision table, if described simplification decision table is empty, make it as described, treat the final yojan result of yojan decision table and be exported;

Attribute Significance parallel computation unit, if while being non-NULL for described simplification decision table, initialization the 2nd MapReduce model also makes that its response is described is with markd simplification decision table, obtains described with the importance degree of each attribute in markd simplification decision table and its result is write in Hadoop distributed file system with parallel computation;

The parallel yojan of Attribute Significance unit, for reading decision table that Hadoop distributed file system Attribute Significance is the highest and deleting redundant information wherein to obtain the new yojan decision table for the treatment of, make the described new yojan decision table for the treatment of carry out yojan again as the input value of a described MapReduce model.

In sum, the relative prior art of the present invention, has the following advantages:

The first, the present invention carries out Attribute Significance parallel computation after first decision table being simplified again, picks out afterwards the reduced unitized table that Attribute Significance is the highest and carries out yojan again, thereby make the yojan result that obtains more accurate in dependency importance degree result of calculation.

The second, there is not the restriction on reduced unitized table in the present invention, the method for relatively can only relative consistency decision table in prior art carrying out yojan, and the present invention has range of application widely.

The 3rd, existing additive method once can only obtain the importance degree of a condition in parallel MapReduce on computation attribute importance degree, and this method can a MapReduce after, coordinate again and once simply read text calculating, just can obtain the importance degree of all conditional attributes, and complete a yojan, be also the efficiency of having heightened method.

The 4th, the present invention carries out after yojan decision table, can effectively optimize storage area, can also improve the counting yield that the yojan result utilized after yojan is calculated simultaneously.

Accompanying drawing explanation

Fig. 1 is shown as the principle of work process flow diagram of MapReduce.

Fig. 2 is shown as the workflow diagram of the parallel reduction method of rough set that the present invention is based on MapReduce.

Fig. 3 is shown as the rough schematic of the parallel reduction method of rough set that the present invention is based on MapReduce.

Drawing reference numeral explanation

S10-S50 step

Embodiment

Below, by specific instantiation explanation embodiments of the present invention, those skilled in the art can understand other advantages of the present invention and effect easily by the disclosed content of this instructions.The present invention can also be implemented or be applied by other different embodiment, and the every details in this instructions also can be based on different viewpoints and application, carries out various modifications or change not deviating under spirit of the present invention.It should be noted that, in the situation that not conflicting, the feature in following examples and embodiment can combine mutually.

The present invention is based on that MapReduce and rough set realize, in order to make those skilled in the art can be aware and understand better the technical program, first MapReduce and rough set are done and correspondingly explained and illustrate here.

The relevant general introduction of MapReduce

MapReduce is a kind of programming model, for the concurrent operation of large-scale dataset.Concept " Map (mapping) " and " Reduce (reduction) ", and their main thought are all borrowed from Functional Programming, the characteristic of borrowing from vector programming language in addition.It is very easy to programming personnel can not distributed parallel programming in the situation that, and the program of oneself is operated in distributed system.It is to specify a Map (mapping) function that current software is realized, be used for one group of key-value pair to be mapped to one group of new key-value pair, specify concurrent Reduce (reduction) function, be used for guaranteeing that each in the key-value pair of all mappings shares identical key group.

See Fig. 1, below in conjunction with Fig. 1, the principle of work flow process of MapReduce is done to a simple explanation.

Map end

The first, each input minute sector-meeting allows a Map task process, and under default situations, the size (being for example 64M) of a piece of HDFS of take is a burst, and we also can arrange the size of piece certainly.The result of Map output can be placed in a circulating memory buffer zone for the time being, when soon overflow this buffer zone, can in local file system, create a spill file, and the data in this buffer zone are write to this file.

The second, before writing disk, first thread is divided into data according to the number of Reduce task the subregion of similar number, namely the data of a corresponding subregion of Reduce task.Doing is like this to be assigned to mass data for fear of some Reduce task, and some Reduce task is assigned to less data, does not even assign to the difficult situation of data.The process of buffer memory (hash) is carried out in its real partition exactly to data, then the data in each subregion are sorted, and the object of doing is like this to allow the least possible data be written to disk.

The 3rd, when Map task is exported last and recorded, may have a lot of spill files, at this moment need these Piece file mergences.Object has two: reduce as far as possible write at every turn disk data volume and. reduce the data volume of next duplicate stage Internet Transmission as far as possible, be finally merged into subregion and an ordering file.In order to reduce the data volume of Internet Transmission, can also be by data compression.

The 4th, the data in subregion are copied to corresponding Reduce task.

Reduce end

The first, Reduce can receive the data that different Map tasks transmit, and the data that each Map transmits are orderly.If the data volume that Reduce termination is subject to is quite little, be directly stored in internal memory, if data volume has surpassed the certain proportion of this buffer size, after data being merged, overflow and write in disk.

The second, along with increasing of the written document that overflows, background thread can be merged into them a larger orderly file, and doing is like this in order to save time to merging below.In fact no matter at Map end or Reduce end, MapReduce carries out sequence, union operation repeatedly.

The 3rd, in the process of merging, can produce many intermediate files (having write disk), but MapReduce can allow to write the data of disk few as much as possible, and the last result merging do not write disk, but be directly inputted to Reduce function.

The relevant general introduction of rough set

First, first introduce below by relevant key concept and some explanations about MapRedce model of the rough set of using.

Defining 1: one decision table is an information table knowledge-representation system S=<U, R, and V, f>, the community set of R=C ∩ D, subset C and D become respectively conditional attribute collection and result property set, V=∪ _{r ∈ R}v _rthe set of property value, V _rthe range of attributes that represents attribute r ∈ R, i.e. the codomain of attribute r, f: U * R → V is an information function, it specifies the property value of each object x in U.For each attribute set we define one can not differentiate binary relation IND (B),

IND (B) = {(x, y) | (x, y) &Element; U^{2}, &ForAll; b &Element; B (b (x) = b (y))}

Definition 2: given knowledge-representation system S=<U, R, V, f>, for each subset with the indefinite B that is related to, under the upper approximate set of X, approximate collection respectively can be as follows by the basic definition of B:

Definition 3: set B N _b(X)=B ^-(X) B_ (X) be called the B border of X.POS _b(X)=B_ (X) is called the positive territory of B of X; NEG _b(X)=U B_ (X) be called the negative territory of X.

Definition 4: in decision table S=(U, C ∪ D, V, f), P and Q are two relation of equivalence bunch that are defined on U, if POS _p(Q)=POS _{(P { r})}(Q), claim r be in P with respect to Q omissible (unnecessary), be called for short in P Q omissible; Otherwise, claim that r is with respect to Q not omissible (necessary) in P.

Definition 5: in decision table S=(U, C ∪ D, V, f), P and Q are two relation of equivalence bunch that are defined on U, if the Q independent subset of P there is POS _s(Q)=POS _p(Q), claim that S is the Q yojan of P.

Definition 6: in decision table S=(U, C ∪ D, V, f), U/C={[u ' ₁] _c, [u ' ₂] _c... [u ' _m] _cbe complete or collected works U about a division of community set C, U '=u ' ₁, u ' ₂..., u ' _m, wherein

&ForAll; u_{i_{s}}^{'} &Element; U^{'}

And

| {[{u^{'}}_{i_{s}}]}_{c} \ D | = 1, (s = 1,2, . . ., t);

Note

U_{pos}^{'} = {[{u^{'}}_{i_{1}}, {u^{'}}_{i_{2}}, . . ., {u^{'}}_{i_{t}}},

U′ _neg＝U′-U′ _pos，U′＝U′ _pos∪U′ _neg。Title S '=(U ', C ∪ D, V, f) for simplifying decision table.

Definition 7: in decision table S=(U, C ∪ D, V, f), S '=(U ', C ∪ D, V, f) for simplifying decision table, importance be defined as

sig _p(a)＝|U′ _P∪{a}-U′ _p|

Wherein,

Based on the above relevant general introduction to MapReduce and rough set, the mode explanation with in conjunction with the embodiments be the present invention is based on to the detailed implementation procedure of the parallel reduction method of rough set of MapReduce below.

In addition, in the present invention, " decision table " refers to the data with decision attribute, i.e. the present invention carry out yojan to as if there are the data of decision attribute.

See Fig. 2, show the schematic flow sheet of the parallel reduction method of rough set that the present invention is based on MapReduce, the parallel reduction method of the described rough set based on MapReduce comprises:

S10 reads and treats yojan decision table: before the yojan that walks abreast, first to read and treat yojan decision table, its mode reading directly from this locality (for example can be, Hadoop distributed file system) read and treat yojan decision table, in addition, it also can directly directly read and treat that yojan decision table is to local from network node.

S30 obtains simplification decision table: initialization the one MapReduce model also makes described in its response and treats yojan decision table, to treat that to described yojan decision table carries out parallel computation and processes and to obtain being with markd simplification decision table,

If the described simplification decision table of S31 is empty, make it as described, treat the final yojan result of yojan decision table and be exported;

S32 parallel computation Attribute Significance: if described simplification decision table is non-NULL, initialization the 2nd MapReduce model make it response is described and be with markd simplification decision table, obtains described with the importance degree of each attribute in markd simplification decision table and its result is write in Hadoop distributed file system with parallel computation;

The parallel yojan of S50 Attribute Significance: read decision table that in Hadoop distributed file system, Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of, make the described new yojan decision table for the treatment of carry out yojan again as the input value of a described MapReduce model.

First, relatively existing reduction method, the parallel reduction method of the above rough set based on MapReduce, by first drawing the decision table of a simplification, carries out yojan and will greatly reduce calculated amount, thereby improved efficiency on the decision table of simplifying; In addition, existing reduction method once can only obtain the importance degree of a condition in parallel MapReduce on computation attribute importance degree, and the parallel yojan side of the rough set that the present invention is based on MapReduce is once after MapReduce, coordinate again and once simply read text calculating, just can obtain the importance degree of all conditional attributes, and complete a yojan, thereby heightened the efficiency of yojan.

Secondly, the parallel reduction method of rough set that the present invention is based on MapReduce is mainly the improvement in prior art, therefore be necessary here first traditional attribute reduction method is simply introduced and illustrated.

According to the elaboration about rough set providing above, here provide again an attribute reduction classic method fast, the method is usingd Attribute Significance as yojan index, the attribute that taking-up importance degree is the highest is at every turn as a result of yojan, when U ' gathers for sky, method stops, and has found a best yojan result, and has exported this result.

Concrete, provide the specific implementation process of traditional Rough Set Reduction method:

Method one

Input: decision table S=(U, C ∪ D, V, f)

Output: attribute reduction R

The first step, calculates U/C, obtains U ', U ' _pos, U ' _neg;

Second step,

The 3rd step, is handled as follows any a ∈ C-R:

The importance degree sig of each attribute in set of computations a _r(a), B _r(a), NB _rand U '/(R ∪ { a ' }) (B (a) _r(a) represent that all elements in this equivalence class is all at U ' _posin, and all elements of this equivalence class is got identical value, NB on decision attribute _r(a) what represent this equivalence class have element all at U ' _neg);

The 4th step, note sig _r(a ')=max sig _r(a),, if during more than one of such attribute, appoint and get one;

The 5th step, R=R ∪ { a ' }; U '=U '-B _ra '-NB _r(a ');

The 6th step, if output R; Otherwise forward next step to;

The 7th step, U ' _pos=U ' _pos-B _r(a '), U ' _neg=U ' _neg-NB _r(a ');

The 8th step, calculates U '/R ∪ { a ' } and forwards to, the 3rd step.

On above-mentioned basis, below the parallel reduction method of rough set that specifically the present invention is based on MapReduce to how to realize is elaborated.

Particularly, in step S30, be how decision table is simplified in specific implementation parallel computation, see as follows:

Definition 8: provide a decision table S=(U, C ∪ D, V, f), order s _i=(U _i, C ∪ D, V, f) and be the sub-decision table of S, it meets following condition:

(1) - - - U = \cup_{i - 1}^{m} U_{i};

this means that we can be divided into a decision table a lot of incoherent sub-decision tables mutually.

Theorem 1: provide a decision table S=(U, C ∪ D, V, f), order s _i=(U _i, C ∪ D, V, f) and be the sub-decision table of S.The given subset of a conditional attribute arbitrarily there is relation of equivalence U/B={E ₁, E ₂... E _i, for sub-decision table S _i, can obtain conclusion below: require the equivalence class of decision table, can first ask the equivalence class of every sub-decision table.Then the identical equivalence class of same alike result in sub-decision table is merged and obtained.

According to above-mentioned theorem 1, MapReduce can meet the requirement of asking equivalence class, U ' when obtaining equivalence class _pos, U ' _negalso can obtain simultaneously, therefore, simplify decision table U ' and can obtain by MapReduce equally.The parallel method PACSDT (Parallel Algorithm for Computation of a Simplified Decision Table) that provides computational short cut decision table S ' below, method PACSDT forms PACSDT-Map and PACSDT-Reduce by two parts.Be described below:

Method 2, PACSDT-Map (key, value)

Input: decision table S _i=(U _i, C ∪ D, V, f),

Output: <x_C, x_D>x_C: the conditional attribute that object x is corresponding, x_D: the decision attribute that object x is corresponding.

For example, PACSDT-Map (key, the value) input format that MapReduce provides is as follows:

After finishing by method 2 calculating, can sort according to the key value of Map output, the key after sequence and value can pass to reduce and further calculate, therefore, pass to the <key of Reduce, key of value> includes a plurality of value.Therefore, each key is an equivalence class of decision table in fact.Value is the set of the decision attribute got on equivalence class.

Method 3, PACSDT-Reduce (key value)

Input: <x_C, x_D>x_C: the value of the conditional attribute that object x is corresponding, x_D: the value of the decision attribute that object x is corresponding;

Output: <x_C, x_D+POS _c(D) _ flag+x_No>x_C: the conditional attribute that object x is corresponding, x_D+POS _c(D) _ flag+x_No: the decision attribute that object x is corresponding and POS _c(D) sign of _ flag and the combination of object number.

For example, the PACSDT-Reduce that MapReduce provides (key value) input format is as follows:

By method 2 and method 3, calculate, obtained the decision table of new simplification, this decision table except the due feature of general decision table, more than a POS _c(D) sign of _ flag, this sign can be asked on Attribute Significance and play an important role at next step.According to the decision table of new simplification obtained above, if described simplification decision table is empty, make it as described, treat the final yojan result of yojan decision table and be exported.

Particularly, in step S32, be specific implementation parallel computation Attribute Significance how, see as follows:

Reduction method based on Attribute Significance has obtained using widely in traditional rough set, and has obtained good effect.Because the importance degree of each attribute is calculating of can walking abreast, thereby can utilize Attribute Significance as the parallel mode of attribute reduction.But they carry out the importance degree that MapReduce can only obtain an attribute, and efficiency is not high.And the present invention will ask the method for Attribute Significance to improve in method 1, a MapReduce can calculate the importance degree of all properties, thereby has improved efficiency.

Provide parallel Attribute Significance computing method PACAS (Parallel Algorithm for Computation of AttributeSignificance) below, described computing method are by PACAS-Map, and tri-computing method of PACAS-Reduce and PACAS form.Be described below:

Method 4, PACAS-Map (key, value)

Input: simplify decision table S ' _i=(U ' _i, C ∪ D, V, f)

Output: <c+x_c ∪ R, x_D+POS _c(D) _ flag+x_No>c+x_c is each attribute c ∈ C in decision table and the combination of the value of object x on property set c ∪ R, x_D+POS _{c ∪ R}(D) _ flag+x_No is decision attribute and the POS that object x is corresponding _{c ∪ R}(D) sign of _ flag and the combination of object number.

For example, PACAS-Map (key, the value) input format that MapReduce provides is as follows:

By method 4, can obtain the corresponding decision value of getting of each classification of each attribute in each decision table.And after Map finishes, by all <key, value> is to carrying out a sequence, each classification of each attribute will be exported together, and as the input of Reduce.

Method 5, PACAS-Reduce (key, value)

Input: <c+x_c, x_D+POS _{c ∪ R}(D) _ flag+x_No>c+x_c is each attribute c ∈ C in decision table and the combination of the value of object x in attribute c, x_D+POS _{c ∪ R}(D) _ flag+x_No is decision attribute and the POS that object x is corresponding _{c ∪ R}(D) sign of _ flag and the combination of object number.

Output: <c, sig (c)+B _r(c)+NB _r(c) the attribute c ∈ C in >c decision table, sig (c)+B _r(c)+NB _r(c) be the importance degree of this attribute and the B that calculates while asking importance degree _rand NB (c) _r(c).

By method 5, can obtain the B that each equivalence class of each attribute is got _rand NB (c) _r(c) and | B _r(c) | and | NB _r(c) |.And result is kept in the text of HDFS.By the content of text, can calculate the importance degree of each attribute, and select yojan result of conduct that importance degree is the highest.The method of complete computation attribute importance degree, is described by method 6.

Method 6PACAS

Input: simplify decision table S ' _i=(U ' _i, C ∪ D, V, f)

Output: a yojan result reduction

For example, the PACAS input format that MapReduce provides is as follows:

begin

let reduction←0；

Be MapReduce operation of initialization, each equivalence class that calculates each attribute by method 4 and method 5 obtains sig (c).

Particularly, in step S50, be the parallel yojan of specific implementation Attribute Significance how, see as follows:

By said method 6, obtained the yojan result of once calculating, result is joined in the set of yojan, then carry out solving of Attribute Significance next time, but before solving, we to adjust simplifying decision table again, remove the information of redundancy.This step also can be used parallelization PACDT (Parallel Algorithm for Computation of Decision Table).The method only comprises a PACDT-Map.Be described below:

Method 7PACDT-Map (key, value)

Input: simplify decision table S ' _i=(U ' _i, C ∪ D, V, f)

Output: new simplification decision table S ' _i=(U ' _i, C ∪ D, V, f)

For example, PACDT-Map (key, the value) input format that MapReduce provides is as follows:

By method 7, we can obtain new simplification decision table, and this decision table will ask the input decision table of Attribute Significance to carry out computation attribute importance degree as next time.The complete parallel reduction method PACARBAS based on Attribute Significance (Parallel Algorithm for Computation of Attribute Reduction Based on AttributeSignificance) will be provided below.Method is described below:

Method 8PACARBAS

Input: decision table Si=(U _i, C ∪ D, V, f),

Output: yojan Reductions

For example, the PACARBAS input format that MapReduce provides is as follows:

begin

let Reductions←0；

By method 2 and method 3, be simplified decision table S ';

while S′is not empty do

By method 6, calculate reduction

let Reductions←reduction；

By method 7, recalculate simplification decision table;

end

Reductions

end

Method 8 has provided complete calculating yojan process, and method is passed through repeatedly iteration, adjusts simplification decision table and finally obtains yojan result.

By above-mentioned, for method 1 to the explanation of method 8, introduce, realization of the present invention can execution flow process as shown in Figure 3 be summarized in fact.

Particularly, by being how to realize the example of yojan by said method by exemplifying a decision table, be that those skilled in the art can know technical scheme of the present invention more below.

Embodiment

First, provide a decision table S=(U, C ∪ D, V, f), table can be divided into two sub-decision tables, S ₁=(U ₁, C ∪ D, V, f) and S ₂=(U ₂, C ∪ D, V, f), as table 1, shown in 2:

The sub-decision table S of table 1 ₁

The sub-decision table S of table 2 ₂

Secondly, how decision table and parallel computation Attribute Significance are simplified in parallel computation

Table 3 is simplified decision table U '

The Map stage: conditional attribute <x_C separated with decision attribute, x_D>

For example:

Key＝{1,1,1,2}

Value＝{1}

The Map stage: add POS _p(D) and line number <x_C, x_D+POS _c(D) _ flag+x_No>:

For example:

Key＝{1,1,1,2}

Value＝{1_1_1}

Parallel computation Attribute Significance

Map: input: simplify decision table S ' _i=(U ' _i, C ∪ D, V, f)

Output: <c+x_c ∪ R, x_D+POS _c(D) _ flag+x_No>

For example:

For NO1

Output <key, value>={a_1,1_1_1}

{b_1,1_1_1}

{c_1,1_1_1}

{d_1,1_1_1}

Reduce: input: <c+x_c, x_D+POS _{c ∪ R}(D) _ flag+x_No>

Output: <c, sig (c)+B _r(c)+NB _r(c) >

Calculate the importance degree of each attribute.While the output of Map being aggregated into Reduce here, can be through sequence, sequence is to sort according to each attribute, therefore, after sequence, the key of same alike result is together.Reduce once calculates the importance degree of all properties, and mapreduce of present method can only calculate the importance degree of an attribute

For example:

After calculating:

sig _R(a)＝1

sig _R(a)＝0

sig _R(c)＝0

sig _R(d)＝0

Method 6PACAS

From HDFS, read out result, calculate the yojan of conduct that importance degree is the highest, select for the first time attribute A as output

Finally, the parallel reduction method based on Attribute Significance how

According to the B of attribute a _rand NB (a) _r(a), recalculate simplification decision table, delete redundant information, therefore by the information deletion of No=1.

Simplifying decision table becomes:

Then recalculate Attribute Significance:

B _R(b)＝{X3,X4,X5}，NB _R(b)＝{X2,X9},sig _R(b)＝5

B _R(c)＝{X3,X5}，NB _R(c)＝{X2},sig _R(c)＝3

B _R(d)＝{X3,X4,X5}，NB _R(d)＝{X2,X9},sig _R(d)＝5

Occur that Attribute Significance is the same, select immediately identical one as yojan, select b as output here.Then recalculate simplification decision table, be simplified the result of decision table for empty, yojan finishes, and obtains result Reductions={a, b}.

Operation configuration module, treats yojan decision table for reading;

Particularly, described tasks in parallel is simplified module, specifically for treating that to described yojan decision table carries out operation configuration, to obtain a plurality of sub-decision tables; Make the Map function of a MapReduce model carry out parallel computation to described a plurality of sub-decision tables and treat conditional attribute and the decision attribute in yojan decision table described in obtaining, and exported; The Reduce function that makes a MapReduce model obtains being with markd simplification decision table after described conditional attribute and decision attribute are calculated.

Particularly, described Attribute Significance parallel computation module, specifically for initialization the 2nd MapReduce model; Make the Map function of described the 2nd MapReduce model respond the described markd simplification decision table of being with, with parallel computation, obtain described corresponding decision value of getting of each classification with each attribute in markd simplification decision table; Make the Reduce function of described the 2nd MapReduce model respond described decision value, the Attribute Significance of getting to obtain each equivalence class of each attribute, and its result is write in Hadoop distributed file system.

Particularly, the described Attribute Significance yojan module that walks abreast, also for being that the highest decision table of Attribute Significance that described Hadoop distributed file system reads has when a plurality of, selecting at random the decision table that one of them Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of.

Further, the present invention also provides a kind of rough set based on MapReduce yojan system that walks abreast, and comprising:

Operation dispensing unit, treats yojan decision table for reading;

Particularly, described tasks in parallel simplified element, specifically for treating that to described yojan decision table carries out operation configuration, to obtain a plurality of sub-decision tables; Make the Map function of a MapReduce model carry out parallel computation to described a plurality of sub-decision tables and treat conditional attribute and the decision attribute in yojan decision table described in obtaining, and exported; The Reduce function that makes a MapReduce model obtains being with markd simplification decision table after described conditional attribute and decision attribute are calculated;

Particularly, described Attribute Significance parallel computation unit, specifically for initialization the 2nd MapReduce model; Make the Map function of described the 2nd MapReduce model respond the described markd simplification decision table of being with, with parallel computation, obtain described corresponding decision value of getting of each classification with each attribute in markd simplification decision table; Make the Reduce function of described the 2nd MapReduce model respond described decision value, the Attribute Significance of getting to obtain each equivalence class of each attribute, and its result is write in Hadoop distributed file system;

Particularly, the parallel yojan of described Attribute Significance unit, also for being that the highest decision table of Attribute Significance that described Hadoop distributed file system reads has when a plurality of, selecting at random the decision table that one of them Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of.

First, existing parallel reduction method cannot obtain parallel yojan result accurately, because the method that they propose is directly the sub-decision table after the cutting in map to be carried out to yojan, and then yojan result is merged, but reduction method needs complete equivalence class just passable.Therefore, the method for this proposition is from unilateral data, to obtain yojan result in fact, and the result of yojan has incorrect and inexactness.

The second, parallel reduction method has limitation, the parallel reduction method of existing proposition, and requiring decision table is compatible decision table, provides the definition of a compatible decision table below: for a decision table S, all objects are all at POS _c(D), in, this decision table is exactly compatible decision table so.If have object at U-POS _c(D), in, be inconsistent decision table.The POS here _c(D) be exactly the POS in my paper _c(D), be also mark given in method 3, compatible decision table is exactly POS in my method _c(D) be all 1 decision table.And the method that I propose does not have these restrictions.Be applicable to all decision tables.

The 3rd, the present invention has high efficiency.First, existing method, be mainly on original decision table, to carry out yojan, and this method is the decision table that first draws a simplification, carries out yojan and will greatly reduce calculated amount, thereby improved efficiency on the decision table of simplifying; Secondly, existing additive method once can only obtain the importance degree of a condition in parallel MapReduce on computation attribute importance degree, and this method can a MapReduce after, coordinate again and once simply read text calculating, just can obtain the importance degree of all conditional attributes, and complete a yojan, be also the efficiency of having heightened method.

Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any person skilled in the art scholar all can, under spirit of the present invention and category, modify or change above-described embodiment.Therefore, such as in affiliated technical field, have and conventionally know that the knowledgeable, not departing from all equivalence modifications that complete under disclosed spirit and technological thought or changing, must be contained by claim of the present invention.

Claims

1. the parallel reduction method of the rough set based on MapReduce, is characterized in that, comprising:

Read and treat yojan decision table;

2. the parallel reduction method of the rough set based on MapReduce according to claim 1, is characterized in that, utilizes a MapReduce model to treat that to described yojan decision table carries out parallel computation and processes the concrete grammar that is simplified decision table and comprise:

To described, treat that yojan decision table carries out operation configuration, to obtain a plurality of sub-decision tables;

Make the Map function of a MapReduce model carry out parallel computation to described a plurality of sub-decision tables and treat conditional attribute and the decision attribute in yojan decision table described in obtaining, and exported;

The Reduce function that makes a MapReduce model obtains being with markd simplification decision table after described conditional attribute and decision attribute are calculated.

3. the parallel reduction method of the rough set based on MapReduce according to claim 1 and 2, is characterized in that, the concrete grammar that utilizes the 2nd MapReduce Model Based Parallel to calculate the importance degree of each attribute in described simplification decision table comprises:

Initialization the 2nd MapReduce model;

Make the Map function of described the 2nd MapReduce model respond the described markd simplification decision table of being with, with parallel computation, obtain described corresponding decision value of getting of each classification with each attribute in markd simplification decision table;

Make the Reduce function of described the 2nd MapReduce model respond described decision value, the Attribute Significance of getting to obtain each equivalence class of each attribute, and its result is write in Hadoop distributed file system.

4. according to the parallel reduction method of the rough set based on MapReduce described in claim 1 or 3, it is characterized in that, if it is a plurality of that the highest decision table of Attribute Significance reading in described Hadoop distributed file system has, select at random decision table that one of them Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of.

5. the parallel yojan device of the rough set based on MapReduce, is characterized in that, comprising:

Operation configuration module, treats yojan decision table for reading;

6. the parallel yojan device of the rough set based on MapReduce according to claim 1, is characterized in that:

Described tasks in parallel is simplified module, specifically for treating that to described yojan decision table carries out operation configuration, to obtain a plurality of sub-decision tables; Make the Map function of a MapReduce model carry out parallel computation to described a plurality of sub-decision tables and treat conditional attribute and the decision attribute in yojan decision table described in obtaining, and exported; The Reduce function that makes a MapReduce model obtains being with markd simplification decision table after described conditional attribute and decision attribute are calculated.

7. the parallel yojan device of the rough set based on MapReduce according to claim 1, is characterized in that:

Described Attribute Significance parallel computation module, specifically for initialization the 2nd MapReduce model; Make the Map function of described the 2nd MapReduce model respond the described markd simplification decision table of being with, with parallel computation, obtain described corresponding decision value of getting of each classification with each attribute in markd simplification decision table; Make the Reduce function of described the 2nd MapReduce model respond described decision value, the Attribute Significance of getting to obtain each equivalence class of each attribute, and its result is write in Hadoop distributed file system.

8. the parallel yojan device of the rough set based on MapReduce according to claim 1, it is characterized in that: the described Attribute Significance yojan module that walks abreast, also for being that the highest decision table of Attribute Significance that described Hadoop distributed file system reads has when a plurality of, selecting at random the decision table that one of them Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of.

9. the parallel yojan system of the rough set based on MapReduce, is characterized in that, comprising:

Operation dispensing unit, treats yojan decision table for reading;

10. the parallel yojan system of the rough set based on MapReduce according to claim 9, is characterized in that:

Described tasks in parallel simplified element, specifically for treating that to described yojan decision table carries out operation configuration, to obtain a plurality of sub-decision tables; Make the Map function of a MapReduce model carry out parallel computation to described a plurality of sub-decision tables and treat conditional attribute and the decision attribute in yojan decision table described in obtaining, and exported; The Reduce function that makes a MapReduce model obtains being with markd simplification decision table after described conditional attribute and decision attribute are calculated;

Described Attribute Significance parallel computation unit, specifically for initialization the 2nd MapReduce model; Make the Map function of described the 2nd MapReduce model respond the described markd simplification decision table of being with, with parallel computation, obtain described corresponding decision value of getting of each classification with each attribute in markd simplification decision table; Make the Reduce function of described the 2nd MapReduce model respond described decision value, the Attribute Significance of getting to obtain each equivalence class of each attribute, and its result is write in Hadoop distributed file system;

The parallel yojan of described Attribute Significance unit, also for being that the highest decision table of Attribute Significance that described Hadoop distributed file system reads has when a plurality of, selecting at random the decision table that one of them Attribute Significance is the highest and delete redundant information wherein to obtain the new yojan decision table for the treatment of.