CN104063230A

CN104063230A - Rough set parallel reduction method, device and system based on MapReduce

Info

Publication number: CN104063230A
Application number: CN201410325508.9A
Authority: CN
Inventors: 席大超; 王国胤; 张学睿; 张帆; 封雷; 李广砥; 邓伟辉; 郭义帅; 谢亮; 董建华
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2014-07-09
Filing date: 2014-07-09
Publication date: 2014-09-24
Anticipated expiration: 2034-07-09
Also published as: CN104063230B

Abstract

The invention provides a rough set parallel reduction method, device and system based on MapReduce. The method comprises the steps that after a decision table to be reduced is read, the decision table is reduced, attribute importance degree parallel calculation processing is conducted on the reduced decision table, and lastly attribute importance degree parallel reduction is conducted to obtain a final reduction result. By means of the method, the importance degrees of all attributes can be worked out through one-time MapReduce, redundant information of the reduction decision table is deleted again after the reduction result is obtained, the reduction decision table is more simplified, and thus the calculation speed can be further improved. In addition, same as the method, the rough set parallel reduction device and system can well solve the problems that certain limiting conditions exist in a knowledge reduction method and parallel reduction cannot be conducted efficiently, and further optimize the storage space.

Description

MapReduce-based rough set parallel reduction method, device and system

Technical Field

The invention relates to the field of knowledge reduction, in particular to a rough set parallel reduction method, device and system based on MapReduce.

Background

With the advent of the big data era, the classic reduction method cannot load data into a memory at one time and cannot meet the requirement of big data. Therefore, a main objective of those skilled in the art is how to accurately and rapidly perform data mining under big data.

With Google^TMThe distributed File system GFS (Google File System), the parallel programming mode MapReduce and the distributed data storage system BigTable of the company provide a foundation for processing big data. In general, the classical approaches for data mining are primarily directed to the following.

The rough set, which is a classical tool dealing with ambiguity and uncertainty, is widely used in the fields of machine learning and data mining. Knowledge reduction is one of the important research contents in rough set theory, and is also the key step of knowledge acquisition, wherein so-called knowledge, in rough set theory, "knowledge" is considered as a sort ability. For example, people's behavior is based on the ability to distinguish real or abstract objects, such as in the ancient times, people must be able to distinguish what can be eaten and what cannot be eaten in order to survive; the doctor gives a diagnosis to the patient and must distinguish which disease the patient is suffering from. These abilities to classify things according to their characteristic differences can all be considered as some kind of "knowledge". In addition, the knowledge reduction is to delete unnecessary knowledge of the knowledge base while maintaining the classification ability of the knowledge base. By deleting redundant knowledge, the definition of the latent knowledge of the information system can be greatly improved.

MapReduce, a programming model (i.e., software framework) in a Hadoop distributed file system, based on which written applications can run on large clusters of thousands of commercial machines and process data sets at the top T level in parallel in a reliable fault-tolerant manner. A MapReduce job (job) typically splits the input dataset into several independent data blocks, which are processed in a completely parallel manner by the map task (task). The framework will sort the output of the maps first and then input the results to the reduce task. Typically both the input and output of a job will be stored in a file system. The whole framework is responsible for scheduling and monitoring tasks and re-executing tasks that have failed.

Typically, the MapReduce framework and the Hadoop distributed file system run on the same set of nodes, i.e., compute nodes and storage nodes are typically together. This configuration allows the framework to efficiently schedule tasks on those nodes that already have data in place, which can allow the network bandwidth of the entire cluster to be utilized very efficiently. In addition, the map function and the reduce function are given to the user for implementation, and the two functions define the task itself.

In the existing theory, see the literature for details:

1)【Zhang J,Li T,Ruan D,et al.A parallel method for computing rough set approximations[J].Information Sciences,2012,194:209-223】；

2)【Junbo Zhang,Jian-Syuan Wong,Tianrui Li,YiPan.A comparison of parallel large-scaleknowledge acquisition using rough set theory on different MapReduce runtime systems.International Journal of Approximate Reasoning.2013】。

in the above document, a rough set parallel approximation model and a rough set knowledge acquisition parallel model based on the model are proposed. The model gives a good demonstration in theory, demonstrates the feasibility of the rough set parallel model, but the model only parallelizes the most basic method of the rough set, and the rough set reduction method does not relate to the model.

In addition, in the literature:

3) a knowledge reduction algorithm [ J ] in a cloud computing environment, 2011,34(12): 2332-;

4) [ money in, courtesy seedling, courtesy Zhang, in a cloud computing environment, a study of a difference matrix knowledge reduction algorithm [ J ]. computer science, 2011,38(8) ].

A rough parallel reduction method model is provided, but the method has many limitations, a compatible decision table is needed to carry out reduction under big data, and the practical application is greatly limited.

Briefly, the above prior knowledge reduction methods mainly have the following drawbacks:

first, although rough parallel computing processing can be performed, reduction cannot be performed.

Secondly, although there is a method capable of performing parallelization reduction of rough sets, the limited condition is that the method only aims at a consistent decision table, and is very limited in practical application.

Finally, the existing parallel reduction method model is not high in operation efficiency and needs to be improved.

Disclosure of Invention

In view of the above disadvantages or shortcomings of the prior art, an object of the present invention is to provide a rough set parallel reduction method, apparatus and system based on MapReduce, which are used to solve the problems that the knowledge reduction method in the prior art has certain limitations and cannot efficiently perform parallelization reduction.

In order to achieve the above objects and other related objects, the present invention provides the following technical solutions:

a rough set parallel reduction method based on MapReduce comprises the following steps:

reading a decision table to be reduced;

initializing a first MapReduce model and enabling the first MapReduce model to respond to the decision table to be reduced so as to perform parallel computing processing on the decision table to be reduced to obtain a simplified decision table with a mark:

if the decision table is empty, the decision table is used as a final reduction result of the decision table to be reduced and output;

if the simplified decision table is not empty, initializing a second MapReduce model and enabling the second MapReduce model to respond to the simplified decision table with the marks so as to obtain the importance of each attribute in the simplified decision table with the marks through parallel calculation and write the result into a Hadoop distributed file system;

and reading a decision table with the highest attribute importance in the Hadoop distributed file system, deleting redundant information in the decision table to obtain a new decision table to be reduced, and enabling the new decision table to be reduced to be used as an input value of the first MapReduce model to be reduced again.

In addition, the invention also provides a rough set parallel reduction device based on MapReduce, which comprises the following steps:

the operation configuration module is used for reading a decision table to be reduced;

the task parallel simplification module is used for initializing a first MapReduce model and enabling the first MapReduce model to respond to the decision table to be reduced so as to perform parallel calculation processing on the decision table to be reduced to obtain a simplified decision table with marks, and if the simplified decision table is empty, enabling the simplified decision table to be used as a final reduction result of the decision table to be reduced and outputting the final reduction result;

the attribute importance parallel computing module is used for initializing a second MapReduce model and enabling the second MapReduce model to respond to the simplified decision table with the marks if the simplified decision table is non-empty, so as to obtain the importance of each attribute in the simplified decision table with the marks through parallel computing and write the result into a Hadoop distributed file system;

and the attribute importance degree parallel reduction module is used for reading the decision table with the highest attribute importance degree in the Hadoop distributed file system and deleting redundant information in the decision table to obtain a new decision table to be reduced, and the new decision table to be reduced is used as an input value of the first MapReduce model to be reduced again.

In addition, the invention also provides a rough set parallel reduction system based on MapReduce, which comprises the following steps:

the operation configuration unit is used for reading a decision table to be reduced;

the task parallel simplification unit is used for initializing a first MapReduce model and enabling the first MapReduce model to respond to the decision table to be reduced so as to perform parallel calculation processing on the decision table to be reduced to obtain a simplified decision table with marks, and if the simplified decision table is empty, enabling the simplified decision table to be used as a final reduction result of the decision table to be reduced and outputting the final reduction result;

the attribute importance parallel computing unit is used for initializing a second MapReduce model and enabling the second MapReduce model to respond to the simplified decision table with the marks if the simplified decision table is non-empty, so as to obtain the importance of each attribute in the simplified decision table with the marks through parallel computing and write the result into a Hadoop distributed file system;

and the attribute importance degree parallel reduction unit is used for reading the decision table with the highest attribute importance degree in the Hadoop distributed file system and deleting redundant information in the decision table to obtain a new decision table to be reduced, and the new decision table to be reduced is used as an input value of the first MapReduce model to be reduced again.

In summary, compared with the prior art, the invention has the following advantages:

firstly, the invention simplifies the decision table and then carries out attribute importance parallel computation, and then selects the reduction table with the highest attribute importance from the calculation results of the attribute importance to carry out reducibility, thereby leading the obtained reduction result to be more accurate.

Second, the present invention has no restriction on reduction tables, and has a wider application range compared with the prior art which can only reduce the compatibility decision table.

Thirdly, in the existing other methods, only the importance of one condition can be obtained in the last parallel MapReduce of the calculation of the importance of the attribute, and the importance of all condition attributes can be obtained by matching with a simple text reading calculation after the MapReduce is performed once, so that a reduction is completed once, and the efficiency of the method is improved.

Fourthly, after the decision table is reduced, the storage space can be effectively optimized, and meanwhile, the calculation efficiency of calculation by using the reduced reduction result can be improved.

Drawings

FIG. 1 is a flow chart of the working principle of MapReduce.

FIG. 2 is a flowchart illustrating the operation of the rough set parallel reduction method based on MapReduce according to the present invention.

FIG. 3 is a simplified schematic diagram of the rough set parallel reduction method based on MapReduce according to the present invention.

Description of the reference numerals

S10-S50 steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The present invention is implemented based on MapReduce and rough set, and in order to make the technical solution better clear and understood by those skilled in the art, MapReduce and rough set are explained and illustrated accordingly.

MapReduce related overview

MapReduce is a programming model for parallel operations on large-scale datasets. The concepts "Map" and "Reduce" and their main ideas are borrowed from functional programming languages and also features borrowed from vector programming languages. The method greatly facilitates programmers to operate programs on the distributed system under the condition of no distributed parallel programming. Current software implementations specify a Map function to Map a set of key-value pairs into a new set of key-value pairs, and a concurrent Reduce function to ensure that each of all mapped key-value pairs share the same key-set.

Referring to fig. 1, a brief description will be made of the working principle flow of MapReduce in conjunction with fig. 1.

Map terminal

First, each input tile is processed by a Map task, and the size of a block (e.g. 64M) of the HDFS is set as a tile by default, although we can set the size of the block. The result output by the Map is temporarily placed in a ring memory buffer, when the buffer is about to overflow, an overflow file is created in the local file system, and the data in the buffer is written into the file.

Secondly, before writing into the disk, the thread firstly divides the data into the partitions with the same number according to the number of Reduce tasks, namely, one Reduce task corresponds to the data of one partition. This is done to avoid the embarrassment of having some Reduce tasks assigned large amounts of data, while some Reduce tasks may be assigned little to no data. In fact, the partitioning is a process of caching (hash) data, and then sorting the data in each partition, which is performed to write as little data as possible to the disk.

Third, when the Map task outputs the last record, there may be many overflow files that need to be merged. The purpose is two: the data volume written into the disk at each time is reduced as much as possible, the data volume transmitted by the network at the next copying stage is reduced as much as possible, and finally, the data volume is combined into a partitioned and sequenced file. Data may also be compressed in order to reduce the amount of data transmitted by the network.

Fourth, the data in the partition is copied to the corresponding Reduce task.

Reduce end

First, Reduce receives data from different Map tasks, and the data from each Map is ordered. If the data volume received by the Reduce end is quite small, the data volume is directly stored in the memory, and if the data volume exceeds a certain proportion of the size of the buffer area, the merged data is overflowed to a disk.

Second, as the number of over-written files increases, the background thread will merge them into a larger ordered file, which is done to save time for later merges. In fact, MapReduce repeatedly executes sorting and merging operations no matter at the Map end or the Reduce end.

Third, many intermediate files (written to disk) are generated during the merging process, but MapReduce allows as little data to be written to disk as possible, and the result of the last merging is not written to disk but is directly input to the Reduce function.

Summary of related Art

First, the basic concepts related to the rough set and some explanations about the MapRedce model will be described.

Definition 1: a decision table is an information table knowledge expression system S ═<U,R,V,f>R ═ C ═ D attribute set, subsets C and D become condition attribute set and result attribute set, respectively, V ═ u @_r∈RV_rIs a collection of attribute values, V_rThe attribute range representing the attribute R ∈ R, i.e., the value range of the attribute R, f: U × R → V is an information function that specifies the attribute value of each object x in U. For each attribute subsetWe define an unresolvable binary relationship IND (B), i.e.

<math> <mrow> <mi>IND</mi> <mrow> <mo>(</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>{</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>|</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <msup> <mi>U</mi> <mn>2</mn> </msup> <mo>,</mo> <mo>&ForAll;</mo> <mi>b</mi> <mo>&Element;</mo> <mi>B</mi> <mrow> <mo>(</mo> <mi>b</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>b</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </math>

Definition 2: given a knowledge expression system S ═<U,R,V,f>For each subsetAnd the ambiguous relationship B, the upper approximation set of X and the lower approximation set, respectively, can be defined by the basis of B as follows:

definition 3: set BN_B(X)＝B^-(X) \ B _ (X) is referred to as the B boundary of X. POS (Point of sale)_B(X) ═ B _ (X) the B positive domain referred to as X; NEG_BThe (X) ═ U \ B _ (X) is referred to as the negative field of X.

Definition 4: in the decision table S ═ (U, C ═ D, V, f), P and Q are two equivalence clusters defined on U, if POS_P(Q)＝POS_(P\{r})(Q), then r is (unnecessary) to Q can be omitted in P, Q can be omitted in P for short; otherwise, let r be non-omissible (essential) in P relative to Q.

Definition 5: in the decision table S ═ (U, C ═ D, V, f), P and Q are two equivalence clusters defined on U, if P is a Q independent subset of PWith POS_s(Q)＝POS_P(Q), S is called Q reduction of P.

Definition 6: in the decision table S { (U, C { [ U { ]) } { [ U'₁]_c,[u′₂]_c,…[u′_m]_cIs a division of the corpus U into attribute sets C, U '═ U'₁,u′₂,…,u′_m}，Wherein

<math> <mrow> <mo>&ForAll;</mo> <msubsup> <mi>u</mi> <msub> <mi>i</mi> <mi>s</mi> </msub> <mo>′</mo> </msubsup> <mo>&Element;</mo> <msup> <mi>U</mi> <mo>′</mo> </msup> </mrow> </math>

And is

Note the book

U′_neg＝U′-U′_pos，U′＝U′_pos∪U′_neg. The simplified decision table is called S ═ (U', C ═ D, V, f).

Definition 7: in the decision table S ═ (U, C ═ D, V, f), S ═ U', C ═ D, V, f) is a simplified decision table, is defined as

sig_p(a)＝|U′_P∪{a}-U′_p|

Wherein,

based on the above summary of MapReduce and rough set, the following will describe the detailed implementation process of the rough set parallel reduction method based on MapReduce in a manner of combining with embodiments.

In the present invention, the "decision table" refers to data having a decision attribute, that is, data having a decision attribute is an object of reduction in the present invention.

Fig. 2 shows a schematic flow diagram of the rough set parallel reduction method based on MapReduce of the present invention, where the rough set parallel reduction method based on MapReduce includes:

s10 reads the decision table to be reduced: before parallel reduction is performed, the decision table to be reduced is read first, and the read mode may be to directly read the decision table to be reduced from the local (for example, a Hadoop distributed file system), or may also be to directly read the decision table to be reduced from the network node to the local.

S30 obtains a reduced decision table: initializing a first MapReduce model and enabling the first MapReduce model to respond to the decision table to be reduced so as to perform parallel computing processing on the decision table to be reduced to obtain a simplified decision table with marks,

s31, if the reduction decision table is empty, taking the reduction decision table as the final reduction result of the decision table to be reduced and outputting the final reduction result;

s32 parallel computing attribute importance: if the simplified decision table is not empty, initializing a second MapReduce model and enabling the second MapReduce model to respond to the simplified decision table with the marks so as to obtain the importance of each attribute in the simplified decision table with the marks through parallel calculation and write the result into a Hadoop distributed file system;

s50 parallel reduction of attribute importance: and reading a decision table with the highest attribute importance in the Hadoop distributed file system, deleting redundant information in the decision table to obtain a new decision table to be reduced, and enabling the new decision table to be reduced to be used as an input value of the first MapReduce model to be reduced again.

Firstly, compared with the existing reduction method, the rough set parallel reduction method based on MapReduce obtains a simplified decision table firstly, and the reduction on the simplified decision table greatly reduces the calculated amount, thereby improving the efficiency; in addition, the existing reduction method can only obtain the importance of one condition in the last parallel MapReduce for calculating the importance of the attribute, and the rough set parallel reduction method based on MapReduce can obtain the importance of all the condition attributes by matching with simple text reading calculation after one MapReduce and finish one reduction, thereby improving the reduction efficiency.

Secondly, the rough set parallel reduction method based on MapReduce is mainly an improvement on the prior art, so that it is necessary to first simply introduce and explain the traditional attribute reduction method.

According to the explanation on rough sets given above, a traditional method for rapid attribute reduction is given, which takes the importance of attributes as a reduction index, takes the attribute with the highest importance each time as a result of the reduction, and when the U' set is empty, the method stops, i.e. an optimal reduction result is found, and outputs the result.

Specifically, a specific implementation process of the traditional rough reduction method is given:

method 1

Inputting: decision table S ═ (U, C ═ D, V, f)

And (3) outputting: attribute reduction R

Firstly, calculating U/C to obtain U ', U'_pos,U′_neg；

In the second step, the first step is that,

thirdly, any a epsilon C-R is treated as follows:

calculating the importance sig of each attribute in the set a_R(a)，B_R(a)，NB_R(a) And U '/(R { U { a' }) (B)_R(a) Indicates that all elements in the equivalence class are U'_posAnd all elements of the equivalence class take the same value on the decision attribute, NB_R(a) All the elements of the equivalent group are in U'_neg)；

Fourthly, recording sig_R(a′)＝max sig_R(a) If there is more than one attribute, taking one of the attributes;

a fifth step of R ═ u { a' }; u' -B_Ra′-NB_R(a′)；

In the sixth step, ifOutputting R; otherwise, go to the next step;

seventh step, U'_pos＝U′_pos-B_R(a′)，U′_neg＝U′_neg-NB_R(a′)；

Eighth, the calculation of U '/R { a' } goes to, the third step.

On the basis of the above, how to implement the rough set parallel reduction method based on MapReduce of the present invention will be described in detail below.

Specifically, how to implement the parallel computation of the simplified decision table in step S30 is as follows:

definition 8: giving a decision table S ═ (U, C ^ D, V, f), and makingS_i＝(U_iC ^ D, V, f) is a sub-decision table of S, which satisfies the following condition:

this means that we can split a decision table into many sub-decision tables that are not related to each other.

Theorem 1: giving a decision table S ═ (U, C ^ D, V, f), and makingS_i＝(U_iAnd C ^ D, V, f) is a sub-decision table of S. Given an arbitrary subset of conditional attributesHaving equivalence relation U/B ═ E₁,E₂,…E_iFor sub-decision table S_i，The following conclusions can be drawn: the equivalence class of the decision table is required, and the equivalence class of each sub-decision table can be solved firstly. And then merging the same equivalence classes with the same attribute in the sub-decision tables to obtain the equivalent equivalence class.

According to the theorem 1, MapReduce can meet the requirement of obtaining equivalence class and obtain U 'of the equivalence class'_pos,U′_negAnd can be obtained simultaneously, so that the simplified decision table U' can be obtained by MapReduce. The parallel method PACSDT (parallel Algorithm for computing of a Simplified Decision Table) for computing the Simplified Decision table S' is given below, and the method PACSDT consists of two parts, PACSDT-Map and PACSDT-Reduce. The description is as follows:

method 2, PACSDT-Map (key, value)

Inputting: decision table S_i＝(U_i,C∪D,V,f)，

And (3) outputting: < x _ C, x _ D > x _ C: condition attribute corresponding to object x, x _ D: the decision attribute corresponding to object x.

For example, the PACSDT-Map (key, value) input format provided by MapReduce is as follows:

after the calculation is finished by the method 2, sorting is carried out according to the key values output by the Map, and the sorted keys and values are transmitted to Reduce for further calculation, so that one key of < key, value > transmitted to Reduce comprises a plurality of values. Thus, each key is actually an equivalent class of the decision table. value is the set of decision attributes taken on the equivalence class.

Method 3, PACSDT-reduce (key value)

Inputting: < x _ C, x _ D > x _ C: value of condition attribute corresponding to object x, x _ D: the value of the decision attribute corresponding to object x;

and (3) outputting:<x_C,x_D+POS_C(D)_flag+x_No>x _ C: conditional attribute corresponding to object x, x _ D + POS_C(D) Flag + x _ No: decision attribute and POS corresponding to object x_C(D) Flag, and object number.

For example, the PACSDT-reduce (key value) input format provided by MapReduce is as follows:

through calculation of the method 2 and the method 3, a new simplified decision table is obtained, and the decision table has one more POS besides the due characteristics of a common decision table_C(D) Flag, which plays an important role in the next determination of attribute importance. And if the simplified decision table is empty, taking the simplified decision table as a final reduction result of the decision table to be reduced and outputting the final reduction result.

Specifically, how to implement the importance of the parallel computing attribute in step S32 is as follows:

the attribute importance-based reduction method is widely applied to the traditional rough set and has good effect. Because the importance of each attribute can be calculated in parallel, the attribute importance can be used as a parallel mode of attribute reduction. But the MapReduce executed once can only obtain the importance of one attribute, and the efficiency is not high. The method for solving the attribute importance in the method 1 is improved, and the importance of all attributes can be calculated by MapReduce once, so that the efficiency is improved.

A parallel attribute importance calculation method PACAS (parallel Algorithm for calculation of Attribute Significance) is given below, and the calculation method comprises three calculation methods, namely PACAS-Map, PACAS-Reduce and PACAS. The description is as follows:

method 4, PACAS-Map (key, value)

Inputting: simplified decision list S'_i＝(U′_i,C∪D,V,f)

And (3) outputting:<c+x_c∪R,x_D+POS_C(D)_flag+x_No>c + x _ C is the combination of each attribute C ∈ C and the value of the object x on the attribute set C ≧ R in the decision table, x _ D + POS_c∪R(D) The _flag + x _ No is the decision attribute corresponding to the object x and POS_c∪R(D) Flag, and object number.

For example, the PACAS-Map (key, value) input format provided by MapReduce is as follows:

the decision value corresponding to each category of each attribute in each decision table can be obtained by the method 4. And after Map is finished, all the < key, value > pairs are sorted, and each classification of each attribute is output together and is used as the input of Reduce.

Method 5, PACAS-Reduce (key, value)

Inputting:<c+x_c,x_D+POS_c∪R(D)_flag+x_No>c + x _ C is the combination of each attribute C belonging to C and the value of the object x in the attribute C in the decision table, and x _ D + POS_c∪R(D) The _ flag + x _ No is the decision attribute corresponding to the object x andPOS_c∪R(D) flag, and object number.

And (3) outputting:<c，sig(c)+B_R(c)+NB_R(c)>c attribute C in decision table is belonged to C, sig (C) + B_R(c)+NB_R(c) Is the importance of the attribute and B calculated when calculating the importance_R(c) And NB_R(c)。

by the method 5, B taken by each equivalence class of each attribute can be obtained_R(c) And NB_R(c) And | B_R(c) | and | NB_R(c) L. And saves the result in the text of the HDFS. The importance of each attribute can be calculated according to the content of the text, and the most important attribute is selected as a reduction result. The complete method for calculating the importance of the attribute is described by method 6.

Method 6PACAS

Inputting: simplified decision list S'_i＝(U′_i,C∪D,V,f)

And (3) outputting: reduction result reduction

For example, the PACAS input format provided by MapReduce is as follows:

begin

let reduction←0；

namely, a MapReduce job is initialized, and sig (c) is obtained by calculating each equivalence class of each attribute through a method 4 and a method 5.

Specifically, how to implement the parallel reduction of the attribute importance in step S50 is as follows:

by the method 6, a reduction result of one-time calculation is obtained, the result is added into a reduction set, and then the attribute importance is solved for the next time, but before solving, the simplified decision table needs to be adjusted again to remove redundant information. This step may also use parallelized PACDT (parallel Algorithm for calculation of Decision Table). The method only comprises one PACDT-Map. The description is as follows:

method 7PACDT-Map (key, value)

Inputting: simplified decision list S'_i＝(U′_i,C∪D,V,f)

And (3) outputting: new reduced decision Table S'_i＝(U′_i,C∪D,V,f)

For example, the PACDT-Map (key, value) input format provided by MapReduce is as follows:

by the method 7, a new simplified decision table can be obtained, and the decision table is used as an input decision table for solving the attribute importance next time to calculate the attribute importance. The complete Attribute-importance-Based parallel Reduction method PACARBAS (parallel Algorithm for calculation of Attribute Reduction Based on Attribute Significance) will be given below. The method is described as follows:

method 8PACARBAS

Inputting: decision table Si ═ U_i,C∪D,V,f)，

And (3) outputting: reduction of Reductions

For example, the pacarbase input format provided by MapReduce is as follows:

begin

let Reductions←0；

obtaining a simplified decision table S' by a method 2 and a method 3;

while S′is not empty do

reduction calculated by method 6

let Reductions←reduction；

Recalculating the reduced decision table by method 7;

end

Reductions

end

the method 8 gives a complete calculation reduction process, and the method adjusts the simplified decision table through multiple iterations to finally obtain a reduction result.

By introducing the above description of the methods 1 to 8, the implementation of the present invention can be summarized as the execution flow shown in fig. 3.

Specifically, the following is an example of how the reduction is realized by the above method by exemplifying a decision table, and it is a technical solution that can be more clearly understood by those skilled in the art.

Examples

First, a decision table S ═ (U, C ═ D, V, f) is given, and the table can be divided into two sub-decision tables, S₁＝(U₁C ^ U ^ D, V, f) and S₂＝(U₂C £ D, V, f), as shown in tables 1, 2:

TABLE 1 sub-decision Table S₁

TABLE 2 sub-decision Table S₂

Second, how to compute the reduced decision table and the importance of the attributes in parallel

Table 3 simplified decision table U'

And (3) Map stage: condition attribute and decision attribute separation < x _ C, x _ D >

Examples are:

Key＝{1,1,1,2}

Value＝{1}

and (3) Map stage: adding POS_p(D) And line number<x_C,x_D+POS_C(D)_flag+x_No>：

Examples are:

Key＝{1,1,1,2}

Value＝{1_1_1}

parallel computing attribute importance

And Map: inputting: simplified decision list S'_i＝(U′_i,C∪D,V,f)

And (3) outputting:<c+x_c∪R,x_D+POS_C(D)_flag+x_No>

examples are:

for NO1

Output < key, value > = { a _1,1_1_1}

{b_1,1_1_1}

{c_1,1_1_1}

{d_1,1_1_1}

Reduce: inputting:<c+x_c,x_D+POS_c∪R(D)_flag+x_No>

and (3) outputting:<c,sig(c)+B_R(c)+NB_R(c)>

the importance of each attribute is calculated. Here, when the output of Map is collected to Reduce, the Map is sorted according to each attribute, and therefore, the keys of the same attribute are sorted together. Reduce can calculate the importance of all attributes at one time, while the existing method can only calculate the importance of one attribute at one time by mapreduce

Examples are:

after calculation:

sig_R(a)＝1

sig_R(a)＝0

sig_R(c)＝0

sig_R(d)＝0

method 6PACAS

Reading the result from HDFS, calculating the most important one as a reduction, and selecting attribute A as output

Finally, how to reduce the method based on attribute importance in parallel

B according to attribute a_R(a) And NB_R(a) The reduced decision table is recalculated, and the redundant information is deleted, so that the information with No1 is deleted.

The reduced decision table becomes:

then recalculate attribute importance:

B_R(b)＝{X3,X4,X5}，NB_R(b)＝{X2,X9},sig_R(b)＝5

B_R(c)＝{X3,X5}，NB_R(c)＝{X2},sig_R(c)＝3

B_R(d)＝{X3,X4,X5}，NB_R(d)＝{X2,X9},sig_R(d)＝5

the same importance of the attributes occurs and then the same one is selected as the reduction, where b is selected as the output. And then recalculating the simplified decision table, wherein the result of the simplified decision table is null, and the reduction is finished to obtain the results redactions ═ a, b }.

Specifically, the task parallel simplification module is specifically configured to perform job configuration on the decision table to be simplified to obtain a plurality of sub-decision tables; enabling a Map function of the first MapReduce model to perform parallel calculation on the plurality of sub-decision tables to obtain condition attributes and decision attributes in the decision table to be reduced, and outputting the condition attributes and the decision attributes; and calculating the condition attribute and the decision attribute by using a Reduce function of the first MapReduce model to obtain a simplified decision table with marks.

Specifically, the attribute importance parallel computing module is specifically configured to initialize a second MapReduce model; enabling a Map function of the second MapReduce model to respond to the simplified decision table with the marks, and obtaining a decision value corresponding to each classification of each attribute in the simplified decision table with the marks through parallel calculation; and enabling a Reduce function of the second MapReduce model to respond to the decision value to obtain the attribute importance degree obtained by each equivalent class of each attribute, and writing the result into a Hadoop distributed file system.

Specifically, the attribute importance parallel reduction module is further configured to, when there are a plurality of decision tables with the highest attribute importance read in the Hadoop distributed file system, randomly select one of the decision tables with the highest attribute importance and delete redundant information therein to obtain a new decision table to be reduced.

Further, the invention also provides a rough set parallel reduction system based on MapReduce, which comprises:

Specifically, the task parallel simplification unit is specifically configured to perform job configuration on the decision table to be simplified to obtain a plurality of sub-decision tables; enabling a Map function of the first MapReduce model to perform parallel calculation on the plurality of sub-decision tables to obtain condition attributes and decision attributes in the decision table to be reduced, and outputting the condition attributes and the decision attributes; calculating the condition attribute and the decision attribute by a Reduce function of the first MapReduce model to obtain a simplified decision table with marks;

specifically, the attribute importance parallel computing unit is specifically configured to initialize a second MapReduce model; enabling a Map function of the second MapReduce model to respond to the simplified decision table with the marks, and obtaining a decision value corresponding to each classification of each attribute in the simplified decision table with the marks through parallel calculation; enabling a Reduce function of the second MapReduce model to respond to the decision value to obtain the attribute importance degree obtained by each equivalent class of each attribute, and writing the result into a Hadoop distributed file system;

specifically, the attribute importance parallel reduction unit is further configured to, when there are a plurality of decision tables with the highest attribute importance read in the Hadoop distributed file system, randomly select one of the decision tables with the highest attribute importance and delete redundant information therein to obtain a new decision table to be reduced.

first, the existing parallel reduction methods cannot accurately obtain the parallel reduction result because the methods proposed by them directly reduce the sub-decision tables after cutting in the map and then merge the reduction results, but the reduction methods need complete equivalence classes. Therefore, the proposed method actually obtains the reduction result from the one-sided data, and the reduction result has inaccuracy and inaccuracy.

Secondly, the parallel reduction method has limitations, and the currently proposed parallel reduction method requires the decision table to be a consistent decision table, and a definition of the consistent decision table is given below: for one decision table S, all objects are at the POS_C(D) Then the decision table is a consistent decision table. If there is an object in the U-POS_C(D) In (3), it is an incompatible decision table. POS here_C(D) Is just the POS in our paper_C(D) For the tag given in method 3, the consistent decision list is the POS in My method_C(D) Decision table of all 1. I propose a method without these limitations. Fitting all decision tables.

Thirdly, the present invention is highly efficient. Firstly, the existing method is mainly to reduce on an original decision table, but the method firstly obtains a simplified decision table, and the reduction on the simplified decision table can greatly reduce the calculated amount, thereby improving the efficiency; secondly, the importance of only one condition can be obtained in the last parallel MapReduce for calculating the importance of the attribute by other methods, and the importance of all the condition attributes can be obtained by matching with a simple text reading calculation after the MapReduce is performed once by the method, so that the reduction is completed once, and the efficiency of the method is improved.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A rough set parallel reduction method based on MapReduce is characterized by comprising the following steps:

reading a decision table to be reduced;

2. The MapReduce-based rough set parallel reduction method as recited in claim 1, wherein the specific method for performing parallel computation processing on the decision table to be reduced by using a first MapReduce model to obtain the reduction decision table comprises the following steps:

performing operation configuration on the decision table to be reduced to obtain a plurality of sub-decision tables;

enabling a Map function of the first MapReduce model to perform parallel calculation on the plurality of sub-decision tables to obtain condition attributes and decision attributes in the decision table to be reduced, and outputting the condition attributes and the decision attributes;

and calculating the condition attribute and the decision attribute by using a Reduce function of the first MapReduce model to obtain a simplified decision table with marks.

3. The MapReduce-based rough set parallel reduction method as set forth in claim 1 or 2, wherein the specific method for calculating the importance of each attribute in the simplified decision table in parallel by using a second MapReduce model comprises:

initializing a second MapReduce model;

enabling a Map function of the second MapReduce model to respond to the simplified decision table with the marks, and obtaining a decision value corresponding to each classification of each attribute in the simplified decision table with the marks through parallel calculation;

and enabling a Reduce function of the second MapReduce model to respond to the decision value to obtain the attribute importance degree obtained by each equivalent class of each attribute, and writing the result into a Hadoop distributed file system.

4. The MapReduce-based rough set parallel reduction method as claimed in claim 1 or 3, wherein if there are a plurality of decision tables with the highest attribute importance read in the Hadoop distributed file system, one decision table with the highest attribute importance is randomly selected and redundant information in the decision table is deleted to obtain a new decision table to be reduced.

5. A rough set parallel reduction device based on MapReduce is characterized by comprising:

6. The MapReduce-based rough set parallel reduction device of claim 1, wherein:

the task parallel simplification module is specifically used for performing operation configuration on the decision table to be reduced to obtain a plurality of sub-decision tables; enabling a Map function of the first MapReduce model to perform parallel calculation on the plurality of sub-decision tables to obtain condition attributes and decision attributes in the decision table to be reduced, and outputting the condition attributes and the decision attributes; and calculating the condition attribute and the decision attribute by using a Reduce function of the first MapReduce model to obtain a simplified decision table with marks.

7. The MapReduce-based rough set parallel reduction device of claim 1, wherein:

the attribute importance parallel computing module is specifically used for initializing a second MapReduce model; enabling a Map function of the second MapReduce model to respond to the simplified decision table with the marks, and obtaining a decision value corresponding to each classification of each attribute in the simplified decision table with the marks through parallel calculation; and enabling a Reduce function of the second MapReduce model to respond to the decision value to obtain the attribute importance degree obtained by each equivalent class of each attribute, and writing the result into a Hadoop distributed file system.

8. The MapReduce-based rough set parallel reduction device of claim 1, wherein: the attribute importance degree parallel reduction module is further used for randomly selecting one decision table with the highest attribute importance degree and deleting redundant information in the decision table with the highest attribute importance degree to obtain a new decision table to be reduced when a plurality of decision tables with the highest attribute importance degree are read in the Hadoop distributed file system.

9. A rough set parallel reduction system based on MapReduce is characterized by comprising:

10. The MapReduce-based rough set parallel reduction system of claim 9, wherein:

the task parallel simplification unit is specifically used for performing operation configuration on the decision table to be reduced to obtain a plurality of sub-decision tables; enabling a Map function of the first MapReduce model to perform parallel calculation on the plurality of sub-decision tables to obtain condition attributes and decision attributes in the decision table to be reduced, and outputting the condition attributes and the decision attributes; calculating the condition attribute and the decision attribute by a Reduce function of the first MapReduce model to obtain a simplified decision table with marks;

the attribute importance parallel computing unit is specifically used for initializing a second MapReduce model; enabling a Map function of the second MapReduce model to respond to the simplified decision table with the marks, and obtaining a decision value corresponding to each classification of each attribute in the simplified decision table with the marks through parallel calculation; enabling a Reduce function of the second MapReduce model to respond to the decision value to obtain the attribute importance degree obtained by each equivalent class of each attribute, and writing the result into a Hadoop distributed file system;

the attribute importance degree parallel reduction unit is also used for randomly selecting one decision table with the highest attribute importance degree and deleting redundant information in the decision table with the highest attribute importance degree to obtain a new decision table to be reduced when a plurality of decision tables with the highest attribute importance degree are read in the Hadoop distributed file system.