CN103336790A

CN103336790A - Hadoop-based fast neighborhood rough set attribute reduction method

Info

Publication number: CN103336790A
Application number: CN2013102240081A
Authority: CN
Inventors: 蒋云良; 杨建党; 刘勇; 范婧; 张雄涛
Original assignee: Huzhou University
Current assignee: Huzhou University
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2013-10-02
Anticipated expiration: 2033-06-06
Also published as: CN103336790B

Abstract

The invention discloses a Hadoop-based fast neighborhood rough set attribute reduction method. The method comprises the following steps: a, establishing a distributed platform based on the Hadoop; b, defining a neighborhood rough set; c, generating a candidate set; d, calculating the importance of each attribute; e, selecting the attribute with the largest importance and adding the attribute into the candidate set; f, judging whether a stop condition is met or not; g, storing conditions selected by characteristics. The method is based on the Hadoop distributed platform to analyze the parallelization of a parallel data mining algorithm so as to realize the parallelization of a neighborhood rough set attribute reduction algorithm; the time complexity of the parallelized attribute reduction is greatly lowered, the output of an intermediate result in the performing intermediate process is greatly reduced, and the analysis efficiency of large-scale data is improved, so that numerous and varied mass data are converted into available data with information and business values, thereby completing mining and analysis optimizing of data.

Description

Based on the quick attribute reduction method of the neighborhood rough set of Hadoop

[technical field]

The present invention relates to the data attribute reduction method, especially at big data distributed nature reduction method.

[background technology]

Along with the high speed development of high-tech information industry, the continual renovation of human history chapter, nowadays we have entered the epoch of a data blast, information expansion, and the quilt that has every day the mass data every minute and second not have a rest produces, running and utilizing." big data age " arrives, and in one minute, the data volume of New Development surpasses 100,000 on the microblogging, and the New York Stock Exchange produces the transaction data of 1TB every day, and the whole world generates the data of 2.5 Chinese mugworts (18 powers that 1 Chinese mugwort equals 10) byte every day.The digital cosmic exploration prediction that IDC is nearest, to the year two thousand twenty, the data in world storage total value will reach 35ZB (21 powers that 1Z equals 10).Rapid growth in the face of mass data, how more effective analysis long-term accumulation cuts still mass data in sustainable growth, therefrom excavate market business and be worth, support that business decision and business development are current numerous severe challenges that large-scale data enterprise faces that have.

Data mining is to extract from mass data or excavation knowledge, utilizes Data Mining Tools to carry out data analysis, can find important data pattern, and huge contribution is made in business strategy, knowledge base, science and medical research.Feature selecting and attribute reduction are regarded as the very important data pre-treatment step of pattern-recognition, machine learning and data mining.Simultaneously, feature selecting and attribute reduction itself also is the very important machine learning task of a class, and purpose is to delete uncorrelated, weak relevant or redundant attribute or dimension, discloses feature and feature accurately, correlativity between feature and the categorised decision directly helps the user to understand the essence of data.In marketing was analyzed, the relation between product feature and the product sales volume can help to formulate correct marketing strategy and instruct the designer to improve product.

Along with the explosion type increase of data, the kind of data is more and more abundanter, not only constantly expand on data scale, and the dimension of data is also very high.Attribute reduction method has been proposed new requirement.A good attribute reduction method not only can reduce the data attribute dimension effectively, also will have good time efficiency at the processing large-scale data.More existing attribute reduction theories are not considered big data cases, and the yojan process is very consuming time, during in the face of mass data, and at all can't practical application.Though now some are also arranged at the old attribute reduction algorithms of big data quantity, are not suitable for the data of distributed storage.

Along with the rise of Web2.0, social networks has obtained development at full speed, and the visit capacity of various social network sites is considerably beyond traditional portal website, and customer volume is huge, and the surf time increases severely, and the data volume that makes network produce is increased sharply.The difficult problem that website operator faces is exactly how for huge customer group provides stable, service efficiently.Google handles in the prostatitis in big data, and the Google file system of release and MapReduce programming model have satisfied storage and the computation requirement of ultra-large data.

Since the Google file system of Google company research and development and MapReduce programming model with and handle the peculiar glamour of extensive mass data, caused non-same common music in academia and industry member.Academia continues to bring out out at mass data processing, is based on the achievement in research of MapReduce.Industry member is similar to the Google file system in a large number, adopts the system of class MapReduce programming model also to obtain disposing widely.The concept of cloud computing subsequently is suggested, and makes people see that the information explosion epoch solve the effective scheme of mass data processing problem.Amazon and Google are the pioneers of cloud computing, and Google applications engine (Google App Engine) and Amazon network service (Amazon Web Service) [43] are cloud computing services the earliest.Domestic and international well-known IT vendors such as IBM, Microsoft, China Mobile, CHINAUNICOM, China Telecom have also released the cloud computing plan of oneself according to advantage separately.

Today, when the application that has the mass data demand as internet, science data processing, business intelligence data analysis etc. becomes more and more general, no matter be from scientific research or from the application and development angle, the technology of grasping as Google file system and MapReduce programming model has become a kind of trend.Under such background, realized that the Hadoop system of increasing income of Google file system and MapReduce programming model becomes the most widely used distributed structure/architecture.Hadoop has become a core of many Internet firms basic platform, as Yahoo, FaceBook, LinkedIn and Twitter.Many traditional industries as media industry and telecommunications industry, also begin to adopt the Hadoop system.Hadoop has become the cloud computing platform that is most widely used.

The MapReduce model is simple, easy to understand, be easy to use.Mass data is handled problems, and comprises a lot of machine learning and data mining algorithm, can use MapReduce to realize.Machine learning and data mining are closely related, and the realization of various machine learning algorithms under cloud computing platform also becomes a research focus.Alina etc. have set forth and how to use MapReduce to carry out cluster [50] fast.Bahman Bahmani etc. has described with MapReduce and has carried out PageRank[29 rapidly].Apache Mahout[45] be a brand-new project of increasing income of Apache Software Foundation exploitation, its main target is to create some telescopic machine learning algorithms.This project has realized many common machine learning algorithms that are used for cluster and classification.In data mining, fields such as information retrieval, the implementation of a lot of algorithms is repeatedly processes of iteration, such as PageRank, SSSP (Single Source Shortest Path).Yet in traditional iterative operation of MapReduce model operation, performance is very low.Twister[48] and Haloop[45,46] be exactly the improvement at the MapReduce model of iterative algorithm proposition.Abstract not high enough of the model of Twister and Haloop but, the calculating of support is limited.

Rough set theory (Rough Set) is proposed by Pawlak in nineteen eighty-two as a kind of data analysis treatment theory, it is a kind of mathematical tool that has uncertainty, imperfection data for description, it can analyze out of true, various incomplete information such as inconsistent, imperfect effectively, can also analyze and reasoning data, therefrom find potential knowledge or rule.Rough set theory is as the theory of probability that continues, fuzzy set, and another after the evidence theory handled probabilistic mathematical tool.Rough set theory is handled, and simple and practical property and validity uncertain, the imperfection data are wonderful, and it can become an international research focus in the approval that obtains global academia after the foundation and the popularization in reality is answered.As a kind of brand-new soft computing method [3], rough set more and more comes into one's own in recent years, and its validity is confirmed in the successful Application in many scientific and engineerings field.Rough collection applied research mainly comprises the uncertainty measure of attribute reduction, knowledge acquisition, data mining, knowledge etc., is one of research focus in current data mining in the world, machine learning, artificial intelligence theory and the application thereof.

Feature selecting and attribute reduction play a part very important in machine learning, pattern-recognition and data mining.All the time, be subjected to the extensive concern of academia.The information of being accompanied by is obtained the development with memory technology, and in some real world applications, the dimension of obtaining and be stored in the data in the database may have tens, hundreds of, in addition thousands of.Under the situation of limited training sample, too much characteristic attribute can seriously influence study course, and many features are unessential, or redundant, the feature that these are redundant not only can increase computation complexity, also can reduce the predictive ability of sorter.In order to simplify sorter, the dimension of improving nicety of grading and reducing deal with data, we need select the suitable character subset of a cover.

For many years, existing different theoretical method applies to solve the complexity of attribute reduction.The Pawlak rough set theory has been obtained successful development as a kind of mathematical tool of handling the uncertain information classification problem in character subset selection and aspects such as yojan, classifier design.The Pawlak rough set is based upon on the basis of relation of equivalence and division, is fit to handle the classification problem of nominal attribute description.In order to solve the coarse computational problem of numerical attribute, the numerical attribute discretize will be converted into symbol attribute and handle.But the discretize process can be brought information loss inevitably, thereby the model that causes learning can not reflect the structure of knowledge of raw data.

In order to handle numeric data, the scholar has proposed the concept of neighborhood rough set.Neighborhood rough set no longer requires with the object in the approximate essential information particle identical, but gets final product less than a certain threshold value with the centre distance of neighborhood.Q.H.Hu etc. have provided the neighborhood rough set model according to the neighborhood rough set theory, have designed the Algorithm for Reduction of the name of yojan simultaneously type, numeric type, mixed type data.

Dash in 2003 and Liu have proposed the classification capacity that coincident indicator is estimated the discrete features space, think that consistent sample obviously can correctly be classified, but consistent sample also may not necessarily be by the mistake branch, the sample that only is in minority class in the inconsistent sample just can be divided by mistake, and the sample that is in most classes can correctly be classified.Coincident indicator has solved the Pawlak rough set and can not put up with inconsistent information in the equivalence class, the problem of class distribution situation that can not the meticulous depiction borderline region.The thought of coincident indicator is incorporated into neighborhood rough set, thereby solves the meticulous depiction problem of mixed type variable classification ability.

Above-mentioned theory and model all are based on small data quantity or the single node basis, and in the face of big data, distributed situation, the method time efficiency of these attribute reductions becomes bottleneck.The neighborhood rough set model has been introduced the concept in field, make the attribute reduction method of some classical rough sets can not be applicable to the neighborhood rough set model, and the neighborhood rough set model is the calculating distance that requires a great deal of time calculating the sample neighborhood, causes the attribute reduction efficient of neighborhood rough set model not high.At big data by reducing sample size, dwindling the search volume and calculate the neighborhood of sample fast and become the breach of designing neighborhood rough set model attributes yojan highly effective algorithm.Based on the MapReduce programming mode, realize the parallelization of old attribute reduction algorithms, also can break through the bottleneck of big data attribute yojan on the single node.But, current both at home and abroad about the research of distributed attribute reduction method seldom.

[summary of the invention]

Purpose of the present invention solves the problems of the prior art exactly, the quick attribute reduction method of a kind of neighborhood rough set based on Hadoop is proposed, can improve the analysis efficiency of large-scale data, thereby convert numerous and complicated various mass data to information and commercial value data available, and then finish data mining and analysis optimization.

For achieving the above object, the present invention proposes the quick attribute reduction method of a kind of neighborhood rough set based on Hadoop, may further comprise the steps:

A) set up distributed platform based on Hadoop: set up HDFS distributed file system and MapReduce multiple programming model; Described HDFS distributed file system adopts the host-guest architecture system, constituted by a supvr and a plurality of worker, the NameSpace of administrator administers file system, maintaining file system tree and whole interior all file and the catalogue of tree, the worker is the working node of file system, store as required and retrieves data blocks, and regularly send " heartbeat " report to the supvr, if the supvr does not receive " heartbeat " report of worker in the time period of appointment, then the supvr starts fault tolerant mechanism it is handled; Described MapReduce multiple programming model is divided into some little tasks with task and goes to carry out, and each little task is handled the local data blocks stored of clustered node;

B) neighborhood rough set definition: in the database that mixes attribute, a neighborhood information system is expressed as: NIS=＜U, A, V, f＞, in the formula: U is the set of sample, and A is the set of attribute, and V is the codomain of attribute, and f is information function: U * A → V; If B is a numerical characteristics subclass, then for B, the neighborhood of x is

δ_{B} (x) = {x_{i} | x_{i} &Element; δ_{a} (x), &ForAll; a &Element; B};

C) generate Candidate Set: adopt search strategy to generate one group of character subset to be evaluated as Candidate Set, initial Candidate Set is empty set or feature complete or collected works or the stack features subclass that produces at random;

D) calculate the importance degree of each attribute: set up a Mapper class and a Reducer class, the Mapper class is read in sample data, and sample set is the input of each attribute assignment corresponding sample to be evaluated set as the Reducer class according to the community set that has selected, each reducer only accepts all sample subclass of an attribute, and within a reducer, carry out subregion by the Macintosh data, the number of Reducer task is exactly attribute number to be evaluated, the corresponding sample set will be input to different attribute numbers in the corresponding Reducer task, a plurality of Reducer tasks in parallel are carried out, a given neighborhood decision system NDT=＜U, AUD, V, f＞

, the importance degree of defined attribute a be calculated as SIG (a, B, D)=γ _B(D)-γ _B-a(D), (a, B D) have reacted the significance level of the decision attribute D of attribute a to SIG, are used for estimating the importance degree of each attribute;

E) select the importance degree maximum attribute to join Candidate Set: with d) output in the step is as the input of this step, compare with maximum before importance value, if the importance value of current attribute is higher, then current attribute is added in the Candidate Set as best character subset;

F) judge whether to satisfy stop condition: adopt feature production process and evaluation procedure as stop condition, the stop condition of feature production process has two kinds: a kind of is to judge whether to have selected predefined feature number, and another kind is to judge whether to reach predefined iteration number; The stop condition of evaluation procedure has two kinds: a kind of is to judge whether to add or rejected a feature to produce a better character subset, and another kind is to judge whether that optimal feature subset obtains;

G) state of preservation feature selecting: preserve characteristic set and the unselected characteristic set of having selected respectively, d) in the step to unselected feature set computation importance degree, f) characteristic set of having selected in the step and unselected characteristic set upgrade, and the characteristic set that will select and unselected characteristic set are as a result of exported at last.

As preferably, described c) search strategy adopts the method for exhaustion in the step, from empty set, adopts the BFS (Breadth First Search) strategy, up to find one can complete prediction classification smallest subset.

As preferably, described c) search strategy adopts complete searching algorithm in the step, from the feature complete or collected works, rejects a feature at every turn.

As preferably, described c) search strategy adopts heuristic search algorithm in the step, from empty set, put into a higher feature of character subset coverage rate that can make selection at every turn, arrive the value of a setting up to " coverage rate ", perhaps intact the time, algorithm stops all features by exhaustive.

As preferably, described c) search strategy adopts the random chance searching algorithm in the step, with identical probability selection character subset at random, and the subclass that keeps a upper limit that satisfies a certain evaluation criterion, constantly and the value of newly selecting relatively, up to finding one to satisfy pre-conditioned subclass, perhaps arrive predefined number of attempt, algorithm stops.

As preferably, described d) evaluation method adopts based on the attribute appraisement method of dependency degree or based on conforming attribute appraisement method in the step.

Beneficial effect of the present invention: the present invention is based on the Hadoop distributed platform parallel data mining algorithm parallelization is analyzed, realized the parallelization of neighborhood rough set old attribute reduction algorithms, attribute reduction after parallel greatly reduces time complexity, carry out the output that pilot process has significantly reduced intermediate result, improved the analysis efficiency of large-scale data, thereby convert numerous and complicated various mass data to information and commercial value data available, and then finish data mining and analysis optimization.

Feature of the present invention and advantage will be elaborated by embodiment.

[embodiment]

δ_{B} (x) = {x_{i} | x_{i} &Element; δ_{a} (x), &ForAll; a &Element; B};

D) calculate the importance degree of each attribute: set up a Mapper class and a Reducer class, the Mapper class is read in sample data, and sample set is the input of each attribute assignment corresponding sample to be evaluated set as the Reducer class according to the community set that has selected, each reducer only accepts all sample subclass of an attribute, and within a reducer, carry out subregion by the Macintosh data, the number of Reducer task is exactly attribute number to be evaluated, the corresponding sample set will be input to different attribute numbers in the corresponding Reducer task, a plurality of Reducer tasks in parallel are carried out, a given neighborhood decision system NDT=＜U, AUD, V, f＞ , the importance degree of defined attribute a be calculated as SIG (a, B, D)=γ _B(D)-γ _B-a(D), (a, B D) have reacted the significance level of the decision attribute D of attribute a to SIG, are used for estimating the importance degree of each attribute;

E) select the importance degree maximum attribute to join Candidate Set: with c) output in the step is as the input of this step, compare with maximum before importance value, if the importance value of current attribute is higher, then current attribute is added in the Candidate Set as best character subset;

G) state of preservation feature selecting: preserve characteristic set and the unselected characteristic set of having selected respectively, c) in the step to unselected feature set computation importance degree, e) characteristic set of having selected in the step and unselected characteristic set upgrade, and the characteristic set that will select and unselected characteristic set are as a result of exported at last.

One, rough set and attribute reduction method.

1.1 rough set theory key concept.

Define 1: one knowledge-representation system and be a quaternary organize in order I=(U, A, V, f), wherein U is domain, U={x ₁, x ₂, L, x _n, be that the nonempty finite set of object is closed; A is community set, A={a ₁, a ₂, L, a _m; V is the set of property value, to a ∈ C, V=UV _a, V _aCodomain for attribute a; F:U * A → V is information function, and it gives a value of information for each attribute of each object, namely

Usually note I=(U, A, V, f) be I=(U, A).

Definition 2: note

&ForAll; P &SubsetEqual; A, IND (P) = {(x, y) &Element; U^{2} | &ForAll; a &Element; P, a (x) = a (y)},

We claim INP (P) be P about A can not differentiate relation, also be designated as R _PIf (then x and y can not differentiate for x, y) ∈ IND (P).

Definition 3: note

IND (P) can not differentiate relation, and we claim U/IND (P)={ [x] _P: x ∈ U} is the division on the U, wherein [x] _P=y:(x.y) ∈ IND (P) }, and [x] _PBe that x is about the equivalence class of P.

Infosystem as following table 1-1.

Table 1-1 infosystem

We establish P={a ₁, a ₂, a ₃, a ₄, then can obtain following equivalence class according to above-mentioned: { x ₁, x ₂, { x ₃, { x ₄, { x ₅If we establish P={a ₁, a ₃, in like manner have: { x ₁, x ₂, { x ₃, x ₅, { x ₄.

Right

Usually property set P can not accurately represent X, this be because X may comprise about P can not differentiate a part of object in the relation, and a remaining part does not comprise.As X={x ₁, x ₂, x ₃, x ₄, x ₅, P={a ₁, a ₂, a ₃, a ₄, here P just can not accurately represent X, because { x ₁, x ₂Can not differentiate, so there is not additive method accurately to represent to comprise x by X ₁And do not comprise x ₂In order to represent X by property set P, we introduce about following approximate (the Lower approximation) of X and last approximate (Upper approximation).

Definition 4: establish

Then the following of X is similar to and the last approximate following form that is defined as

\underset{&OverBar;}{P} X = {x | [x]_{p} &SubsetEqual; X}

According to top definition as can be known X following approximate representation all belong to the union of the equivalence class of X, last approximate representation be with X intersect for empty set also.For approximate down

, we are also referred to as positive region, are designated as POS _P(X).And Be borderline region, the object that belongs to borderline region may belong to X, also may be not necessarily.

Definition 5: establish

PX is approximate down,

Be to go up to be similar to, then we claim by approximate down

With last approximate

Two tuples of forming

Be rough set.

Therefore rough set has two parts to represent, wherein

The lower boundary of expression X,

The expression coboundary.In order to weigh

With

To the expression effect of rough set, the precision that we define rough set is as follows

α_{P} (X) = \frac{| \underset{&OverBar;}{P} X |}{| \overset{&OverBar;}{P} X |}

0≤α wherein _P(X)≤1, α _P(X) propinquity effect to X has been described more intuitively.Obvious optimal situation is α _P(X)=1 o'clock, namely

With

Equate, certainly under the extreme case, i.e. α _P(X)=0 o'clock, this is approximate at present

Equal empty.

1.2 rough set theory decision information system.

Definition 6: given infosystem I=(U, A), A=CUD, CI

Wherein C is conditional attribute, and D is decision attribute.Infosystem with conditional attribute and decision attribute is called decision information system.

In fact, infosystem is divided into different kinds, and is common as many (list) value information system, fuzzy information systems, coordination (inharmonious) infosystem, Imperfect Information Systems.

1. many (list) value information system

Given infosystem I=(U, A), right

Remember a (x) indicated object x in the value of attribute, if | a (x) |＞1, claim that then infosystem is many-valued; If | a (x) |=1, then infosystem is monodrome.

2. fuzzy information systems

Given infosystem I=(U, AU{d}), A={a ₁, a ₂, L, a _m, d is decision attribute.a _i, i=1,2, L, m are clear variablees, and d is fuzzy variable, and value is fuzzy set D ₁, D ₂, L, D _M, D wherein _k(x), k=1,2, L, M.Claim that then (U AU{d}) is fuzzy information systems to infosystem I=.

3. coordinate (inharmonious) infosystem

Given infosystem I=(U, A, F, D, G), wherein, U is that sample closes, U={x ₁, x ₂..., x _n; A is the conditional attribute set, A={a ₁, a ₂..., a _m; D is that decision attribute closes, D={d ₁, d ₂...., d _q; F is the funtcional relationship set of U and A, F={f _k: U → V _k, k≤m}, V _kBe sample x _kAt a _kCodomain; G is the funtcional relationship set of U and A, G={g _k: U → V _k, k≤q}, V _kBe sample x _kAt d _kCodomain.

For arbitrarily

Can not differentiate relation:

R_{B} = {(x, y) : f_{k} (x) = f_{k} (y), &ForAll; a_{k} &Element; B}

R_{D} = {(x, y) : g_{k} (x) = g_{k} (y), &ForAll; a_{k} &Element; D}

They are respectively in the division that U produces

U/R _B＝{[x] _B:x∈U}

U/R _D＝{[x] _D:x∈U}

Wherein

[x] _B＝[y:(x,y)∈R _B]

[x] _D＝{y:(x,y)∈R _D}

Be respectively that x is about the equivalence class of B and D.

Definition 7: given Information Decision System I=(U, A, F, D, G), if having

We claim Information Decision System to coordinate, otherwise claim that Information Decision System is inharmonic.

4. Imperfect Information Systems

Given infosystem I=(U, A), A=CUD, CI For

The definition tolerance concerns that T is:

The tolerance class definition of object x ∈ U is:

The upper and lower approximate of concept X on the U is defined as respectively:

\underset{&OverBar;}{T} X = {x | xU^T (X &SubsetEqual; X})

We claim that (U A) is Imperfect Information Systems to infosystem I=.

1.3 neighborhood rough set.

In brief, sample x _iNeighborhood be exactly x _iNear the set of the object under a certain particular space in the certain distance.The calculating of its distance, we define a distance function Δ usually and measure.

Define 8: one tolerance Δs and refer to a R ^N* R ^NThe function of → R, and satisfy following character:

1) Δ (x_{1}, x_{2}) &GreaterEqual; 0, &ForAll; x_{1}, x_{2} &Element; R^{N}; Δ (x_{1}, x_{2}) = 0,

, and if only if x ₁=x ₂

2) Δ (x_{1}, x_{2}) = Δ (x_{2}, x_{1}), &ForAll; x_{1}, x_{2} &Element; R^{N}

3) Δ (x_{1}, x_{3}) \leq Δ (x_{1}, x_{2}) + Δ (x_{2}, x_{3}), &ForAll; x_{1}, x_{2}, x_{3} &Element; R^{N}

Generally speaking, we use the distance of theorem in Euclid space to calculate usually.For the attribute of nominal data, we can define a special module:

Δ_{C} (x, y) = \{\begin{matrix} 1, ifx &NotEqual; y \\ 0, ifx = y \end{matrix}

Prove Δ easily _CSatisfy broad sense tolerance equation.

Definition 9: given one overlaps the object U={x of limited and non-NULL ₁, x ₂, K, x _nAnd numerical attribute a, in order to describe this set, object x arbitrarily then _iThe δ neighborhood of ∈ U is defined as:

δ _a(x _i)＝{x _j|Δ(x _i,x _j)≤δ,x _j∈U}

We also can claim δ _a(x _i) be a dependency a and object x _iThe neighborhood information particle that derivation is come out.Set { the δ of neighborhood information particle _a(x) | x ∈ U} has formed the basic concept in the cover total space.

In the database that mixes attribute, a neighborhood information system is represented as:

NIS＝＜U,A,V,f＞

In the formula: U---the set of sample

The set of A---attribute,

The codomain of V---attribute,

F---information function: U * A → V.

More specifically, if comprise condition and two kinds of attributes of decision-making in the system the inside, a neighborhood information system is also referred to as a neighborhood decision table, and it can be expressed as:

NDT＝＜U,AUD,V,f＞

Definition 10: given NIS=＜U, A, V, f＞, B is a numerical characteristics subclass, then for B, the neighborhood of x is

δ_{B} (x) = {x_{i} | x_{i} &Element; δ_{a} (x), &ForAll; a &Element; B};

Definition 11: given NIS=＜U, A, V, f＞, B=B ⁿUB ^c, B wherein ⁿAnd B ^cBe respectively numerical characteristics and nominal feature.B ⁿProduce neighborhood relationships

B ^cProduce relation of equivalence

We define x

δ_{B} (x) = {x_{i} | x_{i} &Element; δ_{B^{n}} (x) I x_{i} &Element; δ_{B^{c}}, &ForAll; a_{i} &Element; B^{n}, b_{j} &Element; B^{c}}

After having defined the δ neighborhood granulating of the data set that contains the blended data feature, we investigate the relation between the later feature of granulating and the decision-making now.

Definition 12: a given neighborhood decision table NDT=＜U, AUD, V, f＞, X ₁, X ₂..., X _nBe the subclass of decision-making 1 to N, δ _B(x _i) by character subset

What generate comprises x _iMessenger particle, the D that then makes a strategic decision being defined as about the approximate collection up and down of character subset B:

\underset{&OverBar;}{δ_{B}} D = {\underset{&OverBar;}{δ_{B}} X_{1}, \underset{&OverBar;}{δ_{B}} X_{2}, \cdot \cdot \cdot, \underset{&OverBar;}{δ_{B}} X_{n}}

\overset{&OverBar;}{δ_{B}} D = {\overset{&OverBar;}{δ_{B}} X_{1}, \overset{&OverBar;}{δ_{B}} X_{2}, \cdot \cdot \cdot, \overset{&OverBar;}{δ_{B}} X_{n}}

Wherein:

\underset{&OverBar;}{δ_{B}} X = {x_{i} | δ_{B} (x_{i}) &SubsetEqual; X, x_{i} &Element; U}

Approximate set is illustrated in attribute B as the inside, space of dimension down, is completely contained in the set of the object among a certain decision-making D.In other words, belong to the down object of approximate set, as classification foundation, can Complete Classification correct with character subset B.The approximate positive territory of decision-making that is also referred to as is expressed as POS down _B(D).And go up approximate set, then be illustrated in attribute B as the inside, space of dimension, at least part of set that comprises the object of the D that makes a strategic decision.Namely belong to the object of approximate set, B weighs with character subset, might be categorized into decision-making D.

With the feature space messenger particleization, make the characteristic attribute of the characteristic attribute of nominal type and numeric type obtain describing in same set of system the inside, and then for it is unified under the evaluation function, the contact place mat between the excavation characteristic attribute basis.

For the decision-making accuracy of characteristic feature subclass B, we define a boundary function:

BN (D) = \overset{&OverBar;}{δ_{B}} D - \underset{&OverBar;}{δ_{B}} D

Decision boundary is the messenger particle set that belongs to the object of more than one decision-making class.Object in decision boundary does not all have clear and definite classification, and the size of decision boundary can be weighed the classification blur level of character subset B.

Definition 13: the interdependency of the character subset B of decision-making D is defined as under character subset B, the ratio of the object that classification is consistent:

γ_{B} (D) = \frac{| PO S_{B} (D) |}{| U |}

The interdependency equation has reflected the descriptive power of feature B to decision-making D, also can be regarded as the importance degree index of feature B pairing approximation decision-making D.POS _B(D) value is more big, and then character subset B is more strong to the descriptive power of decision-making D.Work as γ _B(D)=1 o'clock, claim that then classification problem is consistent.

1.4 the attribute reduction based on neighborhood rough set.

Part attribute in the data may be that have nothing to do or redundant to some categorised decision problem.These attributes can cause the minimizing in the positive territory of decision-making, perhaps reduce the sorter pace of learning, cause the over-fitting of learning model, sorter to complicate.Therefore be necessary these attributes to be identified and carry out yojan.Be the definition to necessary attribute and unnecessary attribute below.

Definition 14: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

If γ _B(D)＞γ _B-a(D), claim that so a is necessary attribute for categorised decision D in B.Otherwise a is unnecessary attribute.

Definition 15: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

If B satisfies following two conditions, claim that then B is the relative yojan of A:

1) necessary condition:

&ForAll; a &Element; B, γ_{B} (D) > γ_{B - a} (D);

2) adequate condition: γ _B(D)=γ _A(D).

First condition guarantees that all attributes all are necessary among the B, does not have unnecessary attribute in yojan.Second condition guarantees that character subset B can describe the positive territory of classification of whole attribute A fully.

The size in positive territory (perhaps border) is not only relevant with the feature space B of problem, and relevant with neighborhood information granularity δ, and under different feature spaces and analysis granularity, the consistance of classification is different.

Character 1(attribute monotonicity) a given neighborhood decision system NDT=＜U, AUD, V, f＞,

Δ is the metric function on the U, B ₁, If granularity is δ.If

So

1)

B_{1} N &SupersetEqual; B_{2} N

2)

&ForAll; X &SubsetEqual; U, \underset{&OverBar;}{N_{B_{1}}} X &SubsetEqual; \underset{&OverBar;}{N_{B_{2}}} X

3)

PO S_{B_{1}} (D) &SubsetEqual; PO S_{B_{2}} (D), γ_{B_{1}} (D) \leq γ_{B_{2}} (D)

Character 2(granularity monotonicity) a given neighborhood decision system NDT=＜U, AUD, V, f＞,

Δ is the metric function on the U,

If δ ₁≤ δ ₂, so

1)

B N_{δ_{2}} &SupersetEqual; B N_{δ_{1}}

2)

&ForAll; X &SubsetEqual; U, \underset{&OverBar;}{N_{δ_{2}}} X &SubsetEqual; \underset{&OverBar;}{N_{δ_{1}}} X

3)

PO S_{δ_{2}} (D) &SubsetEqual; PO S_{δ_{1}} (D), γ_{δ_{2}} (D) \leq γ_{δ_{1}} (D)

Character 1 explanation, under the identical granularity δ, attribute is more many, and the positive territory of classifying is more big, and the border is more little, and consistance is more high.Character 2 explanations, in the identical feature space, the neighborhood information granularity is more little, and the positive territory of classifying is more big, and consistance is more high.When attribute is got over for a long time, more accurate to the description of sample.When B=A, B is identical in A classification consistance.When the information granularity more hour, more meticulous in order to the key concept of approaching decision-making, also more accurate to the description of classification.When δ=0, claim that this system is the Pawlak unanimity, also with regard to consistent under the relation of equivalence.

Definition 16: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

The importance degree of defined attribute a is calculated as

SIG(a,B,D)＝γ _B(D)-γ _B-a(D)

(a, B D) have reacted the significance level of the decision attribute D of attribute a to SIG.

1.5 attribute appraisement method.

1.5.1 feature selection process.

We have known the importance degree index of character subset B pairing approximation decision-making D, and the importance degree computing method of attribute a, next will pass through certain strategy, obtain the optimal feature subset of our expectation.Typical feature selection process mainly contains four parts:

1) generate Candidate Set: generate next group character subset to be evaluated,

2) attribute appraisement method: the evaluating characteristic subclass,

3) stop condition: judge whether selection course satisfies termination condition,

4) whether affirmation process: it is effective to verify the selected character subset that comes out.

Generating the Candidate Set process adopts certain search strategy to generate one group of character subset to be evaluated.Initial Candidate Set can be (i) empty set, (ii) feature complete or collected works, (iii) a stack features subclass that produces at random.Preceding two kinds of situations need to add iteratively or delete property, and last a kind of situation also can be added or delete property iteratively, and perhaps each search all is generation one stack features subclass at random.

Evaluation method is used for the interdependency of evaluating characteristic subclass, and compares with best before interdependency value, if the interdependency of current subclass is higher, then current character subset as best character subset.Optimal feature subset always corresponding to certain specific evaluation method (such as, the optimal feature subset of selecting by different attribute appraisement methods probably is different).

If the stop condition that neither one is suitable, feature selection process might move for a long time, and search strategy perhaps places one's entire reliance upon.Feature produces and estimates can be as stop condition.Stop condition based on the feature production process has following two kinds of situations: (i) whether selected predefined feature number, (ii) whether reached predefined iteration number.Stop condition based on evaluation procedure has following two kinds of situations: (i) whether add (or rejecting) feature and will produce a better character subset, (ii) whether optimal feature subset obtains.

Also have the distortion of a lot of feature selection processes, but basic feature generation, evaluation and stop condition exists all in all methods.

1.5.2 the attribute appraisement method based on dependency degree.

Definition 17: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

δ _B(x) be the neighborhood of sample x.If

D (x _i)=D (x) so just we can say, x is the correct sample of Complete Classification, x ∈ POS _B(D).The dependency degree of the conditional attribute B of decision attribute D is so

γ_{B} (D) = \frac{| PO S_{B} (D) |}{| U |}

| POS _B(D) | can the correct number of samples of Complete Classification among the expression sample space U.

1.5.3 based on conforming attribute appraisement method.

The dependency degree function has only been considered the sample that Complete Classification is correct, has inconsistent sample and ignored classification.But the sample in the territory, border might not be by the misclassification class, such as Bayes

Under the sorter, can be according to the classification that occupies most decision-making class predicate nodes in certain characteristic attribute, therefore, what of the inconsistent decision-making class of characteristic attribute have also played effect to the classification accuracy of last sorter.

Consistent sample obviously can correctly be classified, but consistent sample also may not necessarily be by the mistake branch, and the sample that only is in minority class in the inconsistent sample just can be divided by mistake, and the sample that is in most classes can correctly be classified.Coincident indicator has solved the Pawlak rough set and can not put up with inconsistent information in the equivalence class, the problem of class distribution situation that can not the meticulous depiction borderline region.

Suppose that P is the number of all samples, N represents all feature numbers, and M is feature number in the character subset of selecting, and S is character subset, and c is decision-making classification number, and C is the set of decision-making classification.

Definition 18: coherence method defines by inconsistent rate, is the computing method of inconsistent rate below:

1. the pattern p of a character subset b referred to as inconsistently, has at least two examples if satisfy in the sample set of this pattern, and their decision attribute is inconsistent.

2. the inconsistent number of certain pattern p of character subset S equals total degree that it occurs deducts the maximum that occurs in different classification set number of times in data acquisition.

3. the inconsistent rate S (I of character subset S _R(S)) be all patterns under the character subset inconsistent number and divided by total sample number P.

Coherence method just can be applied to following feature selecting task: a given candidate's character subset S, and the inconsistent rate I of calculating S _R(S).If I _R(S)≤and δ, wherein δ is given threshold value, so just thinks that S is consistent.

Conforming thought is incorporated into neighborhood rough set, thereby solves the meticulous depiction problem of mixed type variable classification ability.

Definition 19: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

x _i∈ U, δ _B(x _i) be sample x _iNeighborhood.P (ω _j| δ _B(x _i)), j=1,2, L, c is the ω in this neighborhood _jThe probability of class, x so _iThe neighborhood decision function be defined as

ND (x _i)=ω, if

P (ω | δ_{B} (x_{i})) = \max_{j} P (ω_{j} | δ_{B} (x_{i}))

Wherein, P (ω _j| δ _B(x _i))=n _j/ N, N are neighborhood δ _B(x _i) quantity of interior sample, n _jBe neighborhood δ _B(x _i) in the quantity of j class sample.

Introduce the 0-1 mistake and divide loss function

λ (ω (x_{i}) | ND (x_{i})) = \{\begin{matrix} 0 & ω (x_{i}) = ND (x_{i}) \\ 1 & ω (x_{i}) &NotEqual; ND (x_{i}) \end{matrix}

ω (x wherein _i) be x _iTrue classification

Definition 20: the neighborhood inconsistent rate of making a strategic decision

I_{R} (B) = \frac{1}{| U |} Σ_{i = 1}^{| U |} λ (ω (x_{i}) | ND (x_{i}))

Neighborhood make a strategic decision inconsistent rate be exactly in fact according to the class in the sample neighborhood distributed intelligence, according to the majority decision principle again to sample distribution decision-making class, and then the variance rate of statistics concrete class and the classification redistributed.

And then we just can pass through 1-I _R(B) concordance rate of expression neighborhood decision-making.

Definition 21: a given neighborhood decision system NDT=＜U, AUD, V, f＞,

Interdependency computing formula based on the conditional attribute B of conforming decision attribute D is

γ _B(D)=1-I _R(B)

We can find out that the interdependency function is the ratio of the example of classification that can be fully errorless, and the consistance function is the ratio of the sample that possible correctly be classified.The consistance function has not only been considered the positive territory in the interdependency function, has also considered to comprise maximum samples in certain classification in border (boundary).

1.5.4 search strategy.

Efficient search technique is very important, and optimal situation is by enumerative technique, at the characteristic set of N feature, with 2 ^NIndividual character subset is all checked under evaluation function one by one, seeks out optimum character subset.It is unpractical often to search for optimal subset by exhaustive mode.Next we will introduce and more different search strategies:

1) Focus: the method for exhaustion (Exhaustive Search) is the earliest one of algorithm in the machine learning.Generally from empty set, adopt the BFS (Breadth First Search) strategy, up to find one can complete prediction classification smallest subset.The search strategy one of the method for exhaustion guarantees to find an optimal subset surely, but along with the growth of optimal feature subset number M, the time complexity of algorithm will seriously raise, even can't obtain the result.All generally speaking, when the number M of correlated characteristic more hour, the search volume of algorithm will be very little, efficient also can be more high.On the contrary, when N-M was very little, the efficient of algorithm will be very low.

2) ABB: searching algorithm (Complete Search) is different with Focus fully, and ABB rejects a feature from the feature complete or collected works at every turn.So when N-M was very little, ABB had good efficient.

3) SetCover: heuristic search algorithm (Heuristic Search).When two belonged to different classes of example the different feature of at least one value is arranged, these two examples just were called as " covering ".It should be noted that, seek the smallest subset of conforming feature, be equal to " covering " all different classes of a pair of examples.This algorithm is put into a higher feature of character subset coverage rate that can make selection from empty set at every turn, arrives the value of a setting up to " coverage rate ", and perhaps intact the time, algorithm stops all features by exhaustive.

4) LVF: random chance searching algorithm (Probabilistic Search).This algorithm is with identical probability selection character subset at random, and the subclass that keeps a upper limit that satisfies a certain evaluation criterion, constantly and the value of newly selecting relatively, up to finding one to satisfy pre-conditioned subclass, perhaps arrive predefined number of attempt.

Introduced attribute appraisement method and search strategy above, we just can the design attributes Algorithm for Reduction in conjunction with attribute appraisement method and search strategy, will introduce the old attribute reduction algorithms design by a concrete algorithm below.

1.6 the old attribute reduction algorithms design based on the dependency degree evaluation method of neighborhood rough set.

Adopt the greedy search strategy forward (Forward Greedy Search) of heuristic search algorithm, from empty set, increase by one at every turn and make separating capacity increase maximum feature, up to increasing any attribute, separating capacity all no longer increases, and perhaps all properties is all selected.The importance degree of attribute is calculated as

SIG(a,B,D)＝γ _B(D)-γ _B-a(D)

Two important step are arranged in the algorithm: whether the neighborhood sample interior with analyzing neighborhood that calculates sample is consistent.The time complexity that calculates the neighborhood of sample is O (nlogn), and whether consistent time complexity is O (n) to the sample in the neighborhood of judgement sample.When having N feature and n sample, the time complexity of algorithm is O (N ²Nlogn).

Two, based on the realization of the rough set attribute reduction algorithm of Hadoop.

2.1 the parallelization of rough set attribute reduction algorithm is analyzed.

The iterative process of attribute reduction is mainly carried out the operation of two steps: (i) calculate the importance degree of each attribute, (ii) select the importance degree maximum attribute to join Candidate Set.The second step action need is used the execution result of the first step, so this two step needs to use serial operation.We can use two sub-Job, adopt the MapReduce activity chain of dependence formula to finish this two steps operation.Two Job judge whether to satisfy stop condition after carrying out and finishing.Export the subclass after the yojan at last.The operation of in the iterative process two step is the main calculating section of algorithmic procedure, considers that this two step is carried out parallelization handles.

Below respectively to the operation parallelization analysis of the step of two in the top iterative process:

1) suppose that N represents all feature numbers, M is the feature number in the character subset of having selected, and n is number of samples, can will calculate the importance degree process parallelization of N-M attribute.Because calculating the important process of each attribute is independently.The importance degree computing method of each attribute are calculated exactly at character subset BUa _iAs the sample set under the space of dimension, and do not need the sample complete or collected works.The time complexity that originally calculates the importance degree process of each attribute is O (Nnlogn), and parallel back time complexity is O (klogk), and k is number of samples among each split.

2) first step has calculated the important of N-M candidate attribute, in the distributed HDFS file that is stored in each feature correspondence.The operation in second step just is equivalent to calculate the importance degree maximal value in the distributed document, and obtains the attribute of maximal value correspondence, as candidate attribute.This is very simple concerning Hadoop.

In the time of each iteration, the Map of first step task will read in data set, sends the sample set after handling to the Reduce task by Sort and Shuffle process.If the time of the read data of Map is T ₁, the time of deal with data is O (N) among the Map, the time of Sort and Shuffle process is T ₂, the time of calculating importance degree is O (n), the time of first step execution is T so ₁+ O (N)+T ₂+ O (n).The data that second step handled are very little, and the time is ignored for the time being.Then total time of the old attribute reduction algorithms of iteration is N* (T ₁+ O (N)+T ₂+ O (n)), time complexity is O (N*max (T ₁, O (N), T ₂, O (n))).We can predict, under the constant situation of feature number N, along with the increase of number of samples n, be to be linear growth total working time.

2.2 the parallelization of rough set attribute reduction algorithm realizes.

2.2.1 the MapReduce framework of the algorithm of attribute reduction.

Parallel old attribute reduction algorithms mainly contains two Job, and first Job carries out the calculating of attribute importance degree, and second Job obtains the importance degree maximum attribute.Serial is finished between two Job.Stop condition judges whether the importance degree maximum attribute satisfies condition, and just attribute is appended in the feature selecting subclass if satisfy condition, and judges whether to satisfy stop condition.

2.2.2 the MapReduce algorithm design of old attribute reduction algorithms.

1) Job1: need a Mapper class and a Reducer class.The Mapper class is read in sample data, and sample set is the input of each attribute assignment corresponding sample set to be evaluated as Reducer according to the community set that has selected among the Feature Select.The number of Reducer task is exactly attribute number to be evaluated, and the corresponding sample set will be that Key is input in the corresponding Reducer task with the attribute number.A plurality of Reducer tasks in parallel are carried out, thereby realize the parallel computation of attribute evaluation.

Data have been divided into a plurality of split, are distributed on each DataNode node, and the HDFS file system of Hadoop is that unit divides file with the piece of BlockSize size.A file may be divided into a plurality of, and each piece starts a Map.So the division of file size and piece directly influences the degree of parallelism of Map.

The Mapper stage of Job1 has been born lot of data and has handled operation, and in order better to understand the task in this stage, we understand the conforming relevant nature of rough set earlier.

Character 3 is in the space of attribute B as dimension, and as two sample x, y belongs to subspace, same positive territory, and sample x then, y make a strategic decision consistent, and sample x, and the value of y on attribute B is also consistent, that is:

i)D(x)＝D(y)，

ii) \underset{p &Element; B}{U} x_{p} = \underset{p &Element; B}{U} y_{p} \cdot

Based on character 3, introduce the realization of Mapper class and Reducer class among the Job1 below.

Mapper class: Map reads in each piecemeal, with＜key, value〉the form input.We select input format InputFormat to be the text formatting of acquiescence, and each row of data set is imported as a textual value.Key is that this start of line position is with respect to the side-play amount of file reference position.Value is the sample value of each row.The value of each the attribute correspondence among the value is cut apart by certain separator, and this sample record is cut apart by separator, is kept in the character string array.The length of character string is the attribute dimensions of sample.

According to selecting attribute B and non-selected attribute B, be the sample subclass of dimension for each attribute to be evaluated produces with attribute BUa

Then with the combination of attribute number to be evaluated and sample subclass as key, it is exported as value with the decision-making of sample.

key : < a, \underset{p &Element; BUa}{U} x_{P} >, value : < D (x) > \cdot

The output meeting of map is carried out subregion (partition) according to the value of a, is the number of attribute to be evaluated because we have set the number of reducer, so have only the data of an attribute correspondence in each subregion.The process of shuffle is given corresponding reducer according to partition number with the output data transmission of map.

Reducer class: because partition has guaranteed that each reducer only accepts all sample subclass of an attribute, and within a reducer, carry out subregion by the Macintosh data.So what the reduce function need be done is the values tabulation of traversal Macintosh correspondence, and the number of samples of statistical decision unanimity.Output is the number of samples POS of the classification unanimity of dimension with attribute BUa _BUa(D).Here we are POS _BUa(D) as the importance degree index of character subset BUa pairing approximation decision-making D.Final key/value is with＜a, POS _BUa(D)＞form output.

2) Job2: as input, the function of realization is very simple, gets peaked exactly with the output of Job1.The task of Job2 obtains POS from the record of Job1 output _BUa(D) maximum＜a, POS _BUa(D)＞.

Mapper class: can select input format InputFormat, general＜a, POS _BUa(D)＞read in.With＜a, POS _BUa(D)＞form output.

Reducer class: only need a reducer task, obtain POS _BUa(D) maximum＜a, POS _BUa(D)＞.

3) stop condition: the stop condition based on the feature production process has following two kinds of situations: (i) whether selected predefined feature number, (ii) whether reached predefined iteration number.Stop condition based on evaluation procedure has following two kinds of situations: (i) whether add (or rejecting) feature and will produce a better character subset, (ii) whether optimal feature subset obtains.

Above-mentioned four kinds of situations are set the stop condition priority orders: (1) does not produce better character subset, i.e. a POS _BUa(D)≤POS _B(D); (2) optimal feature subset, i.e. POS have been obtained _BUa(D)=POS _A(D); (3) reached predefined iteration number; (4) selected predefined feature number.

If do not satisfy first stop condition, the so current character subset of newly electing has pair decision-making that stronger descriptive power is arranged, then the character subset of newly electing as the character subset of having selected, B=BUa, and deletion newly adds the attribute of coming in from unselected character subset

4) feature select: preserve certain state of feature selecting, mainly comprise two data: the characteristic set of having selected (selected) and unselected characteristic set (unselected).Job1 need be with reference to these two data to remaining property calculation importance degree, and stop condition upgrades it, at last it is as a result of exported.So these two data need be with the form storage of global parameter.

2.3 the gordian technique that parallel algorithm realizes.

Writing application program in the MapReduce framework is exactly the process that customizes mapper and reducer, except realizing map and reduce function according to self-demand, all right self-defined input format, customization Writable data type, realization global parameter etc.

2.3.1 self-defined key and value data type.

The input and output of Hadoop in the process of mapper and reducer all are the forms of key/value, and the type of key and value is not arbitrarily.For key/value can be moved at cluster, the MapReduce framework provides the method for a kind of serializing key/value.Have only those classes of supporting this serializing can in this framework, serve as key or value.

The serializing form that Hadoop provides is Writable[37].The class that realizes the Writable interface can be value, if must realize WritableComparable＜T as key assignments〉interface.WritableComparable inherits from Writable and java.lang.Comparable interface.For key, they will sort in the Reduce stage, thus need to realize WritableComparable, and value is only simply transmitted.

The org.apache.hadoop.io that Hadoop carries is surrounded by some predefined Writable classes.

The Writable class provides encapsulation to the fundamental type of Java.Referring to showing 2-1:

The Writable class of table 2-1Java fundamental type

Figure 2013102240081100002DEST_PATH_IMAGE001

NullWritable is a kind of specific type of Writable, and serializing length is 0.It does not read data from data stream, do not write data yet, serves as placeholder.ArrayWritable, TowDArrayWritable, MapWritable and SortedMapWritable are 4 collection class of Writable.ArrayWritable and TowDArrayWritable are to the array of Writable and the realization of two-dimensional array.MapWritable and SortedMapWritable have realized java.util.Map＜Writable, Writable respectively〉and java.util.SortedMap＜WritableComparable, Writable 〉.

The Writable of Hadoop realizes satisfying most of demand, and sometimes, key and the desired type of value exceed the fundamental type that Hadoop self supports.We can be according to the new realization of demand structure of oneself.The Writable type of customization need realize Writable (or WritableComparable＜〉) interface.The type of customization has been arranged, just can control binary representation and clooating sequence fully.

We have used Macintosh in Job1, below code realized the most basic realization of Macintosh type.

Self-defining IntTextPair composite type, the type of first key is IntWritable, and the type of second key is Text, and it is for the design of the map run-out key of Job1.Along with the attribute of having selected is more and more, be the sample subclass of dimension with attribute BUa Length will be more and more longer, namely the length of second key assignments of map run-out key of Job1 can be elongated, if transfer out to reduce with this data layout, the output data quantity of map will be very big.

The IntIntPair composite type is second composite type that we design, and the type of first key is IntWritable, and the type of second key also is IntWritable.Be the sample subclass of dimension in order to attribute BUa with second key assignments of the map run-out key of Job1

HashCode replace, and the run-out key type is used IntIntPair.The benefit of Ti Huaning is sample subclass no matter like this

Length what are, its hashCode value is fixed length, the output data quantity of the map only record number with output is relevant.But so also brought another one mistake hidden danger.Such as, two character strings that string value is identical among the Java, their hashCode is necessarily equal, otherwise but is not like this, if the hashCode of two character strings equates that their string value then not necessarily equates.So, use hashCode substitute characte string to need careful consideration.

Defined the TextLongPair composite type in addition, the type of first key is Text, and the type of second key also is LongWritable.The purpose that designs this type is in order to represent every kind of classification and classification occurrence number.We have designed one in Job1 Combiner, and with the output valve type of TextLongPair type as map, to satisfy the requirement of Combiner function.Use Combiner can effectively improve the performance of MapReduce.

Made up a Macintosh, we expect that the record of same attribute is assigned among the same reducer.But iff being to use Macintosh, will can not obtain the situation of our expectation, because record can carry out subregion according to Macintosh, recording of same like this attribute may be sent among the different reducer.Also need to set one and carry out the partitioner of subregion according to the attribute in the key, can be sent among the same reducer with the record of guaranteeing same attribute, need self-defined partitioner class carry out subregion with the lead-in section according to Macintosh.

2.3.2 input and output form.

Hadoop is cut apart and the mode that reads input file is defined within the realization of InputFormat interface.TextInputFormat is the InputFormat of acquiescence.The key that returns is this row byte offsets hereof, and value is the content of this row.And byte offsets does not have any use usually.

Several InputFormat classes commonly used have been listed among the table 2-2.

The main InputFormat class of table 2-2.

KeyValueTextInputFormat uses in more structurized input file, by a predefined character, is generally tab (t), and key and the value of every row are separated.

We can be by calling setInputFormatClass () and setOutputFormatClass () the customization data input and output form in the Job object.

The output of the normally some other MapReduce operation of the input of a MapReduce operation.We can set output format, make it to mate with the input format of back operation, improve treatment effeciency.When Hadoop output data during to file, use be the OutPutFormat class, the output format of acquiescence is text output, each key/value is cut apart by tab.

Hadoop provides the OutPutFormat of several standards to realize, shown in table 2-3.

The main OutPutFormat class of table 2-3

2.3.3 global variable.

In the programming practice, we use global variable sometimes, reasonably use global variable can improve the efficient of program.Such as the variable at a static of class the inside statement, as certain global state, might a plurality of methods all to use this variable.

Do not support global variable at Hadoop.Because Mapper class and Reducer class are separately to carry out, that different task carry out is different map and reduce, so global variable can't transmit.So directly the method for definition global variable is unworkable in class.If but have a global variable can make things convenient for the realization of program greatly sometimes.Introduce 2 methods that realize global variable below.

I) variable information is preserved hereof.

Ii) in Configuration, set property.

The thought of first method is very simple, also is easy to realize.But the file of preserving variable information might be modified or delete, and these all are unforeseen in program, thereby can cause the program can't expected result.There is data security hidden danger in this method.

Configuration is the core classes of the configuration information of Hadoop, has realized the Writable interface, is the resource distribution form of Hadoop appointment.It is right that resource is preserved one group of name/value with the XML data layout.Every group of resource character string or path name.Two resource path: core-default.xml of Hadoop acquiescence and core-site.xml.

Application program can be added extra resource, and Configuration provides various set and get method, can conveniently carry out the configuration of parameter.We can utilize these two methods to realize the effect of global variable.

At first construct a Configuration object conf, the variable of the overall situation is wanted in definition.Calling set () method adds variable among the Configuration to.Use new Job (Configuration) building method, conf is appointed as the configuration information of Job.So just can in the Mapper of Job and Reducer class, the getConfiguration () method by the Context class obtain configuration information, by calling get () method, obtain the value of named variable name then.

Setup () and cleanup () method have all been defined in Mapper class and the Reducer class.Task begins to call automatically setup (), and task finishes to call automatically cleanup ().We can rewrite these two methods and come global variable is operated.In the process that task is carried out, global variable all is encapsulated in the example of Configuration class, has guaranteed the security of data.

2.3.4 use Combiner to promote performance.

Bandwidth on the cluster is also limited, thereby has limited the quantity of MapReduce operation.Lot of data stream is arranged between map task and the reduce task, if can reduce the data transmission between map and the reduce, will improve the execution efficient of operation.Combiner can regard local Reducer as, specifies a pooled function at the output of map task, and the output after the merging is as the input of reduce.Because pooled function is a prioritization scheme, so Hadoop can't determine need call the how many times pooled function at arbitrary record in the output of map task.That is to say, having or not under the situation of pooled function that the output that all will guarantee reduce is unanimity as a result.

Earlier illustrate with an example how Combiner promotes the MapReduce performance.Suppose to have M data burst, the N number is arranged in each burst.We will write a MapReduce task, find the maximal value in all data.During traditional MapReduce task: map reads M the data in the burst, then the M*N number is transferred to reduce, the maximizing from M*N of reduce function.The output record number of map task is M*N.If a pooled function is specified in the output that is map, earlier in local maximizing, then M local maximal value is transferred to reduce, obtaining the maximal value of total data.The record number of the task output of map is M, and M can be very not big generally speaking.As seen can effectively reduce volume of transmitted data between map and the reduce by Combiner.But sometimes be not suitable for using Combiner, such as averaging.So using Combiner in the MapReduce operation is need be careful.

We have used Combiner in Job1 and Job2.Combiner among the Job2 uses the function identical with Reducer, and the function of realization is maximizing.We will introduce the definition of Combiner among the Job1 in detail.

The reduce function of Job1 is the value of POSBUa (D) under the statistical condition attribute BUa, at first adds up the frequency that each decision-making occurs based on conforming method, and the frequency of the decision-making correspondence that occurrence number is maximum then is POSBUa (D).Combiner adds up frequency that each decision-making occurs in this locality, then with decision-making class and frequency as combined value, as the input value of reduce function.By the pre-service of Combiner, can reduce the output record number of map, reduce the processing time of reduce.We will revise the output format of map and the input format of reduce, and at first the TextLongWritable composite type is as the value output type new TextLongWritable (D (value), 1) of map, and D (value) is the sample decision-making, and subsidiary 1 counting.TextLongWritable (D (value), count) composite type is as the value input type of reduce, and D (value) is the sample decision-making, and count is the frequency that occurs through the D (value) after merging.The specific implementation code of Combiner among the Job1.

Combiner need realize the reducer interface, realizes union operation in reduce () method.In fact, it is conversion of equal value that Combiner has carried out, and as input, its output is as the input of reduce, so the key assignments of the input and output of Combiner must be consistent with the output of map.

2.3.5 select the number of reducer.

A link of Algorithm for Reduction parallelization is exactly to calculate the importance degree process parallelization of N-M attribute.What directly expect is that the number of reducer is set to N-M, has reached theoretic parallel fully.Yet the number that in the actual conditions is not reducer is more many, and operational efficiency is more high.Below we will introduce the number how a suitable reducer is set.

Under the default situations, reducer has only one.Just have only a subregion, the partitioner operation is put into all data in the subregion, and the function of partitioner is just unimportant like this.And all intermediate data all can be put in the reducer task, cause operating efficiency extremely low.Yet the effect of partitioner is very important, because in real the application, need a plurality of reducer under the situation mostly.Too much reducer can be that the intermediate result of map output produces a lot of subregions, and each subregion needs a reducer task to handle.Yet the processor of cluster is limited, can only move limited reducer task at every turn, and other a large amount of reducer task is all being waited for.And the intermediate result of map is to be stored in local disk, and too much reducer task makes and the increased frequency of I/O efficiency of operation reduced.

Task groove number available in the optimum number of reducer and the cluster is relevant.Total groove number is multiplied each other by the node number (tasktracker) in the cluster and each node task groove number and obtains.A tasktracker can move a plurality of map tasks simultaneously, and by the mapred.tasktracker.map.tasks.maximum property control, default value is 2 tasks.Accordingly, a tasktracker can move a plurality of reduce tasks simultaneously, and by the mapred.tasktracker.reduce.tasks.maximum property control, default value also is 2.Property value can arrange in configuration file hadoop-env.sh.The actual number of tasks that can move simultaneously on a tasktracker depends on what processors a machine has.Suppose that the client has 8 processors, and plan on each processor, to run respectively 2 processes, then the value of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum can be set to 7(respectively and consider to also have datanode and these two processes of tasktracker).

Use setNumReduceTasks (int tasks) method of Job that the number of reducer is set.A kind of method commonly used is that the reducer number that lacks some than total groove number a little is set, and tolerates that some wrong computing method tasks=0.95* (node is counted * mapred.tasktracker.reduce.tasks.maximum) take place for the reducer task.In addition, also can in addition more a little reducer number, Gao Su node can begin to calculate second batch of reducer task after finishing first reducer task so, be conducive to load balancing, can use computing method tasks=1.75* (node is counted * mapred.tasktracker.reduce.tasks.maximum).

In the importance degree process of calculating N-M attribute, because the record of each attribute correspondence will be with a reducer task computation, partitioner is divided into the output result of map the subregion of N-M at most, so, the number of reducer at most only needs N-M in the algorithm of article, the number of reducer be made as min (N-M, tasks).

Certainly; the present invention also can have other various embodiments; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. based on the quick attribute reduction method of the neighborhood rough set of Hadoop, it is characterized in that: may further comprise the steps:

δ_{B} (x) = {x_{i} | x_{i} &Element; δ_{a} (x), &ForAll; a &Element; B};

The importance degree of defined attribute a be calculated as SIG (a, B, D)=γ _B(D)-γ _B-a(D), (a, B D) have reacted the significance level of the decision attribute D of attribute a to SIG, are used for estimating the importance degree of each attribute;

2. the quick attribute reduction method of the neighborhood rough set based on Hadoop based on Hadoop as claimed in claim 1, it is characterized in that: described c) search strategy adopts the random chance searching algorithm in the step, with identical probability selection character subset at random, and the subclass that keeps a upper limit that satisfies a certain evaluation criterion, constantly and the value of newly selecting relatively, up to finding one to satisfy pre-conditioned subclass, perhaps arrive predefined number of attempt, algorithm stops.

3. the quick attribute reduction method of the neighborhood rough set based on Hadoop as claimed in claim 1 or 2 is characterized in that: described d) evaluation method adopts based on the attribute appraisement method of dependency degree or based on conforming attribute appraisement method in the step.