CN1588363A

CN1588363A - Backward coarse collecting attribute reducing method using directed search

Info

Publication number: CN1588363A
Application number: CN 200410067151
Authority: CN
Inventors: 杨胜; 施鹏飞
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2004-10-14
Filing date: 2004-10-14
Publication date: 2005-03-02
Anticipated expiration: 2024-10-14
Also published as: CN1300730C

Abstract

The directionally searching backward coarse set attribute reducing method adopts mutual information and redundant cooperation index of the attribute subset as the measurement of the reducing coarse set attribute. The initial attribute sets are first sorted and several equivalent attribute subsets with minimum redundant cooperation index are selected from the subsets of the initial attribute sets and are stored in the directional memory area; several next level equivalent attribute subsets with minimum redundant cooperation index are then selected from the next level subsets of equivalent attribute subsets for further search; and so on until no further equivalent attribute subset can be found. The attribute subsets ultimately stored in the directional memory area are the attribute reducing result. The said method is flexible, simple and universal, and may be used in all coarse set attribute reducing fields.

Description

The backward coarse collecting attribute reducing method of using directed search

Technical field

The present invention relates to a kind of rough set attribute reduction method, relate in particular to and a kind ofly make yojan tolerance with mutual information, adopted the backward coarse collecting attribute reducing method of directed (Beam) search technique,, belonged to field of information processing for the rough set knowledge acquisition provides good approach.

Background technology

Along with developing rapidly and the widespread use of data base management system (DBMS) of infotech, the data of people's accumulation are more and more.The under cover many behind important information of data of increasing sharply, people wish and can carry out higher level analysis to it, so that utilize these data better.Present Database Systems can realize functions such as the typing, inquiry, statistics of data efficiently, but can't find the relation and the rule that exist in the data, can't be according to existing data prediction development in future trend.The means that lack the knowledge that mining data hides have behind caused the phenomenon of " data explosion but knowledge poorness ".Therefore, research can form summary (conclusion) from bulk information method just seems and comes more with important, the maturation but senior intelligent data analytical technology also is far from.

Rough set theory is a kind ofly to study the theoretical method that uncertain, imperfect knowledge and data are concluded, expressed by what Z.Pawlak proposed, be widely used in data mining, machine learning, fields such as artificial intelligence and fault diagnosis become scientific research focus in recent years.Rough set theory obtains classifying rules by attribute reduction and value yojan, and then the treatment classification problem.Attribute reduction is a basic operation in the rough set theory classifying rules acquisition process, and it is meant the uncorrelated and redundant attribute of deletion under the prerequisite of the classification capacity that keeps the initial attribute collection.On the basis of attribute reduction, remake further value yojan, the classifying rules that obtains simplifying.

Minimum attribute reduction (also claiming optimum) is the attribute set that obtains a minimum, makes that its classification capacity is identical with the initial attribute collection.The target of rough set attribute reduction is exactly minimum attribute reduction, and it has been proved to be non-linear polynomial expression difficulty (NP-hard).The method of attribute reduction can be summed up as two big classes at present:

(1) complete searching method, searching method is meant and estimates each possible attribute set fully, obtains minimum attribute reduction result.Searching method is exactly exhaustive combinatorial search the most fully, promptly estimates each combinations of attributes.This method is most time-consuming a kind of way, as the exhaustive combination searching method of forward direction.When search evaluating deg measurer has monotonicity character, can adopt branch-bound method to search for fully.When adopting mutual information to measure, can adopt branch-bound method, as automatic branching boundary method (ABB) and branch-bound method (B﹠amp as attribute reduction; B), they are all with the mutual information of the initial attribute collection boundary as attribute reduction.Difference is that the former is the breadth-first search method, and the latter adopts the depth-first search method.Have only complete searching method can guarantee to realize minimum attribute reduction, but its time complexity is an exponential form, when property set is excessive (normally＞20), searching method just becomes inapplicable owing to working time is long fully.

(2) heuristic search, heuristic search is determined search procedure according to certain direction, modal is best method (Beet First) at first.Whether common heuristic attribute reduction method is to investigate one by one that each attribute sees can be deleted, the sequencing that this obviously method is investigated according to attribute and difference.The heuristic attribute reduction method of Best First that just is based on mutual information is arranged again, and it carries out attribute reduction with the maximization mutual information as the direction of search from nuclear.The shortcoming of heuristic is that it is unidirectional, promptly has only a direction of advancing and reconnoitering.Be greatly reduced with respect to complete searching method operation time, but often produce a very poor attribute reduction result.

Summary of the invention

The objective of the invention is to overcome the deficiency of existing rough set attribute reduction method, a kind of new rough set attribute reduction method is provided, realize the rapidity of high-quality attribute reduction and computing, satisfy the actual needs of classification learning.

In order to realize such purpose, the present invention utilize the mutual information of attribute set and redundant coefficient of concordance (redundancy-synergy coefficient, RSC,

RSC (A) = \frac{I (A; P)}{Σ_{i = 1}^{a} I (f_{i}; P)}, A = {f_{i} | i = 1, . ., a}

) as the tolerance of rough set attribute reduction, from initial attribute collection F through ordering, from child's subclass (so-called child is meant and deletes the attribute set that an attribute obtains) of initial attribute collection, choose the attribute set of equal value (so-called attribute set of equal value is meant that mutual information equates) of M redundant coefficient of concordance minimum, be stored in directed memory block; Then, again from this M attribute set of equal value, the attribute set of equal value of choosing M redundant coefficient of concordance minimum from their child's subclass stores directed memory block into and does further search; By that analogy, up to do not have attribute set of equal value can be found till, the attribute set that is stored in directed memory block thus at last is exactly the attribute reduction result.

The concrete steps of the inventive method are as follows:

1, initialization: each attribute among the initial attribute collection F is rearranged from small to large according to mutual information, and will deposit in the directed memory block (Beam) through the initial attribute collection F after the ordering.

2, beam search: empty transient state memory block (Queue); For each attribute set in the directed memory block, according to redundant coefficient of concordance characteristic can by successively from front to back delete property find child's attribute set of equal value of its M redundant coefficient of concordance minimum, just preceding M child attribute set of equal value, deposit the transient state memory block in, wherein, redundant coefficient of concordance

RSC (A) = \frac{I (A; P)}{Σ_{i = 1}^{a} I (f_{i}; P)}, A = {f_{i} | i = 1, . ., a},

A representation attribute subclass, f _iRepresentation attribute, I (A; P) mutual information of expression A and categorical attribute P, I (f _iP) expression f _iMutual information with categorical attribute P; If the child of certain attribute set attribute set number of equal value, is then got whole children attribute set of equal value of this attribute set less than M and is deposited the transient state memory block in.

3, the beam search stop condition is differentiated: if the transient state memory block comprises attribute set, then empty directed memory block; From the transient state memory block, find out M attribute set of redundant coefficient of concordance minimum, deposit directed memory block in, if the attribute set in the transient state memory block is less than M, whole attribute sets of then getting in the transient state memory block deposit directed memory block in, proceed the beam search in (2) step then.If the transient state memory block does not comprise attribute set, then all properties subclass in the output directional memory block obtains the attribute reduction result thus.

Method of the present invention can guarantee the rapidity of computing and attribute reduction result's quality by flexible M value.The M value can be set an initial value according to the size of initial attribute collection, and can adjust with length operation time, and operation time is long, then reduces the M value, otherwise, increase the M value, up to obtaining satisfied attribute reduction result.The initial attribute collection is big more, and it is more little that M gets initial value; If operation time is long, then can reduce the M value, otherwise, can increase the M value.Owing to can enlarge the hunting zone, thereby can obtain more more excellent attribute reduction results, but guarantee the rapidity of computing simultaneously.The present invention is a heuristic attribute reduction method, with general optimum at first method different be, it can be regarded as the optimum expansion of method at first, perhaps, optimum method at first is its special case.

The present invention utilizes the mutual information of attribute set and the information redundancy tolerance between the attribute---redundant coefficient of concordance is measured as attribute reduction, makes the attribute reduction of a sweep backward.Method realizes simple flexibly, and with strong points, highly versatile has the polynomial time complexity, can be applicable to all rough set attribute reduction fields.

Description of drawings

Fig. 1 is the beam search synoptic diagram in the inventive method.

Embodiment

Technical scheme for a better understanding of the present invention is further described below in conjunction with drawings and Examples.

(1) initialization:

With each attribute among the initial attribute collection F according to mutual information I (f _iP) rearrange from small to large, and will deposit in the directed memory block (Beam) through the initial attribute collection F after the ordering.It is exactly the child's attribute set of equal value that finds preceding M redundant coefficient of concordance minimum of attribute set in the directed memory block for convenience that mutual information is arranged from small to large, can compress the beam search space like this, reduces search time.

Note redundant coefficient of concordance from quantity of information merchant's angle describe attribute set redundant degree and the combination cooperative ability.A (A={f _i| f _i∈ A, i=1 ..., a}) F, RSC (A) is called the redundant coefficient of concordance of attribute set A, and it calculates suc as formula (1),

RSC (A) = \frac{I (A; P)}{Σ_{i = 1}^{a} I (f_{i}; P)} - - (1)

Redundant coefficient of concordance is the notion of a relative measure information.The span of redundant coefficient of concordance be (0, ∞).Redundant coefficient of concordance is more little, and the combination of attributes ability is weak more, and the redundancy that comprises category information between the declared attribute is big more, and many more attributes can be deleted and keep mutual information not reduce.It has following two character:

(1) if I is (A; P)=I (B; And A B, then RSC (A) 〉=RSC (B) P).

(2) for attribute set A F, A={f ₁, f ₂..., f _a, if I is (f ₁P)＜I (f ₂P)＜...＜I (f _aAnd I (A-(f P), _i| i=1,2 ..., a}; P)=I (A; P), RSC (A-{f then ₁)＜RSC (A-{f ₂)＜...＜RSC (A-{f _a)＜RSC (A).

At first the attribute among the initial attribute collection F is arranged from small to large according to mutual information in the present invention.According to redundant coefficient of concordance character (2), using this only to arrange need be by deleting preceding M child's equivalence attribute set that an attribute finds each father's attribute set from front to back successively, and need not consider child's attribute set that this father's attribute set is all.Because for each node among the directed memory block Beam (being attribute set), the redundant coefficient of concordance minimum of preceding M child's attribute set of equal value, this has saved operation time greatly.So in the initialization procedure attribute among the initial attribute collection F is arranged from small to large according to mutual information.

(2) beam search:

Optimum search at first is a starting point of estimating the optimum node of tolerance as next step search normally, and beam search is then chosen the starting point of M the measured node of evaluating deg as next step search.Beam search can be " search of a tree finite width " method, and its tree search width is made as M, is called directed width.The beam search process as shown in Figure 1, dark node represents to be used to do the further node of search among the figure, white nodes is the node that is rejected in the search procedure, directed width M is 2.The starting point of two best tree nodes that satisfy optimal conditions as next step search arranged in each layer, do further search, up to satisfying the search stop condition, end product is node 1 and 2.If (attribute set of equal value of individual redundant coefficient of concordance minimum of K＜M) is then got this K attribute set and is done further search to be merely able to find K.

The redundancy of the attribute coordinate expression generic attribute that redundant coefficient of concordance is a property set and the tolerance of cooperative ability, redundant coefficient of concordance is more little, redundance is big more, having many redundant attributes more can be deleted, also promptly more may find the attribute set of equal value of a littler F, therefore, redundant coefficient of concordance can be selected tolerance as attribute set, in conjunction with the beam search method, carry out the back to the delete property yojan.

(3) the beam search stop condition is differentiated:

Be empty in the transient state memory block, illustrate when not finding attribute set of equal value, therefore the last attribute set of equal value that is stored in the minimum that the attribute set of equal value in the directed memory block is considered to find that finds, so beam search stops, and obtains the attribute reduction result.If have, explanation can be made further beam search, from the transient state memory block, find out M attribute set of redundant coefficient of concordance minimum, deposit directed memory block in, if the attribute set in the transient state memory block is less than M, whole attribute sets of then getting in the transient state memory block deposit directed memory block in, continue the search in (2) step.

The working time of attribute reduction method of the present invention and two factors have relation: the calculating of (1) attribute set mutual information; (2) search volume, the i.e. number of the attribute set of being estimated.The time of an attribute set evaluation is depended on attribute set to the sample set division of (sample set comprises p attribute, m sample), adopts hashing to divide, and the time complexity of attribute set evaluation is O (m).If r is a yojan sub-set size as a result, the attribute set number that the inventive method is estimated is not more than 0.5*M* (p-r) * (p-1+r)+p+1, so time complexity of the present invention is O (mMp ²).In fact, reduced unnecessary attribute set evaluation because produce framework by attribute ordering and child's attribute set, therefore search volume of the present invention is much smaller than 0.5*M* (p-r) * (p-1+r)+p+1.When M=1, time complexity of the present invention is O (mp).

5 UCI standard data set: Corral, Monk1, Parity5+2, Vote, Mushroom are chosen in experiment.At first select for use the ABB method to make attribute reduction, result and operation time are as shown in table 1.For the Mushroom data set, surpass 2 hours operation time, thinks that the ABB method is unaccommodated, with "-" expression.The attribute reduction result of the inventive method is as shown in table 2 respectively, and M gets 1 respectively, p and 2p.They almost can access as can be seen from the table most the attribute reduction subclass, but relative ABB method of time descends greatly.For the Mushroom data set, the inventive method has also obtained good attribute reduction result, and the ABB method is owing to be that a complete searching method can not.

Table 1 data set information and ABB method attribute reduction result

Data set sample number initial attribute collection size u

ABB

AS t(ms)

Corral 128 6 2 3

{f ₁-f ₄}

Monkl 432 6 2 {f ₁，f ₂，f ₅} 19

Parity5+2 1024 10 2 650

{f ₁-f ₅} ⁽¹⁾{f ₁，f ₃-f ₅，f ₇} ⁽²⁾

{f ₂-f ₆} ⁽³⁾{f ₃-f ₇} ⁽⁴⁾

Vote 435 16 2 {f ₁-f ₄，f ₉，f ₁₁，f ₁₃，f ₁₅，f ₁₆} 2697

Mushroom 8124 22 2 - -

U is the classification number, and AS is the attribute reduction subclass, and t is operation time.

Table 2 the inventive method attribute reduction result

Data set the present invention (M=2p) the present invention (M=p) the present invention (M=1)

AS t(ms) AS t(ms) AS t(ms)

Corral {f ₁-f ₄} 2 {f ₁-f ₄} 2 {f ₁-f ₄} 2

Monkl {f ₁，f ₂，f ₅} 13 {f ₁，f ₂，f ₅} 13 {f ₁，f ₂，f ₅} 4

Parity5+2?{f ₃-f ₇} ⁽¹⁾ 403 {f ₃-f ₇} ⁽¹⁾ 397 {f ₂-f ₅} 49

{f ₁，f ₃-f ₅，f ₇} ⁽²⁾ {f ₁，f ₃-f ₅，f ₇} ⁽²⁾

{f ₂-f ₆} ⁽³⁾ {f ₂-f ₆} ⁽³⁾

{f ₁-f ₅) ⁽⁴⁾

Vote {f ₁-f ₄，f ₉，f ₁₁， 985 {f ₁-f ₄，f ₉，f ₁₁， 765 {f ₁-f ₄，f ₉，f ₁₁， 42

{f ₁₃，f ₁₅，f ₁₆} f ₁₃，f ₁₅，f ₁₆} f ₁₃，f ₁₅，f ₁₆}

Mushroom {f ₅，f ₂₀，f ₂₁，f ₂₂} ⁽¹⁾ 659219 15 369640 {f ₅，f ₈，f ₁₂，f ₁₉，f ₂₀} 2389

{f ₄，f ₅，f ₁₂，f ₂₂} ⁽²⁾

Claims

1, a kind of backward coarse collecting attribute reducing method of using directed search is characterized in that comprising the steps:

1) initialization: each attribute that initial attribute is concentrated rearranges from small to large according to mutual information, and will deposit in the directed memory block through the initial attribute collection after the ordering;

2) beam search: empty the transient state memory block; For each attribute set in the directed memory block, according to redundant coefficient of concordance characteristic can by successively from front to back delete property find child's attribute set of equal value of its M redundant coefficient of concordance minimum, just preceding M child attribute set of equal value, deposit the transient state memory block in, wherein, redundant coefficient of concordance

RSC (A) = \frac{I (A; P)}{Σ_{i = 1}^{a} I (f_{i}; P)},

A={f _i| i=1 ..., a}, A representation attribute subclass, f _iRepresentation attribute, I (A; P) mutual information of expression A and categorical attribute P, I (f _iP) expression f _iMutual information with categorical attribute P; If the attribute set number of equal value of the child in certain attribute set, is then got whole children attribute set of equal value of this attribute set less than M and is deposited the transient state memory block in; Wherein the M initial value is set and can be adjusted with length operation time according to the size of initial attribute collection, and the initial attribute collection is big more, and the M initial value is just obtained more little, operation time is long, then reduces the M value, otherwise, increase the M value, up to obtaining satisfied attribute reduction result;

3) the beam search stop condition is differentiated: if the transient state memory block comprises attribute set, then empty directed memory block, from the transient state memory block, find out M attribute set of redundant coefficient of concordance minimum, deposit directed memory block in, if the attribute set in the transient state memory block is less than M, whole attribute sets of then getting in the transient state memory block deposit directed memory block in, proceed the beam search in (2) step then; If the transient state memory block does not comprise attribute set, then all properties subclass in the output directional memory block obtains the attribute reduction result thus.