CA2494799A1

CA2494799A1 - Method for clustering decision trees in data classifiers

Info

Publication number: CA2494799A1
Application number: CA002494799A
Authority: CA
Inventors: Alexander F. Tulai
Original assignee: ROCA ENGINEERING Ltd
Current assignee: ROCA ENGINEERING Ltd
Priority date: 2005-01-28
Filing date: 2005-01-28
Publication date: 2006-07-28

Abstract

A method for clustering decision trees that in one embodiment can be implemented in a genetics-based data classifier for the purpose of speeding up the classification process and increasing the classification accuracy. The present invention relates to a decision tree clustering method whereby, in order the increase the classification speed an d accuracy of a data classifier using groups of decision trees, similar decision trees are identified and clustered. When the method is presented with a group of decision trees encoding in each of their leaf nodes a class from the same group of classes, the method identifies similar decision trees where two decision trees are said to be similar if all the data instances correctly classified by a first decision tree are included in the set of data instance s correctly classified by a second decision tree, in which case the second decision tree is said to be greater or equal than the first decision tree. The clustering is performed b y placing a decision tree in the same cluster with another decision tree that is greater or equal to it, and the process is repeated until no more clustering is possible.

Description

SPECIFICATIONS
TO WHOM IT MAY CONCERN, BE IT KNOWN THAT, Roca Engineering Ltd. , a Canadian company, claims the rights to the invention of a new and useful "Method for clustering decision trees in data classifiers", method of which the following is a specification:
Method for increasing the classification speed and accuracy by clustering decision trees in data classifiers BACKGROUND OF THE INVENTION
Field of the Invention The present invention relates to data classification, and more particularly, to a method for increasing the classification speed and the accuracy of decision trees based data classifiers by decision tree clustering.
Related Background Art Conventionally, for constructing a data classifier from training examples, one procedure utilizes one decision tree (also referred in the literature as classification trees) to sort the examples into categories.
Modern techniques build multiple such decision trees, have all of them vote on the possible class of a new instance of data and select the vote that has the highest associated probability.
The conventional methods construct the decision trees in a top-down recursive divide-and-conquer method resulting in very complex decision trees that are not binary (each node may be linked down to more than two other nodes) and may have more internal nodes than the number of features characterizing the data.
Genetics-based classification methods may also use populations of decision trees and voting schemes for predicting the class of a new instance of data.
A recent genetics-based classification technique uses canonical binary decision trees that predict only two classes, rather than the complex decision trees generated through conventional techniques which are built to predict all the classes. In a binary decision tree each internal node is connected to exactly two other lower level nodes). A
canonical decision tree is a decision tree in which each internal node is associated with a distinct feature and, as a consequence, the number of internal nodes is limited by the number of features characterizing the data. Canonical binary decision trees that predict only two classes are significantly simpler than the traditional decision trees.
To compensate for the simplicity of the decision trees used, such genetics-based classification techniques use as many populations of canonical binary decision trees recognizing only two classes, as many classes characterize the given data, and a voting system. A decision tree of such a population will recognize only a certain class (the class assigned to that population) and treat all the other classes (assigned to the other populations) as the unknown class.
Each population of decision trees is evolved to improve its capability of recognizing instances from a certain class (called positive examples) and also instances from all the other classes (called negative examples) labeled together as the unknown class.
The method used to for evolving these decision trees is called genetics-based because is using operators derived from genetics: mutation, crossover and selection.

In any decision tree (whether conventional or a canonical binary decision tree) each internal node is associated with a boolean valued function (also called attribute) built on the domain of a certain feature while each leaf node is associated with a class. The path from the root node to any leaf node encodes a decision rule as conjunction or disjunction of the attributes encountered along the path. Therefore each such tree encodes a set of decision rules (as many as the total number of leaf nodes), thus justifying the name of decision trees.
In the aforementioned genetics-based classification method using populations of two-class canonical binary decision trees, a restricted form of selection is used where a child tree can replace only its parent tree. This restriction on selection, placed in order to preserve the diversity of the classification rules discovered, may adversely affect the accuracy of the rules discovered and also the speed with which they are discovered.
This condition will be explained while referring to Fig. 1.
In Fig. 1, the different hills represent various decision rules and their size is proportional to the number of training instances classified by the respective rule. The smaller hills 101 correspond to weak decision rules (decision rules that correctly predict a small number of training instances) while the taller hills 102 correspond to strong decision rules (decision rules that correctly predict a large number of training instances).
The white dots 103, 104 represent the decision trees which depending on the decision rules encoded may be placed in a different location in the fitness landscape.
Because the decision trees are randomly initialized the distribution of white dots in the fitness landscape is random. The decision trees with the lowest fitness (for example white dot 103) are shown on the floor of the fitness landscape. These decision trees are not committed to any decision rule yet. Other decision trees are shown placed on different hills at various heights (for example white dot 104). Ideally we would like to make sure that all the decision rules are completely discovered (which would be represented by a white dot reaching the top of the hill 106).
If instead of restricted selection we place all the agents in a single population and allow them to compete freely, one can immediately see from Fig. 1 that, if at some point during the evolution most of the agents will be close to the peak of hills 102 and 107, the decision rule represented by smaller hills, like 101, may be abandoned and the accuracy of the classifier will decrease.
On the other hand, because of the restricted selection the process of the discovery of a certain rule by a group of decision trees pursuing that rule may be slow and incomplete resulting in decision rules that are not the best possible.
To explain why this problem occurs we use two pairs of decision trees 109 and linked by arrows. The white dot placed at the start of an arrow marks a parent decision tree and the white dot at the tip of the arrow marks an offspring decision tree.
If each restricted selection is used, the offspring in the pair 108 will replace its parent while the offspring in the pair 109, having a lower fitness than its parent, will be eliminated. However, it is clear that preserving both the parent and the offspring of the pair 109 and dumping both decision trees of pair 108 would accelerate the process of climbing the fitness hill and consequently the process of discovering the associated decision rule. Moreover, because better decision trees would result through this process, there is an increased probability of discovering higher accuracy decision rules.

SUMMARY OF THE INVENTION
It is therefore one object of the present invention to provide a method for detecting the decision trees that are in the process of discovering the same decision rules, grouping them together and applying full selection to such created groups for the purpose of speeding up the process of discovering the decision rules and increasing the accuracy of the decision rule discovered.
According to the present invention, the positive and negative examples correctly classified by a decision tree could be used as a criterion for detecting decision trees that are discovering similar decision rules (therefore will be called similar decision trees).
According to the present invention if all the positive and negative examples correctly classified by a first decision tree are also correctly classified by a second decision tree, the two decision trees are considered similar.
According to the present invention, similar decision trees can be clustered (grouped) together. In Genetic Algorithms terminology these grouped individuals (decision trees) form a sub-population or a deme.
According to the present invention, the decision trees members of the same deme are subjected to genetic operators that apply to two or more individuals such as selection and crossover.
According to the present invention, the processes for detecting similar decision trees can be applied continuously or periodically.
According to the present invention, when the method for detecting similar decision trees is subsequently applied, from all the individuals of a deme, only the decision tree with the highest fitness needs to be considered.
According to the present invention, when all the individuals in a population are clustered in one deme, the processes for detecting similar individuals could be stopped because no further clustering is possible.
According to the present invention, during the voting phase of a classification system using populations of decisions trees, only the best decision tree from each cluster of trees is allowed to vote, therefore reducing the number of trees used during voting and also minimizing the compounded effect introduced when similar trees making an erroneous decision were all allowed to vote on the class of an instance.
According to the present invention, the process of clustering similar decision trees results in a speeding up of the process of decision rule discovery as well as in increasing the accuracy of the decision rules discovered.
According to the present invention, the process of clustering similar decision trees results in an overall increase in classification accuracy.
The foregoing has outlined, in general, the aspects of the invention and is to serve as an aid to better understanding the more complete detailed description which is to follow. In reference to such, there is to be a clear understanding that the present invention is not limited to the method, or application of use described and illustrated herein.
Any other variation of implementation, use, or application should be considered apparent as an alternative embodiment of the present invention.

OBJECTS OF THE INVENTION
Accordingly several advantages and objects of the present invention are:
A principal object of the present invention is to provide a method for identifying and clustering similar decision trees that will overcome the deficiencies of the prior art in decision trees based data classifiers.
An object of the present invention is to provide a method for clustering decision trees that works on a group of decision trees encoding in each of their leaf nodes a class from the same group of classes Another object of the present invention is to provide a method for clustering decision trees that identifies similar decision trees where two decision trees are said to be similar if all the data instances correctly classified by a first decision tree are included in the set of data instances correctly classified by a second decision tree, in which case the second decision tree is said to be greater or egual than the first decision tree Another object of the present invention is to provide a method for clustering decision trees that places a decision tree in the same cluster with another decision tree that is greater or eaual to it Another object of the present invention is to provide a method for clustering decision trees that continues to repeat the clustering procedure until no more trees can be clustered Another object of the present invention is to provide a method for clustering decision trees that can be used to increase the classification speed and accuracy in genetics-based data classifiers It is intended that any other advantages and objects of the present invention that become apparent or obvious from the detailed description or illustrations contained herein are within the scope of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS
The following drawings further describe by illustration the advantages and objects of the present invention. Drawings are referenced by corresponding figure reference characters within the "DETAILED DESCRIPTION OF THE INVENTION" section to follow. Fig. 1 has been referenced by corresponding figure reference characters within the "Related Background Art" section.
Fig. 1 describes a theoretical distribution of decision trees over the decision rule fitness landscape used for explaining the benefits and the problems associated with restricted selection, problems fixed through the method proposed in this invention.
Fig. 2 is a block diagram of the method for detecting and clustering similar decision trees according to the first embodiment.
Fig. 3 is a block diagram for explaining where the decision tree clustering method is used in an example of genetics-based data classification algorithm according to the second embodiment DETAILED DESCRIPTION OF THE INVENTION
(First Embodiment) In a first embodiment, a description will be given of a method for clustering decision trees in conjunction with Fig. 2 Referring to Fig. 2, reference character 201 denotes a step during which the best decision trees from each cluster are selected for detecting similarity relationships. We assume there are N such best trees.
Initially each tree in a population of decision trees can be placed in a separate cluster such that when the clustering algorithm is performed the first time all the trees will be inspected and subjected to potential clustering.
By the best decision tree we mean the decision tree that obtained the highest score in classifying the instances in the training dataset.
Reference character 202 denotes a step during which a similarity matrix of dimension Nx(N+1) is built and initialized as follows:
Each cell of coordinates (i,j) with 1 <_ i <_ N and 1 S j S N is set to 1 (TRUE) Each cell of coordinates (i,N+1) with 1 <_ i <_ Nis set to N-1 Reference character 203 denotes a step during which for each data instance in training dataset the following steps are performed:
Each decision tree selected at 201 tries to classify the selected data instance If decision tree i recognizes the data instance and decision tree doesn't, the location (i, j) in the similarity matrix built at 202 is set to 0 and the counter stored at location (i, N + 1 ) is decremented by 1 Reference character 204 denotes a step during which for each 1 <_ i <_ N the following steps are performed;
The counter at location (i, N + 1) is inspected and, if the counter (i,N+1) is greater than 0, decision tree i is similar to at least one other decision trees so execute next step.
Check if cluster to which decision tree i belongs has already been re-clustered in which case continue with the next i if not proceed to the next step.
Inspect locations (i, j) with 2 S j <_ N . The first location found that contains a value of 1 corresponds to a decision tree J that is similar to decision tree i.
Therefore we cluster decision tree i and decision tree J .
Proceed to next i .

(Second Embodiment) In the second embodiment, an example of a genetics-based data classifier using a decision tree clustering method will be described in conjunction with Fig. 3.
Refernng to Fig. 3, reference character 301 describes a step during which we create as many populations of two-class canonical binary decision trees as classes are in the dataset.
Reference character 302 denotes a step during which each decision tree is placed in a separate cluster.
Reference character 303 denotes a step during which the decision trees in each cluster are evolved using Genetic Algorithms.
Reference character 304 denotes a step during which we check to see if the time to perform the clustering algorithm has arnved Reference character 305 denotes the step during which the clustering algorithm is performed according to the first embodiment.

Claims

What is claimed is:

1. A method of detecting similar decision trees in a population of decision trees built to predict (classify) the same classes of data instances, comprising the steps of selecting the best decision trees in the existing clusters, building a similarity matrix, checking the selected decision trees on each instance in the training dataset and updating the similarity matrix based on how a certain decision trees does when compared to other decision trees and decrementing a counter that keeps track of how many decision trees are similar to the said decision tree and after all the instances in the training dataset have been used, use the aforementioned counters and the similarity matrix to established similarity relationships and cluster the decision trees that are similar.

2. A method claimed as in 1, characterized in that if there are no existing clusters consider that each decision trees in the population is placed in a separate cluster and therefore qualifies for clustering.

3. A method claimed as in 1, characterized in that said counters can be placed in the last column of the similarity matrix or in a separate linear array.

4. A method claimed as in 1, characterized in that said similarity matrix is initialized assuming that all decision trees are similar and is updated, after each instance in the training dataset has its class predicted by each selected decision tree, to reflect when a decision tree is found to be de-similar from another decision tree.

5. A method claimed as in 3, characterized in that said counters are investigated and when their content is greater than 0 at least one similarity has been found as reflected in the corresponding row in the similarity matrix.

6. A method claimed as in 5, characterized in that when the content of a counter is greater than 0 the associated row in the similarity matrix is investigated in increasing order and the decision tree associated with the respected row is placed in the same cluster with the first decision tree found to be similar.

7. A method claimed as in 6, characterized in that the processes is repeated for all the said counters found to be greater than 0.