WO2019027451A1

WO2019027451A1 - Training classifiers to reduce error rate

Info

Publication number: WO2019027451A1
Application number: PCT/US2017/045076
Authority: WO
Inventors: Steven J. Simske; Marie Vans
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2017-08-02
Filing date: 2017-08-02
Publication date: 2019-02-07

Abstract

A computer system includes a training module and an error reduction module. The training module trains a set of classifiers using training data that is selected from a data collection that is labelled in accordance with a classification schema. The error reduction module determines an overall error rate in which the trained set of classifiers misclassify data items from a representative portion of data collection. The error reduction module selects at least a class from the classification schema of the data collection, for use in filtering training data obtained from the collection. The selection of the class may be based on a determination that the set of classifiers, when trained on the filtered training data, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection.

Description

TRAINING CLASSIFIERS TO REDUCE ERROR RATE

BACKGROUND

[0001] Classifiers are programmatic mechanisms which operate to assign data items of a particular type into one of multiple pre-defined classes. The evaluation of a classifier (e.g., suitability of the classifier for a particular task) is based at least in part on the accuracy of the classifier, which in turn, can depend on how the classifier was trained . In supervised training, classifiers are trained and evaluated using labeled data sets.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIG. 1 illustrates an example classifier training system ;

[0003] FIG. 2 illustrates an example method for filtering training data from a labeled data set.

[0004] FIG. 3 illustrates a computer system on which one or more examples may be implemented.

DETAILED DESCRIPTION

[0005] Labeled data sets are commonly used to train classifiers in a variety of applications. Examples recognize that labeled data sets may include classes which are not suitable for purpose of training classifiers. In particular, examples identify non-predictive classifications in the classification schema of a labeled data set which are more likely to include data items for which the performance of the trained classifier will be deemed inaccurate. Examples determine a training data set for training classifiers by excluding data items that are labeled with the classifications that are deemed non-predictive.

[0006] By way of comparison, other approaches have utilized data pruning methods which independently learn different aspects of the data in order to identify examples which are most likely noisy or wrongly labeled. While such conventional approaches perform analysis on training data that is noisy for purpose of pruning the training data, such approaches fail to determine which classifications of a pre-labeled training set contain data items which are more likely to yield inaccurate results when applied to classifiers. Such approaches also run the risk of losing the most important samples for purpose of defining the boundary between two or more classes.

[0007] Still further, other conventional approaches have relied on techniques to correct labels of noisy training data. But such conventional approaches fail to identify classes of data in a pre-labeled data set which are correctly labeled but are detrimental to the training of classifiers. In contrast to such approaches, examples describe a system and method to identify classes of data in a labeled data set which are detrimental to training. As described in greater detail, examples filter data items of such classes from the training data, resulting in improved performance by classifiers that are trained using the filtered training data.

[0008] According to examples as described, a computer system includes a training module and an error reduction module. The training module trains a set of classifiers using training data that is selected from a data collection that is labelled in accordance with a classification schema. The error reduction module determines an overall error rate in which the trained set of classifiers misclassify data items from a representative portion of data collection.

Additionally, the error reduction module selects at least a class from the classification schema of the data collection, for use in filtering training data obtained from the collection . The selection of the class may be based on a determination that the set of classifiers, when trained on the filtered training data, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection.

[0009] Additionally, in some examples, the error reduction module selects multiple classifiers from the set of classifiers to deploy as an ensemble classifier, where each of the multiple classifiers are trained on the filtered training data before the selection is made. Once selected, the set of classifiers may be trained again for deployment, using another selection of training data from the data collection that is not filtered. This may correspond to a new training data or a superset of the filtered training data, depending on the application and the size of the ground truth set.

[0010] One or more examples described herein provide that methods, techniques, and actions performed by a computing device are performed programmatically, or as a computer-implemented method. Programmatically, as used herein, means through the use of code or computer-executable instructions. These instructions can be stored in one or more memory resources of the computing device. A programmatically performed step may or may not be automatic.

[0011] One or more examples described herein can be implemented using programmatic modules, engines, or components. A programmatic module, engine, or component can include a program, a sub-routine, a portion of a program, or a software component or a hardware component capable of performing one or more stated tasks or functions. As used herein, a module or component can exist on a hardware component independently of other modules or components. Alternatively, a module or component can be a shared element or process of other modules, programs or machines.

[0012] Some examples described herein can generally require the use of computing devices, including processing and memory resources. For example, one or more examples described herein may be implemented, in whole or in part, on computing devices such as servers, desktop computers, cellular or smartphones, and tablet devices. Memory, processing, and network resources may all be used in connection with the establishment, use, or performance of any example described herein (including with the performance of any method or with the implementation of any system).

[0013] Furthermore, one or more examples described herein may be implemented through the use of instructions that are executable by one or more processors. These instructions may be carried on a computer-readable medium. Machines shown or described with figures below provide examples of processing resources and computer-readable mediums on which instructions for implementing examples of the invention can be carried and/or executed. In particular, the numerous machines shown with examples of the invention include processor(s) and various forms of memory for holding data and instructions. Examples of computer-readable mediums include permanent memory storage devices, such as hard drives on personal computers or servers. Other examples of computer storage mediums include portable storage units, such as CD or DVD units, flash memory (such as carried on smartphones, multifunctional devices or tablets), and magnetic memory.

Computers, terminals, network enabled devices (e.g ., mobile devices, such as cell phones) are all examples of machines and devices that utilize processors, memory, and instructions stored on computer-readable mediums. Additionally, examples may be implemented in the form of computer-programs, or a computer usable carrier medium capable of carrying such a program.

[0014] FIG. 1 illustrates an example classifier training system . A classifier training system 100 includes a training module 110 and an error reduction module 120 that utilize a labelled data collection 122, having a classification schema 123 that defines multiple class 125. The classification schema 123 may define individual classes 125 of data items of the labelled data collection 122, including criterion by which data items are assigned to each class. As described in greater detail, the error reduction module 120 improves the performance of classifiers ("classifier 112") trained by the training module 110, by identifying and filtering data items that are of a class which is deemed detrimental to the effectiveness of training methods utilized by the training module 110.

[0015] In an example of FIG. 1, the training module 110 trains a collection of classifier 112 using a representative portion 136 of the labelled data collection 122 as training data 128. The representative portion 136 may include a distribution of classes that match or are substantially similar to the distribution of the classes of the larger data collection. Accordingly, the distribution of classes for the representative portion 136 may match or be similar to the classification schema 123 of the labelled data collection 122. Depending on implementation, the number of trained classifiers 112 may range between one and many.

[0016] The error reduction module 120 may evaluate the trained classifiers 112 using another representative portion 138 of the labelled data collection 122. The representative portion 138 used for evaluation of trained classifiers 112 may correspond to, or be selected from a portion of the labelled data collection 122 that excludes the data items selected for the

representative portion 136. Thus, the representative portion 138 that is used by the error reduction module 120 for evaluation may also reflect the classification schema 123 of the labelled data collection 122.

[0017] In evaluating the classifiers 112, the error reduction module 120 may determine an overall error rate for each trained classifier 112, where the determined error rate reflects a comparison (e.g., ratio) of each data item of the representative portion 138 which was misclassified by each classifier 112, as compared to the total number of data items of the representative portion 138 which were correctly classified by that classifiers 112. For example, the error reduction module 120 may rank the trained classifiers 112 based on the respective error rates, and then include (or exclude) a designated number or portion of the trained classifiers 112 from the selected set of classifiers 114 based on the determined ranking. For example, the error reduction module 120 may determine the selected set of classifiers 114 by including (or excluding) the highest (or lowest) classifiers from the selected set of classifiers 114. Alternatively, the selected set of classifiers 114 may include (or exclude) each classifier that meets a given threshold or condition. For example, the selected set of classifiers 114 may include (or exclude) trained classifiers 112 having error rates that are above/below a given ranking threshold. In this way, the selected set of classifiers 114 may correspond to a selection of the trained classifiers 112, where the selected set of classifiers 114 exclude those classifiers which are determined to have error rates that fail to meet a corresponding threshold or condition.

[0018] In some examples, the error reduction module 120 selects a set of classes 125 from the classification schema 123 of the labelled data collection 122 for use as a class filter 116, from which filtered training data 128 is determined. In some examples, the class filter 116 is used to filter the representative portion 136 of all data items that are associated with the classifier 112 in order to determine the filtered training data 128. In

variations, the filtered training data 128 is determined by applying the class filter 116 to filter out some data items (e.g., majority or substantial portion thereof) of the representative portion 136. The selection of classes 125 for use as the class filter 116 may be based on a determination that the classifiers 112, when trained on training data 128 filtered by the class filter 116, sufficiently reduce the overall error rate in which the trained classifiersl l2 misclassify data items from any representative data set of a larger data collection (e.g., having data items of same type, and sharing the same classification schema). The training module 110 may retrain individual classifiers 112 using the filtered training data 128. For example, the training module 110 may select the set of classifiers 114 based on an evaluation of the individual classifiers 112, then retrain the selected set of classifiers 114 using training data 128 that is filtered by the class filter 116. The error reduction module 120 then evaluates the individual classifiers 112, once the classifiers have been retrained using the filtered training data 128.

[0019] In some examples, the selected set of classifiers 114 is implemented as an ensemble set of classifiers based on results determined from training the larger set of classifiers 112 on filtered training data (e.g ., training data subject to class filter 116). For example, once the classifiers 112 are trained using filtered training data (e.g., training data with non-cohesive classes removed), the classifiers 112 may be ranked by performance, from which selection is made for the selected set of classifiers 114. The selected set of classifiers 114 may then be trained as an ensemble configuration of classifiers using an unfiltered representation of the data collection. The retrained set of classifiers 114 may then be deployed to classify data items for all classes of the classification schema 123.

[0020] The error reduction module 120 identifies and excludes suspect classes of data items from the training data, before training classifiers for deployment. According to some examples, the error reduction module 120 selects a set of classes 125 for the class filter 116 by determining an error rate associated with each of the respective classes. As related to individual classes 125, the error rate may correspond to a comparison (e.g., ratio) of the data items in the representative portion 138 that were labeled incorrectly by one or multiple trained classifiers 112, as compared to the data items of the same class in the representative portion which were classified correctly by the trained classifiers 112. Depending on implementation, the error rate determination made for each class 125 is based on an aggregation of error rates for the respective class 125 from some or all of the trained classifiers 112. For example, the error rate determination may be made for each class 125, using results that are aggregated from the selected set of classifiers 114, or subset thereof.

[0021] In some examples, the error reduction module 120 selects one or multiple classes 125 for the class filter 116 by ranking the individual classes according to an error reduction metric. For each class 125, the error reduction metric may correspond to a figure of merit, which may be defined as a ratio of (i) the amount of overall error reduced by removing the particular class (or selecting the particular class for the class filter 116), and (ii) the error of remaining classes divided by the degrees of freedom of the number of remaining classes. Thus, the denominator may correspond to the expected value of error reduction for all of the classes. In other words, if the figure of merit is greater than unity, the particular class 125 may be assumed as suspect, and included in the class filter 116.

[0022] According to some examples, the system 100 identifies and removes multiple non-predictive classes from use as training data 128. The system 100 may, for example, optimize the selection of classes for the class filter 116, by identifying one or multiple classes 125 of the classification schema 123 that, when removed from use of training data 128, causes the classifiers to have a lower error rate. In one example, the classes 125 which are non-predictive and detrimental to training are identified as those which have a figure of merit that is greater than one.

[0023] In some examples, the system 100 utilizes a recursive process in which training and evaluation are repeated to identify each of the non- predictive classes. For example, in an initial process, the training module 110 may initially train the classifiers using unfiltered training data. The error reduction module 120 may evaluate the trained classifiers 112 to identify an initial set of non-predictive classes (e.g., those classes which have a figure of merit that is greater than one) for use as the class filter 116. The process may be repeated by retraining the classifiers 112 using training data 128 that is filtered by the class filter 116, then evaluating the retrained classifiers 112 to identify if additional non-predictive classes exist in the remainder. If the process identifies additional suspect classes, the process of training and evaluating to determine additional non-predictive classes for the class filter 116 is repeated, until a determination is made that removal of additional classes would not have merit.

[0024] According to some examples, the system 100 trains classifiers 112 for deployment using the determined class filter 116. In some variations, once the non-predictive classes are identified, the training module 110 may train classifiers for deployment using an optimal set of class filters 116. For example, the training module 110 may deploy a set of trained classifiers 118 by retraining the selected set of classifiers 114 using training data 128 that is filtered by the class filter 116 to exclude those non-predictive classes.

[0025] By excluding non-predictive class(es) from training, system 100 is able to train and deploy classifiers which have a reduced error rate with respect to the remaining classes of the classification schema. In variations, system 100 "splits" the classes of the classification schema 123 into those classes which are cohesive and those which are not cohesive (and thus non- cohesive). Additionally, the system 100 may train and deploy alternative classifiers for purpose of classifying data items of the non-cohesive class. The system 100 may then identify suitable or optimal classifiers for either or both of the reduced sets. As an addition or variation, the system 100 may identify an optimal classifier for classifying data items from the entire set of classes. As described in greater detail with an example of FIG. 2, in such a hybrid approach, the system 100 may perform separate analysis to determine a classification deployment strategy which considers whether deployment of separate classifiers for each class set of the split is more optimal than deployment of one or more classifiers for the entire set.

[0026] In some examples, system 100 implements an evaluation process as described with an example of FIG. 1 in order to determine a filtering scheme for training a specific classifier or set of classifiers. For example, an initial set of classifiers 112 may be trained on unfiltered training data, in order to determine classes in the classification schema 123 of the data collection which are not sufficiently cohesive (e.g., figure of merit is less than one) for the trained classifier. The selected set of classifiers 114 may then be retrained using training data that is filtered to reduce or eliminate the classes which have, for example, a figure of merit of less than one. The set of classifiers 112 may be said to be optimized or made more optimal when the class filtering of the training data improves the performance of the classifier for those classes remaining in the filtered training set. In some examples, an optimized classifier 112 (or set thereof) may be subject to an additional training process that utilizes training data which includes data items of classes which were previously filtered out of the training data. For example, the optimized set of filters may be subjected to an additional training process that utilizes unfiltered training data, or a selective inclusion of one or more (but not all) of the classes that previously formed the class filter 116. Still further, the class filter 116 may be applied more sparingly so that data items of suspect classes are represented in the training data, but to a lesser degree.

[0027] FIG. 2 illustrates an example method for filtering training data from a labeled data set. In describing an example of FIG. 2, reference may be made to elements or aspects of an example of FIG. 1, in order to illustrate suitable components or functionality for implementing a step or sub-step being described.

[0028] With reference to FIG. 2, the system makes a determination of an overall error rate in which a trained set of classifiers misclassify data items from a representative portion of the labelled data collection 122 (210). The determination may be made in response to a set of classifiers 112 being trained on a representative portion of the labelled data collection 122, which encompasses the entire set of classes for the classification schema 123.

[0029] The system 100 may select at least a class 125 from multiple classes of the classification schema 123 as a class filter 116 for training data obtained from the labelled data collection 122 (220). The selection of the class 125 may be based on a determination that the set of classifiers, when trained on a representative portion of the labelled data collection 122 that is filtered using the class filter 116, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the labelled data collection 122.

[0030] In a variation, an optimal set of classifiers may be selected for the class filter 116. In an implementation, a figure of merit, or other metric of error prediction, is determined during an evaluation process or phase, with respect to each class of the classification schema. The optimal set of classifiers may be determined by, for example, each classifier that includes a figure of merit that is greater than one.

[0031] The set of classifiers are then trained on training data that is filtered using the training filter (230). Once trained on the filtered training data, the trained set of classifiers can be evaluated again, to determine the performance of the classifiers without interference from data items of non- cohesive classes. Accordingly, at least a first classifier from the trained set of classifiers is selected based at least in part on the error rate of each classifier in the set (240). In some examples, multiple classifiers are selected, or are otherwise determined to be suitable for use as an ensemble classifier.

[0032] Once the classifier(s) are selected, the classifiers may be trained for deployment using training data from the collection that is not filtered (250). The classifiers may then be deployed to classify unclassified data items in accordance with the classification schema 123, using training data that is filtered by the class filter 116 (260).

[0033] Examples such as described with FIG. 1 and FIG. 2 recognize that some classifiers can experience a significant improvement in accuracy when such classifiers are trained using a labeled data set of cohesive classes. In other words, the presence of non-cohesive data sets may be detrimental to the training of such classifiers. Among other advantages, by identifying such non-cohesive classes in advance, examples can tailor the selection and deployment of classifiers to better suit the labeled data set of the available training data .

[0034] Additionally, the performance of classifiers may improve when trained on data items that are part of a cohesive subset of classes. As described with an example of FIG. 2, a set of classifiers may be selected after being trained using training data that filters out the non-cohesive classes (e.g., those classes with a figure of merit greater than one). In such examples, multiple classifiers can be selected for deployment as an ensemble classifier. The selected classifiers may be retrained on training data that encompasses a full representation (i.e., unfiltered) of the labeled data collection, before being deployed as an ensemble set of classifiers.

[0035] Still further, in other examples, a classifier deployment strategy may be determined based on identification of the cohesive and non-cohesive classes of the classification schema 123. In some variations, an alternative classifier is used for a reduced portion of the labelled data collection 122 corresponding to data items of classes used for the class filter 116 (i .e., the non-cohesive set). The selection of the alternative classifier may be based on, for example, selecting a classifier that is more suited for less cohesive classifications of data sets. In other variations, the classifier deployment strategy may provide for deployment of a classifier (or set thereof) to classify data items across the entire set of classes in the classification schema 123. The determination to use a classifier across the entire set of classes may benefit from advance knowledge of those classes of the labeled data set which are not cohesive. For example, the system 100 may employ a method such as described with an example of FIG. 2 to train and evaluate classifiers that are suited for cohesive data sets. If the classification schema is determined to include non-cohesive classes, the system 100 may train and evaluate a classifier (or set thereof) that is suited for discontinuous data sets.

[0036] Still further, the system 100 may train and evaluate a selected classifier for the entire set of classifiers for accuracy as compared to use of separate and independent classifiers for each of the separate classifier sets (e.g., set of non-cohesive classes selected for the class filter 116 versus remaining cohesive set of classes selected for training). The system 100 may select to deploy classifier(s) for the entire set based on the comparison.

[0037] In some examples, the classifier deployment strategy may be determined from features of the full set of classes versus features of the reduced sets of classes (e.g., those which are identified as being non- predictive). For example, the deployment of classifiers may employ boosting, in which case the set of boosters (simple classifiers) that are deemed appropriate from the reduced set may coincide with a superset of the boosters that are deemed appropriate from the full set. In such cases, the set of boosters that are deemed appropriate from the full set of classes may be selected and utilized.

[0038] FIG. 3 illustrates a computer system on which one or more examples may be implemented. As shown, a computer system 300 may communicate over a computer network with one or more remote data sources in order to obtain, for example, the labelled data collection 122. The computer system 300 may correspond to, for example, a server, work station or terminal. In variations, the computer system 300 may correspond to a distributed computer system, such as one which may exist on multiple servers or workstations, or as between workstation and server. The computer network may correspond to, for example, a local area network (LAN), a virtual LAN (VLAN), a wireless local area network (WLAN), a virtual private network (VPN), the Internet, or a combination thereof. [0039] The computer system 300 may include least one processing resource 310, and at least one machine-readable storage medium 320 that includes or is otherwise encoded with instructions (including instructions 322, 324, and 326) that are executable by the at least one processing resource 310 in order to implement functionality such as described with examples of FIG. 1 or FIG. 2. By way of example, the processing resource 310 may access the instructions 322 to make a determination of an overall error rate in which a trained set of classifiers misclassify data items from a representative portion of the labelled data collection 122. The processing resource may execute the instructions 324 to select a class 125, from the classification schema 123, as a class filter. Additionally, the processing resource 310 may execute the instructions 326 to deploy at least a first classifier to classify unclassified data items in accordance with the classification schema 123. In some examples, the instructions 326 may select the first classifier (or multiple classifiers) by retraining the selected set of classifiers using the filtered training data, and then evaluating the performance of the individual classifiers. Once the classifiers are trained on training data that excludes non-cohesive classes (e.g., figure of merit less than one), a selection of one or multiple (e.g., ensemble) classifier(s) can be made for deployment. The selected classifier(s) may then be trained again on training data that is unfiltered, so as to include data items for both the cohesive and non-cohesive classes.

[0040] With further reference to an example of FIG. 3, at least one processing resource 310 may fetch, decode, and execute instructions stored on storage medium 320 to perform functionalities described above in relation to instructions stored on storage medium 320. In other examples, the functionalities of any of the instructions of storage medium 320 may be implemented in the form of electronic circuitry, in the form of executable instructions encoded on a machine-readable storage medium, or a

combination thereof. The storage medium 320 may be located either in the computer system 300 executing the machine-readable instructions, or remote from but accessible to the computing system (e.g., via a computer network) for execution. In an example of FIG. 3, the storage medium 320 may be implemented by one machine-readable storage medium, or multiple machine- readable storage media. [0041] Although specific examples have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific examples shown and described without departing from the scope of the disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein.

Claims

WHAT IS CLAIMED IS:

1. A computer system comprising :

a training module to train a set of classifiers using training data that is selected from a data collection, wherein the data collection is labelled in accordance with a classification schema that defines a plurality of classes; and

an error reduction module to determine an overall error rate in which the trained set of classifiers misclassify data items from a representative portion of the collection, and wherein the error reduction module selects at least a class from the plurality of classes as a filter for training data obtained from the collection, based on a determination that the set of classifiers, when trained on the filtered training data, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection.

2. The computer system of claim 1, wherein the error reduction module selects multiple classifiers from the set of classifiers to deploy as an ensemble classifier, the multiple classifiers being trained on the filtered training data before selection, and then trained again for deployment using another selection of training data from the data collection that is not filtered.

3. The computer system of claim 1, wherein the error reduction module selects at least the class from the plurality of classes by implementing a selection process to determine a merit value for each of the plurality classes, and selecting the class with the lowest merit value to filter training data from the collection.

4. The computer system of claim 3, wherein the selection process is implemented recursively to select multiple classes in succession to filter from the training data .

5. The computer system of claim 4, wherein the selection process is implemented until the error reduction module determines that each remaining class that is not filtered from the training module has a merit value that is greater than or equal to one.

6. The computer system of claim 1, wherein the error reduction module selects multiple classes from the plurality of classes to filter training data obtained from the collection .

7. The computer system of claim 1, wherein the error reduction module selects at least the class by identifying, for each of the plurality of classes, a class-specific error rate in which the set of trained classifiers misclassify data items of each class.

8. The computer system of claim 1, wherein the error reduction module selects an optimal set of classes to filter the training data obtained from the labeled data set, to minimize the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection.

9. The computer system of claim 1, wherein the error reduction module performs a recursive process that includes selecting one or a combination of classes to filter training data obtained from the collection in order to determine the optimal set.

10. A method for training classifiers, the method being implemented by one or more processors and comprising :

determining an overall error rate in which a trained set of classifiers misclassify data items from a representative portion of a data collection, wherein the data collection is labeled in accordance with a classification schema that defines a plurality of classes; and

selecting at least a class from the plurality of classes as a training filter, based on a determination that the set of classifiers, when trained on a representative portion of the data collection that is filtered using the training filter, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection; training the set of classifiers on the representative portion of the data collection that is filtered using the training filter;

selecting at least a first classifier from the trained set of classifiers based at least in part on the error rate of each classifier in the set;

training at least the first classifier using another representative portion of the data collection without filtering; and

deploying at least the first classifier to classify unclassified data items in accordance with the classification schema.

11. The method of claim 10, wherein selecting at least the first classifier includes selecting multiple classifiers from the trained set of classifiers, and wherein deploying at least the first classifier includes deploying the multiple classifiers as a trained ensemble classifier.

12. The method of claim 10, further comprising :

triggering the first classifier to be trained using the representative portion of the data collection after the data collection is filtered using the training filter.

13. The method of claim 10, wherein selecting at least the class from the plurality of classes includes:

determining a merit value for each of the plurality of classes, based on an error rate associated with each of the plurality of classes; and

selecting at least the class based on the merit values of the plurality of classes.

14. The method of claim 10, wherein selecting at least the class from the plurality of classes includes determining an optimal set of one or multiple classes for the training filter.

15. A non-transitory computer-readable medium that stores instructions, which when executed by one or more processors of a computer system, cause the computer system to :

determine an overall error rate in which a trained set of classifiers misclassify data items from a representative portion of a data collection, wherein the data collection is labeled in accordance with a classification schema that defines a plurality of classes; and

select at least a class from the plurality of classes as a training filter, based on a determination that the set of classifiers, when trained on a representative portion of the data collection that is filtered using the training filter, reduce the overall error rate in which the set of classifiers misclassify data items from any representative data set of the collection; and

deploy at least a first classifier to classify unclassified data items in accordance with the classification schema.