WO2020188425A1

WO2020188425A1 - Method for balancing datasets of multi-class instance data

Info

Publication number: WO2020188425A1
Application number: PCT/IB2020/052251
Authority: WO
Inventors: Colin Brown
Original assignee: Wrnch Inc.
Priority date: 2019-03-15
Filing date: 2020-03-12
Publication date: 2020-09-24
Also published as: CA3036847A1

Abstract

This disclosure describes a method for balancing datasets of instances in which each instancemay be labelled by a sequence, plurality or distribution of class labels. The disclosure includesperforming stochastic under-sampling (removal of dataset instances) and oversampling(replication of dataset instances) based on the distribution of classes in each instance, tominimize the ratio between the sizes of the minority class (i.e. class labelling the fewest framesacross all instances) and the majority class (i.e. class labelling the most frames across allinstances).

Description

METHOD FOR BALANCING DATASETS OF MULTI-CLASS INSTANCE DATA

FIELD

[001] This disclosure relates to methods for balancing datasets of multi-class-labelled instances of data, which may be sequential.

BACKGROUND

[002] Trained classifiers are often sensitive to the distribution of labels in the training set but acquiring or creating class-balanced datasets can be challenging. The problem is exacerbated in the multi-class case when there exist more than two classes from which to choose.

[003] In many training databases for machine learning tasks that operate over sequential data, (e.g. human activity recognition, speech recognition), individual training instances are encoded as sequences of frames (or time-samples, elements, etc.), which each have a class label. For example, in the case of activity recognition from video, each frame of a labelled training video may be associated with an activity class representing the activity of a person in that frame. In general, each training instance may be associated with a sequence of class labels, which may be summarized as a distribution of labels for that training instance.

[004] In the case of sequentially labelled training instances, traditional methods for managing class imbalance, for example oversampling, under-sampling or SMOTE (Synthetic Minority Over-sampling Technique), may fail as these methods typically rely on the assumption that each instance is associated with a single label. In attempt to satisfy this assumption, sequential instances may be split into sub-sequences of uniform class labels, however 1) contextual and class-transition information that may be important for training could be lost and 2) instances may be broken into sub-sequences of varying lengths, which may be unsuitable or undesirable for the given training task.

SUMMARY

[005] This disclosure describes a method for balancing datasets of instances in which each instance may be labelled by a sequence, plurality or distribution of class labels. The disclosure includes performing stochastic under-sampling (removal of dataset instances) and oversampling (replication of dataset instances) based on the distribution of classes in each instance, to minimize the ratio between the sizes of the minority class (i.e. class labelling the fewest frames across all instances) and the majority class (i.e. class labelling the most frames across all instances).

BRIEF DESCRIPTION OF THE DRAWINGS

[006] Figure 1 is a representation of class instances within several training instances.

DETAILED DESCRIPTION

[007] The method for balancing datasets of instances may proceed in two stages: 1) oversampling and 2) under-sampling. Oversampling may be performed prior to under-sampling in order to reduce the loss of unique, original instances. In some cases under-sampling may result in the removal of all replicas of a specific instance. Depending on the requirements of the application, such as a limit on the number of instances, these two stages may be repeated, such as in an interleaved manner or sequential manner. Also depending on the requirements of the application, only one stage may be performed. For example, if it is desirable to train on a smaller dataset, the under-sampling stage may be performed without performing oversampling.

[008] Each of the two stages may comprise a number of repeated rounds. Within each round, a sequence of steps may be performed, which may include 1) determination of the minority class (for oversampling) or majority class (for under-sampling) in the current dataset, 2) stochastic sampling of one or more instances in the dataset, weighted by the count of that class in each instance, 3) replication (for oversampling) or deletion (for under-sampling) of those selected instances in the dataset. The count of a class may be the number of frames associated with that class in the dataset. These steps are described below in more detail.

[009] After performing this method on a dataset, the dataset may be used as a Raining set for a machine learning system. Typically, training sets contain thousands or more instances, each potentially labelled with multiple classes. After applying this method, there can be assurance that the difference in frequency between the minority class appearing the fewest times in the dataset and the majority class appearing the most times is within a specified limit. In other words, tire proportions of the classes that appear in the dataset are more consistent. This can assist with the training of the machine learning systems, which will either fail or not work as desired for classes which do not appear frequently enough within the training dataset.

[0010] Each instance is preferably sequential data, such as representing video, audio or text. These instances may be used, for example, for recognizing human activities or for speech recognition or for classifying the content in documents. The sequential data may be video clips, skeleton data representing human joint positions or other similar information. Individual training instances may be encoded as sequences of frames, time-samples, elements, or similar. Each frame or a range of frames from the clips, which each have a class label

[0011] Determination of the minority class involves determining which class appears the fewest times in the dataset or has the shortest total portion labelled with that class. It may be performed by counting the total number of frames associated with each class (or keeping track of this count incrementally) and finding the class with the smallest count. Similarly, determination of the majority class may involve counting the total number of frames associated with each class and finding the class with the largest count or longest total portion labelled with that class. This may be done in a number of ways, such as by counting the number of frames for each class, or storing and incrementing a count of each class while each instance of the dataset is scanned.

Note that the class which is the majority class and the class which is the minority class may change across each round and each stage as instances are replicated or deleted.

[0012] With reference to Figure 1, in an example, a dataset of one kind of data may contain five instances 10, numbered 1 through 5. The data may be labelled with one of three classes 20. For the purposes of representation in Figure 1, each class is represented by a different texture and each instance is represented as a bar indicating some amount of frames of different classes denoted by the regions of associated textures. Some frames from each instance may be unlabelled, such as for transitions. These frames may be considered as part of a default or null class, or an additional 4^th class. These frames and any associated class is not included in the determination of a minority or majority class. This example represents a case of an imbalanced dataset in which class 3 is the majority class and class 2 is the minority class. Class 3 appears the most across the five instances in terms of the number frames (length of the bars). In contrast, class 2 has the fewest frames in the five instances.

[0013] Stochastic sampling may select instances with frames labeled by the minority class during oversampling and may select instances with frames labeled by the majority class during under-sampling. Stochastic sampling of instances with frames of a particular class may be performed by 1) computing the frequency of that class in each instance and then 2) randomly selecting one or more instances, weighted by their frequency. Computing the frequency of a class in one instance may be performed by counting the number of frames associated with that class in the instance divided by the total number of frames associated with that class across all instances. The sum of the frequencies of one class across all instances may sum to 1. Randomly selecting instances may be performed by a weighted random sampling where the probability of selecting each instance is equal to the computed frequency.

[0014] The number of instances to randomly select in a given round of oversampling or under-sampling may be decided given the requirements of the use-case. Selecting only one instance in each round may ensure that the exact distribution of classes is considered for each selected instance, however this strategy may incur a longer run-time as computing the distribution of classes across all instances may be costly in terms of time and computing power. In contrast, selecting many instances in each round may be more efficient but may mean that the distribution of classes computed for each instance is only approximate as the other instances sampled in that round are not considered.

[0015] Replication of a sampled instance may be performed by creating a copy of that instance and appending it to the dataset of all instances. Alternatively, a count of the number of each instance to include in the dataset may be maintained to reduce the resource requirements of duplicating the entirety of an instance. Deletion of a sampled instance may be performed by removing the instance from the dataset of all instances or otherwise indicating that the instance is not to be included in the dataset. [0016] In the stage of oversampling, the overall number of instances will increase, which increases the number of classes from the minority class, but also other classes which appear in the replicated instances. Similarly, in the stage of under-sampling, the overall number of instances may decrease, which decreases the number of classes from the majority class, but also other classes which appear in the deleted instances.

[0017] For oversampling, with reference to Figure 1 , class 2 was previously identified as the minority class. The frequency of that class within each instance may be calculated, such as by utilizing the counts determined earlier. In this example, instances 1 , 4 and 5 have a frequency of class 2 of zero. Instance 2 has a frequency of class 2 of 0.84 and instance 3 of .16. In other words, 84% of all appearances of class 2 appear in instance 2 while 16% of all appearances of class 2 appear in instance 3.

[0018] A random sampling may then done based on these frequencies. In this example, instances 1, 4 and 5 will never be selected and instance 2 has a higher probability of being selected than instance 3. As a result of the sampling (one instance in this case), the selected instance is replicated. In this example, instance 2 may be replicated to form a new instance 6, otherwise identical to instance 2, including data labelled as class 3 and 2.

[0019] In this example, if instead of one instance being selected as part of the oversampling, three instances are selected using the weighted random sampling. As an example, two copies of instance 2 and one copy of instance 3 may be selected for replication and appending to the dataset.

[0020] Each stage may be repeated (i.e. perform all 3 steps) until a specified maximum number of rounds has been performed or until the class imbalance has been lowered below a specified threshold. The maximum number of rounds and class imbalance may be different for the oversampling stage and the under-sampling stage. The class imbalance may be computed as the fraction of the number of frames in the majority class over the number of frames in the minority class. Note that this fraction can be calculated if any the dataset contains any frames labeled with the minority class. [0021] The result of performing this method on a dataset may be to increase the total number of frames labelled by the minority class and decrease the total number of frames labelled by the majority class, thereby reducing the overall class imbalance in the dataset.

[0022] The below pseudocode represents an embodiment of the above algorithm, doing several rounds of oversampling, followed by several rounds of under-sampling. The variables can be interpreted as follows: Y is a list of instance class labels for the entire dataset; X is a list of instances (e.g. data); M is the number of classes; T__o is the class imbalance threshold for oversampling; T_u is the class imbalance threshold for under-sampling; N_o is the maximum number of rounds for oversampling; N_u is the maximum number of rounds for under-sampling; C is a 2D array of size number of frames by number of classes, representing the number of frames per class in each instance; c_sum is an array representing the total count of each class in the dataset.

[0023] The above pseudocode describes only one way of implementing the proposed method but other equivalent implementations may be expressed. For example, instead of replicating and removing instances from the dataset, instances may equivalently be associated with integer (or fractional) weights representing the number of copies of that instance in the dataset.

[0024] The described method may be performed in isolation or conjunction with any number of other algorithms or operations and may be performed once or multiple times with respect to one or more datasets.

[0025] The described method may be implemented in any appropriate programming language such as python, C++, CUDA, or Java, may be run in a single thread or in parallel on multiple threads, and may be run on any appropriate hardware such as CPU hardware, GPU hardware, embedded systems, or FPGAs on any appropriate operating system such as Linux, Windows, or MacOS. Appropriate memory, such as a combination of RAM, solid state or hard drives may be used to maintain the instances of the dataset, as well as the counts and weights listed above.

[0026] The described method may be applied to any appropriate dataset such as a dataset of instances containing sequential data with class labels for each sequential frame in the instance (or a dataset of instances where each instance has a plurality of classes, etc.). An appropriate dataset may be associated with relevant application such as activity recognition or speech recognition and may be used for appropriate purpose, such as training, validation, statistical analysis, visualisation or similar purposes. [0027] The embodiments of the invention described above are intended to be exemplary only. The scope of the invention is therefore intended to be limited solely by the scope of the appended claims.

Claims

CLAIMS:

1. A system comprising: at least one memory storing a dataset comprising a plurality of instances of sequential data, and instructions, the instances of sequential data each having portions of the sequential data labelled with a class; at least one hardware processor interoperably coupled with the at least one memory, wherein the instructions instruct the at least one hardware processor to: over-sampling the instances by: identifying a minority class being the class that has the shortest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the minority class; replicating the selected one or more instances in the dataset; under-sampling the instances by: identifying a majority class being the class that has the longest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the majority class; removing the selected one or more instances from the dataset. wherein after the over-sampling and under-sampling, class imbalance computed as the fraction of the portion of the sequential data labelled with the majority class over the portion of the sequential data labelled with the minority class, is reduced.

2. The system of claim 1 wherein sequential data comprises of frames of data and each frame, in a portion of the sequential data labelled with a class, is labelled with the class.

3. The system of claim 2 wherein the portions of the sequential data labelled with a class comprises the number of frames in the sequential data labelled with the class.

4. The system of any one of claims 1 to 3 wherein the over-sampling is repeated one or more times.

5. The system of any one of claims 1 to 4 wherein the under-sampling is repeated one or more times.

6. The system of any one of claims 1 to 3 where after the under-sampling, the over- sampling and under-sampling is repeated one or more times.

7. The system of claim 6 wherein the maximum number of repetitions is predetermined.

8. The system of claim 7 wherein the under-sampling, the over-sampling and undersampling is repeated until the class imbalance has been reduced to a less than a determined amount or the maximum is reached.

9. A method of balancing a dataset comprising a plurality of instances of sequential data, the instances of sequential data each having portions of the sequential data labelled with a class, the method comprising: over-sampling the instances by: identifying a minority class being the class that has the shortest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the minority class; replicating the selected one or more instances in the dataset; under-sampling the instances by: identifying a majority class being the class that has the longest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the majority class; removing the selected one or more instances from the dataset. wherein after the over-sampling and under-sampling, class imbalance computed as the fraction of the portion of the sequential data labelled with the majority class over the portion of the sequential data labelled with the minority class, is reduced.

10. The method of claim 9 wherein sequential data comprises of frames of data and each frame, in a portion of the sequential data labelled with a class, is labelled with the class.

11. The method of claim 10 wherein the portions of the sequential data labelled with a class comprises the number of frames in the sequential data labelled with the class.

12. The method of any one of claims 9 to 1 1 wherein the over-sampling is repeated one or more times.

13. The method of any one of claims 9 to 12 wherein the under-sampling is repeated one or more times.

14. The method of any one of claims 9 to 11 where after the under-sampling, the over- sampling and under-sampling is repeated one or more times.

15. The method of claim 14 wherein the maximum number of repetitions is predetermined.

16. The system of claim 15 wherein the under-sampling, the over-sampling and under- sampling is repeated until the class imbalance has been reduced to a less than a determined amount or the maximum is reached.