WO2020188425A1 - Method for balancing datasets of multi-class instance data - Google Patents

Method for balancing datasets of multi-class instance data Download PDF

Info

Publication number
WO2020188425A1
WO2020188425A1 PCT/IB2020/052251 IB2020052251W WO2020188425A1 WO 2020188425 A1 WO2020188425 A1 WO 2020188425A1 IB 2020052251 W IB2020052251 W IB 2020052251W WO 2020188425 A1 WO2020188425 A1 WO 2020188425A1
Authority
WO
WIPO (PCT)
Prior art keywords
class
instances
sampling
sequential data
labelled
Prior art date
Application number
PCT/IB2020/052251
Other languages
French (fr)
Inventor
Colin Brown
Original Assignee
Wrnch Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wrnch Inc. filed Critical Wrnch Inc.
Publication of WO2020188425A1 publication Critical patent/WO2020188425A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/08Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • This disclosure relates to methods for balancing datasets of multi-class-labelled instances of data, which may be sequential.
  • Trained classifiers are often sensitive to the distribution of labels in the training set but acquiring or creating class-balanced datasets can be challenging. The problem is exacerbated in the multi-class case when there exist more than two classes from which to choose.
  • each frame of a labelled training video may be associated with an activity class representing the activity of a person in that frame.
  • each training instance may be associated with a sequence of class labels, which may be summarized as a distribution of labels for that training instance.
  • This disclosure describes a method for balancing datasets of instances in which each instance may be labelled by a sequence, plurality or distribution of class labels.
  • the disclosure includes performing stochastic under-sampling (removal of dataset instances) and oversampling (replication of dataset instances) based on the distribution of classes in each instance, to minimize the ratio between the sizes of the minority class (i.e. class labelling the fewest frames across all instances) and the majority class (i.e. class labelling the most frames across all instances).
  • Figure 1 is a representation of class instances within several training instances.
  • the method for balancing datasets of instances may proceed in two stages: 1) oversampling and 2) under-sampling. Oversampling may be performed prior to under-sampling in order to reduce the loss of unique, original instances. In some cases under-sampling may result in the removal of all replicas of a specific instance. Depending on the requirements of the application, such as a limit on the number of instances, these two stages may be repeated, such as in an interleaved manner or sequential manner. Also depending on the requirements of the application, only one stage may be performed. For example, if it is desirable to train on a smaller dataset, the under-sampling stage may be performed without performing oversampling.
  • Each of the two stages may comprise a number of repeated rounds.
  • a sequence of steps may be performed, which may include 1) determination of the minority class (for oversampling) or majority class (for under-sampling) in the current dataset, 2) stochastic sampling of one or more instances in the dataset, weighted by the count of that class in each instance, 3) replication (for oversampling) or deletion (for under-sampling) of those selected instances in the dataset.
  • the count of a class may be the number of frames associated with that class in the dataset.
  • the dataset may be used as a Raining set for a machine learning system.
  • training sets contain thousands or more instances, each potentially labelled with multiple classes.
  • tire proportions of the classes that appear in the dataset are more consistent. This can assist with the training of the machine learning systems, which will either fail or not work as desired for classes which do not appear frequently enough within the training dataset.
  • Each instance is preferably sequential data, such as representing video, audio or text. These instances may be used, for example, for recognizing human activities or for speech recognition or for classifying the content in documents.
  • the sequential data may be video clips, skeleton data representing human joint positions or other similar information.
  • Individual training instances may be encoded as sequences of frames, time-samples, elements, or similar. Each frame or a range of frames from the clips, which each have a class label
  • Determination of the minority class involves determining which class appears the fewest times in the dataset or has the shortest total portion labelled with that class. It may be performed by counting the total number of frames associated with each class (or keeping track of this count incrementally) and finding the class with the smallest count. Similarly, determination of the majority class may involve counting the total number of frames associated with each class and finding the class with the largest count or longest total portion labelled with that class. This may be done in a number of ways, such as by counting the number of frames for each class, or storing and incrementing a count of each class while each instance of the dataset is scanned.
  • class which is the majority class and the class which is the minority class may change across each round and each stage as instances are replicated or deleted.
  • a dataset of one kind of data may contain five instances 10, numbered 1 through 5.
  • the data may be labelled with one of three classes 20.
  • each class is represented by a different texture and each instance is represented as a bar indicating some amount of frames of different classes denoted by the regions of associated textures. Some frames from each instance may be unlabelled, such as for transitions. These frames may be considered as part of a default or null class, or an additional 4 th class. These frames and any associated class is not included in the determination of a minority or majority class.
  • This example represents a case of an imbalanced dataset in which class 3 is the majority class and class 2 is the minority class. Class 3 appears the most across the five instances in terms of the number frames (length of the bars). In contrast, class 2 has the fewest frames in the five instances.
  • Stochastic sampling may select instances with frames labeled by the minority class during oversampling and may select instances with frames labeled by the majority class during under-sampling.
  • Stochastic sampling of instances with frames of a particular class may be performed by 1) computing the frequency of that class in each instance and then 2) randomly selecting one or more instances, weighted by their frequency.
  • Computing the frequency of a class in one instance may be performed by counting the number of frames associated with that class in the instance divided by the total number of frames associated with that class across all instances. The sum of the frequencies of one class across all instances may sum to 1. Randomly selecting instances may be performed by a weighted random sampling where the probability of selecting each instance is equal to the computed frequency.
  • the number of instances to randomly select in a given round of oversampling or under-sampling may be decided given the requirements of the use-case. Selecting only one instance in each round may ensure that the exact distribution of classes is considered for each selected instance, however this strategy may incur a longer run-time as computing the distribution of classes across all instances may be costly in terms of time and computing power. In contrast, selecting many instances in each round may be more efficient but may mean that the distribution of classes computed for each instance is only approximate as the other instances sampled in that round are not considered.
  • Replication of a sampled instance may be performed by creating a copy of that instance and appending it to the dataset of all instances. Alternatively, a count of the number of each instance to include in the dataset may be maintained to reduce the resource requirements of duplicating the entirety of an instance. Deletion of a sampled instance may be performed by removing the instance from the dataset of all instances or otherwise indicating that the instance is not to be included in the dataset. [0016] In the stage of oversampling, the overall number of instances will increase, which increases the number of classes from the minority class, but also other classes which appear in the replicated instances. Similarly, in the stage of under-sampling, the overall number of instances may decrease, which decreases the number of classes from the majority class, but also other classes which appear in the deleted instances.
  • class 2 was previously identified as the minority class.
  • the frequency of that class within each instance may be calculated, such as by utilizing the counts determined earlier.
  • instances 1 , 4 and 5 have a frequency of class 2 of zero.
  • Instance 2 has a frequency of class 2 of 0.84 and instance 3 of .16. In other words, 84% of all appearances of class 2 appear in instance 2 while 16% of all appearances of class 2 appear in instance 3.
  • a random sampling may then done based on these frequencies.
  • instances 1, 4 and 5 will never be selected and instance 2 has a higher probability of being selected than instance 3.
  • the selected instance is replicated.
  • instance 2 may be replicated to form a new instance 6, otherwise identical to instance 2, including data labelled as class 3 and 2.
  • Each stage may be repeated (i.e. perform all 3 steps) until a specified maximum number of rounds has been performed or until the class imbalance has been lowered below a specified threshold.
  • the maximum number of rounds and class imbalance may be different for the oversampling stage and the under-sampling stage.
  • the class imbalance may be computed as the fraction of the number of frames in the majority class over the number of frames in the minority class. Note that this fraction can be calculated if any the dataset contains any frames labeled with the minority class.
  • the result of performing this method on a dataset may be to increase the total number of frames labelled by the minority class and decrease the total number of frames labelled by the majority class, thereby reducing the overall class imbalance in the dataset.
  • Y is a list of instance class labels for the entire dataset
  • X is a list of instances (e.g. data)
  • M is the number of classes
  • T__o is the class imbalance threshold for oversampling
  • T_u is the class imbalance threshold for under-sampling
  • N_o is the maximum number of rounds for oversampling
  • N_u is the maximum number of rounds for under-sampling
  • C is a 2D array of size number of frames by number of classes, representing the number of frames per class in each instance
  • c_sum is an array representing the total count of each class in the dataset.
  • instances may equivalently be associated with integer (or fractional) weights representing the number of copies of that instance in the dataset.
  • the described method may be performed in isolation or conjunction with any number of other algorithms or operations and may be performed once or multiple times with respect to one or more datasets.
  • the described method may be implemented in any appropriate programming language such as python, C++, CUDA, or Java, may be run in a single thread or in parallel on multiple threads, and may be run on any appropriate hardware such as CPU hardware, GPU hardware, embedded systems, or FPGAs on any appropriate operating system such as Linux, Windows, or MacOS.
  • Appropriate memory such as a combination of RAM, solid state or hard drives may be used to maintain the instances of the dataset, as well as the counts and weights listed above.
  • the described method may be applied to any appropriate dataset such as a dataset of instances containing sequential data with class labels for each sequential frame in the instance (or a dataset of instances where each instance has a plurality of classes, etc.).
  • An appropriate dataset may be associated with relevant application such as activity recognition or speech recognition and may be used for appropriate purpose, such as training, validation, statistical analysis, visualisation or similar purposes.
  • relevant application such as activity recognition or speech recognition and may be used for appropriate purpose, such as training, validation, statistical analysis, visualisation or similar purposes.

Abstract

This disclosure describes a method for balancing datasets of instances in which each instancemay be labelled by a sequence, plurality or distribution of class labels. The disclosure includesperforming stochastic under-sampling (removal of dataset instances) and oversampling(replication of dataset instances) based on the distribution of classes in each instance, tominimize the ratio between the sizes of the minority class (i.e. class labelling the fewest framesacross all instances) and the majority class (i.e. class labelling the most frames across allinstances).

Description

METHOD FOR BALANCING DATASETS OF MULTI-CLASS INSTANCE DATA
FIELD
[001] This disclosure relates to methods for balancing datasets of multi-class-labelled instances of data, which may be sequential.
BACKGROUND
[002] Trained classifiers are often sensitive to the distribution of labels in the training set but acquiring or creating class-balanced datasets can be challenging. The problem is exacerbated in the multi-class case when there exist more than two classes from which to choose.
[003] In many training databases for machine learning tasks that operate over sequential data, (e.g. human activity recognition, speech recognition), individual training instances are encoded as sequences of frames (or time-samples, elements, etc.), which each have a class label. For example, in the case of activity recognition from video, each frame of a labelled training video may be associated with an activity class representing the activity of a person in that frame. In general, each training instance may be associated with a sequence of class labels, which may be summarized as a distribution of labels for that training instance.
[004] In the case of sequentially labelled training instances, traditional methods for managing class imbalance, for example oversampling, under-sampling or SMOTE (Synthetic Minority Over-sampling Technique), may fail as these methods typically rely on the assumption that each instance is associated with a single label. In attempt to satisfy this assumption, sequential instances may be split into sub-sequences of uniform class labels, however 1) contextual and class-transition information that may be important for training could be lost and 2) instances may be broken into sub-sequences of varying lengths, which may be unsuitable or undesirable for the given training task.
SUMMARY
[005] This disclosure describes a method for balancing datasets of instances in which each instance may be labelled by a sequence, plurality or distribution of class labels. The disclosure includes performing stochastic under-sampling (removal of dataset instances) and oversampling (replication of dataset instances) based on the distribution of classes in each instance, to minimize the ratio between the sizes of the minority class (i.e. class labelling the fewest frames across all instances) and the majority class (i.e. class labelling the most frames across all instances).
BRIEF DESCRIPTION OF THE DRAWINGS
[006] Figure 1 is a representation of class instances within several training instances.
DETAILED DESCRIPTION
[007] The method for balancing datasets of instances may proceed in two stages: 1) oversampling and 2) under-sampling. Oversampling may be performed prior to under-sampling in order to reduce the loss of unique, original instances. In some cases under-sampling may result in the removal of all replicas of a specific instance. Depending on the requirements of the application, such as a limit on the number of instances, these two stages may be repeated, such as in an interleaved manner or sequential manner. Also depending on the requirements of the application, only one stage may be performed. For example, if it is desirable to train on a smaller dataset, the under-sampling stage may be performed without performing oversampling.
[008] Each of the two stages may comprise a number of repeated rounds. Within each round, a sequence of steps may be performed, which may include 1) determination of the minority class (for oversampling) or majority class (for under-sampling) in the current dataset, 2) stochastic sampling of one or more instances in the dataset, weighted by the count of that class in each instance, 3) replication (for oversampling) or deletion (for under-sampling) of those selected instances in the dataset. The count of a class may be the number of frames associated with that class in the dataset. These steps are described below in more detail.
[009] After performing this method on a dataset, the dataset may be used as a Raining set for a machine learning system. Typically, training sets contain thousands or more instances, each potentially labelled with multiple classes. After applying this method, there can be assurance that the difference in frequency between the minority class appearing the fewest times in the dataset and the majority class appearing the most times is within a specified limit. In other words, tire proportions of the classes that appear in the dataset are more consistent. This can assist with the training of the machine learning systems, which will either fail or not work as desired for classes which do not appear frequently enough within the training dataset.
[0010] Each instance is preferably sequential data, such as representing video, audio or text. These instances may be used, for example, for recognizing human activities or for speech recognition or for classifying the content in documents. The sequential data may be video clips, skeleton data representing human joint positions or other similar information. Individual training instances may be encoded as sequences of frames, time-samples, elements, or similar. Each frame or a range of frames from the clips, which each have a class label
[0011] Determination of the minority class involves determining which class appears the fewest times in the dataset or has the shortest total portion labelled with that class. It may be performed by counting the total number of frames associated with each class (or keeping track of this count incrementally) and finding the class with the smallest count. Similarly, determination of the majority class may involve counting the total number of frames associated with each class and finding the class with the largest count or longest total portion labelled with that class. This may be done in a number of ways, such as by counting the number of frames for each class, or storing and incrementing a count of each class while each instance of the dataset is scanned.
Note that the class which is the majority class and the class which is the minority class may change across each round and each stage as instances are replicated or deleted.
[0012] With reference to Figure 1, in an example, a dataset of one kind of data may contain five instances 10, numbered 1 through 5. The data may be labelled with one of three classes 20. For the purposes of representation in Figure 1, each class is represented by a different texture and each instance is represented as a bar indicating some amount of frames of different classes denoted by the regions of associated textures. Some frames from each instance may be unlabelled, such as for transitions. These frames may be considered as part of a default or null class, or an additional 4th class. These frames and any associated class is not included in the determination of a minority or majority class. This example represents a case of an imbalanced dataset in which class 3 is the majority class and class 2 is the minority class. Class 3 appears the most across the five instances in terms of the number frames (length of the bars). In contrast, class 2 has the fewest frames in the five instances.
[0013] Stochastic sampling may select instances with frames labeled by the minority class during oversampling and may select instances with frames labeled by the majority class during under-sampling. Stochastic sampling of instances with frames of a particular class may be performed by 1) computing the frequency of that class in each instance and then 2) randomly selecting one or more instances, weighted by their frequency. Computing the frequency of a class in one instance may be performed by counting the number of frames associated with that class in the instance divided by the total number of frames associated with that class across all instances. The sum of the frequencies of one class across all instances may sum to 1. Randomly selecting instances may be performed by a weighted random sampling where the probability of selecting each instance is equal to the computed frequency.
[0014] The number of instances to randomly select in a given round of oversampling or under-sampling may be decided given the requirements of the use-case. Selecting only one instance in each round may ensure that the exact distribution of classes is considered for each selected instance, however this strategy may incur a longer run-time as computing the distribution of classes across all instances may be costly in terms of time and computing power. In contrast, selecting many instances in each round may be more efficient but may mean that the distribution of classes computed for each instance is only approximate as the other instances sampled in that round are not considered.
[0015] Replication of a sampled instance may be performed by creating a copy of that instance and appending it to the dataset of all instances. Alternatively, a count of the number of each instance to include in the dataset may be maintained to reduce the resource requirements of duplicating the entirety of an instance. Deletion of a sampled instance may be performed by removing the instance from the dataset of all instances or otherwise indicating that the instance is not to be included in the dataset. [0016] In the stage of oversampling, the overall number of instances will increase, which increases the number of classes from the minority class, but also other classes which appear in the replicated instances. Similarly, in the stage of under-sampling, the overall number of instances may decrease, which decreases the number of classes from the majority class, but also other classes which appear in the deleted instances.
[0017] For oversampling, with reference to Figure 1 , class 2 was previously identified as the minority class. The frequency of that class within each instance may be calculated, such as by utilizing the counts determined earlier. In this example, instances 1 , 4 and 5 have a frequency of class 2 of zero. Instance 2 has a frequency of class 2 of 0.84 and instance 3 of .16. In other words, 84% of all appearances of class 2 appear in instance 2 while 16% of all appearances of class 2 appear in instance 3.
[0018] A random sampling may then done based on these frequencies. In this example, instances 1, 4 and 5 will never be selected and instance 2 has a higher probability of being selected than instance 3. As a result of the sampling (one instance in this case), the selected instance is replicated. In this example, instance 2 may be replicated to form a new instance 6, otherwise identical to instance 2, including data labelled as class 3 and 2.
[0019] In this example, if instead of one instance being selected as part of the oversampling, three instances are selected using the weighted random sampling. As an example, two copies of instance 2 and one copy of instance 3 may be selected for replication and appending to the dataset.
[0020] Each stage may be repeated (i.e. perform all 3 steps) until a specified maximum number of rounds has been performed or until the class imbalance has been lowered below a specified threshold. The maximum number of rounds and class imbalance may be different for the oversampling stage and the under-sampling stage. The class imbalance may be computed as the fraction of the number of frames in the majority class over the number of frames in the minority class. Note that this fraction can be calculated if any the dataset contains any frames labeled with the minority class. [0021] The result of performing this method on a dataset may be to increase the total number of frames labelled by the minority class and decrease the total number of frames labelled by the majority class, thereby reducing the overall class imbalance in the dataset.
[0022] The below pseudocode represents an embodiment of the above algorithm, doing several rounds of oversampling, followed by several rounds of under-sampling. The variables can be interpreted as follows: Y is a list of instance class labels for the entire dataset; X is a list of instances (e.g. data); M is the number of classes; T__o is the class imbalance threshold for oversampling; T_u is the class imbalance threshold for under-sampling; N_o is the maximum number of rounds for oversampling; N_u is the maximum number of rounds for under-sampling; C is a 2D array of size number of frames by number of classes, representing the number of frames per class in each instance; c_sum is an array representing the total count of each class in the dataset.
Figure imgf000007_0001
Figure imgf000008_0001
Figure imgf000009_0001
[0023] The above pseudocode describes only one way of implementing the proposed method but other equivalent implementations may be expressed. For example, instead of replicating and removing instances from the dataset, instances may equivalently be associated with integer (or fractional) weights representing the number of copies of that instance in the dataset.
[0024] The described method may be performed in isolation or conjunction with any number of other algorithms or operations and may be performed once or multiple times with respect to one or more datasets.
[0025] The described method may be implemented in any appropriate programming language such as python, C++, CUDA, or Java, may be run in a single thread or in parallel on multiple threads, and may be run on any appropriate hardware such as CPU hardware, GPU hardware, embedded systems, or FPGAs on any appropriate operating system such as Linux, Windows, or MacOS. Appropriate memory, such as a combination of RAM, solid state or hard drives may be used to maintain the instances of the dataset, as well as the counts and weights listed above.
[0026] The described method may be applied to any appropriate dataset such as a dataset of instances containing sequential data with class labels for each sequential frame in the instance (or a dataset of instances where each instance has a plurality of classes, etc.). An appropriate dataset may be associated with relevant application such as activity recognition or speech recognition and may be used for appropriate purpose, such as training, validation, statistical analysis, visualisation or similar purposes. [0027] The embodiments of the invention described above are intended to be exemplary only. The scope of the invention is therefore intended to be limited solely by the scope of the appended claims.

Claims

CLAIMS:
1. A system comprising: at least one memory storing a dataset comprising a plurality of instances of sequential data, and instructions, the instances of sequential data each having portions of the sequential data labelled with a class; at least one hardware processor interoperably coupled with the at least one memory, wherein the instructions instruct the at least one hardware processor to: over-sampling the instances by: identifying a minority class being the class that has the shortest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the minority class; replicating the selected one or more instances in the dataset; under-sampling the instances by: identifying a majority class being the class that has the longest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the majority class; removing the selected one or more instances from the dataset. wherein after the over-sampling and under-sampling, class imbalance computed as the fraction of the portion of the sequential data labelled with the majority class over the portion of the sequential data labelled with the minority class, is reduced.
2. The system of claim 1 wherein sequential data comprises of frames of data and each frame, in a portion of the sequential data labelled with a class, is labelled with the class.
3. The system of claim 2 wherein the portions of the sequential data labelled with a class comprises the number of frames in the sequential data labelled with the class.
4. The system of any one of claims 1 to 3 wherein the over-sampling is repeated one or more times.
5. The system of any one of claims 1 to 4 wherein the under-sampling is repeated one or more times.
6. The system of any one of claims 1 to 3 where after the under-sampling, the over- sampling and under-sampling is repeated one or more times.
7. The system of claim 6 wherein the maximum number of repetitions is predetermined.
8. The system of claim 7 wherein the under-sampling, the over-sampling and undersampling is repeated until the class imbalance has been reduced to a less than a determined amount or the maximum is reached.
9. A method of balancing a dataset comprising a plurality of instances of sequential data, the instances of sequential data each having portions of the sequential data labelled with a class, the method comprising: over-sampling the instances by: identifying a minority class being the class that has the shortest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the minority class; replicating the selected one or more instances in the dataset; under-sampling the instances by: identifying a majority class being the class that has the longest total portion of sequential data labelled with the class; randomly selecting one or more instances of the sequential data weighted by the portion of the sequential data labelled with the majority class; removing the selected one or more instances from the dataset. wherein after the over-sampling and under-sampling, class imbalance computed as the fraction of the portion of the sequential data labelled with the majority class over the portion of the sequential data labelled with the minority class, is reduced.
10. The method of claim 9 wherein sequential data comprises of frames of data and each frame, in a portion of the sequential data labelled with a class, is labelled with the class.
11. The method of claim 10 wherein the portions of the sequential data labelled with a class comprises the number of frames in the sequential data labelled with the class.
12. The method of any one of claims 9 to 1 1 wherein the over-sampling is repeated one or more times.
13. The method of any one of claims 9 to 12 wherein the under-sampling is repeated one or more times.
14. The method of any one of claims 9 to 11 where after the under-sampling, the over- sampling and under-sampling is repeated one or more times.
15. The method of claim 14 wherein the maximum number of repetitions is predetermined.
16. The system of claim 15 wherein the under-sampling, the over-sampling and under- sampling is repeated until the class imbalance has been reduced to a less than a determined amount or the maximum is reached.
PCT/IB2020/052251 2019-03-15 2020-03-12 Method for balancing datasets of multi-class instance data WO2020188425A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CA3,036,847 2019-03-15
CA3036847A CA3036847A1 (en) 2019-03-15 2019-03-15 Method for balancing datasets of multi-class instance data

Publications (1)

Publication Number Publication Date
WO2020188425A1 true WO2020188425A1 (en) 2020-09-24

Family

ID=72519781

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2020/052251 WO2020188425A1 (en) 2019-03-15 2020-03-12 Method for balancing datasets of multi-class instance data

Country Status (2)

Country Link
CA (1) CA3036847A1 (en)
WO (1) WO2020188425A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395558A (en) * 2020-11-27 2021-02-23 广东电网有限责任公司肇庆供电局 Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter
CN112685515A (en) * 2021-01-08 2021-04-20 西安理工大学 Discrete data oversampling method based on D-SMOTE
US20220374410A1 (en) * 2021-05-12 2022-11-24 International Business Machines Corporation Dataset balancing via quality-controlled sample generation
US11947633B2 (en) * 2020-11-30 2024-04-02 Yahoo Assets Llc Oversampling for imbalanced test data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BATISTA ET AL.: "A study of the behaviour of several methods for balancing machine learning training data", ACM SIGKDD EXPLORATIONS NEWSLETTER, vol. 6, no. 1, June 2004 (2004-06-01), pages 20 - 29, XP058218362, DOI: 10.1145/1007730.1007735 *
CHOIRUNNISA SHABRINA; LIANTO JOKO: "Hybrid Method of Undersampling and Oversampling for Handling Imbalanced Data", 2018 INTERNATIONAL SEMINAR ON RESEARCH OF INFORMATION TECHNOLOGY AND INTELLIGENT SYSTEMS (ISRITI), November 2018 (2018-11-01), pages 276 - 280, XP033629349, ISBN: 978-1-5386-7422-2, DOI: 10.1109/ISRITI.2018.8864335 *
MAO ET AL.: "Extreme Learning Machine with Hybrid Sampling Strategy for Sequential Imbalanced Data", COGNITIVE COMPUTATION, vol. 9, no. 6, December 2017 (2017-12-01), pages 780 - 801, XP036379964, ISSN: 1866-9956, DOI: 10.1007/s12559-017-9504-2 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395558A (en) * 2020-11-27 2021-02-23 广东电网有限责任公司肇庆供电局 Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter
CN112395558B (en) * 2020-11-27 2023-05-26 广东电网有限责任公司肇庆供电局 Improved unbalanced data mixed sampling method suitable for historical fault data of intelligent electric meter
US11947633B2 (en) * 2020-11-30 2024-04-02 Yahoo Assets Llc Oversampling for imbalanced test data
CN112685515A (en) * 2021-01-08 2021-04-20 西安理工大学 Discrete data oversampling method based on D-SMOTE
US20220374410A1 (en) * 2021-05-12 2022-11-24 International Business Machines Corporation Dataset balancing via quality-controlled sample generation
US11797516B2 (en) * 2021-05-12 2023-10-24 International Business Machines Corporation Dataset balancing via quality-controlled sample generation

Also Published As

Publication number Publication date
CA3036847A1 (en) 2020-09-15

Similar Documents

Publication Publication Date Title
WO2020188425A1 (en) Method for balancing datasets of multi-class instance data
AU2018232914B2 (en) Techniques for correcting linguistic training bias in training data
US8983959B2 (en) Optimized partitions for grouping and differentiating files of data
CN110019779B (en) Text classification method, model training method and device
JP5523543B2 (en) Concept recognition method and concept recognition device based on co-learning
CN111243601B (en) Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
EP2385471A1 (en) Measuring document similarity
US20230334154A1 (en) Byte n-gram embedding model
CN109299264A (en) File classification method, device, computer equipment and storage medium
EP3989216A1 (en) Automatic preparation of a new midi file
US20240078330A1 (en) A method and system for lossy compression of log files of data
KR20210004036A (en) Method and apparatus for operating independent classification model using metadata
CN110413580A (en) For the compression method of FPGA configuration bit stream, system, device
Khaleel Image Compression Using Swarm Intelligence.
US9553605B2 (en) Efficient data encoding
CN109299260B (en) Data classification method, device and computer readable storage medium
EP4237977B1 (en) Method for detection of malware
CN111737007B (en) Frequency division processing system and method for data object
CN111639496A (en) Text similarity calculation method and system based on intelligent weighted word segmentation technology
Nguyen et al. A proposed approach to compound file fragment identification
CN113486671B (en) Regular expression coding-based data expansion method, device, equipment and medium
EP4360016A1 (en) Method and system for selecting samples to represent a cluster
Reyssan et al. Multiple Character Modification for Huffman Algorithm
Chang et al. An efficient and effective method for VQ codebook design
CN113590173A (en) Code statistical method, system, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20774328

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20774328

Country of ref document: EP

Kind code of ref document: A1