CN109033148A - One kind is towards polytypic unbalanced data preprocess method, device and equipment - Google Patents
One kind is towards polytypic unbalanced data preprocess method, device and equipment Download PDFInfo
- Publication number
- CN109033148A CN109033148A CN201810599969.3A CN201810599969A CN109033148A CN 109033148 A CN109033148 A CN 109033148A CN 201810599969 A CN201810599969 A CN 201810599969A CN 109033148 A CN109033148 A CN 109033148A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- sample set
- class
- minority
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000007781 pre-processing Methods 0.000 claims abstract description 44
- 238000012545 processing Methods 0.000 claims description 28
- 238000012217 deletion Methods 0.000 claims description 18
- 230000037430 deletion Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 17
- 230000015572 biosynthetic process Effects 0.000 claims description 4
- 238000001308 synthesis method Methods 0.000 claims description 4
- 238000003786 synthesis reaction Methods 0.000 claims description 4
- 230000010076 replication Effects 0.000 claims description 3
- 238000007635 classification algorithm Methods 0.000 abstract description 17
- 238000004422 calculation algorithm Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000007418 data mining Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开一种面向多分类的不平衡数据预处理方法、装置、设备,方法包括:接收最终样本集大小及样例集的不平衡比,得出每个类别的理想样例个数;根据理想样例个数和实际样例个数判断少数类样例集和多数类样例集;对少数类样例集中的样例,计算k近邻中其他类样例和少数类样例的个数,以分类标记;对少数类样例集中的样例,根据样例的标记进行删除、保存、复制或合成,得到最终少数类样例集;对多数类样例集中的样例,计算k近邻中该多数类样例和其他类样例的个数,以分类标记;对多数类样例集中的样例,根据样例的标记进行删除或保存,得到最终多数类样例集;生成最终样本集。本发明使得最终样本集能有效提高多分类算法的准确性。
The invention discloses a method, device and equipment for unbalanced data preprocessing oriented to multi-classification. The method includes: receiving the size of the final sample set and the unbalanced ratio of the sample set, and obtaining the ideal number of samples for each category; according to The number of ideal samples and the actual number of samples determine the minority class sample set and the majority class sample set; for the samples in the minority class sample set, calculate the number of other class samples and minority class samples in the k nearest neighbors , to classify and mark; for the samples in the minority class sample set, delete, save, copy or synthesize according to the mark of the sample to obtain the final minority class sample set; for the samples in the majority class sample set, calculate the k-nearest neighbor The number of the majority class sample and other class samples in the class is marked by classification; the samples in the majority class sample set are deleted or saved according to the mark of the sample, and the final majority class sample set is obtained; the final sample is generated set. The invention enables the final sample set to effectively improve the accuracy of the multi-classification algorithm.
Description
技术领域technical field
本发明涉及大数据处理领域,尤其涉及一种面向多分类的不平衡数据预处理方法、装置及设备。The present invention relates to the field of big data processing, in particular to a multi-category-oriented unbalanced data preprocessing method, device and equipment.
背景技术Background technique
随着技术的不断进步,包括互联网速度提升、移动互联网更新换代、硬件技术不断发展、数据采集技术、存储技术、处理技术得到长足的发展,数据正以前所未有的速度增长,我们已经进入了大数据时代。大数据的数据规模巨大(volume)、产生高速(velocity)、形式多样(variety)、数据不确定(veracity)等特性使得传统的数据分析与挖掘技术在应用到大数据领域时遇到了前所未有的挑战。With the continuous advancement of technology, including the improvement of Internet speed, the upgrading of mobile Internet, the continuous development of hardware technology, the rapid development of data acquisition technology, storage technology and processing technology, data is growing at an unprecedented rate, and we have entered the big data era. The characteristics of big data such as huge data volume (volume), high speed (velocity), variety (variety), and data uncertainty (veracity) make traditional data analysis and mining technologies encounter unprecedented challenges when applied to the field of big data. .
数据分类是数据分析和挖掘中的基本算法,具有广泛的应用领域,也是很多其他数据分析和挖掘算法的基础。在大数据中,几乎所有的数据集都是不平衡数据,不平衡数据是指在数据集中至少有一个类别包含相对其它类别更少的样例。数据不平衡问题在现实世界中广泛存在,尤其在大数据应用领域。例如,在互联网文本分类中,各个类别的数据是不均衡的,而我们关注的往往是小类别的数据,如网络上的敏感信息,新出现的话题等;在电子商务应用中,大量的用户交易数据和行为数据都是正常的,而我们关注的往往是电子商务中的欺诈行为以及异常行为,这些数据淹没在大量的正常行为数据中,属于严重倾斜的不平衡数据集。类似的应用还有医疗诊断、卫星遥感数据分类等。因此,不平衡大数据分类是国民经济和社会发展中迫切需要解决的关键技术问题,具有广泛的应用前景。Data classification is a basic algorithm in data analysis and mining, which has a wide range of applications and is also the basis of many other data analysis and mining algorithms. In big data, almost all data sets are unbalanced data. Imbalanced data means that at least one category in the data set contains fewer samples than other categories. The problem of data imbalance exists widely in the real world, especially in the field of big data applications. For example, in Internet text classification, the data of each category is unbalanced, and we often focus on small categories of data, such as sensitive information on the Internet, emerging topics, etc.; in e-commerce applications, a large number of users Both transaction data and behavior data are normal, but we often focus on fraudulent and abnormal behaviors in e-commerce. These data are submerged in a large amount of normal behavior data and are severely skewed and unbalanced data sets. Similar applications include medical diagnosis, classification of satellite remote sensing data, etc. Therefore, the classification of unbalanced big data is a key technical problem that urgently needs to be solved in national economic and social development, and has broad application prospects.
不平衡大数据由于不同类别数据样例的数量差别过大,导致传统的分类学习算法很难取得好的分类效果,如图1所示的不平衡数据分类示例,其中圆圈为少数类样例,三角为多数类样例,不平衡比为3:1,即多数类样例为少数类样例的3倍,而在实际的大数据集中,不平衡比往往是10000:1,甚至更高,因此在进行分类之前先需要对数据进行预处理,以得到好的学习效果。Due to the large difference in the number of different types of data samples in unbalanced big data, it is difficult for traditional classification learning algorithms to achieve good classification results. The example of unbalanced data classification is shown in Figure 1, where the circles are minority class samples. The triangle is the majority class sample, and the imbalance ratio is 3:1, that is, the majority class sample is 3 times that of the minority class sample, and in the actual large data set, the imbalance ratio is often 10000:1 or even higher. Therefore, it is necessary to preprocess the data before classification to obtain a good learning effect.
已有的不平衡大数据预处理方法主要是针对二分类算法的,即数据集中只有两个类别,多数类和少数类,在预处理中对多数类进行欠采样,针对少数类进行过采样,或者两者同时进行,缩小数据的不平衡比,进而提高分类效果。对多分类算法的不平衡大数据预处理目前缺乏相关研究,多分类算法即数据集中有多个类别,分类算法要通过训练学习,把数据分到多个类别中的一个。目前的方法是把多分类问题简化成二分类问题进行处理,即把数据集中多个类别分成多个二分类数据集,两两进行处理。The existing unbalanced big data preprocessing methods are mainly aimed at the binary classification algorithm, that is, there are only two categories in the data set, the majority category and the minority category. In the preprocessing, the majority category is under-sampled and the minority category is over-sampled. Or carry out both at the same time to reduce the imbalance ratio of the data, thereby improving the classification effect. There is currently a lack of relevant research on the unbalanced big data preprocessing of multi-classification algorithms. Multi-classification algorithms mean that there are multiple categories in the data set, and the classification algorithm needs to learn through training to divide the data into one of the multiple categories. The current method is to simplify the multi-classification problem into a binary classification problem for processing, that is, divide multiple categories in the data set into multiple binary classification data sets, and process them two by two.
把多分类问题转化成多个二分类问题面临如下问题:Transforming multi-classification problems into multiple binary classification problems faces the following problems:
1、某一个类别的数据集在一个二分类问题中是少数类,在另一个二分类问题中可能是多数类,采用这种方法无法对其进行有效处理。如图2所示,圆圈样本集在与三角形样本集的分类中属于少数类,而在与叉形样本集的分类中属于多数类。1. The data set of a certain category is a minority class in a binary classification problem, and may be a majority class in another binary classification problem. This method cannot effectively deal with it. As shown in Figure 2, the circle sample set belongs to the minority class in the classification with the triangle sample set, and belongs to the majority class in the classification with the fork sample set.
2、某一个样本在不同的二分类问题中可能是不同的样例类别,例如,在一个二分类中是噪音,需要删除,在另外一个二分类中是重要的边界样例,需要保留,采用已有方法无法对其进行有效处理。如图2所示,圆圈中的三角形样例在与圆圈样例的二分类问题中是噪音,需要删除;在与叉形样例的二分类中是重要的边界样例,需要保留。2. A certain sample may be of different sample categories in different binary classification problems. For example, it is noise in one binary classification and needs to be deleted. It is an important boundary sample in another binary classification and needs to be retained. Use Existing methods cannot handle it effectively. As shown in Figure 2, the triangle sample in the circle is noise in the binary classification problem with the circle sample and needs to be deleted; it is an important boundary sample in the binary classification problem with the fork sample and needs to be retained.
总之,如果把多分类问题考虑成多个二分类问题,则对样例的处理无法综合考虑各个类别中的不同情况,无法有效提高多分类算法的精确性。In short, if the multi-classification problem is considered as multiple binary classification problems, the processing of samples cannot comprehensively consider the different situations in each category, and the accuracy of the multi-classification algorithm cannot be effectively improved.
发明内容Contents of the invention
针对上述问题,本发明的目的在于提供一种面向多分类的不平衡数据预处理方法、装置以及设备,能够适应不同的多分类算法的需求,有效提高多分类算法的精确性。In view of the above problems, the purpose of the present invention is to provide a multi-classification-oriented unbalanced data preprocessing method, device and equipment, which can adapt to the needs of different multi-classification algorithms and effectively improve the accuracy of multi-classification algorithms.
本发明实施例提供了一种面向多分类的不平衡数据预处理方法,包括如下步骤:The embodiment of the present invention provides a multi-category-oriented unbalanced data preprocessing method, including the following steps:
读取原始样本集;其中,所述原始样本集包括至少两个类别的样例集;Read the original sample set; wherein, the original sample set includes at least two categories of sample sets;
接收由用户输入的最终样本集大小以及各个样例集之间的不平衡比,以计算得出最终样本集中每个样例集的理想样例个数;Receive the final sample set size input by the user and the imbalance ratio between each sample set to calculate the ideal number of samples for each sample set in the final sample set;
根据每个样例集的理想样例个数和实际样例个数判断该样例集属于少数类样例集还是多数类样例集;Judging whether the sample set belongs to the minority class sample set or the majority class sample set according to the ideal sample number and the actual sample number of each sample set;
对每个少数类样例集中的样例,计算每个样例的k近邻中其他类样例和属于该少数类样例集中的样例的个数,以将每个样例划分为噪音样例、不稳定样例、边界样例或稳定样例并打上相应的标记;其中,其他类样例是指除该少数类样例集中的样例以外的其他样例集中的样例;For each sample in the minority class sample set, calculate the number of other class samples in the k-nearest neighbors of each sample and the number of samples belonging to the minority class sample set, so as to divide each sample into a noise sample Examples, unstable samples, boundary samples or stable samples and marked accordingly; among them, other class samples refer to samples in other sample sets except the samples in the minority class sample set;
对每个少数类样例集中的样例,根据每个样例的标记进行删除、保存、复制或者合成,以得到相应于每个少数类样例集的最终少数类样例集;For the examples in each minority class sample set, delete, save, copy or synthesize according to the mark of each sample, so as to obtain the final minority class sample set corresponding to each minority class sample set;
对每个多数类样例集中的样例,计算每个样例的k近邻中属于该多数类样例集中的样例和其他类样例的个数,以将每个样例划分为噪音样例、边界样例或稳定样例,并打上相应的标记;For each sample in the majority class sample set, calculate the number of samples belonging to the majority class sample set and other class samples in the k-nearest neighbors of each sample, so as to divide each sample into a noise sample sample, boundary sample or stable sample, and mark accordingly;
对每个多数类样例集中的样例,根据每个样例的标记进行删除或保存,以得到相应于每个多数类样例集的最终多数类样例集;For each sample in the majority class sample set, delete or save according to the mark of each sample, so as to obtain the final majority class sample set corresponding to each majority class sample set;
根据所述最终少数类样例集以及多数类样例集,生成最终样本集,以实现不平衡数据的预处理。A final sample set is generated according to the final minority class sample set and the majority class sample set, so as to realize preprocessing of unbalanced data.
优选地,所述根据每个样例集的理想样例个数和实际样例个数判断该样例集属于少数类样例集还是多数类样例集具体为:Preferably, the judging whether the sample set belongs to the minority sample set or the majority sample set according to the ideal sample number and the actual sample number of each sample set is as follows:
对于每个样例集,若其理想样例个数大于实际样例个数,则判断该样例集为少数类样例集;若其理想样例个数小于等于实际样例个数,则判断该样例集为多数类样例集。For each sample set, if the number of ideal samples is greater than the actual number of samples, it is judged that the sample set is a minority class sample set; if the number of ideal samples is less than or equal to the actual number of samples, then It is judged that the sample set is a majority class sample set.
优选地,所述对每个少数类样例集中的样例,计算每个样例的k近邻中其他类样例和属于该少数类样例集中的样例的个数,以将每个样例划分为噪音样例、不稳定样例、边界样例或稳定样例并打上相应的标记,具体包括:Preferably, for each sample in the minority class sample set, calculate the number of other class samples in the k-nearest neighbors of each sample and the number of samples belonging to the minority class sample set, so that each sample Examples are divided into noise samples, unstable samples, boundary samples or stable samples and marked accordingly, including:
当判断少数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例;When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are other class samples, the sample is marked as a noise sample;
当判断少数类样例集中的样例的k近邻样例大多数为其他类样例时,则标记该样例为不稳定样例;When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are samples of other classes, the sample is marked as an unstable sample;
当判断少数类样例集中的样例的k近邻样例的其他类样例的个数与少数类样例集中的样例的个数接近时,则标记该样例为边界样例;When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the minority class sample set is close to the number of samples in the minority class sample set, then mark the sample as a boundary sample;
当判断少数类样例集中的样例的k近邻样例大多数为属于该少数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the sample in the minority class sample set belong to the samples in the minority class sample set, mark the sample as a stable sample.
优选地,所述对每个少数类样例集中的样例,根据每个样例的标记进行删除、保存、复制或者合成,以得到最终少数类样例集,具体为:Preferably, the samples in each minority class sample set are deleted, saved, copied or synthesized according to the label of each sample to obtain the final minority class sample set, specifically:
对每个少数类样例集中的样例:For each example in the minority class example set:
删除该少数类样例集中的所有噪音样例;Delete all noise samples in the minority sample set;
把所有不稳定样例加入相应的最终少数类样例集;Add all unstable samples to the corresponding final minority sample set;
对每个边界样例进行复制,复制个数为∣c-1∣,将该边界样例以及复制的样例一起加入相应的最终少数类样例集;其中,c为复制比,且c=(该少数类样例集的理想样例个数-不稳定样例个数)/(该少数类样例集的实际样例个数-噪音样例个数-不稳定样例个数);Copy each boundary sample, the number of copies is |c-1|, add the boundary sample and the copied sample to the corresponding final minority sample set; where c is the replication ratio, and c= (the ideal number of samples of the minority class sample set - the number of unstable samples)/(the actual number of samples of the minority class sample set - the number of noise samples - the number of unstable samples);
对每个稳定样例,与周围样例合成新样例,合成个数为∣c-1∣,并将该样例以及新合成的样例一起加入相应的最终少数类样例集;其中,合成方法为每次从该稳定样例xi的k近邻中随机选择一个属于该少数类样例集中的样例xj,新合成的样例xi’=xi+(xi-xj)*a,a为0到1之间的随机数;For each stable sample, synthesize a new sample with the surrounding samples, the number of synthesis is |c-1|, and add this sample and the newly synthesized sample to the corresponding final minority sample set; where, The synthesis method is to randomly select a sample xj belonging to the minority class sample set from the k-nearest neighbors of the stable sample x i each time, and the newly synthesized sample x i '= xi +( xi -x j ) *a, a is a random number between 0 and 1;
计算还需生成的属于该少数类样例集的样例个数d;其中,d=该少数类样例集的理想样例个数-最终少类样例集中的属于该少数类样例集的样例的当前个数;Calculate the number d of samples belonging to the minority class sample set that still needs to be generated; where, d=the ideal sample number of the minority class sample set-the final minority class sample set that belongs to the minority class sample set The current number of samples of ;
随机选择d个稳定样例,每个稳定样例与周围样例合成一个新样例,将合成的新样例加入到相应的最终少数类样例集中;Randomly select d stable samples, synthesize a new sample from each stable sample and surrounding samples, and add the synthesized new samples to the corresponding final minority class sample set;
获得相应于每个少数类样例集的最终少数类样例集。A final minority class sample set corresponding to each minority class sample set is obtained.
优选地,所述对每个多数类样例集中的样例,计算每个样例的k近邻中属于该多数类样例集中的样例和其他类样例的个数,以将每个样例划分为噪音样例、边界样例或稳定样例,并打上相应的标记,具体为:Preferably, for each sample in the majority class sample set, calculate the number of samples and other class samples belonging to the majority class sample set in the k-nearest neighbors of each sample, so that each sample The samples are divided into noise samples, boundary samples or stable samples, and marked accordingly, specifically:
当判断多数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例;When it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set are other class samples, the sample is marked as a noise sample;
当判断多数类样例集中的样例的k近邻样例的其他类样例的个数与属于多数类样例集中的样例的个数接近时,则标记该样例为边界样例;When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the majority class sample set is close to the number of samples belonging to the majority class sample set, then mark the sample as a boundary sample;
当判断多数类样例集中的样例的k近邻样例大多数为属于该多数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set belong to the samples in the majority class sample set, mark the sample as a stable sample.
优选地,所述对每个多数类样例集中的样例,根据每个样例的标记进行删除或保存,以得到最终多数类样例集,具体包括:Preferably, the samples in each majority class sample set are deleted or saved according to the mark of each sample, so as to obtain the final majority class sample set, which specifically includes:
对每个多数类样例集中的样例:For each example in the majority class example set:
删除噪音样例;remove noise samples;
保留所有边界样例;Keep all boundary samples;
对每个稳定样例,执行选择性删除操作,直至删除e个稳定样例;其中,e=该多数类样例集的实际样例个数-噪音样例个数-该多数类样例集的理想样例个数;For each stable sample, perform a selective deletion operation until e stable samples are deleted; where, e=the actual number of samples in the majority class sample set - the number of noise samples - the majority class sample set The ideal number of samples;
得到相应于每个多数类样例集的最终多数类样例集。A final majority class sample set corresponding to each majority class sample set is obtained.
优选地,所述对每个稳定样例,执行选择性删除操作,直至删除e个稳定样例具体为:Preferably, for each stable sample, performing a selective deletion operation until e stable samples are deleted is specifically:
重复以下步骤直至已删除的稳定样例的个数f等于e;Repeat the following steps until the number f of deleted stable samples is equal to e;
对当前选中的稳定样例,计算该稳定样例到周围的k近邻个属于该多数类样例集的样例的距离;For the currently selected stable sample, calculate the distance from the stable sample to the surrounding k nearest neighbor samples belonging to the majority class sample set;
根据所述距离计算删除该稳定样例的概率;其中,距离越小,删除概率越大;Calculate the probability of deleting the stable sample according to the distance; wherein, the smaller the distance, the greater the deletion probability;
如果删除概率大于等于0.5,则删除该稳定样例,并更新已删除的稳定样例的个数f;If the deletion probability is greater than or equal to 0.5, delete the stable sample and update the number f of the deleted stable samples;
选中下一个稳定样例。Select the next stable sample.
本发明实施例还提供了一种面向多分类的不平衡数据预处理装置,包括:The embodiment of the present invention also provides a multi-classification-oriented unbalanced data preprocessing device, including:
数据读取单元,用于读取原始样本集;其中,所述原始样本集包括至少两个类别的样例集;A data reading unit, configured to read an original sample set; wherein, the original sample set includes at least two categories of sample sets;
数据接收单元,用于接收由用户输入的最终样本集大小以及各个样例集之间的不平衡比,以计算得出最终样本集中每个样例集的理想样例个数;The data receiving unit is used to receive the size of the final sample set input by the user and the imbalance ratio between each sample set, so as to calculate the ideal number of samples of each sample set in the final sample set;
判断单元,用于根据每个样例集的理想样例个数和实际样例个数判断该样例集属于少数类样例集还是多数类样例集;A judging unit, configured to judge whether the sample set belongs to the minority sample set or the majority sample set according to the ideal sample number and the actual sample number of each sample set;
少数类样例分类单元,用于对每个少数类样例集中的样例,计算每个样例的k近邻中其他类样例和属于该少数类样例集中的样例的个数,以将每个样例划分为噪音样例、不稳定样例、边界样例或稳定样例并打上相应的标记;其中,其他类样例是指除该少数类样例集中的样例以外的其他样例集中的样例;The minority class sample classification unit is used to calculate the number of other class samples in the k nearest neighbors of each sample and the number of samples belonging to the minority class sample set for each sample in the minority class sample set, so as to Divide each sample into noise sample, unstable sample, boundary sample or stable sample and mark accordingly; among them, other class samples refer to the samples other than the samples in the minority class sample set. samples in the sample set;
少数类样例处理单元,用于对每个少数类样例集中的样例,根据每个样例的标记进行删除、保存、复制或者合成,以得到相应于每个少数类样例集的最终少数类样例集;The minority class sample processing unit is used to delete, save, copy or synthesize the samples in each minority class sample set according to the mark of each sample, so as to obtain the final result corresponding to each minority class sample set Minority class sample set;
多数类样例分类单元,用于对每个多数类样例集中的样例,计算每个样例的k近邻中属于该多数类样例集中的样例和其他类样例的个数,以将每个样例划分为噪音样例、边界样例或稳定样例,并打上相应的标记;The majority class sample classification unit is used to calculate the number of samples and other class samples belonging to the majority class sample set in the k nearest neighbors of each sample for the samples in each majority class sample set, so as to Divide each sample into noise sample, boundary sample or stable sample, and mark accordingly;
多数类样例处理单元,用于对每个多数类样例集中的样例,根据每个样例的标记进行删除或保存,以得到相应于每个多数类样例集的最终多数类样例集;The majority class sample processing unit is used to delete or save the samples in each majority class sample set according to the mark of each sample, so as to obtain the final majority class sample corresponding to each majority class sample set set;
最终样本集生成单元,用于根据所述最终少数类样例集以及多数类样例集,生成最终样本集,以实现不平衡数据的预处理。The final sample set generating unit is configured to generate a final sample set according to the final minority class sample set and the majority class sample set, so as to realize preprocessing of unbalanced data.
本发明实施例还提供给了一种面向多分类的不平衡数据预处理设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如上述的面向多分类的不平衡数据预处理方法。The embodiment of the present invention also provides a multi-classification-oriented unbalanced data preprocessing device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processing When the computer executes the computer program, the above-mentioned multi-classification-oriented imbalanced data preprocessing method is realized.
本发明实施例实现了一种面向多分类的不平衡数据预处理方法,结合使用过采样和欠采样技术,使得新生成的最终样本集满足分类算法需求,有效提高不平衡数据的分类准确性。具体地,本发明实施例允许用户输入需要的总样例个数以及希望得到多个样例集的不平衡比,通过计算得出每个样例集的理想样例个数,根据最终样本集中样本个数确定每个样例集是多数类还是少数类,解决了传统方法中一个样例集在不同二分类算法中可能同时是多数类和少数类的问题。在对每个样例集中的样例进行处理时,把所有其他类别的样例集合并处理,把少数类样例分为噪音样例、不稳定样例、边界样例和稳定样例分别处理,把多数类样例分为噪音样例、边界样例和稳定样例分别处理,解决了传统方法中一个样例在不同二分类算法中可能属于不同类别导致对样例的处理出现冲突的问题。终上,本发明实施例通过对每个样例做出最合适的处理,使得最终样本集能够有效提高多分类算法的准确性。The embodiment of the present invention implements a multi-classification-oriented unbalanced data preprocessing method, which uses oversampling and undersampling techniques in combination, so that the newly generated final sample set meets the requirements of the classification algorithm, and effectively improves the classification accuracy of unbalanced data. Specifically, the embodiment of the present invention allows the user to input the total number of samples required and the unbalanced ratio of multiple sample sets that are expected to be obtained, and the ideal number of samples for each sample set is obtained through calculation. According to the final sample set The number of samples determines whether each sample set is a majority class or a minority class, which solves the problem that a sample set may be a majority class and a minority class at the same time in different binary classification algorithms in traditional methods. When processing the samples in each sample set, all other types of sample sets are processed together, and the minority class samples are divided into noise samples, unstable samples, boundary samples and stable samples. , the majority class samples are divided into noise samples, boundary samples and stable samples to be processed separately, which solves the problem that a sample may belong to different categories in different binary classification algorithms in the traditional method, which leads to conflicts in the processing of samples . Finally, in the embodiment of the present invention, the final sample set can effectively improve the accuracy of the multi-classification algorithm by performing the most appropriate processing on each sample.
附图说明Description of drawings
为了更清楚地说明本发明的技术方案,下面将对实施方式中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solution of the present invention more clearly, the accompanying drawings used in the implementation will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some implementations of the present invention. As far as the skilled person is concerned, other drawings can also be obtained based on these drawings on the premise of not paying creative work.
图1是一种不平衡数据分类示例图;Figure 1 is an example diagram of unbalanced data classification;
图2是另一种不平衡数据分类示例图;Figure 2 is another example of unbalanced data classification;
图3是本发明第一实施例提供的面向多分类的不平衡数据预处理方法的流程示意图;FIG. 3 is a schematic flowchart of a multi-classification-oriented unbalanced data preprocessing method provided by the first embodiment of the present invention;
图4是本发明第二实施例提供的面向多分类的不平衡数据预处理装置的结果示意图。Fig. 4 is a schematic diagram of the result of the multi-classification-oriented unbalanced data preprocessing device provided by the second embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
请参阅图3,本发明第一实施例提供了一种面向多分类的不平衡数据预处理方法,其可由面向多分类的不平衡数据预处理设备(以下简称设备)来执行,并至少包括如下步骤:Please refer to FIG. 3 , the first embodiment of the present invention provides a multi-classification-oriented unbalanced data preprocessing method, which can be performed by a multi-classification-oriented imbalanced data preprocessing device (hereinafter referred to as the device), and at least includes the following step:
S101,读取原始样本集;其中,所述原始样本集包括至少两个类别的样例集。S101. Read an original sample set; wherein, the original sample set includes sample sets of at least two categories.
在本实施例中,所述设备首先读取原始样本集,其中,所述原始样本集包括至少两个类别的样例集。In this embodiment, the device first reads an original sample set, where the original sample set includes sample sets of at least two categories.
S102,接收由用户输入的最终样本集大小以及各个样例集之间的不平衡比,以计算得出最终样本集中每个样例集的理想样例个数。S102. Receive the size of the final sample set input by the user and the imbalance ratio between the various sample sets to calculate the ideal number of samples for each sample set in the final sample set.
在本实施例中,所述设备为具有数据处理能力的设备,例如,所述设备可为个人计算机、笔记本、平板、服务器或者服务器集群等,本发明不做具体限定。In this embodiment, the device is a device with data processing capability, for example, the device may be a personal computer, notebook, tablet, server or server cluster, etc., which is not specifically limited in the present invention.
在本实施例中,设备首先接收由用户输入的最终想得到的最终样本集大小x和理想的各个样例集之间的不平衡比a1:a2:……:an(假设有n个样例集)。然后计算得出最终样本集中每个类别的样例集的理想样例个数为 In this embodiment, the device first receives the desired final sample set size x input by the user and the ideal imbalance ratio a1:a2:...:an between each sample set (assuming that there are n sample sets ). Then calculate the ideal number of samples in the sample set of each category in the final sample set as
例如,假定用户输入的理想的最终样本集包含20000个样例,其中有4个样例集,各个样例集之间的不平衡比为4:3:2:1,则最终得出的每个样例集中理想样例个数分别为:For example, suppose the ideal final sample set input by the user contains 20,000 samples, among which there are 4 sample sets, and the imbalance ratio between each sample set is 4:3:2:1, then each The ideal number of samples in the sample set are:
x1=20000*4/(4+3+2+1)=8000;x1 = 20000*4/(4+3+2+1)=8000;
x2=20000*3/(4+3+2+1)=6000; x2 =20000*3/(4+3+2+1)=6000;
x3=20000*2/(4+3+2+1)=4000;x 3 =20000*2/(4+3+2+1)=4000;
x4=20000*1/(4+3+2+1)=2000。x 4 =20000*1/(4+3+2+1)=2000.
S103,根据每个样例集的理想样例个数和实际样例个数判断该样例集属于少数类样例集还是多数类样例集。S103. Determine whether the sample set belongs to the minority class sample set or the majority class sample set according to the ideal sample number and the actual sample number of each sample set.
在本实施例中,对每个样例集xi,设备会判断该样例集是少数类样例集还是多数类样例集,判别方法如下:In this embodiment, for each sample set x i , the device will judge whether the sample set is a minority class sample set or a majority class sample set, and the judgment method is as follows:
对于每个样例集,若其理想样例个数大于实际样例个数(即在原始样本集中的个数),则判断该样例集为少数类样例集;若其理想样例个数小于等于实际样例个数,则判断该样例集为多数类样例集。For each sample set, if the number of ideal samples is greater than the number of actual samples (that is, the number in the original sample set), it is judged that the sample set is a minority class sample set; if the number of ideal samples is If the number is less than or equal to the actual number of samples, it is judged that the sample set is a majority class sample set.
例如,假设对于样例集x1,其理想样例个数为8000,而实际样例个数为12000,则说明样例集x1为多数类样例集。For example, assuming that the sample set x 1 has an ideal sample number of 8000 and an actual sample number of 12000, it means that the sample set x 1 is a majority class sample set.
再例如,对于样例集x4,其理想样例个数为2000,而实际样例个数为200,则说明样例集x4为少数类样例集。For another example, for the sample set x 4 , the ideal number of samples is 2000, but the actual number of samples is 200, which means that the sample set x 4 is a minority class sample set.
S104,对每个少数类样例集中的样例,计算每个样例的k近邻中其他类样例和属于该少数类样例集中的样例的个数,以将每个样例划分为噪音样例、不稳定样例、边界样例或稳定样例并打上相应的标记;其中,其他类样例是指除该少数类样例集中的样例以外的其他样例集中的样例。S104, for each sample in the minority class sample set, calculate the number of other class samples in the k-nearest neighbors of each sample and the number of samples belonging to the minority class sample set, so as to divide each sample into Noise samples, unstable samples, boundary samples or stable samples and marked accordingly; where, other class samples refer to samples in other sample sets except the samples in the minority class sample set.
在本实施例中,在处理每个样例集时,把所有其他类别的样例集合并成一个样例集,称为其他类样例集,其内的样例称为其他类样例。例如,在处理x1样例集中,x2、x3、x4合并称为其他类样例集。在本实施例中:In this embodiment, when processing each sample set, all sample sets of other categories are merged into one sample set, which is called a sample set of other classes, and the samples in it are called samples of other classes. For example, in processing the x 1 sample set, the combination of x 2 , x 3 , and x 4 is called the other class sample set. In this example:
当判断少数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例。When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are samples of other classes, the sample is marked as a noise sample.
在本实施例中,k的取值由用户输入,不同分类算法在不同k值下会有最好效果。k邻近的样例是指:假设某个样例的位置为xi,选择k个离xi最近的邻居节点(邻居样例)。In this embodiment, the value of k is input by the user, and different classification algorithms will have the best effect under different k values. The k-neighboring samples refer to: assuming that the position of a certain sample is x i , select k neighbor nodes (neighbor samples) closest to x i .
例如,当判断少数类样例集中的某个样例的k近邻样例中,其他类样例的个数与k的比值大于第一阈值(如80%)时,则标记该样例为噪音样例。当然,第一阈值可根据实际需要进行设定,如可设定为75%,85%等,本发明不做具体设定。For example, when it is judged that in the k-nearest neighbor samples of a sample in the minority class sample set, the ratio of the number of other class samples to k is greater than the first threshold (such as 80%), then mark the sample as noise sample. Of course, the first threshold can be set according to actual needs, for example, it can be set to 75%, 85%, etc., which is not specifically set in the present invention.
当判断少数类样例集中的样例的k近邻样例大多数为其他类样例时,则标记该样例为不稳定样例。When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are samples of other classes, the sample is marked as an unstable sample.
例如,当判断少数类样例集中的某个样例的k近邻样例中,其他类样例的个数与k的比值小于第一阈值(如80%)而大于第二阈值(如60%)时,则标记该样例为不稳定样例。当然,第二阈值可根据实际需要进行设定,如可设定为55%,65%等,本发明不做具体设定。For example, when it is judged that in the k-nearest neighbor samples of a certain sample in the minority class sample set, the ratio of the number of other class samples to k is less than the first threshold (such as 80%) and greater than the second threshold (such as 60% ), the sample is marked as an unstable sample. Of course, the second threshold can be set according to actual needs, for example, it can be set to 55%, 65%, etc., which is not specifically set in the present invention.
当判断少数类样例集中的样例的k近邻样例的其他类样例的个数与少数类样例集中的样例的个数接近时,则标记该样例为边界样例。When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the minority class sample set is close to the number of samples in the minority class sample set, mark the sample as a boundary sample.
例如,当判断少数类样例集中的某个样例的k近邻样例中,其他类样例的个数与k的比值小于第二阈值(如60%)而大于第三阈值(如40%)时,标记该样例为边界样例。For example, when it is judged that in the k-nearest neighbor samples of a sample in the minority class sample set, the ratio of the number of other class samples to k is less than the second threshold (such as 60%) and greater than the third threshold (such as 40% ), mark the sample as a boundary sample.
当判断少数类样例集中的样例的k近邻样例大多数为属于该少数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the sample in the minority class sample set belong to the samples in the minority class sample set, mark the sample as a stable sample.
例如,当判断少数类样例集中的某个样例的k近邻样例中,其他类样例的个数与k的比值小于第三阈值(如40%)时,标记该样例为稳定样例。For example, when it is judged that in the k-nearest neighbor samples of a sample in the minority class sample set, the ratio of the number of other class samples to k is less than the third threshold (such as 40%), mark the sample as a stable sample example.
S105,对每个少数类样例集中的样例,根据每个样例的标记进行删除、保存、复制或者合成,以得到相应于每个少数类样例集的最终少数类样例集。S105. Delete, save, copy or synthesize the samples in each minority class sample set according to the label of each sample, so as to obtain a final minority class sample set corresponding to each minority class sample set.
在本实施例中,对每个少数类样例集中的样例,根据其标记分别采取如下处理:In this embodiment, the samples in each minority class sample set are processed as follows according to their labels:
1、删除该少数类样例集中的所有噪音样例;1. Delete all noise samples in the minority sample set;
2、把所有不稳定样例加入与该少数类样例集对应的最终少数类样例集;2. Add all unstable samples to the final minority sample set corresponding to the minority sample set;
3、对每个边界样例,对其进行复制,复制个数为∣c-1∣,将该边界样例以及复制的样例一起加入对应的最终少数类样例集;其中,c为复制比,且c=(该少数类样例集的理想样例个数-不稳定样例个数)/(该少数类样例集的实际样例个数-噪音样例个数-不稳定样例个数);3. For each boundary sample, copy it, the number of copies is |c-1|, add the boundary sample and the copied sample together to the corresponding final minority class sample set; where c is the copy ratio, and c=(ideal sample number of the minority class sample set-unstable sample number)/(actual sample number of the minority class sample set-noise sample number-unstable sample number of instances);
其中,边界样例对分类学习算法十分重要,而且如果与其他样本合成生成新样本容易导致样本偏离,因此对边界样例采取复制操作。Among them, the boundary samples are very important to the classification learning algorithm, and if they are synthesized with other samples to generate new samples, it is easy to cause sample deviation, so the copy operation is taken for the boundary samples.
对每个稳定样例,与周围样例合成新样例,合成个数为∣c-1∣,并将该样例以及新合成的样例一起加入对应的最终少数类样例集;其中,合成方法为每次从该稳定样例xi的k近邻中随机选择一个属于该少数类样例集中的样例xj,新合成的样例xi’=xi+(xi-xj)*a,a为0到1之间的随机数。For each stable sample, synthesize a new sample with the surrounding samples, the number of synthesis is |c-1|, and add this sample and the newly synthesized sample to the corresponding final minority class sample set; where, The synthesis method is to randomly select a sample xj belonging to the minority class sample set from the k-nearest neighbors of the stable sample x i each time, and the newly synthesized sample x i '= xi +( xi -x j ) *a, a is a random number between 0 and 1.
计算还需生成的属于该少数类样例集的样例个数d;其中,d=该少数类样例集的理想样例个数-最终少类样例集中的属于该少数类样例集的样例的当前个数;Calculate the number d of samples belonging to the minority class sample set that still needs to be generated; where, d=the ideal sample number of the minority class sample set-the final minority class sample set that belongs to the minority class sample set The current number of samples of ;
随机选择d个稳定样例,每个稳定样例与周围样例合成一个新样例,将合成的新样例加入到对应的最终少数类样例集中。Randomly select d stable samples, synthesize a new sample with each stable sample and surrounding samples, and add the synthesized new sample to the corresponding final minority class sample set.
通过上述处理,对不同分类的样例分别处理,提高了新生成的样例的质量,进而提高分类学习算法的性能,而且保证了生成的最终少数类样例集的样例个数与等于用户预设的理想样例个数。Through the above processing, the samples of different classifications are processed separately, which improves the quality of the newly generated samples, thereby improving the performance of the classification learning algorithm, and ensuring that the number of samples in the final minority class sample set is equal to that of the user The preset ideal number of samples.
在本实施例中,依次对每个少数类样例集进行上述处理,即得到相应的最终少数类样例集。In this embodiment, the above processing is performed sequentially on each minority class sample set, that is, the corresponding final minority class sample set is obtained.
S106,对每个多数类样例集中的样例,计算每个样例的k近邻中属于该多数类样例集中的样例和其他类样例的个数,以将每个样例划分为噪音样例、边界样例或稳定样例,并打上相应的标记。S106, for each sample in the majority class sample set, calculate the number of samples belonging to the majority class sample set and other class samples in the k nearest neighbors of each sample, so as to divide each sample into Noise samples, boundary samples or stable samples, and mark accordingly.
具体地,当判断多数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例;Specifically, when it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set are other class samples, the sample is marked as a noise sample;
当判断多数类样例集中的样例的k近邻样例的其他类样例的个数与属于多数类样例集中的样例的个数接近时,则标记该样例为边界样例;When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the majority class sample set is close to the number of samples belonging to the majority class sample set, then mark the sample as a boundary sample;
当判断多数类样例集中的样例的k近邻样例大多数为属于该多数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set belong to the samples in the majority class sample set, mark the sample as a stable sample.
上述绝大多数、大多数、接近对应的比例可参考少数类样例的情况,本发明在此不做赘述。The proportions of the above-mentioned overwhelming majority, majority, and close to corresponding ratios can refer to the situation of the minority class samples, and the present invention will not repeat them here.
S107,对每个多数类样例集中的样例,根据每个样例的标记进行删除或保存,以得到最终多数类样例集。S107. For each sample in the majority class sample set, delete or save according to the label of each sample, so as to obtain a final majority class sample set.
具体地,对每个多数类样例集中的样例:Specifically, for each sample in the majority class sample set:
删除噪音样例;remove noise samples;
保留所有边界样例;Keep all boundary samples;
边界样例对分类学习算法十分重要,因此对边界样例采取保留操作,即不删除任何边界样例。The boundary samples are very important to the classification learning algorithm, so the boundary samples are retained, that is, no boundary samples are deleted.
对每个稳定样例,执行选择性删除操作,直至删除e个稳定样例;其中,e=该多数类样例集的实际样例个数-噪音样例个数-该多数类样例集的理想样例个数。For each stable sample, perform a selective deletion operation until e stable samples are deleted; where, e=the actual number of samples in the majority class sample set - the number of noise samples - the majority class sample set The ideal number of samples for .
在本实施例中,对稳定样例采取选择性删除操作,从而保证最终多数类样例集的样例个数为理想样例个数。In this embodiment, a selective deletion operation is adopted for stable samples, so as to ensure that the number of samples in the final majority class sample set is the ideal number of samples.
在一种实现方式中,可通过如下方式对稳定样例进行选择性删除:In one implementation, stable samples can be selectively deleted in the following manner:
重复以下步骤直至已删除的稳定样例的个数f等于e;Repeat the following steps until the number f of deleted stable samples is equal to e;
对当前选中的稳定样例,计算该稳定样例到周围的k近邻个属于该多数类样例集的样例的距离;For the currently selected stable sample, calculate the distance from the stable sample to the surrounding k nearest neighbor samples belonging to the majority class sample set;
在本实施例中,距离的计算根据不同分类对象计算方法不同,例如分类对象为词向量,则可以用向量欧氏计算距离。In this embodiment, the calculation method of the distance is different according to different classification objects. For example, if the classification object is a word vector, the distance can be calculated by using the vector Euclidean.
根据所述距离计算删除该稳定样例的概率;其中,距离越小,删除概率越大;Calculate the probability of deleting the stable sample according to the distance; wherein, the smaller the distance, the greater the deletion probability;
如果删除概率大于等于0.5,则删除该稳定样例,并更新已删除的稳定样例的个数f(即令f=f+1);If the deletion probability is greater than or equal to 0.5, then delete the stable sample, and update the number f of the deleted stable sample (that is, let f=f+1);
选中下一个稳定样例。Select the next stable sample.
上述实施例中,在选择性删除过程中,分布越密集的样例,其被删除的可能性越大,使得剩余的样例尽可能保留所有样本的特征。这样对不同样例的分别处理提高了欠采样以后样本的质量,进而提高分类学习算法的性能。In the above embodiment, during the selective deletion process, samples with denser distribution are more likely to be deleted, so that the remaining samples retain the characteristics of all samples as much as possible. In this way, the separate processing of different samples improves the quality of the samples after undersampling, thereby improving the performance of the classification learning algorithm.
在本实施例中,依次对每个多数类样例集进行上述处理,即得到相应的最终多数类样例集。In this embodiment, the above processing is performed sequentially on each majority class sample set, that is, the corresponding final majority class sample set is obtained.
S108,根据所述最终少数类样例集以及多数类样例集,生成最终样本集,以实现不平衡数据的预处理。S108. Generate a final sample set according to the final minority class sample set and the majority class sample set, so as to realize preprocessing of unbalanced data.
本发明实施例实现了一种面向多分类的不平衡数据预处理方法,结合使用过采样和欠采样技术,使得新生成的最终样本集满足分类算法需求,有效提高不平衡数据的分类准确性。具体地,本发明实施例允许用户输入需要的总样例个数以及希望得到多个样例集的不平衡比,通过计算得出每个样例集的理想样例个数,根据最终样本集中样本个数确定每个样例集是多数类还是少数类,解决了传统方法中一个样例集在不同二分类算法中可能同时是多数类和少数类的问题。在对每个样例集中的样例进行处理时,把所有其他类别的样例集合并处理,把少数类样例分为噪音样例、不稳定样例、边界样例和稳定样例分别处理,把多数类样例分为噪音样例、边界样例和稳定样例分别处理,解决了传统方法中一个样例在不同二分类算法中可能属于不同类别导致对样例的处理出现冲突的问题。综上所述,本发明实施例通过对每个样例做出最合适的处理,使得最终样本集能够有效提高多分类算法的准确性。The embodiment of the present invention implements a multi-classification-oriented unbalanced data preprocessing method, which uses oversampling and undersampling techniques in combination, so that the newly generated final sample set meets the requirements of the classification algorithm, and effectively improves the classification accuracy of unbalanced data. Specifically, the embodiment of the present invention allows the user to input the total number of samples required and the unbalanced ratio of multiple sample sets that are expected to be obtained, and the ideal number of samples for each sample set is obtained through calculation. According to the final sample set The number of samples determines whether each sample set is a majority class or a minority class, which solves the problem that a sample set may be a majority class and a minority class at the same time in different binary classification algorithms in traditional methods. When processing the samples in each sample set, all other types of sample sets are processed together, and the minority class samples are divided into noise samples, unstable samples, boundary samples and stable samples. , the majority class samples are divided into noise samples, boundary samples and stable samples to be processed separately, which solves the problem that a sample may belong to different categories in different binary classification algorithms in the traditional method, which leads to conflicts in the processing of samples . To sum up, the embodiment of the present invention makes the final sample set effectively improve the accuracy of the multi-classification algorithm by performing the most appropriate processing on each sample.
请参阅图4,本发明第二实施例提供了一种不平衡数据的预处理装置,包括:Referring to FIG. 4, the second embodiment of the present invention provides a preprocessing device for unbalanced data, including:
数据读取单元10,用于读取原始样本集,其中,所述原始样本集包括至少两个类别的样例集;A data reading unit 10, configured to read an original sample set, wherein the original sample set includes sample sets of at least two categories;
数据接收单元20,用于接收由用户输入的最终样本集大小以及各个样例集之间的不平衡比,以计算得出最终样本集中每个样例集的理想样例个数;The data receiving unit 20 is used to receive the size of the final sample set input by the user and the imbalance ratio between each sample set, so as to calculate the ideal number of samples for each sample set in the final sample set;
判断单元30,用于根据每个样例集中的理想样例个数和实际样例个数判断该样例集属于少数类样例集还是多数类样例集;Judging unit 30, for judging whether the sample set belongs to the minority class sample set or the majority class sample set according to the ideal sample number and the actual sample number in each sample set;
少数类样例分类单元40,用于对每个少数类样例集中的样例,计算每个样例的k近邻中其他类样例和属于该少数类样例集中的样例的个数,以将每个样例划分为噪音样例、不稳定样例、边界样例或稳定样例并打上相应的标记;其中,其他类样例是指除该少数类样例集中的样例以外的其他样例集中的样例;Minority class example classifying unit 40, for the sample in each minority class sample set, calculate the number of other class samples in the k nearest neighbors of each sample and the number of samples belonging to the minority class sample set, In order to divide each sample into noise sample, unstable sample, boundary sample or stable sample and mark accordingly; among them, other class samples refer to samples other than the samples in the minority class sample set samples from other sample sets;
少数类样例处理单元50,用于对每个少数类样例集中的样例,根据每个样例的标记进行删除、保存、复制或者合成,以得到相应于每个少数类样例集的最终少数类样例集;The minority class sample processing unit 50 is used to delete, save, copy or synthesize the samples in each minority class sample set according to the mark of each sample, so as to obtain the corresponding Final minority sample set;
多数类样例分类单元60,用于对每个多数类样例集中的样例,计算每个样例的k近邻中属于该多数类样例集中的样例和其他类样例的个数,以将每个样例划分为噪音样例、边界样例或稳定样例,并打上相应的标记;The majority class sample classification unit 60 is used for calculating the number of samples and other class samples belonging to the majority class sample set in the k-nearest neighbors of each sample for the samples in each majority class sample set, To classify each sample as noise sample, boundary sample or stable sample, and mark accordingly;
多数类样例处理单元70,用于对每个多数类样例集中的样例,根据每个样例的标记进行删除或保存,以得到相应于每个多数类样例集的最终多数类样例集;The majority class sample processing unit 70 is used to delete or save the samples in each majority class sample set according to the mark of each sample, so as to obtain the final majority class sample set corresponding to each majority class sample set example set;
最终样本集生成单元80,用于根据所述最终少数类样例集以及多数类样例集,生成最终样本集,以实现不平衡数据的预处理。The final sample set generating unit 80 is configured to generate a final sample set according to the final minority class sample set and the majority class sample set, so as to realize preprocessing of unbalanced data.
优选地,所述判断单元30具体阈值用于:Preferably, the specific threshold of the judgment unit 30 is used for:
对于每个样例集,若其理想样例个数大于实际样例个数,则判断该样例集为少数类样例集;若其理想样例个数小于等于实际样例个数,则判断该样例集为多数类样例集。For each sample set, if the number of ideal samples is greater than the actual number of samples, it is judged that the sample set is a minority class sample set; if the number of ideal samples is less than or equal to the actual number of samples, then It is judged that the sample set is a majority class sample set.
优选地,所述少数类样例分类单元40具体用于:Preferably, the minority class example classification unit 40 is specifically used for:
当判断少数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例;When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are other class samples, the sample is marked as a noise sample;
当判断少数类样例集中的样例的k近邻样例大多数为其他类样例时,则标记该样例为不稳定样例;When it is judged that most of the k-nearest neighbor samples of the samples in the minority class sample set are samples of other classes, the sample is marked as an unstable sample;
当判断少数类样例集中的样例的k近邻样例的其他类样例的个数与少数类样例集中的样例的个数接近时,则标记该样例为边界样例;When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the minority class sample set is close to the number of samples in the minority class sample set, then mark the sample as a boundary sample;
当判断少数类样例集中的样例的k近邻样例大多数为属于该少数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the sample in the minority class sample set belong to the samples in the minority class sample set, mark the sample as a stable sample.
优选地,所述少数类样例处理单元50具体用于:Preferably, the minority class example processing unit 50 is specifically configured to:
对每个少数类样例集中的样例:For each example in the minority class example set:
删除该少数类样例集中的所有噪音样例;Delete all noise samples in the minority sample set;
把所有不稳定样例加入相应的最终少数类样例集;Add all unstable samples to the corresponding final minority sample set;
对每个边界样例进行复制,复制个数为∣c-1∣,将该边界样例以及复制的样例一起加入相应的最终少数类样例集;其中,c为复制比,且c=(该少数类样例集的理想样例个数-不稳定样例个数)/(该少数类样例集的实际样例个数-噪音样例个数-不稳定样例个数);Copy each boundary sample, the number of copies is |c-1|, add the boundary sample and the copied sample to the corresponding final minority sample set; where c is the replication ratio, and c= (the ideal number of samples of the minority class sample set - the number of unstable samples)/(the actual number of samples of the minority class sample set - the number of noise samples - the number of unstable samples);
对每个稳定样例,与周围样例合成新样例,合成个数为∣c-1∣,并将该样例以及新合成的样例一起加入相应的最终少数类样例集;其中,合成方法为每次从该稳定样例xi的k近邻中随机选择一个属于该少数类样例集中的样例xj,新合成的样例xi’=xi+(xi-xj)*a,a为0到1之间的随机数;For each stable sample, synthesize a new sample with the surrounding samples, the number of synthesis is |c-1|, and add this sample and the newly synthesized sample to the corresponding final minority sample set; where, The synthesis method is to randomly select a sample x j belonging to the minority class sample set from the k-nearest neighbors of the stable sample x i each time, and the newly synthesized sample x i '= xi +( xi -x j )*a, a is a random number between 0 and 1;
计算还需生成的属于该少数类样例集的样例个数d;其中,d=该少数类样例集的理想样例个数-最终少类样例集中的属于该少数类样例集的样例的当前个数;Calculate the number d of samples belonging to the minority class sample set that still needs to be generated; where, d=the ideal sample number of the minority class sample set-the final minority class sample set that belongs to the minority class sample set The current number of samples of ;
随机选择d个稳定样例,每个稳定样例与周围样例合成一个新样例,将合成的新样例加入到相应的最终少数类样例集中;Randomly select d stable samples, synthesize a new sample from each stable sample and surrounding samples, and add the synthesized new samples to the corresponding final minority class sample set;
获得相应于每个少数类样例集的最终少数类样例集。A final minority class sample set corresponding to each minority class sample set is obtained.
优选地,所述多数类样例分类单元60具体用于:Preferably, the majority class example classification unit 60 is specifically used for:
当判断多数类样例集中的样例的k近邻样例中绝大多数为其他类样例时,则标记该样例为噪音样例;When it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set are other class samples, the sample is marked as a noise sample;
当判断多数类样例集中的样例的k近邻样例的其他类样例的个数与属于多数类样例集中的样例的个数接近时,则标记该样例为边界样例;When it is judged that the number of other class samples of the k-nearest neighbor samples of the sample in the majority class sample set is close to the number of samples belonging to the majority class sample set, then mark the sample as a boundary sample;
当判断多数类样例集中的样例的k近邻样例大多数为属于该多数类样例集中的样例时,则标记该样例为稳定样例。When it is judged that most of the k-nearest neighbor samples of the samples in the majority class sample set belong to the samples in the majority class sample set, mark the sample as a stable sample.
优选地,所述多数类样例处理单元70具体用于:Preferably, the majority class example processing unit 70 is specifically used for:
对每个多数类样例集中的样例:For each example in the majority class example set:
删除噪音样例;remove noise samples;
保留所有边界样例;Keep all boundary samples;
对每个稳定样例,执行选择性删除操作,直至删除e个稳定样例;其中,e=该多数类样例集的实际样例个数-噪音样例个数-该多数类样例集的理想样例个数;For each stable sample, perform a selective deletion operation until e stable samples are deleted; where, e=the actual number of samples in the majority class sample set - the number of noise samples - the majority class sample set The ideal number of samples;
得到相应于每个多数类样例集的最终多数类样例集。A final majority class sample set corresponding to each majority class sample set is obtained.
优选地,所述对每个稳定样例,执行选择性删除操作,直至删除e个稳定样例具体为:Preferably, for each stable sample, performing a selective deletion operation until e stable samples are deleted is specifically:
重复以下步骤直至已删除的稳定样例的个数f等于e;Repeat the following steps until the number f of deleted stable samples is equal to e;
对当前选中的稳定样例,计算该稳定样例到周围的k近邻个属于该多数类样例集的样例的距离;For the currently selected stable sample, calculate the distance from the stable sample to the surrounding k nearest neighbor samples belonging to the majority class sample set;
根据所述距离计算删除该稳定样例的概率;其中,距离越小,删除概率越大;Calculate the probability of deleting the stable sample according to the distance; wherein, the smaller the distance, the greater the deletion probability;
如果删除概率大于等于0.5,则删除该稳定样例,并更新已删除的稳定样例的个数f;If the deletion probability is greater than or equal to 0.5, delete the stable sample and update the number f of the deleted stable samples;
选中下一个稳定样例。Select the next stable sample.
本发明第三实施例还提供了一种不平衡数据的预处理设备,包括处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序。所述处理器执行所述计算机程序时实现上述各个步骤。或者,所述处理器执行所述计算机程序时实现上述各装置实施例中各模块的功能。The third embodiment of the present invention also provides a preprocessing device for unbalanced data, including a processor, a memory, and a computer program stored in the memory and operable on the processor. The above steps are realized when the processor executes the computer program. Alternatively, when the processor executes the computer program, the functions of the modules in the foregoing device embodiments are realized.
示例性的,所述计算机程序可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器中,并由所述处理器执行,以完成本发明。所述一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在面向多分类的不平衡数据预处理设备中的执行过程。Exemplarily, the computer program may be divided into one or more units, and the one or more units are stored in the memory and executed by the processor to implement the present invention. The one or more units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the multi-classification-oriented unbalanced data preprocessing device.
所述面向多分类的不平衡数据预处理设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述面向多分类的不平衡数据预处理设备可包括但不仅限于处理器、存储器。本领域技术人员可以理解,所述示意图仅仅是面向多分类的不平衡数据预处理设备的示例,并不构成对面向多分类的不平衡数据预处理设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述面向多分类的不平衡数据预处理设备还可以包括输入输出设备、网络接入设备、总线等。The multi-category-oriented unbalanced data preprocessing device may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers. The multi-classification-oriented unbalanced data preprocessing device may include, but not limited to, a processor and a memory. Those skilled in the art can understand that the schematic diagram is only an example of a multi-category-oriented unbalanced data preprocessing device, and does not constitute a limitation on the multi-classification-oriented unbalanced data preprocessing device, and may include more or Fewer components, or a combination of certain components, or different components, for example, the multi-category-oriented unbalanced data preprocessing device may also include input and output devices, network access devices, buses, and the like.
所称处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述面向多分类的不平衡数据预处理设备的控制中心,利用各种接口和线路连接整个面向多分类的不平衡数据预处理设备的各个部分。The so-called processor can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. General-purpose processor can be microprocessor or this processor also can be any conventional processor etc., described processor is the control center of described unbalanced data pre-processing equipment facing multi-category, utilizes various interfaces and line connection Parts of the entire multiclass-oriented unbalanced data preprocessing apparatus.
所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现所述面向多分类的不平衡数据预处理设备的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(SecureDigital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory can be used to store the computer programs and/or modules, and the processor realizes the oriented Various functions of multi-class unbalanced data preprocessing equipment. The memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.) and the like; the storage data area may store Data created based on the use of the mobile phone (such as audio data, phonebook, etc.), etc. In addition, the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (SecureDigital, SD) card, A flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other volatile solid state storage devices.
其中,所述面向多分类的不平衡数据预处理设备集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。Wherein, if the integrated unit of the multi-category-oriented unbalanced data preprocessing device is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention realizes all or part of the processes in the methods of the above embodiments, and can also be completed by instructing related hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (ROM, Read-Only Memory) , random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excludes electrical carrier signals and telecommunication signals.
需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。It should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separated. A unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided by the present invention, the connection relationship between the modules indicates that they have a communication connection, which can be specifically implemented as one or more communication buses or signal lines. It can be understood and implemented by those skilled in the art without creative effort.
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。The above description is a preferred embodiment of the present invention, and it should be pointed out that for those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also considered Be the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810599969.3A CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810599969.3A CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033148A true CN109033148A (en) | 2018-12-18 |
Family
ID=64612664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810599969.3A Pending CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033148A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
CN110378352A (en) * | 2019-07-11 | 2019-10-25 | 河海大学 | The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment |
CN112749719A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for sample balanced classification |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN113298148A (en) * | 2021-05-25 | 2021-08-24 | 南京邮电大学 | Ecological environment evaluation-oriented unbalanced data resampling method |
-
2018
- 2018-06-11 CN CN201810599969.3A patent/CN109033148A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
CN110378352A (en) * | 2019-07-11 | 2019-10-25 | 河海大学 | The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment |
CN110378352B (en) * | 2019-07-11 | 2021-03-19 | 河海大学 | High-precision anti-jamming two-dimensional filtering navigation data denoising method in complex underwater environment |
CN112749719A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for sample balanced classification |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN112766394B (en) * | 2021-01-26 | 2024-03-12 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN113298148A (en) * | 2021-05-25 | 2021-08-24 | 南京邮电大学 | Ecological environment evaluation-oriented unbalanced data resampling method |
CN113298148B (en) * | 2021-05-25 | 2022-08-05 | 南京邮电大学 | An unbalanced data resampling method for ecological environment assessment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111652267B (en) | Method, device, electronic device, and storage medium for generating an adversarial example | |
WO2022141869A1 (en) | Model training method and apparatus, model calling method and apparatus, computer device, and storage medium | |
US20210295114A1 (en) | Method and apparatus for extracting structured data from image, and device | |
CN109033148A (en) | One kind is towards polytypic unbalanced data preprocess method, device and equipment | |
CN112052331A (en) | A method and terminal for processing text information | |
CN111753863A (en) | An image classification method, device, electronic device and storage medium | |
CN112148767A (en) | Group mining method, abnormal group identification method and device and electronic equipment | |
CN108694413A (en) | Adaptively sampled unbalanced data classification processing method, device, equipment and medium | |
CN108647727A (en) | Unbalanced data classification lack sampling method, apparatus, equipment and medium | |
CN110264311B (en) | Business promotion information accurate recommendation method and system based on deep learning | |
CN108537270A (en) | Image labeling method, terminal device and storage medium based on multi-tag study | |
CN114168824A (en) | Method, system, device and medium for separating hot and cold data based on machine learning | |
US10504002B2 (en) | Systems and methods for clustering of near-duplicate images in very large image collections | |
CN115167913B (en) | Layering method of operating system, computing equipment and storage medium | |
CN108647728B (en) | Imbalanced data classification oversampling method, apparatus, equipment and medium | |
CN114860667B (en) | File classification method, device, electronic equipment and computer readable storage medium | |
CN105790967A (en) | Weblog processing method and device | |
CN114416986A (en) | Text data cleaning method and device and storage medium | |
CN113515593A (en) | Topic detection method and device based on clustering model and computer equipment | |
CN111143560B (en) | Short text classification method, terminal equipment and storage medium | |
CN117407875A (en) | Malicious code classification method and system and electronic equipment | |
CN111091198A (en) | Data processing method and device | |
CN114254599B (en) | Table merging method, processing chip and electronic device | |
CN116776173A (en) | A desensitization method for power measurement data based on convolutional neural network | |
CN112633394B (en) | Intelligent user label determination method, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |