CN108647728A

CN108647728A - Unbalanced data classification oversampler method, device, equipment and medium

Info

Publication number: CN108647728A
Application number: CN201810453104.6A
Authority: CN
Inventors: 韩伟红; 李树栋; 王乐; 方滨兴; 贾焰; 黄子中; 周斌; 殷丽华; 田志宏
Original assignee: Guangzhou University
Current assignee: National University of Defense Technology; Guangzhou University
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2018-10-12
Anticipated expiration: 2038-05-10
Also published as: CN108647728B

Abstract

The invention discloses a kind of unbalanced data classification oversampler methods, including：Obtain all a small number of samples in pending unbalanced data；The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm；The classification of corresponding a small number of samples is determined according to the number of most samples；Operation corresponding with the classification is carried out according to the classification of each a small number of samples.The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves the problems, such as that minority class sample lacks.

Description

Unbalanced data classification oversampler method, device, equipment and medium

Technical field

The present invention relates to uneven big data processing field more particularly to unbalanced data classification oversampler method, device, Equipment and medium.

Background technology

With being constantly progressive for technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous Development, data acquisition technology, memory technology, treatment technology obtain significant progress, and data just increase at an unprecedented rate, We have come into the big data epoch.The data scale huge (volume) of big data generates high speed (velocity), form Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to Unprecedented challenge is encountered when big data field.

Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation It is widely present in boundary, especially in big data application field.For example, in internet text classification, the data of each classification are not Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.；In electricity In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to Fraud in business and abnormal behaviour, these data are submerged in a large amount of normal behaviour data, belong to knockdown Unbalanced dataset.Similar application also has medical diagnosis, Satellite Remote Sensing Data Classification etc..Therefore, uneven big data classification It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.

Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive It is difficult the classifying quality obtained, unbalanced data in the prior art as shown in Figure 1 classification example, wherein circle are minority class Sample, triangle are most class samples, and imbalance is than being 3：1, i.e., most class samples are 3 times of minority class sample, and actual In large data sets, imbalance is than often 10000：1, it is even higher, therefore first need to carry out data before being classified Pretreatment.

Existing imbalance big data preprocess method includes mainly for the over-sampling of minority class and for most classes Over-sampling.Over-sampling refers to increasing minority class sample using certain methods and techniques, big by being reduced to the adjustment of sample set The degree of unbalancedness of data set increases the accuracy of sorting algorithm.

Random over-sampling carries out stochastical sampling on raw data set D to minority class, that is, randomly select minority class sample into Row replicates, and obtains an additional data set E, finally merges D and E, obtains a data set D' almost balanced.Wherein, E Size can freely control, to which D' can reach arbitrary uneven ratio.It is to be adopted using random mistake in circle circle in Fig. 2 The minority class sample that quadrat method is chosen replicates.

Heuristic over-sampling is also to be replicated to minority class sample, itself can't create new sample.Difference It is which sample replicate to be selective, rather than random.The sample in grader boundary is replicated, is increased Its strong weight in grader.Fig. 3 is the reproduction copies of boundary sample oversampler method selection.

Inventor has found that there are following technical problems for the prior art when implementing the embodiment of the present invention：Random over-sampling by In being randomly selected when selecting sample, the sample quality for being easy to happen duplication is relatively low, situations such as being noise sample, to drop The performance of low classification learning algorithm.Although heuristic over-sampling selects the sample of duplication according to certain rule, Only to having the simple repetition of minority class sample, this method of sampling does not increase information content, may lead to taxology Overfitting problem (Over-Fitting) during habit, exactly leads to the overlearning of learning sample in sorting algorithm Sorting algorithm it is very ideal for the classifying quality of sample set, but problem is declined instead for the classification performance of test set, Over-fitting is less caused often caused by study sample, although random over-sampling and heuristic over-sampling increase minority class sample This quantity, but the only duplication of sample, however it remains minority class sample is few in the uneven big data assorting process of processing The low problem of caused classification learning algorithm precision cannot fundamentally solve the problems, such as that minority class sample lacks.

Invention content

In view of the above-mentioned problems, the purpose of the present invention is to provide a kind of unbalanced data classification oversampler method,.

In a first aspect, the present invention provides a kind of unbalanced data classification oversampler methods, including：

Obtain all a small number of samples in pending unbalanced data；

The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm；

The classification of corresponding a small number of samples is determined according to the number of most samples；

Operation corresponding with the classification is carried out according to the classification of each a small number of samples.

In the first possible realization method of first aspect, the number according to most samples determines corresponding few Number samples classification include：

Size comparison is carried out with predetermined threshold value according to the number of most samples, with determination corresponding a small number of samples Classification；Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.

The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect, institute Predetermined threshold value is stated to include preset first threshold value n, preset second threshold p and default third threshold value q,

Then the number according to most samples is compared with predetermined threshold value, with the class of determination corresponding a small number of samples Do not include：

When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples Classification is the noise sample；Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k；

The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p When, then the classification of corresponding a small number of samples is the unstable sample；Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k；Wherein, the default second threshold p value ranges are k/2<=p<n；

The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q When, then the classification of corresponding a small number of samples is the boundary sample；Wherein, the default second threshold p value ranges are k/2 <=p<n；Wherein, the default third threshold value q value ranges are k/3<=q<p；

The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute State stable sample；Wherein, the default third threshold value q value ranges are k/3<=q<p.

The possible realization method of second with reference to first aspect, in the third possible realization method of first aspect, institute It states and includes according to the classification progress operation corresponding with the classification of each a small number of samples：

When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted；

When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained；

When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated；

When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.

The third possible realization method with reference to first aspect, in the 4th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the boundary sample, carrying out duplication to a small number of samples includes：

It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h；Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |；

A small number of samples are replicated according to the increase number h.

The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples includes：

Obtain the stable sample to k minority class sample of arest neighbors average distance d；

When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained Each a small number of sample j in example_iSerial number；Wherein, the serial number is according to each a small number of sample j_iK sample of arest neighbors The ratio of middle minority sample and most samples carries out ascending sort；Wherein, 1<i<=k；

Obtain the select probability of the stable sample；Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample j_iSerial number；Wherein, 1<i<=k；

A small number of sample j are randomly selected according to the select probability_i, obtain selected a small number of sample j_i；

According to selected a small number of sample j_iIt is synthesized with the stable sample, to obtain new sample；Wherein, institute State new sample=stable sample+(stable sample-selected a small number of sample j_i)*a；Wherein, a makes a living At 0 to 1 between random number.

The 5th kind of possible realization method with reference to first aspect, in the 6th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples further includes：

When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors One a small number of sample h_iSerial number；Wherein, the serial number is according to each a small number of sample x_nIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort；Wherein, 1<n<=k；

Obtain the select probability of the stable sample；Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample x_nSerial number；Wherein, 1<n<=k；

S a small number of sample x are randomly selected according to the select probability_nj；Wherein, 1<s<=k；Wherein, 1<j<=s；

According to each a small number of sample x_njIt is synthesized with the stable sample, new sample is obtained according to synthetic method； Wherein, the synthetic method is

Wherein, a_nFor the random number between the 0 to 1 of generation；The x_i' it is the new sample；The x_iIt is described steady Random sample example；Wherein, 1<s<=k.

Second aspect, the present invention also provides the sorted sampling apparatuses of unbalanced data, including：

A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data；

Most sample number acquisition modules, for obtaining k of each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of most samples in sample；

Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples；

Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.

The third aspect, the embodiment of the present invention additionally provide a kind of sorted sample devices of unbalanced data, including processor, Memory and it is stored in the memory and is configured as the computer program executed by the processor, the processor The unbalanced data classification oversampler method as described in above-mentioned any one is realized when executing the computer program.

Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, described computer-readable to deposit Storage media includes the computer program of storage, wherein the computer-readable storage is controlled when the computer program is run Equipment where medium executes the classification oversampler method of the unbalanced data described in above-mentioned any one.

Above-mentioned technical proposal has the following advantages that：K of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm The number of most samples in sample；The classification of corresponding a small number of samples is determined according to the number of most samples；According to each institute The classification for stating a small number of samples carries out operation corresponding with the classification；The minority class sample in the uneven big data assorting process of processing When example lacks caused classification learning algorithm precision low problem, avoid taking the same processing method to all a small number of samples, Single only replicates sample or the single new sample of synthesis；By to a small number of samples in the pending unbalanced data Classification is divided, to carry out different operations according to different classes of sample, to come to the Different treatments of a small number of samples The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves minority class sample The problem of missing.

Description of the drawings

Fig. 1 be in the prior art unbalanced data classification exemplary plot；

Fig. 2 is random oversampler method exemplary plot in the prior art；

Fig. 3 is boundary sample oversampler method exemplary plot in the prior art；

Fig. 4 is the unbalanced data classification oversampler method flow diagram that first embodiment of the invention provides；

Fig. 5 is that k sample of the arest neighbors that first embodiment of the invention provides obtains schematic diagram；

Fig. 6 is synthetic method exemplary plot in the prior art；

Fig. 7 is the sorted sampling apparatus structural schematic diagram of a kind of unbalanced data that fifth embodiment of the invention provides；

Fig. 8 is the structural schematic diagram for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Embodiment one

Referring to Fig. 4, the unbalanced data classification oversampler method flow diagram of first embodiment of the invention offer.

It should be noted that when generating minority class sample, existing method takes all minority class samples same Processing method, either using the method for replicating sample or using the method for synthesizing new sample, all only single treatment not When equilibrium criterion, the simple repetition to a small number of samples is thus low in the presence of a small number of sample quality newly increased, is not worth, and holds The overfitting problem (Over-Fitting) during classification learning is easily led to, is exactly the mistake to learning sample in sorting algorithm Sorting algorithm is very ideal for the classifying quality of sample set caused by degree study, but anti-for the classification performance of test set And decline, the low problem of classification learning algorithm precision caused by minority class sample is few in uneven big data assorting process is handled, It cannot fundamentally solve the problems, such as that minority class sample lacks.

Unbalanced data classification oversampler method provided in this embodiment can be executed by terminal device, the terminal device Including but not limited to：Mobile phone, laptop, tablet computer and desktop computer etc..

The unbalanced data classification oversampler method is as follows：

All a small number of samples in S11, the pending unbalanced data of acquisition.

It should be noted that in embodiments of the present invention, a small number of samples in handling the pending unbalanced data When, based in actual large data sets, most samples are uneven than often 10000 with a small number of samples：1, it is even higher, be The quality for improving a small number of samples newly increased obtains all a small number of samples in pending unbalanced data first.

S12, the number that majority samples in k sample of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm.

It should be noted that the k values are more than 1, and it is integer, the k values, of the invention determines according to actual conditions This is not especially limited.But the setting of k values can influence the performance of this method, with the increase of k values, the performance meeting of this method It is on a declining curve, but the too small accuracy decline that can lead to this method of k values.The value of general k is relatively more reasonable between 5-10, this hair It is bright that this is not especially limited.

Specifically, referring to Fig. 5, figure intermediate cam shape is most samples, and circle is a small number of samples, the minority lived with rectangle circle Sample M is illustrated, it is assumed that k values are 4, then 4 samples of the arest neighbors of a small number of sample M are lived with circle circle, circle circle Most sample numbers are 2 in four main samples.

S13, the classification that corresponding a small number of samples are determined according to the number of most samples.

In embodiments of the present invention, the classification of corresponding a small number of samples is determined according to the number of most samples, wherein institute Classification is stated to include noise sample, boundary sample, unstable sample, stablize sample.

It should be noted that determining that practical is exactly to the minority to the classification of a small number of samples in the present embodiment The property that sample is concentrated in the pending unbalanced data is determined, so as to according to actual needs to corresponding a small number of samples It is to be operated accordingly, to ensure that the pending unbalanced data finally obtain desired effect.

It should be noted that then a small number of samples are noise sample, example when a small number of samples are the samples of interference If one of situation is, the overwhelming majority is most samples in neighbours' sample of a small number of samples, i.e., of most samples Number is more more than a small number of samples, then a small number of samples are the noise sample；It is in a small number of sample sets in a small number of samples Between group and most sample clusters, for example, one of situation is, most samples in neighbours' sample of a small number of samples Suitable with the number of a small number of samples, then a small number of samples are the boundary sample；It is to belong to a small number of samples in a small number of samples In example, but its there are it is unstable when be the unstable sample, for example, one of situation is, in a small number of samples In neighbours' sample in most samples number it is more than a small number of samples, then the minority sample is the unstable sample；Institute It is the stable sample to state when a small number of samples are completely in a small number of sample clusters, for example, one of situation is, described few Most samples are few more than a small number of sample numbers in neighbours' sample of number sample, i.e., in neighbours' sample of described a small number of samples absolutely mostly Number is a small number of samples, then a small number of samples are the stable sample.

S14, operation corresponding with the classification is carried out according to the classification of each a small number of samples.

It should be noted that in embodiments of the present invention, being carried out to a small number of samples in the pending unbalanced data Over-sampling, to increase the diversity of a small number of samples, during carrying out over-sampling, not according to each a small number of samples Generic carry out different operation, wherein the operation includes deleting, retain, replicate and synthesizing.

It should be noted that each a small number of samples are all corresponding and only corresponding one operates, i.e., each classification can all have Corresponding operation, and only there are one operations, it is assumed that classification is b1, b2, b3 and b4, then the b1, the b2, the b3 and institute A corresponding operation can be had by stating b4 all, for example, it is to retain that b1 is corresponding, b2 corresponding is to delete, and b3 corresponding is also to delete, B4 corresponding is synthesis, and the present invention is not especially limited this.

Specifically, obtaining all a small number of samples in pending unbalanced data, a small number of sample set A, A=are obtained【A1, A2 ... a3, an】, wherein n is the number of all a small number of samples, it is assumed that when the minority sample a1 is noise sample, It needs to carry out delete operation to a small number of sample a1, and, it is specified that being to protect to the noise sample in some pretreatments It stays, then the minority sample a1 is retained；Assuming that when the minority sample an is unstable sample, and at some pre- places , it is specified that carrying out delete operation to the unstable sample in reason, then a small number of samples are deleted, the present invention to this not Make specific limit.

Implement the present embodiment to have the advantages that：

By obtaining all a small number of samples in pending unbalanced data, obtained according to k nearest neighbor algorithm each described few The number of most samples in k sample of number sample arest neighbors determines corresponding a small number of samples according to the number of most samples Classification, operation corresponding with the classification is carried out according to the classification of each a small number of samples, is solved to all minorities Sample takes the same processing method, and single only replicates sample or the single new sample of synthesis, and that is caused is increased few The low-quality problem of number sample, different a small number of sample classifications carry out different operations, increase the processing of a small number of samples Diversity improves the quality of newly-increased a small number of samples to increase the diversity of a small number of samples, and then avoids because of minority class Sample causes classification learning algorithm precision low less, solves the problems, such as that minority class sample lacks

Embodiment two

On the basis of embodiment one,

The number according to most samples determines that the classification of corresponding a small number of samples includes：

In embodiments of the present invention, the predetermined threshold value is to be set according to actual conditions.Specifically, the default threshold Value includes preset first threshold value n, presets second threshold p and default third threshold value q,

In the present embodiment, the preset first threshold value n be judge a small number of sample whether be the noise threshold value, Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k is the preferred scope of the embodiment of the present invention, is basis Largely test the rational noise sample value range of one obtained.

The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p When, then the classification of corresponding a small number of samples is the unstable sample；Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k；Wherein, the default second threshold p value ranges are k/2<=p<n.

In the present embodiment, the preset first threshold value n is to judge whether a small number of samples are the noise sample Threshold value；The default second threshold p be judge a small number of sample whether be the unstable sample threshold value, wherein it is described Default second threshold p value ranges are k/2<=p<N is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests A rational unstable sample value range.

In the present embodiment, the default second threshold p is to judge whether a small number of samples are the unstable sample Threshold value；The default third threshold value q be judge a small number of sample whether be the boundary sample threshold value；Wherein, described Default third threshold value q value ranges are k/3<=q<P is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests A rational boundary sample value range.

In the present embodiment, the default third threshold value q is to judge whether a small number of samples are the boundary sample Threshold value.

It should be noted that the preset first threshold value n, the default second threshold p and the preset threshold value q are roots The threshold values for rationally determining different classes of samples obtained according to a large amount of tests, wherein the preset first threshold value n, described default Second threshold p and the preset threshold value q specific numbers can voluntarily be set according to condition, and the present invention does not limit this specifically It is fixed.

Then operate corresponding with the classification of classification progress according to each a small number of samples includes：

It should be noted that by deleting the noise sample, to improve the quality for newly increasing sample, reduce Sample is newly increased to the noise effect in subsequent data handling procedure.

When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained.

It is not deleted it should be noted that the unstable sample is retained, is to increase the various of a small number of samples Property so that a small number of samples more meet truth.

When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated.

It should be noted that the boundary sample is the sample in pending uneven big data boundary, it is more valuable, The distinguishing characteristics between most classes and minority class can be more embodied, therefore is selected at the minority class sample of classification boundaries Reason replicates the sample in grader boundary, enhance its weight in pending uneven big data.

It should be noted that in order to increase the quantity of a small number of samples, by being synthesized to the stable sample, can solve The problem of over-fitting.

Implement the present embodiment to have the advantages that：

After being determined to the accurate differentiation of classification progress of a small number of samples, each minority samples of comparison are most The number and predetermined threshold value of most samples in the k sample of neighbour, the predetermined threshold value are according to each a small number of samples The different classes of different condition that carries out judges setting, and carries out different processing to different classes of a small number of samples, effectively improves The classification accuracy of most samples in uneven big data.

Embodiment three

On the basis of embodiment one and embodiment two,

When the classification in corresponding a small number of samples is the boundary sample, copy package is carried out to a small number of samples It includes：

A small number of samples are replicated according to the increase number h.

Specifically, detect each a small number of samples traversed in all a small number of samples, and to a small number of samples Example is that the noise sample delete, also surplus after being the retaining of the unstable sample to a small number of samples Under the boundary sample that is not operated and the stable sample, a small number of samples for noise sample of skimming and skim not The large data sets of a small number of samples for the unstable sample that can be used to synthesize or replicate, first choice, which needs to calculate, also needs to increased Number, that is, increase the increase number h, the h=| (the target minority sample number-unstable sample number)/(described institute Have the number-of a small number of samples noise sample number-unstable sample number) -1 |, wherein the target minority sample It is finally desired to obtain after example number is carries out the unbalanced data classification over-sampling to the pending unbalanced data A small number of samples number；Wherein, the unstable sample number be traversed it is each few in all a small number of samples After number sample, the number of the obtained unstable sample；Wherein, the number of all a small number of samples is at the beginning, to obtain All a small number of sample numbers in the pending unbalanced data obtained；Wherein, the noise sample number is time It has gone through after each a small number of samples in all a small number of samples, the number of the obtained noise sample；It is assumed that the mesh Standard specimen example is 20000, the number 5000 of all a small number of samples, the noise sample 500, the unstable sample 500, then h=| (20000-500)/(5000-500-500) -1 |=| 4.87-1 |=3.It is boundary sample in a small number of samples When example, a small number of samples are replicated according to the increase number h, for example, minority the sample c, the increase number h It is 3, then after being replicated to the minority sample c, that obtain is 4 a small number of sample c.

Implement the present embodiment to have the advantages that：

Boundary sample in the pending unbalanced data is replicated, the boundary sample is to be in grader side The sample on boundary, it is more valuable, the distinguishing characteristics between most classes and minority class can be more embodied, therefore select to being located at classification boundaries Minority class sample handled, that is, replicate in grader boundary sample, enhance its weight in grader, carried with this The classification accuracy of a small number of samples in high imbalance big data.

Example IV

It should be noted that the prior art is when carrying out the synthesis of new sample, for a small number of sample x, using Euclidean distance In the case of, k minority class neighbour is respectively x₁、x₂、x₃、x₄If randomly choosing one from this 4 minority class neighbours, Each selected probability is the same, and as shown in fig. 6, x₃It is among most class samples, it is most likely that be noise. If that choose at random is x₃If, then newly synthesized sample is likely to be noise, is not only difficult to reach enhancing minority class Purpose, can also introduce more noises.

And in the present embodiment, synthesis is all a small number of samples, and the classification in corresponding a small number of samples is described steady When random sample example, carrying out synthesis to a small number of samples includes：

Specifically, assuming that a small number of sample e are unstable sample, k values are 4, then it is nearest to obtain the stable sample 4 adjacent samples arrive the average distance of the unstable sample, and 4 samples of the stable sample arest neighbors are o1, o2, o3 with The distance of o4, the o1, the o2, the o3 and the o4 to the stable sample, that is, a small number of sample e are 10 respectively, 20,30 and 20, then the average distance is (10+20+30+20)/4=20, wherein the stable sample, that is, a small number of samples The distance of e is Euclidean distance.

Specifically, stable sample f, the k 4, a small number of sample j in k sample of the stable sample arest neighbors₁, j₂, j₃And j₄, wherein the minority sample j₁Neighbours in the ratio of a small number of samples and most samples be 2/2, the minority sample j₂ Neighbours in the ratio of a small number of samples and most samples be 3/1, the minority sample j₃Neighbours in a small number of samples and most samples The ratio of example is 1/3, the minority sample j₄Neighbours in the ratio of a small number of samples and most samples be 1/3, then the j₁, institute State j₂, the j₃With the j₄Serial number according to ascending sort be j₃=1, j₄=1, j₁=2, j₂=3.

Specifically, the j₁, the j₂, the j₃With the j₄Serial number according to ascending sort be j₃=1, j₄=1, j₁ =2, j₂=3, the j₁, the j₂, the j₃With the j₄To be randomly derived random number respectively be 0.6,0.5,0.3,0.8, then Corresponding j₁, the j₂, the j₃With the j₄Select probability be respectively：j₁It is 0.6³* 2=0.432；j₂It is 0.5³* 3= 0.375；j₃It is 0.3³* 1=0.027；j₄It is 0.8³* 1=0.512；

It should be noted that not selecting the select probability value maximum, the select probability value is bigger, corresponding The selected possibility of a small number of samples it is bigger, but it is also possible that a small number of samples that the upper select probability value of selection is small.

It should be noted that the average distance d is less than or equal to preset value, i.e. this minority class sample and surrounding is a small number of Class sample is very close, and the similarity based on a small number of samples in feature space carries out the selection of the minority class sample, then Neighbours' minority sample is selected to synthesize new sample with it from the minority class sample of surrounding.

Specifically, the case where assuming k=5, from x_i5 apart from nearest minority class sample x_i ¹、x_i ²、x_i ³、x_i ⁴、x_i ⁵It Between randomly choosed x_i ²New samples synthesis is carried out, this method both avoids overfitting problem, but also the sample power of minority class Increase again, to which grader can tilt during study to minority class, improves the classifying quality of a small number of samples.

Preferably, when the classification in the corresponding a small number of samples is the stable sample, to a small number of samples into Row synthesizes：

According to synthetic method to each a small number of sample x_njIt is synthesized with the stable sample, obtains new sample；Its In, the synthetic method is

Specifically, if s=3,3 neighbours' samples are selected to generate new sample together with original sample, if a₁= 0.2, a₂=0.8, a₃=0.4, then newly-generated sample be：

It should be noted that when the average distance d is more than preset value, i.e., the described minority class sample and surrounding minority class Sample is very loose, then s sample is selected from the minority class sample of surrounding, wherein s can be arranged as required to, but need to meet 1 <s<=k synthesizes new sample with it, i.e., pair with surrounding minority class sample apart from distant sample, the as possible several samples of more options New sample is generated together, in order to avoid it is larger only to select a new sample of sample synthesis to cause a deviation, is not inconsistent with initial data Situation.

Implement the present embodiment to have the advantages that：

During synthesizing new sample, has method one neighbour's sample of random selection and synthesize new sample with existing sample Example, it is most likely that introduce the either most samples of noise sample；Difference is taken by the different characteristics being distributed according to a small number of samples Synthetic method selects neighbour's sample to synthesize new sample with the minority samples to being distributed closely a small number of samples, when selection The selected probability of the more sample of the most class samples of surrounding is lower；To being distributed sparse sample, s sample is selected to be synthesized with it New sample avoids being distributed sparse sample and occurs with the case where sample synthesizes new sample of closing on of a deviation normal value so that Newly synthesized sample more meets sample distribution character.

It is that the sorted sampling apparatus structure of a kind of unbalanced data that fifth embodiment of the invention provides is shown referring to Fig. 7, Fig. 7 It is intended to, including：

A small number of sample acquisition modules 71, for obtaining all a small number of samples in pending unbalanced data；

Most sample number acquisition modules 72, the k for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of most samples in a sample；

Category determination module 73, the classification for determining corresponding a small number of samples according to the number of most samples；

Operation module 74, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.

Preferably, the category determination module 73 includes：

Classification determination unit, for carrying out size comparison with predetermined threshold value according to the number of most samples, with determination The classification of corresponding a small number of samples；Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample Example.

Preferably, the predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q, then The classification determination unit includes：

Preferably, the operating unit includes：

Deleting unit is used for when the classification of corresponding a small number of samples is the noise sample, to a small number of samples It is deleted；

Stick unit is used for when the classification of corresponding a small number of samples is the unstable sample, to a small number of samples Example is retained；

Copied cells are used for when the classification of corresponding a small number of samples is the boundary sample, to a small number of samples It is replicated；

Synthesis unit is used for when the classification of corresponding a small number of samples is the stable sample, to a small number of samples It is synthesized.

Preferably, the copied cells include：

A small number of samples are replicated according to the increase number h.

Preferably, the synthesis unit includes：

Preferably, the synthesis unit further includes：

Implement the present embodiment to have the advantages that：

The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm；According to The number of the majority sample determines the classification of corresponding a small number of samples；According to the classification of each a small number of samples carry out with it is described The corresponding operation of classification；Classification learning algorithm precision caused by minority class sample is few in the uneven big data assorting process of processing When low problem, avoid taking the same processing method to all a small number of samples, single only replicates sample or single Synthesize new sample；It is divided by the classification to a small number of samples in the pending unbalanced data, with according to inhomogeneity Other sample carries out different operations, increases the diversity of a small number of samples with the Different treatments to a small number of samples, avoids Because minority class sample causes classification learning algorithm precision low less, solve the problems, such as that minority class sample lacks.

Fig. 8 is referred to, Fig. 8 is the signal for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides Figure, for executing unbalanced data classification oversampler method provided in an embodiment of the present invention, as shown in figure 8, the unbalanced data Sorted sample devices includes：At least one processor 11, such as CPU, at least one network interface 14 or other users connect Mouth 13, memory 15, at least one communication bus 12, communication bus 12 is for realizing the connection communication between these components.Its In, user interface 13 may include optionally USB interface and other standards interface, wireline interface.Network interface 14 is optional May include Wi-Fi interface and other wireless interfaces.Memory 15 may include high-speed RAM memory, it is also possible to further include Non-labile memory (non-volatilememory), for example, at least a magnetic disk storage.Memory 15 optionally may be used To include at least one storage device for being located remotely from aforementioned processor 11.

In some embodiments, memory 15 stores following element, executable modules or data structures, or Their subset or their superset:

Operating system 151, including various system programs, for realizing various basic businesses and hardware based of processing Business；

Program 152.

Specifically, processor 11 executes described in above-described embodiment not for calling the program 152 stored in memory 15 Equilibrium criterion classification oversampler method.

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor It is the control centre of the unbalanced data classification oversampler method Deng, the processor, utilizes various interfaces and connection The various pieces of the entire unbalanced data classification oversampler method.

The memory can be used for storing the computer program and/or module, and the processor is by running or executing Computer program in the memory and/or module are stored, and calls the data being stored in memory, is realized uneven The various functions of the electronic device of the data that weigh classification over-sampling.The memory can include mainly storing program area and storage data Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function, Text conversion function etc.) etc.；Storage data field can be stored uses created data (such as audio data, text according to mobile phone Word message data etc.) etc..In addition, memory may include high-speed random access memory, can also include non-volatile memories Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.

Wherein, if the module of the adaptively sampled unbalanced data classification is realized in the form of SFU software functional unit simultaneously When sold or used as an independent product, it can be stored in a computer read/write memory medium.Based on such reason Solution, the present invention realize all or part of flow in above-described embodiment method, can also instruct correlation by computer program Hardware complete, the computer program can be stored in a computer readable storage medium, which exists When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer journey Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes Formula etc..The computer-readable medium may include：Any entity or device, note of the computer program code can be carried Recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium Deng.It should be noted that the content that the computer-readable medium includes can be real according to legislation in jurisdiction and patent The requirement trampled carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium Do not include electric carrier signal and telecommunication signal.

It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separating component The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of module therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention In embodiment attached drawing, the connection relation between module indicates there is communication connection between them, specifically can be implemented as one or A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can to understand And implement.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

It should be noted that in the above-described embodiments, all emphasizing particularly on different fields to the description of each embodiment, in some embodiment In the part that is not described in, may refer to the associated description of other embodiment.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and simulation must be that the present invention must Must.

Claims

The oversampler method 1. a kind of unbalanced data is classified, which is characterized in that including：

Obtain all a small number of samples in pending unbalanced data；

The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm；

The classification of corresponding a small number of samples is determined according to the number of most samples；

Operation corresponding with the classification is carried out according to the classification of each a small number of samples.
The oversampler method 2. unbalanced data according to claim 1 is classified, which is characterized in that described according to the majority The number of sample determines that the classification of corresponding a small number of samples includes：

Size comparison is carried out with predetermined threshold value according to the number of most samples, with the class of determination corresponding a small number of samples Not；Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
The oversampler method 3. unbalanced data according to claim 2 is classified, which is characterized in that

The predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q,

Then the number according to most samples is compared with predetermined threshold value, with the classification packet of determination corresponding a small number of samples It includes：

When the number of the majority sample is greater than or equal to the preset first threshold value n, then the classification of the corresponding a small number of samples For the noise sample；Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k；

When the number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p, then The classification of corresponding a small number of samples is the unstable sample；Wherein, the preset first threshold value n value ranges are 2k/3< =n<=k；Wherein, the default second threshold p value ranges are k/2<=p<n；

When the number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q, then The classification of corresponding a small number of samples is the boundary sample；Wherein, the default second threshold p value ranges are k/2<=p< n；Wherein, the default third threshold value q value ranges are k/3<=q<p；

The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is described steady Random sample example；Wherein, the default third threshold value q value ranges are k/3<=q<p.
4. unbalanced data according to claim 3 is sorted to use method, which is characterized in that described according to each described The classification of a small number of samples carries out operation corresponding with the classification：

When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted；

When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained；

When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated；

When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
5. unbalanced data according to claim 4 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the boundary sample, carrying out duplication to a small number of samples includes：

It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h；Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |；

A small number of samples are replicated according to the increase number h.
6. unbalanced data according to claim 5 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples includes：

It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h；Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |；

Obtain the stable sample to k minority class sample of arest neighbors average distance d；

When the average distance d is less than or equal to preset value, in k minority class sample for obtaining the stable sample arest neighbors Each sample j_iSerial number；Wherein, the serial number is according to each a small number of sample j_iIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort；Wherein, 1<i<=k；

Obtain the select probability of the stable sample；Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample j are multiplied by_iSerial number；Wherein, 1<i<=k；

A small number of sample j are randomly selected according to the select probability_i, obtain selected a small number of sample j_i；

According to selected a small number of sample j_iIt is synthesized with the stable sample, to obtain new sample；Wherein, described new Sample=stable the sample+(stable sample-selected a small number of sample j_i)*a；Wherein, a is to generate Random number between 0 to 1.
7. unbalanced data according to claim 6 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples further includes：

Obtain the stable sample to k minority class sample of arest neighbors average distance d；

When the average distance d is more than preset value, each in k minority class sample of the stable sample arest neighbors is obtained Sample h_iSerial number；Wherein, the serial number is according to each a small number of sample x_nIn k sample of arest neighbors a small number of samples with The ratio of most samples carries out ascending sort；Wherein, 1<n<=k；

Obtain the select probability of the stable sample；Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample x are multiplied by_nSerial number；Wherein, 1<n<=k；

S a small number of sample x are randomly selected according to the select probability_nj；Wherein, 1<s<=k；Wherein, 1<j<=s；

According to synthetic method to each a small number of sample x_njIt is synthesized with the stable sample, obtains new sample；Wherein, institute Stating synthetic method is

Wherein, a_nFor the random number between the 0 to 1 of generation；The x_i' it is the new sample；The x_iFor the stable sample Example；Wherein, 1<s<=k.
8. a kind of sorted sampling apparatus of unbalanced data, which is characterized in that including：

A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data；

Most sample number acquisition modules, the k sample for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of middle majority sample；

Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples；

Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
9. a kind of sorted sample devices of unbalanced data, including processor, memory and be stored in the memory and It is configured as the computer program executed by the processor, the processor realizes such as right when executing the computer program The oversampler method it is required that unbalanced data described in any one of 1 to 7 is classified.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium when the computer program is run is executed as weighed Profit requires the classification oversampler method of the unbalanced data described in any one of 1 to 7.