CN108647728A - Unbalanced data classification oversampler method, device, equipment and medium - Google Patents
Unbalanced data classification oversampler method, device, equipment and medium Download PDFInfo
- Publication number
- CN108647728A CN108647728A CN201810453104.6A CN201810453104A CN108647728A CN 108647728 A CN108647728 A CN 108647728A CN 201810453104 A CN201810453104 A CN 201810453104A CN 108647728 A CN108647728 A CN 108647728A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- small number
- classification
- threshold value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0609—Buyer or seller confidence or verification
Abstract
The invention discloses a kind of unbalanced data classification oversampler methods, including:Obtain all a small number of samples in pending unbalanced data;The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;The classification of corresponding a small number of samples is determined according to the number of most samples;Operation corresponding with the classification is carried out according to the classification of each a small number of samples.The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves the problems, such as that minority class sample lacks.
Description
Technical field
The present invention relates to uneven big data processing field more particularly to unbalanced data classification oversampler method, device,
Equipment and medium.
Background technology
With being constantly progressive for technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous
Development, data acquisition technology, memory technology, treatment technology obtain significant progress, and data just increase at an unprecedented rate,
We have come into the big data epoch.The data scale huge (volume) of big data generates high speed (velocity), form
Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to
Unprecedented challenge is encountered when big data field.
Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other
The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data
Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation
It is widely present in boundary, especially in big data application field.For example, in internet text classification, the data of each classification are not
Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity
In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to
Fraud in business and abnormal behaviour, these data are submerged in a large amount of normal behaviour data, belong to knockdown
Unbalanced dataset.Similar application also has medical diagnosis, Satellite Remote Sensing Data Classification etc..Therefore, uneven big data classification
It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.
Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive
It is difficult the classifying quality obtained, unbalanced data in the prior art as shown in Figure 1 classification example, wherein circle are minority class
Sample, triangle are most class samples, and imbalance is than being 3:1, i.e., most class samples are 3 times of minority class sample, and actual
In large data sets, imbalance is than often 10000:1, it is even higher, therefore first need to carry out data before being classified
Pretreatment.
Existing imbalance big data preprocess method includes mainly for the over-sampling of minority class and for most classes
Over-sampling.Over-sampling refers to increasing minority class sample using certain methods and techniques, big by being reduced to the adjustment of sample set
The degree of unbalancedness of data set increases the accuracy of sorting algorithm.
Random over-sampling carries out stochastical sampling on raw data set D to minority class, that is, randomly select minority class sample into
Row replicates, and obtains an additional data set E, finally merges D and E, obtains a data set D' almost balanced.Wherein, E
Size can freely control, to which D' can reach arbitrary uneven ratio.It is to be adopted using random mistake in circle circle in Fig. 2
The minority class sample that quadrat method is chosen replicates.
Heuristic over-sampling is also to be replicated to minority class sample, itself can't create new sample.Difference
It is which sample replicate to be selective, rather than random.The sample in grader boundary is replicated, is increased
Its strong weight in grader.Fig. 3 is the reproduction copies of boundary sample oversampler method selection.
Inventor has found that there are following technical problems for the prior art when implementing the embodiment of the present invention:Random over-sampling by
In being randomly selected when selecting sample, the sample quality for being easy to happen duplication is relatively low, situations such as being noise sample, to drop
The performance of low classification learning algorithm.Although heuristic over-sampling selects the sample of duplication according to certain rule,
Only to having the simple repetition of minority class sample, this method of sampling does not increase information content, may lead to taxology
Overfitting problem (Over-Fitting) during habit, exactly leads to the overlearning of learning sample in sorting algorithm
Sorting algorithm it is very ideal for the classifying quality of sample set, but problem is declined instead for the classification performance of test set,
Over-fitting is less caused often caused by study sample, although random over-sampling and heuristic over-sampling increase minority class sample
This quantity, but the only duplication of sample, however it remains minority class sample is few in the uneven big data assorting process of processing
The low problem of caused classification learning algorithm precision cannot fundamentally solve the problems, such as that minority class sample lacks.
Invention content
In view of the above-mentioned problems, the purpose of the present invention is to provide a kind of unbalanced data classification oversampler method,.
In a first aspect, the present invention provides a kind of unbalanced data classification oversampler methods, including:
Obtain all a small number of samples in pending unbalanced data;
The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;
The classification of corresponding a small number of samples is determined according to the number of most samples;
Operation corresponding with the classification is carried out according to the classification of each a small number of samples.
In the first possible realization method of first aspect, the number according to most samples determines corresponding few
Number samples classification include:
Size comparison is carried out with predetermined threshold value according to the number of most samples, with determination corresponding a small number of samples
Classification;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect, institute
Predetermined threshold value is stated to include preset first threshold value n, preset second threshold p and default third threshold value q,
Then the number according to most samples is compared with predetermined threshold value, with the class of determination corresponding a small number of samples
Do not include:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples
Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p
When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are
2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q
When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2
<=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute
State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
The possible realization method of second with reference to first aspect, in the third possible realization method of first aspect, institute
It states and includes according to the classification progress operation corresponding with the classification of each a small number of samples:
When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;
When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained;
When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated;
When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
The third possible realization method with reference to first aspect, in the 4th kind of possible realization method of first aspect, institute
It states when the classification of corresponding a small number of samples is the boundary sample, carrying out duplication to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method of first aspect, institute
It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained
Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors
The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji;
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute
State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living
At 0 to 1 between random number.
The 5th kind of possible realization method with reference to first aspect, in the 6th kind of possible realization method of first aspect, institute
It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples further includes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors
One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors
The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to each a small number of sample xnjIt is synthesized with the stable sample, new sample is obtained according to synthetic method;
Wherein, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady
Random sample example;Wherein, 1<s<=k.
Second aspect, the present invention also provides the sorted sampling apparatuses of unbalanced data, including:
A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data;
Most sample number acquisition modules, for obtaining k of each a small number of sample arest neighbors according to k nearest neighbor algorithm
The number of most samples in sample;
Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples;
Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
The third aspect, the embodiment of the present invention additionally provide a kind of sorted sample devices of unbalanced data, including processor,
Memory and it is stored in the memory and is configured as the computer program executed by the processor, the processor
The unbalanced data classification oversampler method as described in above-mentioned any one is realized when executing the computer program.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, described computer-readable to deposit
Storage media includes the computer program of storage, wherein the computer-readable storage is controlled when the computer program is run
Equipment where medium executes the classification oversampler method of the unbalanced data described in above-mentioned any one.
Above-mentioned technical proposal has the following advantages that:K of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm
The number of most samples in sample;The classification of corresponding a small number of samples is determined according to the number of most samples;According to each institute
The classification for stating a small number of samples carries out operation corresponding with the classification;The minority class sample in the uneven big data assorting process of processing
When example lacks caused classification learning algorithm precision low problem, avoid taking the same processing method to all a small number of samples,
Single only replicates sample or the single new sample of synthesis;By to a small number of samples in the pending unbalanced data
Classification is divided, to carry out different operations according to different classes of sample, to come to the Different treatments of a small number of samples
The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves minority class sample
The problem of missing.
Description of the drawings
Fig. 1 be in the prior art unbalanced data classification exemplary plot;
Fig. 2 is random oversampler method exemplary plot in the prior art;
Fig. 3 is boundary sample oversampler method exemplary plot in the prior art;
Fig. 4 is the unbalanced data classification oversampler method flow diagram that first embodiment of the invention provides;
Fig. 5 is that k sample of the arest neighbors that first embodiment of the invention provides obtains schematic diagram;
Fig. 6 is synthetic method exemplary plot in the prior art;
Fig. 7 is the sorted sampling apparatus structural schematic diagram of a kind of unbalanced data that fifth embodiment of the invention provides;
Fig. 8 is the structural schematic diagram for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts
Embodiment shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig. 4, the unbalanced data classification oversampler method flow diagram of first embodiment of the invention offer.
It should be noted that when generating minority class sample, existing method takes all minority class samples same
Processing method, either using the method for replicating sample or using the method for synthesizing new sample, all only single treatment not
When equilibrium criterion, the simple repetition to a small number of samples is thus low in the presence of a small number of sample quality newly increased, is not worth, and holds
The overfitting problem (Over-Fitting) during classification learning is easily led to, is exactly the mistake to learning sample in sorting algorithm
Sorting algorithm is very ideal for the classifying quality of sample set caused by degree study, but anti-for the classification performance of test set
And decline, the low problem of classification learning algorithm precision caused by minority class sample is few in uneven big data assorting process is handled,
It cannot fundamentally solve the problems, such as that minority class sample lacks.
Unbalanced data classification oversampler method provided in this embodiment can be executed by terminal device, the terminal device
Including but not limited to:Mobile phone, laptop, tablet computer and desktop computer etc..
The unbalanced data classification oversampler method is as follows:
All a small number of samples in S11, the pending unbalanced data of acquisition.
It should be noted that in embodiments of the present invention, a small number of samples in handling the pending unbalanced data
When, based in actual large data sets, most samples are uneven than often 10000 with a small number of samples:1, it is even higher, be
The quality for improving a small number of samples newly increased obtains all a small number of samples in pending unbalanced data first.
S12, the number that majority samples in k sample of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm.
It should be noted that the k values are more than 1, and it is integer, the k values, of the invention determines according to actual conditions
This is not especially limited.But the setting of k values can influence the performance of this method, with the increase of k values, the performance meeting of this method
It is on a declining curve, but the too small accuracy decline that can lead to this method of k values.The value of general k is relatively more reasonable between 5-10, this hair
It is bright that this is not especially limited.
Specifically, referring to Fig. 5, figure intermediate cam shape is most samples, and circle is a small number of samples, the minority lived with rectangle circle
Sample M is illustrated, it is assumed that k values are 4, then 4 samples of the arest neighbors of a small number of sample M are lived with circle circle, circle circle
Most sample numbers are 2 in four main samples.
S13, the classification that corresponding a small number of samples are determined according to the number of most samples.
In embodiments of the present invention, the classification of corresponding a small number of samples is determined according to the number of most samples, wherein institute
Classification is stated to include noise sample, boundary sample, unstable sample, stablize sample.
It should be noted that determining that practical is exactly to the minority to the classification of a small number of samples in the present embodiment
The property that sample is concentrated in the pending unbalanced data is determined, so as to according to actual needs to corresponding a small number of samples
It is to be operated accordingly, to ensure that the pending unbalanced data finally obtain desired effect.
It should be noted that then a small number of samples are noise sample, example when a small number of samples are the samples of interference
If one of situation is, the overwhelming majority is most samples in neighbours' sample of a small number of samples, i.e., of most samples
Number is more more than a small number of samples, then a small number of samples are the noise sample;It is in a small number of sample sets in a small number of samples
Between group and most sample clusters, for example, one of situation is, most samples in neighbours' sample of a small number of samples
Suitable with the number of a small number of samples, then a small number of samples are the boundary sample;It is to belong to a small number of samples in a small number of samples
In example, but its there are it is unstable when be the unstable sample, for example, one of situation is, in a small number of samples
In neighbours' sample in most samples number it is more than a small number of samples, then the minority sample is the unstable sample;Institute
It is the stable sample to state when a small number of samples are completely in a small number of sample clusters, for example, one of situation is, described few
Most samples are few more than a small number of sample numbers in neighbours' sample of number sample, i.e., in neighbours' sample of described a small number of samples absolutely mostly
Number is a small number of samples, then a small number of samples are the stable sample.
S14, operation corresponding with the classification is carried out according to the classification of each a small number of samples.
It should be noted that in embodiments of the present invention, being carried out to a small number of samples in the pending unbalanced data
Over-sampling, to increase the diversity of a small number of samples, during carrying out over-sampling, not according to each a small number of samples
Generic carry out different operation, wherein the operation includes deleting, retain, replicate and synthesizing.
It should be noted that each a small number of samples are all corresponding and only corresponding one operates, i.e., each classification can all have
Corresponding operation, and only there are one operations, it is assumed that classification is b1, b2, b3 and b4, then the b1, the b2, the b3 and institute
A corresponding operation can be had by stating b4 all, for example, it is to retain that b1 is corresponding, b2 corresponding is to delete, and b3 corresponding is also to delete,
B4 corresponding is synthesis, and the present invention is not especially limited this.
Specifically, obtaining all a small number of samples in pending unbalanced data, a small number of sample set A, A=are obtained【A1,
A2 ... a3, an】, wherein n is the number of all a small number of samples, it is assumed that when the minority sample a1 is noise sample,
It needs to carry out delete operation to a small number of sample a1, and, it is specified that being to protect to the noise sample in some pretreatments
It stays, then the minority sample a1 is retained;Assuming that when the minority sample an is unstable sample, and at some pre- places
, it is specified that carrying out delete operation to the unstable sample in reason, then a small number of samples are deleted, the present invention to this not
Make specific limit.
Implement the present embodiment to have the advantages that:
By obtaining all a small number of samples in pending unbalanced data, obtained according to k nearest neighbor algorithm each described few
The number of most samples in k sample of number sample arest neighbors determines corresponding a small number of samples according to the number of most samples
Classification, operation corresponding with the classification is carried out according to the classification of each a small number of samples, is solved to all minorities
Sample takes the same processing method, and single only replicates sample or the single new sample of synthesis, and that is caused is increased few
The low-quality problem of number sample, different a small number of sample classifications carry out different operations, increase the processing of a small number of samples
Diversity improves the quality of newly-increased a small number of samples to increase the diversity of a small number of samples, and then avoids because of minority class
Sample causes classification learning algorithm precision low less, solves the problems, such as that minority class sample lacks
Embodiment two
On the basis of embodiment one,
The number according to most samples determines that the classification of corresponding a small number of samples includes:
Size comparison is carried out with predetermined threshold value according to the number of most samples, with determination corresponding a small number of samples
Classification;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
In embodiments of the present invention, the predetermined threshold value is to be set according to actual conditions.Specifically, the default threshold
Value includes preset first threshold value n, presets second threshold p and default third threshold value q,
Then the number according to most samples is compared with predetermined threshold value, with the class of determination corresponding a small number of samples
Do not include:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples
Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
In the present embodiment, the preset first threshold value n be judge a small number of sample whether be the noise threshold value,
Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k is the preferred scope of the embodiment of the present invention, is basis
Largely test the rational noise sample value range of one obtained.
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p
When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are
2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n.
In the present embodiment, the preset first threshold value n is to judge whether a small number of samples are the noise sample
Threshold value;The default second threshold p be judge a small number of sample whether be the unstable sample threshold value, wherein it is described
Default second threshold p value ranges are k/2<=p<N is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests
A rational unstable sample value range.
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q
When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2
<=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
In the present embodiment, the default second threshold p is to judge whether a small number of samples are the unstable sample
Threshold value;The default third threshold value q be judge a small number of sample whether be the boundary sample threshold value;Wherein, described
Default third threshold value q value ranges are k/3<=q<P is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests
A rational boundary sample value range.
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute
State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
In the present embodiment, the default third threshold value q is to judge whether a small number of samples are the boundary sample
Threshold value.
It should be noted that the preset first threshold value n, the default second threshold p and the preset threshold value q are roots
The threshold values for rationally determining different classes of samples obtained according to a large amount of tests, wherein the preset first threshold value n, described default
Second threshold p and the preset threshold value q specific numbers can voluntarily be set according to condition, and the present invention does not limit this specifically
It is fixed.
Then operate corresponding with the classification of classification progress according to each a small number of samples includes:
When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;
It should be noted that by deleting the noise sample, to improve the quality for newly increasing sample, reduce
Sample is newly increased to the noise effect in subsequent data handling procedure.
When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained.
It is not deleted it should be noted that the unstable sample is retained, is to increase the various of a small number of samples
Property so that a small number of samples more meet truth.
When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated.
It should be noted that the boundary sample is the sample in pending uneven big data boundary, it is more valuable,
The distinguishing characteristics between most classes and minority class can be more embodied, therefore is selected at the minority class sample of classification boundaries
Reason replicates the sample in grader boundary, enhance its weight in pending uneven big data.
When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
It should be noted that in order to increase the quantity of a small number of samples, by being synthesized to the stable sample, can solve
The problem of over-fitting.
Implement the present embodiment to have the advantages that:
After being determined to the accurate differentiation of classification progress of a small number of samples, each minority samples of comparison are most
The number and predetermined threshold value of most samples in the k sample of neighbour, the predetermined threshold value are according to each a small number of samples
The different classes of different condition that carries out judges setting, and carries out different processing to different classes of a small number of samples, effectively improves
The classification accuracy of most samples in uneven big data.
Embodiment three
On the basis of embodiment one and embodiment two,
When the classification in corresponding a small number of samples is the boundary sample, copy package is carried out to a small number of samples
It includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
Specifically, detect each a small number of samples traversed in all a small number of samples, and to a small number of samples
Example is that the noise sample delete, also surplus after being the retaining of the unstable sample to a small number of samples
Under the boundary sample that is not operated and the stable sample, a small number of samples for noise sample of skimming and skim not
The large data sets of a small number of samples for the unstable sample that can be used to synthesize or replicate, first choice, which needs to calculate, also needs to increased
Number, that is, increase the increase number h, the h=| (the target minority sample number-unstable sample number)/(described institute
Have the number-of a small number of samples noise sample number-unstable sample number) -1 |, wherein the target minority sample
It is finally desired to obtain after example number is carries out the unbalanced data classification over-sampling to the pending unbalanced data
A small number of samples number;Wherein, the unstable sample number be traversed it is each few in all a small number of samples
After number sample, the number of the obtained unstable sample;Wherein, the number of all a small number of samples is at the beginning, to obtain
All a small number of sample numbers in the pending unbalanced data obtained;Wherein, the noise sample number is time
It has gone through after each a small number of samples in all a small number of samples, the number of the obtained noise sample;It is assumed that the mesh
Standard specimen example is 20000, the number 5000 of all a small number of samples, the noise sample 500, the unstable sample
500, then h=| (20000-500)/(5000-500-500) -1 |=| 4.87-1 |=3.It is boundary sample in a small number of samples
When example, a small number of samples are replicated according to the increase number h, for example, minority the sample c, the increase number h
It is 3, then after being replicated to the minority sample c, that obtain is 4 a small number of sample c.
Implement the present embodiment to have the advantages that:
Boundary sample in the pending unbalanced data is replicated, the boundary sample is to be in grader side
The sample on boundary, it is more valuable, the distinguishing characteristics between most classes and minority class can be more embodied, therefore select to being located at classification boundaries
Minority class sample handled, that is, replicate in grader boundary sample, enhance its weight in grader, carried with this
The classification accuracy of a small number of samples in high imbalance big data.
Example IV
It should be noted that the prior art is when carrying out the synthesis of new sample, for a small number of sample x, using Euclidean distance
In the case of, k minority class neighbour is respectively x1、x2、x3、x4If randomly choosing one from this 4 minority class neighbours,
Each selected probability is the same, and as shown in fig. 6, x3It is among most class samples, it is most likely that be noise.
If that choose at random is x3If, then newly synthesized sample is likely to be noise, is not only difficult to reach enhancing minority class
Purpose, can also introduce more noises.
And in the present embodiment, synthesis is all a small number of samples, and the classification in corresponding a small number of samples is described steady
When random sample example, carrying out synthesis to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
Specifically, assuming that a small number of sample e are unstable sample, k values are 4, then it is nearest to obtain the stable sample
4 adjacent samples arrive the average distance of the unstable sample, and 4 samples of the stable sample arest neighbors are o1, o2, o3 with
The distance of o4, the o1, the o2, the o3 and the o4 to the stable sample, that is, a small number of sample e are 10 respectively,
20,30 and 20, then the average distance is (10+20+30+20)/4=20, wherein the stable sample, that is, a small number of samples
The distance of e is Euclidean distance.
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained
Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors
The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Specifically, stable sample f, the k 4, a small number of sample j in k sample of the stable sample arest neighbors1, j2,
j3And j4, wherein the minority sample j1Neighbours in the ratio of a small number of samples and most samples be 2/2, the minority sample j2
Neighbours in the ratio of a small number of samples and most samples be 3/1, the minority sample j3Neighbours in a small number of samples and most samples
The ratio of example is 1/3, the minority sample j4Neighbours in the ratio of a small number of samples and most samples be 1/3, then the j1, institute
State j2, the j3With the j4Serial number according to ascending sort be j3=1, j4=1, j1=2, j2=3.
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
Specifically, the j1, the j2, the j3With the j4Serial number according to ascending sort be j3=1, j4=1, j1
=2, j2=3, the j1, the j2, the j3With the j4To be randomly derived random number respectively be 0.6,0.5,0.3,0.8, then
Corresponding j1, the j2, the j3With the j4Select probability be respectively:j1It is 0.63* 2=0.432;j2It is 0.53* 3=
0.375;j3It is 0.33* 1=0.027;j4It is 0.83* 1=0.512;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji;
It should be noted that not selecting the select probability value maximum, the select probability value is bigger, corresponding
The selected possibility of a small number of samples it is bigger, but it is also possible that a small number of samples that the upper select probability value of selection is small.
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute
State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living
At 0 to 1 between random number.
It should be noted that the average distance d is less than or equal to preset value, i.e. this minority class sample and surrounding is a small number of
Class sample is very close, and the similarity based on a small number of samples in feature space carries out the selection of the minority class sample, then
Neighbours' minority sample is selected to synthesize new sample with it from the minority class sample of surrounding.
Specifically, the case where assuming k=5, from xi5 apart from nearest minority class sample xi 1、xi 2、xi 3、xi 4、xi 5It
Between randomly choosed xi 2New samples synthesis is carried out, this method both avoids overfitting problem, but also the sample power of minority class
Increase again, to which grader can tilt during study to minority class, improves the classifying quality of a small number of samples.
Preferably, when the classification in the corresponding a small number of samples is the stable sample, to a small number of samples into
Row synthesizes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors
One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors
The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Its
In, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady
Random sample example;Wherein, 1<s<=k.
Specifically, if s=3,3 neighbours' samples are selected to generate new sample together with original sample, if a1=
0.2, a2=0.8, a3=0.4, then newly-generated sample be:
It should be noted that when the average distance d is more than preset value, i.e., the described minority class sample and surrounding minority class
Sample is very loose, then s sample is selected from the minority class sample of surrounding, wherein s can be arranged as required to, but need to meet 1
<s<=k synthesizes new sample with it, i.e., pair with surrounding minority class sample apart from distant sample, the as possible several samples of more options
New sample is generated together, in order to avoid it is larger only to select a new sample of sample synthesis to cause a deviation, is not inconsistent with initial data
Situation.
Implement the present embodiment to have the advantages that:
During synthesizing new sample, has method one neighbour's sample of random selection and synthesize new sample with existing sample
Example, it is most likely that introduce the either most samples of noise sample;Difference is taken by the different characteristics being distributed according to a small number of samples
Synthetic method selects neighbour's sample to synthesize new sample with the minority samples to being distributed closely a small number of samples, when selection
The selected probability of the more sample of the most class samples of surrounding is lower;To being distributed sparse sample, s sample is selected to be synthesized with it
New sample avoids being distributed sparse sample and occurs with the case where sample synthesizes new sample of closing on of a deviation normal value so that
Newly synthesized sample more meets sample distribution character.
It is that the sorted sampling apparatus structure of a kind of unbalanced data that fifth embodiment of the invention provides is shown referring to Fig. 7, Fig. 7
It is intended to, including:
A small number of sample acquisition modules 71, for obtaining all a small number of samples in pending unbalanced data;
Most sample number acquisition modules 72, the k for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm
The number of most samples in a sample;
Category determination module 73, the classification for determining corresponding a small number of samples according to the number of most samples;
Operation module 74, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
Preferably, the category determination module 73 includes:
Classification determination unit, for carrying out size comparison with predetermined threshold value according to the number of most samples, with determination
The classification of corresponding a small number of samples;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample
Example.
Preferably, the predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q, then
The classification determination unit includes:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples
Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p
When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are
2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q
When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2
<=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute
State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
Preferably, the operating unit includes:
Deleting unit is used for when the classification of corresponding a small number of samples is the noise sample, to a small number of samples
It is deleted;
Stick unit is used for when the classification of corresponding a small number of samples is the unstable sample, to a small number of samples
Example is retained;
Copied cells are used for when the classification of corresponding a small number of samples is the boundary sample, to a small number of samples
It is replicated;
Synthesis unit is used for when the classification of corresponding a small number of samples is the stable sample, to a small number of samples
It is synthesized.
Preferably, the copied cells include:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
Preferably, the synthesis unit includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute
State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples
The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained
Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors
The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji;
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute
State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living
At 0 to 1 between random number.
Preferably, the synthesis unit further includes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors
One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors
The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1
Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Its
In, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady
Random sample example;Wherein, 1<s<=k.
Implement the present embodiment to have the advantages that:
The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;According to
The number of the majority sample determines the classification of corresponding a small number of samples;According to the classification of each a small number of samples carry out with it is described
The corresponding operation of classification;Classification learning algorithm precision caused by minority class sample is few in the uneven big data assorting process of processing
When low problem, avoid taking the same processing method to all a small number of samples, single only replicates sample or single
Synthesize new sample;It is divided by the classification to a small number of samples in the pending unbalanced data, with according to inhomogeneity
Other sample carries out different operations, increases the diversity of a small number of samples with the Different treatments to a small number of samples, avoids
Because minority class sample causes classification learning algorithm precision low less, solve the problems, such as that minority class sample lacks.
Fig. 8 is referred to, Fig. 8 is the signal for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides
Figure, for executing unbalanced data classification oversampler method provided in an embodiment of the present invention, as shown in figure 8, the unbalanced data
Sorted sample devices includes:At least one processor 11, such as CPU, at least one network interface 14 or other users connect
Mouth 13, memory 15, at least one communication bus 12, communication bus 12 is for realizing the connection communication between these components.Its
In, user interface 13 may include optionally USB interface and other standards interface, wireline interface.Network interface 14 is optional
May include Wi-Fi interface and other wireless interfaces.Memory 15 may include high-speed RAM memory, it is also possible to further include
Non-labile memory (non-volatilememory), for example, at least a magnetic disk storage.Memory 15 optionally may be used
To include at least one storage device for being located remotely from aforementioned processor 11.
In some embodiments, memory 15 stores following element, executable modules or data structures, or
Their subset or their superset:
Operating system 151, including various system programs, for realizing various basic businesses and hardware based of processing
Business;
Program 152.
Specifically, processor 11 executes described in above-described embodiment not for calling the program 152 stored in memory 15
Equilibrium criterion classification oversampler method.
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor
It is the control centre of the unbalanced data classification oversampler method Deng, the processor, utilizes various interfaces and connection
The various pieces of the entire unbalanced data classification oversampler method.
The memory can be used for storing the computer program and/or module, and the processor is by running or executing
Computer program in the memory and/or module are stored, and calls the data being stored in memory, is realized uneven
The various functions of the electronic device of the data that weigh classification over-sampling.The memory can include mainly storing program area and storage data
Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function,
Text conversion function etc.) etc.;Storage data field can be stored uses created data (such as audio data, text according to mobile phone
Word message data etc.) etc..In addition, memory may include high-speed random access memory, can also include non-volatile memories
Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid
State memory device.
Wherein, if the module of the adaptively sampled unbalanced data classification is realized in the form of SFU software functional unit simultaneously
When sold or used as an independent product, it can be stored in a computer read/write memory medium.Based on such reason
Solution, the present invention realize all or part of flow in above-described embodiment method, can also instruct correlation by computer program
Hardware complete, the computer program can be stored in a computer readable storage medium, which exists
When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer journey
Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes
Formula etc..The computer-readable medium may include:Any entity or device, note of the computer program code can be carried
Recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory),
Random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium
Deng.It should be noted that the content that the computer-readable medium includes can be real according to legislation in jurisdiction and patent
The requirement trampled carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium
Do not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separating component
The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also
It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual
It needs that some or all of module therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention
In embodiment attached drawing, the connection relation between module indicates there is communication connection between them, specifically can be implemented as one or
A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can to understand
And implement.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
It should be noted that in the above-described embodiments, all emphasizing particularly on different fields to the description of each embodiment, in some embodiment
In the part that is not described in, may refer to the associated description of other embodiment.Secondly, those skilled in the art should also know
It knows, embodiment described in this description belongs to preferred embodiment, and involved action and simulation must be that the present invention must
Must.
Claims (10)
- The oversampler method 1. a kind of unbalanced data is classified, which is characterized in that including:Obtain all a small number of samples in pending unbalanced data;The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;The classification of corresponding a small number of samples is determined according to the number of most samples;Operation corresponding with the classification is carried out according to the classification of each a small number of samples.
- The oversampler method 2. unbalanced data according to claim 1 is classified, which is characterized in that described according to the majority The number of sample determines that the classification of corresponding a small number of samples includes:Size comparison is carried out with predetermined threshold value according to the number of most samples, with the class of determination corresponding a small number of samples Not;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
- The oversampler method 3. unbalanced data according to claim 2 is classified, which is characterized in thatThe predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q,Then the number according to most samples is compared with predetermined threshold value, with the classification packet of determination corresponding a small number of samples It includes:When the number of the majority sample is greater than or equal to the preset first threshold value n, then the classification of the corresponding a small number of samples For the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;When the number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p, then The classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are 2k/3< =n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;When the number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q, then The classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2<=p< n;Wherein, the default third threshold value q value ranges are k/3<=q<p;The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is described steady Random sample example;Wherein, the default third threshold value q value ranges are k/3<=q<p.
- 4. unbalanced data according to claim 3 is sorted to use method, which is characterized in that described according to each described The classification of a small number of samples carries out operation corresponding with the classification:When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained;When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated;When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
- 5. unbalanced data according to claim 4 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the boundary sample, carrying out duplication to a small number of samples includes:It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |;A small number of samples are replicated according to the increase number h.
- 6. unbalanced data according to claim 5 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples includes:It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |;Obtain the stable sample to k minority class sample of arest neighbors average distance d;When the average distance d is less than or equal to preset value, in k minority class sample for obtaining the stable sample arest neighbors Each sample jiSerial number;Wherein, the serial number is according to each a small number of sample jiIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort;Wherein, 1<i<=k;Obtain the select probability of the stable sample;Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample j are multiplied byiSerial number;Wherein, 1<i<=k;A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji;According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, described new Sample=stable the sample+(stable sample-selected a small number of sample ji)*a;Wherein, a is to generate Random number between 0 to 1.
- 7. unbalanced data according to claim 6 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples further includes:Obtain the stable sample to k minority class sample of arest neighbors average distance d;When the average distance d is more than preset value, each in k minority class sample of the stable sample arest neighbors is obtained Sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIn k sample of arest neighbors a small number of samples with The ratio of most samples carries out ascending sort;Wherein, 1<n<=k;Obtain the select probability of the stable sample;Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample x are multiplied bynSerial number;Wherein, 1<n<=k;S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Wherein, institute Stating synthetic method isWherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiFor the stable sample Example;Wherein, 1<s<=k.
- 8. a kind of sorted sampling apparatus of unbalanced data, which is characterized in that including:A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data;Most sample number acquisition modules, the k sample for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of middle majority sample;Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples;Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
- 9. a kind of sorted sample devices of unbalanced data, including processor, memory and be stored in the memory and It is configured as the computer program executed by the processor, the processor realizes such as right when executing the computer program The oversampler method it is required that unbalanced data described in any one of 1 to 7 is classified.
- 10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium when the computer program is run is executed as weighed Profit requires the classification oversampler method of the unbalanced data described in any one of 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810453104.6A CN108647728B (en) | 2018-05-10 | 2018-05-10 | Unbalanced data classification oversampler method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810453104.6A CN108647728B (en) | 2018-05-10 | 2018-05-10 | Unbalanced data classification oversampler method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108647728A true CN108647728A (en) | 2018-10-12 |
CN108647728B CN108647728B (en) | 2019-04-19 |
Family
ID=63754913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810453104.6A Active CN108647728B (en) | 2018-05-10 | 2018-05-10 | Unbalanced data classification oversampler method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647728B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241969A (en) * | 2020-01-06 | 2020-06-05 | 北京三快在线科技有限公司 | Target detection method and device and corresponding model training method and device |
CN111259964A (en) * | 2020-01-17 | 2020-06-09 | 上海海事大学 | Over-sampling method for unbalanced data set |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101980202A (en) * | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification method of unbalance data |
CN102495901A (en) * | 2011-12-16 | 2012-06-13 | 山东师范大学 | Method for keeping balance of implementation class data through local mean |
CN103324939A (en) * | 2013-03-15 | 2013-09-25 | 江南大学 | Deviation classification and parameter optimization method based on least square support vector machine technology |
US20160335548A1 (en) * | 2015-05-12 | 2016-11-17 | Rolls-Royce Plc | Methods and apparatus for predicting fault occurrence in mechanical systems and electrical systems |
-
2018
- 2018-05-10 CN CN201810453104.6A patent/CN108647728B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101980202A (en) * | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification method of unbalance data |
CN102495901A (en) * | 2011-12-16 | 2012-06-13 | 山东师范大学 | Method for keeping balance of implementation class data through local mean |
CN103324939A (en) * | 2013-03-15 | 2013-09-25 | 江南大学 | Deviation classification and parameter optimization method based on least square support vector machine technology |
US20160335548A1 (en) * | 2015-05-12 | 2016-11-17 | Rolls-Royce Plc | Methods and apparatus for predicting fault occurrence in mechanical systems and electrical systems |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241969A (en) * | 2020-01-06 | 2020-06-05 | 北京三快在线科技有限公司 | Target detection method and device and corresponding model training method and device |
CN111259964A (en) * | 2020-01-17 | 2020-06-09 | 上海海事大学 | Over-sampling method for unbalanced data set |
CN111259964B (en) * | 2020-01-17 | 2023-04-07 | 上海海事大学 | Over-sampling method for unbalanced data set |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN112766394B (en) * | 2021-01-26 | 2024-03-12 | 维沃移动通信有限公司 | Modeling sample generation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108647728B (en) | 2019-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107301225B (en) | Short text classification method and device | |
CN108647728B (en) | Unbalanced data classification oversampler method, device, equipment and medium | |
WO2021142916A1 (en) | Proxy-assisted evolutionary algorithm-based airfoil optimization method and apparatus | |
CN108647727A (en) | Unbalanced data classification lack sampling method, apparatus, equipment and medium | |
CN108628971A (en) | File classification method, text classifier and the storage medium of imbalanced data sets | |
CN106599935B (en) | Three decision unbalanced data oversampler methods based on Spark big data platform | |
Zhang et al. | 5Ws model for big data analysis and visualization | |
CN111860638A (en) | Parallel intrusion detection method and system based on unbalanced data deep belief network | |
CN109034194A (en) | Transaction swindling behavior depth detection method based on feature differentiation | |
CN109816044A (en) | A kind of uneven learning method based on WGAN-GP and over-sampling | |
CN108230010A (en) | A kind of method and server for estimating ad conversion rates | |
Li et al. | Imbalanced sentiment classification | |
CN108681970A (en) | Finance product method for pushing, system and computer storage media based on big data | |
CN109033148A (en) | One kind is towards polytypic unbalanced data preprocess method, device and equipment | |
Rai et al. | The infinite hierarchical factor regression model | |
Buskirk | Surveying the forests and sampling the trees: An overview of classification and regression trees and random forests with applications in survey research | |
CN110909222B (en) | User portrait establishing method and device based on clustering, medium and electronic equipment | |
CN109871901A (en) | A kind of unbalanced data classification method based on mixing sampling and machine learning | |
CN108694413A (en) | Adaptively sampled unbalanced data classification processing method, device, equipment and medium | |
CN110457577A (en) | Data processing method, device, equipment and computer storage medium | |
CN107944460A (en) | One kind is applied to class imbalance sorting technique in bioinformatics | |
Sahin et al. | A discrete dynamic artificial bee colony with hyper-scout for RESTful web service API test suite generation | |
CN102339278A (en) | Information processing device, information processing method, and program | |
CN104731919A (en) | Wechat public account user classifying method based on AdaBoost algorithm | |
CN110472659A (en) | Data processing method, device, computer readable storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220630 Address after: No. 230, Waihuan West Road, Guangzhou University City, Guangzhou 510000 Patentee after: Guangzhou University Patentee after: National University of Defense Technology Address before: No. 230, Waihuan West Road, Guangzhou University City, Guangzhou 510000 Patentee before: Guangzhou University |