CN108647728A - Unbalanced data classification oversampler method, device, equipment and medium - Google Patents

Unbalanced data classification oversampler method, device, equipment and medium Download PDF

Info

Publication number
CN108647728A
CN108647728A CN201810453104.6A CN201810453104A CN108647728A CN 108647728 A CN108647728 A CN 108647728A CN 201810453104 A CN201810453104 A CN 201810453104A CN 108647728 A CN108647728 A CN 108647728A
Authority
CN
China
Prior art keywords
sample
samples
small number
classification
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810453104.6A
Other languages
Chinese (zh)
Other versions
CN108647728B (en
Inventor
韩伟红
李树栋
王乐
方滨兴
贾焰
黄子中
周斌
殷丽华
田志宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201810453104.6A priority Critical patent/CN108647728B/en
Publication of CN108647728A publication Critical patent/CN108647728A/en
Application granted granted Critical
Publication of CN108647728B publication Critical patent/CN108647728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification

Abstract

The invention discloses a kind of unbalanced data classification oversampler methods, including:Obtain all a small number of samples in pending unbalanced data;The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;The classification of corresponding a small number of samples is determined according to the number of most samples;Operation corresponding with the classification is carried out according to the classification of each a small number of samples.The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves the problems, such as that minority class sample lacks.

Description

Unbalanced data classification oversampler method, device, equipment and medium
Technical field
The present invention relates to uneven big data processing field more particularly to unbalanced data classification oversampler method, device, Equipment and medium.
Background technology
With being constantly progressive for technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous Development, data acquisition technology, memory technology, treatment technology obtain significant progress, and data just increase at an unprecedented rate, We have come into the big data epoch.The data scale huge (volume) of big data generates high speed (velocity), form Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to Unprecedented challenge is encountered when big data field.
Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation It is widely present in boundary, especially in big data application field.For example, in internet text classification, the data of each classification are not Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to Fraud in business and abnormal behaviour, these data are submerged in a large amount of normal behaviour data, belong to knockdown Unbalanced dataset.Similar application also has medical diagnosis, Satellite Remote Sensing Data Classification etc..Therefore, uneven big data classification It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.
Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive It is difficult the classifying quality obtained, unbalanced data in the prior art as shown in Figure 1 classification example, wherein circle are minority class Sample, triangle are most class samples, and imbalance is than being 3:1, i.e., most class samples are 3 times of minority class sample, and actual In large data sets, imbalance is than often 10000:1, it is even higher, therefore first need to carry out data before being classified Pretreatment.
Existing imbalance big data preprocess method includes mainly for the over-sampling of minority class and for most classes Over-sampling.Over-sampling refers to increasing minority class sample using certain methods and techniques, big by being reduced to the adjustment of sample set The degree of unbalancedness of data set increases the accuracy of sorting algorithm.
Random over-sampling carries out stochastical sampling on raw data set D to minority class, that is, randomly select minority class sample into Row replicates, and obtains an additional data set E, finally merges D and E, obtains a data set D' almost balanced.Wherein, E Size can freely control, to which D' can reach arbitrary uneven ratio.It is to be adopted using random mistake in circle circle in Fig. 2 The minority class sample that quadrat method is chosen replicates.
Heuristic over-sampling is also to be replicated to minority class sample, itself can't create new sample.Difference It is which sample replicate to be selective, rather than random.The sample in grader boundary is replicated, is increased Its strong weight in grader.Fig. 3 is the reproduction copies of boundary sample oversampler method selection.
Inventor has found that there are following technical problems for the prior art when implementing the embodiment of the present invention:Random over-sampling by In being randomly selected when selecting sample, the sample quality for being easy to happen duplication is relatively low, situations such as being noise sample, to drop The performance of low classification learning algorithm.Although heuristic over-sampling selects the sample of duplication according to certain rule, Only to having the simple repetition of minority class sample, this method of sampling does not increase information content, may lead to taxology Overfitting problem (Over-Fitting) during habit, exactly leads to the overlearning of learning sample in sorting algorithm Sorting algorithm it is very ideal for the classifying quality of sample set, but problem is declined instead for the classification performance of test set, Over-fitting is less caused often caused by study sample, although random over-sampling and heuristic over-sampling increase minority class sample This quantity, but the only duplication of sample, however it remains minority class sample is few in the uneven big data assorting process of processing The low problem of caused classification learning algorithm precision cannot fundamentally solve the problems, such as that minority class sample lacks.
Invention content
In view of the above-mentioned problems, the purpose of the present invention is to provide a kind of unbalanced data classification oversampler method,.
In a first aspect, the present invention provides a kind of unbalanced data classification oversampler methods, including:
Obtain all a small number of samples in pending unbalanced data;
The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;
The classification of corresponding a small number of samples is determined according to the number of most samples;
Operation corresponding with the classification is carried out according to the classification of each a small number of samples.
In the first possible realization method of first aspect, the number according to most samples determines corresponding few Number samples classification include:
Size comparison is carried out with predetermined threshold value according to the number of most samples, with determination corresponding a small number of samples Classification;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect, institute Predetermined threshold value is stated to include preset first threshold value n, preset second threshold p and default third threshold value q,
Then the number according to most samples is compared with predetermined threshold value, with the class of determination corresponding a small number of samples Do not include:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2 <=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
The possible realization method of second with reference to first aspect, in the third possible realization method of first aspect, institute It states and includes according to the classification progress operation corresponding with the classification of each a small number of samples:
When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;
When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained;
When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated;
When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
The third possible realization method with reference to first aspect, in the 4th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the boundary sample, carrying out duplication to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living At 0 to 1 between random number.
The 5th kind of possible realization method with reference to first aspect, in the 6th kind of possible realization method of first aspect, institute It states when the classification of corresponding a small number of samples is the stable sample, carrying out synthesis to a small number of samples further includes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to each a small number of sample xnjIt is synthesized with the stable sample, new sample is obtained according to synthetic method; Wherein, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady Random sample example;Wherein, 1<s<=k.
Second aspect, the present invention also provides the sorted sampling apparatuses of unbalanced data, including:
A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data;
Most sample number acquisition modules, for obtaining k of each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of most samples in sample;
Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples;
Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
The third aspect, the embodiment of the present invention additionally provide a kind of sorted sample devices of unbalanced data, including processor, Memory and it is stored in the memory and is configured as the computer program executed by the processor, the processor The unbalanced data classification oversampler method as described in above-mentioned any one is realized when executing the computer program.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, described computer-readable to deposit Storage media includes the computer program of storage, wherein the computer-readable storage is controlled when the computer program is run Equipment where medium executes the classification oversampler method of the unbalanced data described in above-mentioned any one.
Above-mentioned technical proposal has the following advantages that:K of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm The number of most samples in sample;The classification of corresponding a small number of samples is determined according to the number of most samples;According to each institute The classification for stating a small number of samples carries out operation corresponding with the classification;The minority class sample in the uneven big data assorting process of processing When example lacks caused classification learning algorithm precision low problem, avoid taking the same processing method to all a small number of samples, Single only replicates sample or the single new sample of synthesis;By to a small number of samples in the pending unbalanced data Classification is divided, to carry out different operations according to different classes of sample, to come to the Different treatments of a small number of samples The diversity for increasing a small number of samples, avoids causing classification learning algorithm precision low less because of minority class sample, solves minority class sample The problem of missing.
Description of the drawings
Fig. 1 be in the prior art unbalanced data classification exemplary plot;
Fig. 2 is random oversampler method exemplary plot in the prior art;
Fig. 3 is boundary sample oversampler method exemplary plot in the prior art;
Fig. 4 is the unbalanced data classification oversampler method flow diagram that first embodiment of the invention provides;
Fig. 5 is that k sample of the arest neighbors that first embodiment of the invention provides obtains schematic diagram;
Fig. 6 is synthetic method exemplary plot in the prior art;
Fig. 7 is the sorted sampling apparatus structural schematic diagram of a kind of unbalanced data that fifth embodiment of the invention provides;
Fig. 8 is the structural schematic diagram for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig. 4, the unbalanced data classification oversampler method flow diagram of first embodiment of the invention offer.
It should be noted that when generating minority class sample, existing method takes all minority class samples same Processing method, either using the method for replicating sample or using the method for synthesizing new sample, all only single treatment not When equilibrium criterion, the simple repetition to a small number of samples is thus low in the presence of a small number of sample quality newly increased, is not worth, and holds The overfitting problem (Over-Fitting) during classification learning is easily led to, is exactly the mistake to learning sample in sorting algorithm Sorting algorithm is very ideal for the classifying quality of sample set caused by degree study, but anti-for the classification performance of test set And decline, the low problem of classification learning algorithm precision caused by minority class sample is few in uneven big data assorting process is handled, It cannot fundamentally solve the problems, such as that minority class sample lacks.
Unbalanced data classification oversampler method provided in this embodiment can be executed by terminal device, the terminal device Including but not limited to:Mobile phone, laptop, tablet computer and desktop computer etc..
The unbalanced data classification oversampler method is as follows:
All a small number of samples in S11, the pending unbalanced data of acquisition.
It should be noted that in embodiments of the present invention, a small number of samples in handling the pending unbalanced data When, based in actual large data sets, most samples are uneven than often 10000 with a small number of samples:1, it is even higher, be The quality for improving a small number of samples newly increased obtains all a small number of samples in pending unbalanced data first.
S12, the number that majority samples in k sample of each a small number of sample arest neighbors are obtained according to k nearest neighbor algorithm.
It should be noted that the k values are more than 1, and it is integer, the k values, of the invention determines according to actual conditions This is not especially limited.But the setting of k values can influence the performance of this method, with the increase of k values, the performance meeting of this method It is on a declining curve, but the too small accuracy decline that can lead to this method of k values.The value of general k is relatively more reasonable between 5-10, this hair It is bright that this is not especially limited.
Specifically, referring to Fig. 5, figure intermediate cam shape is most samples, and circle is a small number of samples, the minority lived with rectangle circle Sample M is illustrated, it is assumed that k values are 4, then 4 samples of the arest neighbors of a small number of sample M are lived with circle circle, circle circle Most sample numbers are 2 in four main samples.
S13, the classification that corresponding a small number of samples are determined according to the number of most samples.
In embodiments of the present invention, the classification of corresponding a small number of samples is determined according to the number of most samples, wherein institute Classification is stated to include noise sample, boundary sample, unstable sample, stablize sample.
It should be noted that determining that practical is exactly to the minority to the classification of a small number of samples in the present embodiment The property that sample is concentrated in the pending unbalanced data is determined, so as to according to actual needs to corresponding a small number of samples It is to be operated accordingly, to ensure that the pending unbalanced data finally obtain desired effect.
It should be noted that then a small number of samples are noise sample, example when a small number of samples are the samples of interference If one of situation is, the overwhelming majority is most samples in neighbours' sample of a small number of samples, i.e., of most samples Number is more more than a small number of samples, then a small number of samples are the noise sample;It is in a small number of sample sets in a small number of samples Between group and most sample clusters, for example, one of situation is, most samples in neighbours' sample of a small number of samples Suitable with the number of a small number of samples, then a small number of samples are the boundary sample;It is to belong to a small number of samples in a small number of samples In example, but its there are it is unstable when be the unstable sample, for example, one of situation is, in a small number of samples In neighbours' sample in most samples number it is more than a small number of samples, then the minority sample is the unstable sample;Institute It is the stable sample to state when a small number of samples are completely in a small number of sample clusters, for example, one of situation is, described few Most samples are few more than a small number of sample numbers in neighbours' sample of number sample, i.e., in neighbours' sample of described a small number of samples absolutely mostly Number is a small number of samples, then a small number of samples are the stable sample.
S14, operation corresponding with the classification is carried out according to the classification of each a small number of samples.
It should be noted that in embodiments of the present invention, being carried out to a small number of samples in the pending unbalanced data Over-sampling, to increase the diversity of a small number of samples, during carrying out over-sampling, not according to each a small number of samples Generic carry out different operation, wherein the operation includes deleting, retain, replicate and synthesizing.
It should be noted that each a small number of samples are all corresponding and only corresponding one operates, i.e., each classification can all have Corresponding operation, and only there are one operations, it is assumed that classification is b1, b2, b3 and b4, then the b1, the b2, the b3 and institute A corresponding operation can be had by stating b4 all, for example, it is to retain that b1 is corresponding, b2 corresponding is to delete, and b3 corresponding is also to delete, B4 corresponding is synthesis, and the present invention is not especially limited this.
Specifically, obtaining all a small number of samples in pending unbalanced data, a small number of sample set A, A=are obtained【A1, A2 ... a3, an】, wherein n is the number of all a small number of samples, it is assumed that when the minority sample a1 is noise sample, It needs to carry out delete operation to a small number of sample a1, and, it is specified that being to protect to the noise sample in some pretreatments It stays, then the minority sample a1 is retained;Assuming that when the minority sample an is unstable sample, and at some pre- places , it is specified that carrying out delete operation to the unstable sample in reason, then a small number of samples are deleted, the present invention to this not Make specific limit.
Implement the present embodiment to have the advantages that:
By obtaining all a small number of samples in pending unbalanced data, obtained according to k nearest neighbor algorithm each described few The number of most samples in k sample of number sample arest neighbors determines corresponding a small number of samples according to the number of most samples Classification, operation corresponding with the classification is carried out according to the classification of each a small number of samples, is solved to all minorities Sample takes the same processing method, and single only replicates sample or the single new sample of synthesis, and that is caused is increased few The low-quality problem of number sample, different a small number of sample classifications carry out different operations, increase the processing of a small number of samples Diversity improves the quality of newly-increased a small number of samples to increase the diversity of a small number of samples, and then avoids because of minority class Sample causes classification learning algorithm precision low less, solves the problems, such as that minority class sample lacks
Embodiment two
On the basis of embodiment one,
The number according to most samples determines that the classification of corresponding a small number of samples includes:
Size comparison is carried out with predetermined threshold value according to the number of most samples, with determination corresponding a small number of samples Classification;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
In embodiments of the present invention, the predetermined threshold value is to be set according to actual conditions.Specifically, the default threshold Value includes preset first threshold value n, presets second threshold p and default third threshold value q,
Then the number according to most samples is compared with predetermined threshold value, with the class of determination corresponding a small number of samples Do not include:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
In the present embodiment, the preset first threshold value n be judge a small number of sample whether be the noise threshold value, Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k is the preferred scope of the embodiment of the present invention, is basis Largely test the rational noise sample value range of one obtained.
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n.
In the present embodiment, the preset first threshold value n is to judge whether a small number of samples are the noise sample Threshold value;The default second threshold p be judge a small number of sample whether be the unstable sample threshold value, wherein it is described Default second threshold p value ranges are k/2<=p<N is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests A rational unstable sample value range.
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2 <=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
In the present embodiment, the default second threshold p is to judge whether a small number of samples are the unstable sample Threshold value;The default third threshold value q be judge a small number of sample whether be the boundary sample threshold value;Wherein, described Default third threshold value q value ranges are k/3<=q<P is the preferred scope of the embodiment of the present invention, is obtained according to a large amount of tests A rational boundary sample value range.
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
In the present embodiment, the default third threshold value q is to judge whether a small number of samples are the boundary sample Threshold value.
It should be noted that the preset first threshold value n, the default second threshold p and the preset threshold value q are roots The threshold values for rationally determining different classes of samples obtained according to a large amount of tests, wherein the preset first threshold value n, described default Second threshold p and the preset threshold value q specific numbers can voluntarily be set according to condition, and the present invention does not limit this specifically It is fixed.
Then operate corresponding with the classification of classification progress according to each a small number of samples includes:
When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;
It should be noted that by deleting the noise sample, to improve the quality for newly increasing sample, reduce Sample is newly increased to the noise effect in subsequent data handling procedure.
When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained.
It is not deleted it should be noted that the unstable sample is retained, is to increase the various of a small number of samples Property so that a small number of samples more meet truth.
When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated.
It should be noted that the boundary sample is the sample in pending uneven big data boundary, it is more valuable, The distinguishing characteristics between most classes and minority class can be more embodied, therefore is selected at the minority class sample of classification boundaries Reason replicates the sample in grader boundary, enhance its weight in pending uneven big data.
When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
It should be noted that in order to increase the quantity of a small number of samples, by being synthesized to the stable sample, can solve The problem of over-fitting.
Implement the present embodiment to have the advantages that:
After being determined to the accurate differentiation of classification progress of a small number of samples, each minority samples of comparison are most The number and predetermined threshold value of most samples in the k sample of neighbour, the predetermined threshold value are according to each a small number of samples The different classes of different condition that carries out judges setting, and carries out different processing to different classes of a small number of samples, effectively improves The classification accuracy of most samples in uneven big data.
Embodiment three
On the basis of embodiment one and embodiment two,
When the classification in corresponding a small number of samples is the boundary sample, copy package is carried out to a small number of samples It includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
Specifically, detect each a small number of samples traversed in all a small number of samples, and to a small number of samples Example is that the noise sample delete, also surplus after being the retaining of the unstable sample to a small number of samples Under the boundary sample that is not operated and the stable sample, a small number of samples for noise sample of skimming and skim not The large data sets of a small number of samples for the unstable sample that can be used to synthesize or replicate, first choice, which needs to calculate, also needs to increased Number, that is, increase the increase number h, the h=| (the target minority sample number-unstable sample number)/(described institute Have the number-of a small number of samples noise sample number-unstable sample number) -1 |, wherein the target minority sample It is finally desired to obtain after example number is carries out the unbalanced data classification over-sampling to the pending unbalanced data A small number of samples number;Wherein, the unstable sample number be traversed it is each few in all a small number of samples After number sample, the number of the obtained unstable sample;Wherein, the number of all a small number of samples is at the beginning, to obtain All a small number of sample numbers in the pending unbalanced data obtained;Wherein, the noise sample number is time It has gone through after each a small number of samples in all a small number of samples, the number of the obtained noise sample;It is assumed that the mesh Standard specimen example is 20000, the number 5000 of all a small number of samples, the noise sample 500, the unstable sample 500, then h=| (20000-500)/(5000-500-500) -1 |=| 4.87-1 |=3.It is boundary sample in a small number of samples When example, a small number of samples are replicated according to the increase number h, for example, minority the sample c, the increase number h It is 3, then after being replicated to the minority sample c, that obtain is 4 a small number of sample c.
Implement the present embodiment to have the advantages that:
Boundary sample in the pending unbalanced data is replicated, the boundary sample is to be in grader side The sample on boundary, it is more valuable, the distinguishing characteristics between most classes and minority class can be more embodied, therefore select to being located at classification boundaries Minority class sample handled, that is, replicate in grader boundary sample, enhance its weight in grader, carried with this The classification accuracy of a small number of samples in high imbalance big data.
Example IV
It should be noted that the prior art is when carrying out the synthesis of new sample, for a small number of sample x, using Euclidean distance In the case of, k minority class neighbour is respectively x1、x2、x3、x4If randomly choosing one from this 4 minority class neighbours, Each selected probability is the same, and as shown in fig. 6, x3It is among most class samples, it is most likely that be noise. If that choose at random is x3If, then newly synthesized sample is likely to be noise, is not only difficult to reach enhancing minority class Purpose, can also introduce more noises.
And in the present embodiment, synthesis is all a small number of samples, and the classification in corresponding a small number of samples is described steady When random sample example, carrying out synthesis to a small number of samples includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
Specifically, assuming that a small number of sample e are unstable sample, k values are 4, then it is nearest to obtain the stable sample 4 adjacent samples arrive the average distance of the unstable sample, and 4 samples of the stable sample arest neighbors are o1, o2, o3 with The distance of o4, the o1, the o2, the o3 and the o4 to the stable sample, that is, a small number of sample e are 10 respectively, 20,30 and 20, then the average distance is (10+20+30+20)/4=20, wherein the stable sample, that is, a small number of samples The distance of e is Euclidean distance.
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Specifically, stable sample f, the k 4, a small number of sample j in k sample of the stable sample arest neighbors1, j2, j3And j4, wherein the minority sample j1Neighbours in the ratio of a small number of samples and most samples be 2/2, the minority sample j2 Neighbours in the ratio of a small number of samples and most samples be 3/1, the minority sample j3Neighbours in a small number of samples and most samples The ratio of example is 1/3, the minority sample j4Neighbours in the ratio of a small number of samples and most samples be 1/3, then the j1, institute State j2, the j3With the j4Serial number according to ascending sort be j3=1, j4=1, j1=2, j2=3.
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
Specifically, the j1, the j2, the j3With the j4Serial number according to ascending sort be j3=1, j4=1, j1 =2, j2=3, the j1, the j2, the j3With the j4To be randomly derived random number respectively be 0.6,0.5,0.3,0.8, then Corresponding j1, the j2, the j3With the j4Select probability be respectively:j1It is 0.63* 2=0.432;j2It is 0.53* 3= 0.375;j3It is 0.33* 1=0.027;j4It is 0.83* 1=0.512;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji
It should be noted that not selecting the select probability value maximum, the select probability value is bigger, corresponding The selected possibility of a small number of samples it is bigger, but it is also possible that a small number of samples that the upper select probability value of selection is small.
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living At 0 to 1 between random number.
It should be noted that the average distance d is less than or equal to preset value, i.e. this minority class sample and surrounding is a small number of Class sample is very close, and the similarity based on a small number of samples in feature space carries out the selection of the minority class sample, then Neighbours' minority sample is selected to synthesize new sample with it from the minority class sample of surrounding.
Specifically, the case where assuming k=5, from xi5 apart from nearest minority class sample xi 1、xi 2、xi 3、xi 4、xi 5It Between randomly choosed xi 2New samples synthesis is carried out, this method both avoids overfitting problem, but also the sample power of minority class Increase again, to which grader can tilt during study to minority class, improves the classifying quality of a small number of samples.
Preferably, when the classification in the corresponding a small number of samples is the stable sample, to a small number of samples into Row synthesizes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Its In, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady Random sample example;Wherein, 1<s<=k.
Specifically, if s=3,3 neighbours' samples are selected to generate new sample together with original sample, if a1= 0.2, a2=0.8, a3=0.4, then newly-generated sample be:
It should be noted that when the average distance d is more than preset value, i.e., the described minority class sample and surrounding minority class Sample is very loose, then s sample is selected from the minority class sample of surrounding, wherein s can be arranged as required to, but need to meet 1 <s<=k synthesizes new sample with it, i.e., pair with surrounding minority class sample apart from distant sample, the as possible several samples of more options New sample is generated together, in order to avoid it is larger only to select a new sample of sample synthesis to cause a deviation, is not inconsistent with initial data Situation.
Implement the present embodiment to have the advantages that:
During synthesizing new sample, has method one neighbour's sample of random selection and synthesize new sample with existing sample Example, it is most likely that introduce the either most samples of noise sample;Difference is taken by the different characteristics being distributed according to a small number of samples Synthetic method selects neighbour's sample to synthesize new sample with the minority samples to being distributed closely a small number of samples, when selection The selected probability of the more sample of the most class samples of surrounding is lower;To being distributed sparse sample, s sample is selected to be synthesized with it New sample avoids being distributed sparse sample and occurs with the case where sample synthesizes new sample of closing on of a deviation normal value so that Newly synthesized sample more meets sample distribution character.
It is that the sorted sampling apparatus structure of a kind of unbalanced data that fifth embodiment of the invention provides is shown referring to Fig. 7, Fig. 7 It is intended to, including:
A small number of sample acquisition modules 71, for obtaining all a small number of samples in pending unbalanced data;
Most sample number acquisition modules 72, the k for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of most samples in a sample;
Category determination module 73, the classification for determining corresponding a small number of samples according to the number of most samples;
Operation module 74, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
Preferably, the category determination module 73 includes:
Classification determination unit, for carrying out size comparison with predetermined threshold value according to the number of most samples, with determination The classification of corresponding a small number of samples;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample Example.
Preferably, the predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q, then The classification determination unit includes:
When the number of the majority sample is greater than or equal to the preset first threshold value n, then corresponding a small number of samples Classification is the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
The number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p When, then the classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;
The number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q When, then the classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2 <=p<n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is institute State stable sample;Wherein, the default third threshold value q value ranges are k/3<=q<p.
Preferably, the operating unit includes:
Deleting unit is used for when the classification of corresponding a small number of samples is the noise sample, to a small number of samples It is deleted;
Stick unit is used for when the classification of corresponding a small number of samples is the unstable sample, to a small number of samples Example is retained;
Copied cells are used for when the classification of corresponding a small number of samples is the boundary sample, to a small number of samples It is replicated;
Synthesis unit is used for when the classification of corresponding a small number of samples is the stable sample, to a small number of samples It is synthesized.
Preferably, the copied cells include:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
A small number of samples are replicated according to the increase number h.
Preferably, the synthesis unit includes:
It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, institute State and increase number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples The noise sample number-unstable sample number) -1 |;
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is less than or equal to preset value, k minority class sample of the stable sample arest neighbors is obtained Each a small number of sample j in exampleiSerial number;Wherein, the serial number is according to each a small number of sample jiK sample of arest neighbors The ratio of middle minority sample and most samples carries out ascending sort;Wherein, 1<i<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample jiSerial number;Wherein, 1<i<=k;
A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji
According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, institute State new sample=stable sample+(stable sample-selected a small number of sample ji)*a;Wherein, a makes a living At 0 to 1 between random number.
Preferably, the synthesis unit further includes:
Obtain the stable sample to k minority class sample of arest neighbors average distance d;
When the average distance d is more than preset value, obtain every in k minority class sample of the stable sample arest neighbors One a small number of sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort;Wherein, 1<n<=k;
Obtain the select probability of the stable sample;Wherein, any random number cube between select probability=0 to 1 Operation result is multiplied by each a small number of sample xnSerial number;Wherein, 1<n<=k;
S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Its In, the synthetic method is
Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiIt is described steady Random sample example;Wherein, 1<s<=k.
Implement the present embodiment to have the advantages that:
The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;According to The number of the majority sample determines the classification of corresponding a small number of samples;According to the classification of each a small number of samples carry out with it is described The corresponding operation of classification;Classification learning algorithm precision caused by minority class sample is few in the uneven big data assorting process of processing When low problem, avoid taking the same processing method to all a small number of samples, single only replicates sample or single Synthesize new sample;It is divided by the classification to a small number of samples in the pending unbalanced data, with according to inhomogeneity Other sample carries out different operations, increases the diversity of a small number of samples with the Different treatments to a small number of samples, avoids Because minority class sample causes classification learning algorithm precision low less, solve the problems, such as that minority class sample lacks.
Fig. 8 is referred to, Fig. 8 is the signal for the sorted sample devices of unbalanced data that sixth embodiment of the invention provides Figure, for executing unbalanced data classification oversampler method provided in an embodiment of the present invention, as shown in figure 8, the unbalanced data Sorted sample devices includes:At least one processor 11, such as CPU, at least one network interface 14 or other users connect Mouth 13, memory 15, at least one communication bus 12, communication bus 12 is for realizing the connection communication between these components.Its In, user interface 13 may include optionally USB interface and other standards interface, wireline interface.Network interface 14 is optional May include Wi-Fi interface and other wireless interfaces.Memory 15 may include high-speed RAM memory, it is also possible to further include Non-labile memory (non-volatilememory), for example, at least a magnetic disk storage.Memory 15 optionally may be used To include at least one storage device for being located remotely from aforementioned processor 11.
In some embodiments, memory 15 stores following element, executable modules or data structures, or Their subset or their superset:
Operating system 151, including various system programs, for realizing various basic businesses and hardware based of processing Business;
Program 152.
Specifically, processor 11 executes described in above-described embodiment not for calling the program 152 stored in memory 15 Equilibrium criterion classification oversampler method.
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor It is the control centre of the unbalanced data classification oversampler method Deng, the processor, utilizes various interfaces and connection The various pieces of the entire unbalanced data classification oversampler method.
The memory can be used for storing the computer program and/or module, and the processor is by running or executing Computer program in the memory and/or module are stored, and calls the data being stored in memory, is realized uneven The various functions of the electronic device of the data that weigh classification over-sampling.The memory can include mainly storing program area and storage data Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function, Text conversion function etc.) etc.;Storage data field can be stored uses created data (such as audio data, text according to mobile phone Word message data etc.) etc..In addition, memory may include high-speed random access memory, can also include non-volatile memories Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.
Wherein, if the module of the adaptively sampled unbalanced data classification is realized in the form of SFU software functional unit simultaneously When sold or used as an independent product, it can be stored in a computer read/write memory medium.Based on such reason Solution, the present invention realize all or part of flow in above-described embodiment method, can also instruct correlation by computer program Hardware complete, the computer program can be stored in a computer readable storage medium, which exists When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer journey Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes Formula etc..The computer-readable medium may include:Any entity or device, note of the computer program code can be carried Recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium Deng.It should be noted that the content that the computer-readable medium includes can be real according to legislation in jurisdiction and patent The requirement trampled carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium Do not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separating component The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of module therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention In embodiment attached drawing, the connection relation between module indicates there is communication connection between them, specifically can be implemented as one or A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can to understand And implement.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.
It should be noted that in the above-described embodiments, all emphasizing particularly on different fields to the description of each embodiment, in some embodiment In the part that is not described in, may refer to the associated description of other embodiment.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and simulation must be that the present invention must Must.

Claims (10)

  1. The oversampler method 1. a kind of unbalanced data is classified, which is characterized in that including:
    Obtain all a small number of samples in pending unbalanced data;
    The number of most samples in k sample of each a small number of sample arest neighbors is obtained according to k nearest neighbor algorithm;
    The classification of corresponding a small number of samples is determined according to the number of most samples;
    Operation corresponding with the classification is carried out according to the classification of each a small number of samples.
  2. The oversampler method 2. unbalanced data according to claim 1 is classified, which is characterized in that described according to the majority The number of sample determines that the classification of corresponding a small number of samples includes:
    Size comparison is carried out with predetermined threshold value according to the number of most samples, with the class of determination corresponding a small number of samples Not;Wherein, the classification includes noise sample, boundary sample, unstable sample, stablizes sample.
  3. The oversampler method 3. unbalanced data according to claim 2 is classified, which is characterized in that
    The predetermined threshold value includes preset first threshold value n, presets second threshold p and default third threshold value q,
    Then the number according to most samples is compared with predetermined threshold value, with the classification packet of determination corresponding a small number of samples It includes:
    When the number of the majority sample is greater than or equal to the preset first threshold value n, then the classification of the corresponding a small number of samples For the noise sample;Wherein, the preset first threshold value n value ranges are 2k/3<=n<=k;
    When the number of the majority sample is less than the preset first threshold value n and is greater than or equal to the default second threshold p, then The classification of corresponding a small number of samples is the unstable sample;Wherein, the preset first threshold value n value ranges are 2k/3< =n<=k;Wherein, the default second threshold p value ranges are k/2<=p<n;
    When the number of the majority sample is less than the default second threshold p and is greater than or equal to the default third threshold value q, then The classification of corresponding a small number of samples is the boundary sample;Wherein, the default second threshold p value ranges are k/2<=p< n;Wherein, the default third threshold value q value ranges are k/3<=q<p;
    The number of the majority sample is less than the default third threshold value q, then the classification of corresponding a small number of samples is described steady Random sample example;Wherein, the default third threshold value q value ranges are k/3<=q<p.
  4. 4. unbalanced data according to claim 3 is sorted to use method, which is characterized in that described according to each described The classification of a small number of samples carries out operation corresponding with the classification:
    When the classification of corresponding a small number of samples is the noise sample, a small number of samples are deleted;
    When the classification of corresponding a small number of samples is the unstable sample, a small number of samples are retained;
    When the classification of corresponding a small number of samples is the boundary sample, a small number of samples are replicated;
    When the classification of corresponding a small number of samples is the stable sample, a small number of samples are synthesized.
  5. 5. unbalanced data according to claim 4 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the boundary sample, carrying out duplication to a small number of samples includes:
    It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |;
    A small number of samples are replicated according to the increase number h.
  6. 6. unbalanced data according to claim 5 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples includes:
    It detects each a small number of samples traversed in all a small number of samples, obtains and increase number h;Wherein, the increasing Add number h=| (the target minority sample number-unstable sample number)/(number-of all a small number of samples is described The noise sample number-unstable sample number) -1 |;
    Obtain the stable sample to k minority class sample of arest neighbors average distance d;
    When the average distance d is less than or equal to preset value, in k minority class sample for obtaining the stable sample arest neighbors Each sample jiSerial number;Wherein, the serial number is according to each a small number of sample jiIt is a small number of in k sample of arest neighbors The ratio of sample and most samples carries out ascending sort;Wherein, 1<i<=k;
    Obtain the select probability of the stable sample;Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample j are multiplied byiSerial number;Wherein, 1<i<=k;
    A small number of sample j are randomly selected according to the select probabilityi, obtain selected a small number of sample ji
    According to selected a small number of sample jiIt is synthesized with the stable sample, to obtain new sample;Wherein, described new Sample=stable the sample+(stable sample-selected a small number of sample ji)*a;Wherein, a is to generate Random number between 0 to 1.
  7. 7. unbalanced data according to claim 6 is sorted to use method, which is characterized in that described few in the correspondence When the classification of number sample is the stable sample, carrying out synthesis to a small number of samples further includes:
    Obtain the stable sample to k minority class sample of arest neighbors average distance d;
    When the average distance d is more than preset value, each in k minority class sample of the stable sample arest neighbors is obtained Sample hiSerial number;Wherein, the serial number is according to each a small number of sample xnIn k sample of arest neighbors a small number of samples with The ratio of most samples carries out ascending sort;Wherein, 1<n<=k;
    Obtain the select probability of the stable sample;Wherein, any random number cube operation between select probability=0 to 1 As a result each a small number of sample x are multiplied bynSerial number;Wherein, 1<n<=k;
    S a small number of sample x are randomly selected according to the select probabilitynj;Wherein, 1<s<=k;Wherein, 1<j<=s;
    According to synthetic method to each a small number of sample xnjIt is synthesized with the stable sample, obtains new sample;Wherein, institute Stating synthetic method is
    Wherein, anFor the random number between the 0 to 1 of generation;The xi' it is the new sample;The xiFor the stable sample Example;Wherein, 1<s<=k.
  8. 8. a kind of sorted sampling apparatus of unbalanced data, which is characterized in that including:
    A small number of sample acquisition modules, for obtaining all a small number of samples in pending unbalanced data;
    Most sample number acquisition modules, the k sample for obtaining each a small number of sample arest neighbors according to k nearest neighbor algorithm The number of middle majority sample;
    Category determination module, the classification for determining corresponding a small number of samples according to the number of most samples;
    Operation module, for carrying out operation corresponding with the classification according to the classification of each a small number of samples.
  9. 9. a kind of sorted sample devices of unbalanced data, including processor, memory and be stored in the memory and It is configured as the computer program executed by the processor, the processor realizes such as right when executing the computer program The oversampler method it is required that unbalanced data described in any one of 1 to 7 is classified.
  10. 10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium when the computer program is run is executed as weighed Profit requires the classification oversampler method of the unbalanced data described in any one of 1 to 7.
CN201810453104.6A 2018-05-10 2018-05-10 Unbalanced data classification oversampler method, device, equipment and medium Active CN108647728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810453104.6A CN108647728B (en) 2018-05-10 2018-05-10 Unbalanced data classification oversampler method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810453104.6A CN108647728B (en) 2018-05-10 2018-05-10 Unbalanced data classification oversampler method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN108647728A true CN108647728A (en) 2018-10-12
CN108647728B CN108647728B (en) 2019-04-19

Family

ID=63754913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810453104.6A Active CN108647728B (en) 2018-05-10 2018-05-10 Unbalanced data classification oversampler method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN108647728B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241969A (en) * 2020-01-06 2020-06-05 北京三快在线科技有限公司 Target detection method and device and corresponding model training method and device
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN112766394A (en) * 2021-01-26 2021-05-07 维沃移动通信有限公司 Modeling sample generation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN102495901A (en) * 2011-12-16 2012-06-13 山东师范大学 Method for keeping balance of implementation class data through local mean
CN103324939A (en) * 2013-03-15 2013-09-25 江南大学 Deviation classification and parameter optimization method based on least square support vector machine technology
US20160335548A1 (en) * 2015-05-12 2016-11-17 Rolls-Royce Plc Methods and apparatus for predicting fault occurrence in mechanical systems and electrical systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN102495901A (en) * 2011-12-16 2012-06-13 山东师范大学 Method for keeping balance of implementation class data through local mean
CN103324939A (en) * 2013-03-15 2013-09-25 江南大学 Deviation classification and parameter optimization method based on least square support vector machine technology
US20160335548A1 (en) * 2015-05-12 2016-11-17 Rolls-Royce Plc Methods and apparatus for predicting fault occurrence in mechanical systems and electrical systems

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241969A (en) * 2020-01-06 2020-06-05 北京三快在线科技有限公司 Target detection method and device and corresponding model training method and device
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN111259964B (en) * 2020-01-17 2023-04-07 上海海事大学 Over-sampling method for unbalanced data set
CN112766394A (en) * 2021-01-26 2021-05-07 维沃移动通信有限公司 Modeling sample generation method and device
CN112766394B (en) * 2021-01-26 2024-03-12 维沃移动通信有限公司 Modeling sample generation method and device

Also Published As

Publication number Publication date
CN108647728B (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN107301225B (en) Short text classification method and device
CN108647728B (en) Unbalanced data classification oversampler method, device, equipment and medium
WO2021142916A1 (en) Proxy-assisted evolutionary algorithm-based airfoil optimization method and apparatus
CN108647727A (en) Unbalanced data classification lack sampling method, apparatus, equipment and medium
CN108628971A (en) File classification method, text classifier and the storage medium of imbalanced data sets
CN106599935B (en) Three decision unbalanced data oversampler methods based on Spark big data platform
Zhang et al. 5Ws model for big data analysis and visualization
CN111860638A (en) Parallel intrusion detection method and system based on unbalanced data deep belief network
CN109034194A (en) Transaction swindling behavior depth detection method based on feature differentiation
CN109816044A (en) A kind of uneven learning method based on WGAN-GP and over-sampling
CN108230010A (en) A kind of method and server for estimating ad conversion rates
Li et al. Imbalanced sentiment classification
CN108681970A (en) Finance product method for pushing, system and computer storage media based on big data
CN109033148A (en) One kind is towards polytypic unbalanced data preprocess method, device and equipment
Rai et al. The infinite hierarchical factor regression model
Buskirk Surveying the forests and sampling the trees: An overview of classification and regression trees and random forests with applications in survey research
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN109871901A (en) A kind of unbalanced data classification method based on mixing sampling and machine learning
CN108694413A (en) Adaptively sampled unbalanced data classification processing method, device, equipment and medium
CN110457577A (en) Data processing method, device, equipment and computer storage medium
CN107944460A (en) One kind is applied to class imbalance sorting technique in bioinformatics
Sahin et al. A discrete dynamic artificial bee colony with hyper-scout for RESTful web service API test suite generation
CN102339278A (en) Information processing device, information processing method, and program
CN104731919A (en) Wechat public account user classifying method based on AdaBoost algorithm
CN110472659A (en) Data processing method, device, computer readable storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220630

Address after: No. 230, Waihuan West Road, Guangzhou University City, Guangzhou 510000

Patentee after: Guangzhou University

Patentee after: National University of Defense Technology

Address before: No. 230, Waihuan West Road, Guangzhou University City, Guangzhou 510000

Patentee before: Guangzhou University