CN109033148A - One kind is towards polytypic unbalanced data preprocess method, device and equipment - Google Patents
One kind is towards polytypic unbalanced data preprocess method, device and equipment Download PDFInfo
- Publication number
- CN109033148A CN109033148A CN201810599969.3A CN201810599969A CN109033148A CN 109033148 A CN109033148 A CN 109033148A CN 201810599969 A CN201810599969 A CN 201810599969A CN 109033148 A CN109033148 A CN 109033148A
- Authority
- CN
- China
- Prior art keywords
- sample
- class
- minority
- sample set
- minority class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses one kind towards polytypic unbalanced data preprocess method, device, equipment, and method includes: the uneven ratio for receiving final sample collection size and sample set, obtains the ideal sample number of each classification;Minority class sample set and most class sample sets are judged according to ideal sample number and practical sample number;To the sample in minority class sample set, the number of other class samples and minority class sample in k neighbour is calculated, with classification marker;It to the sample in minority class sample set, deleted, saved, replicated or is synthesized according to the label of sample, obtain final minority class sample set;To the sample in most class sample sets, the number of the majority class sample and other class samples in k neighbour is calculated, with classification marker;It to the sample in most class sample sets, is deleted or is saved according to the label of sample, obtain final most class sample sets;Generate final sample collection.The invention enables the accuracys that final sample collection can effectively improve multi-classification algorithm.
Description
Technical field
The present invention relates to big data processing fields, more particularly to one kind is towards polytypic unbalanced data pretreatment side
Method, device and equipment.
Background technique
With the continuous progress of technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous
Development, data acquisition technology, memory technology, processing technique obtain significant progress, and data just increase at an unprecedented rate,
We have come into big data era.The data scale of big data huge (volume) generates high speed (velocity), form
Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to
Unprecedented challenge is encountered when big data field.
Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other
The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data
Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation
It is widely present in boundary, especially in big data application field.For example, the data of each classification are not in internet text classification
Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity
In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to
Fraud and abnormal behaviour in business, these data are submerged in a large amount of normal behaviour data, belong to knockdown
Unbalanced dataset.There are also medical diagnosis, Satellite Remote Sensing Data Classifications etc. for similar application.Therefore, uneven big data classification
It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.
Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive
It is difficult the classifying quality obtained, unbalanced data classification example as shown in Figure 1, wherein circle is minority class sample, triangle
For most class samples, for imbalance than being 3:1, i.e., most class samples are 3 times of minority class sample, and in actual large data sets
In, imbalance is even higher than being often 10000:1, therefore first needs to pre-process data before being classified, with
The learning effect got well.
It is existing imbalance big data preprocess method primarily directed to two sorting algorithms, i.e., in data set only there are two
Classification, most classes and minority class carry out lack sampling to most classes in pretreatment, carry out over-sampling or two for minority class
Person carries out simultaneously, reduces the uneven ratio of data, and then improve classifying quality.The uneven big data of multi-classification algorithm is located in advance
Reason lacks correlative study at present, and multi-classification algorithm, that is, data are concentrated with multiple classifications, and sorting algorithm will be learnt by training, number
According to one assigned in multiple classifications.Current method is that more classification problems are simplified to two classification problems to be handled, i.e. handle
Multiple classifications are divided into multiple two-category data collection in data set, are handled two-by-two.
More classification problems are converted to multiple two classification problems and face following problem:
1, the data set of some classification is minority class in two classification problems, can in another two classification problem
It can be most classes, it can not be effectively treated in this way.As shown in Fig. 2, circle sample set with triangle sample
Belong to minority class in the classification of this collection, and belongs to most classes in the classification with fork-shaped sample set.
2, some sample may be different sample classification in two different classification problems, for example, at one two points
It is noise in class, needs to delete, is important boundary sample in another two classification, needs to retain, using existing method
It can not be effectively treated.As shown in Fig. 2, the triangle sample in circle is in two classification problems with circle sample
Noise needs to delete;Be important boundary sample in the two of fork-shaped sample classification, need to retain.
In short, if more classification problems are considered not comprehensively considering the processing of sample at multiple two classification problems
Different situations in each classification can not effectively improve the accuracy of multi-classification algorithm.
Summary of the invention
In view of the above-mentioned problems, the purpose of the present invention is to provide one kind towards polytypic unbalanced data pretreatment side
Method, device and equipment can adapt to the demand of different multi-classification algorithms, effectively improve the accuracy of multi-classification algorithm.
The embodiment of the invention provides one kind towards polytypic unbalanced data preprocess method, includes the following steps:
Read original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications;
The uneven ratio between final sample collection size input by user and each sample set is received, to be calculated
Final sample concentrates the ideal sample number of each sample set;
Judge that the sample set belongs to minority class sample according to the ideal sample number of each sample set and practical sample number
Collection or most class sample sets;
To the sample in each minority class sample set, calculates in the k neighbour of each sample other class samples and to belong to this few
Each sample is divided into noise sample, unstable sample, boundary sample or stabilization by the number of the sample in several classes of sample sets
Sample simultaneously stamps corresponding label;Wherein, other class samples refer to other samples in addition to the sample in the minority class sample set
The sample that example is concentrated;
To the sample in each minority class sample set, is deleted, is saved, replicated or closed according to the label of each sample
At to obtain the final minority class sample set corresponding to each minority class sample set;
To the sample in each most class sample sets, calculates and belong in the majority class sample set in the k neighbour of each sample
Sample and other class samples number, each sample is divided into noise sample, boundary sample or stablizes sample, and is stamped
Corresponding label;
To the sample in each most class sample sets, is deleted or saved according to the label of each sample, to obtain phase
It should be in final most class sample sets of each most class sample sets;
According to the final minority class sample set and most class sample sets, final sample collection is generated, to realize imbalance
The pretreatment of data.
Preferably, the ideal sample number according to each sample set and practical sample number judge that the sample set belongs to
Minority class sample set or most class sample sets specifically:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority
Class sample set;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
Preferably, the sample in each minority class sample set calculates other class samples in the k neighbour of each sample
With the number of the sample belonged in the minority class sample set, each sample is divided into noise sample, unstable sample, boundary
Sample stablizes sample and stamps corresponding label, specifically includes:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark
The sample is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked
Example is unstable sample;
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set
In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set
When example, then marking the sample is to stablize sample.
Preferably, the sample in each minority class sample set, is deleted according to the label of each sample, is protected
It deposits, replicate or synthesizes, to obtain final minority class sample set, specifically:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, together by the boundary sample and the sample of duplication
Corresponding final minority class sample set is added;Wherein, c is reproduction ratio, and c=(the ideal sample number-of the minority class sample set
Unstable sample number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly
Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK
Sample an xj, newly synthesized sample x belonged in the minority class sample set is randomly choosed in neighbouri'=xi+(xi-xj) * a, a
For the random number between 0 to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set
Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis
Example is added in corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
Preferably, the sample in each most class sample sets, calculates in the k neighbour of each sample and belongs to the majority
The number of sample and other class samples in class sample set, is divided into noise sample, boundary sample or stabilization for each sample
Sample, and corresponding label is stamped, specifically:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then mark
The sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples
When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set
When example, then marking the sample is to stablize sample.
Preferably, the sample in each most class sample sets, is deleted or is protected according to the label of each sample
It deposits, to obtain final most class sample sets, specifically includes:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority
The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set;
Obtain final most class sample sets corresponding to each most class sample sets.
Preferably, described that selectively removing operation is executed to each stable sample, until it is specific to delete e stable sample
Are as follows:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set
Sample distance;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample
f;
Choose next stable sample.
The embodiment of the invention also provides one kind towards polytypic unbalanced data pretreatment unit, comprising:
Data-reading unit, for reading original sample collection;Wherein, the original sample collection includes at least two classifications
Sample set;
Data receipt unit, for receiving between final sample collection size input by user and each sample set not
Equilibrium ratio, the ideal sample number that final sample concentrates each sample set is calculated;
Judging unit, for judging the sample set category according to the ideal sample number and practical sample number of each sample set
In minority class sample set or most class sample sets;
Minority class sample taxon, for calculating the k neighbour of each sample to the sample in each minority class sample set
In other class samples and belong to the number of sample in the minority class sample set, each sample is divided into noise sample, no
Stablize sample, boundary sample or stablizes sample and stamp corresponding label;Wherein, other class samples refer to except the minority class sample
The sample in other sample sets other than the sample of concentration;
Minority class sample processing unit, for the sample in each minority class sample set, according to the label of each sample
It deleted, saved, replicated or is synthesized, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons, for calculating the k neighbour of each sample to the sample in each most class sample sets
In belong to the number of sample and other class samples in the majority class sample set, each sample is divided into noise sample, side
Boundary's sample stablizes sample, and stamps corresponding label;
Most class sample processing units, for the sample in each most class sample sets, according to the label of each sample
It is deleted or is saved, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit, for generating according to the final minority class sample set and most class sample sets
Final sample collection, to realize the pretreatment of unbalanced data.
The embodiment of the present invention has been also provided to a kind of towards polytypic unbalanced data pre-processing device, including processing
Device, memory and storage in the memory and are configured as the computer program executed by the processor, the place
Reason device is realized when executing the computer program as above-mentioned towards polytypic unbalanced data preprocess method.
The embodiment of the present invention realizes a kind of towards polytypic unbalanced data preprocess method, combined use over-sampling
And Undersampling technique effectively improves the classification of unbalanced data so that newly-generated final sample collection meets sorting algorithm demand
Accuracy.Specifically, the embodiment of the present invention allows user to input the total sample number needed and intentionally get multiple sample sets
Uneven ratio, by the way that the ideal sample number of each sample set is calculated, according to final sample concentrate number of samples determine
Each sample set is most classes or minority class, and solving a sample set in conventional method may in different two sorting algorithms
The problem of being simultaneously most classes and minority class.When handling the sample in each sample set, every other classification
Minority class sample is divided into noise sample, unstable sample, boundary sample and stable sample and located respectively by sample set merging treatment
Majority class sample is divided into noise sample, boundary sample and stable sample and handled respectively, solves a sample in conventional method by reason
Example may belong to a different category in different two sorting algorithms leads to the problem of conflicting to the processing of sample.On end, this hair
Bright embodiment enables final sample collection to effectively improve multi-classification algorithm by making most suitable processing to each sample
Accuracy.
Detailed description of the invention
In order to illustrate more clearly of technical solution of the present invention, attached drawing needed in embodiment will be made below
Simply introduce, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of unbalanced data classification exemplary diagram;
Fig. 2 is another unbalanced data classification exemplary diagram;
Fig. 3 is the process signal towards polytypic unbalanced data preprocess method that first embodiment of the invention provides
Figure;
Fig. 4 is the result signal towards polytypic unbalanced data pretreatment unit that second embodiment of the invention provides
Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 3, first embodiment of the invention provide it is a kind of towards polytypic unbalanced data preprocess method,
It can be by executing towards polytypic unbalanced data pre-processing device (hereinafter referred to as equipment), and includes at least following step
It is rapid:
S101 reads original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications.
In the present embodiment, the equipment reads original sample collection first, wherein the original sample collection includes at least two
The sample set of a classification.
S102 receives the uneven ratio between final sample collection size input by user and each sample set, in terms of
Calculation show that final sample concentrates the ideal sample number of each sample set.
In the present embodiment, the equipment is the equipment with data-handling capacity, for example, the equipment can be personal meter
Calculation machine, notebook, plate, server or server cluster etc., the present invention is not specifically limited.
In the present embodiment, equipment receives final conceivable final sample collection size x input by user and reason first
Imbalance between each sample set thought is than a1:a2: ...: an (assuming that having n sample set).Then final samples are calculated
This concentrates the ideal sample number of the sample set of each classification to be
For example it is assumed that the ideal final sample collection of user's input includes 20000 samples, wherein having 4 sample sets, respectively
Uneven than being 4:3:2:1 between a sample set, then ideal sample number is respectively as follows: in each sample set finally obtained
x1=20000*4/ (4+3+2+1)=8000;
x2=20000*3/ (4+3+2+1)=6000;
x3=20000*2/ (4+3+2+1)=4000;
x4=20000*1/ (4+3+2+1)=2000.
S103 judges that the sample set belongs to minority class according to the ideal sample number of each sample set and practical sample number
Sample set or most class sample sets.
In the present embodiment, to each sample set xi, it is minority class sample set or majority that equipment, which judges the sample set,
Class sample set, method of discrimination are as follows:
For each sample set, if its ideal sample number is greater than practical sample number (concentrated in original sample
Number), then judge the sample set for minority class sample set;If its ideal sample number is less than or equal to practical sample number, judge
The sample set is most class sample sets.
For example, it is assumed that for sample set x1, ideal sample number is 8000, and practical sample number is 12000, then says
Bright sample set x1For most class sample sets.
For another example for sample set x4, ideal sample number is 2000, and practical sample number is 200, then illustrates sample
Example collection x4For minority class sample set.
S104 calculates other class samples in the k neighbour of each sample and belongs to the sample in each minority class sample set
The number of sample in the minority class sample set, by each sample be divided into noise sample, unstable sample, boundary sample or
Stablize sample and stamps corresponding label;Wherein, other class samples refer to its in addition to the sample in the minority class sample set
Sample in his sample set.
In the present embodiment, when handling each sample set, the sample set of every other classification is merged into a sample
Collection, referred to as other class sample sets, in sample be known as other class samples.For example, in processing x1In sample set, x2、x3、x4It closes
And it is known as other class sample sets.In the present embodiment:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark
The sample is noise sample.
In the present embodiment, the value of k is inputted by user, and different classifications algorithm has best effects under different value of K.k
Neighbouring sample refers to: assuming that the position of some sample is xi, select k a from xiNearest neighbor node (neighbours' sample).
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples
Ratio be greater than first threshold (such as 80%) when, then mark the sample be noise sample.Certainly, first threshold can be according to practical need
It is set, such as may be set to 75%, 85% etc., the present invention does not do specific setting.
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked
Example is unstable sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples
Ratio be less than first threshold (such as 80%) and be greater than second threshold (such as 60%) when, then mark the sample be unstable sample.
Certainly, second threshold can be set according to actual needs, such as may be set to 55%, 65% etc., the present invention does not do specific setting.
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set
In sample number it is close when, then mark the sample be boundary sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples
Ratio be less than second threshold (such as 60%) and be greater than third threshold value (such as 40%) when, mark the sample be boundary sample.
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set
When example, then marking the sample is to stablize sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples
Ratio be less than third threshold value (such as 40%) when, mark the sample be stablize sample.
S105 is deleted according to the label of each sample, is saved, is replicated to the sample in each minority class sample set
Or synthesis, to obtain the final minority class sample set corresponding to each minority class sample set.
In the present embodiment, to the sample in each minority class sample set, following processing is taken according to its label respectively:
1, all noise samples in the minority class sample set are deleted;
2, final minority class sample set corresponding with the minority class sample set is added in all unstable samples;
3, to each boundary sample, it is replicated, replicates number Wei ∣ c-1 ∣, by the boundary sample and duplication
Corresponding final minority class sample set is added in sample together;Wherein, c is reproduction ratio, and the c=(ideal of the minority class sample set
Sample number-unstable sample number)/(the practical sample number of the minority class sample set-noise sample number-shakiness random sample
Example number);
Wherein, boundary sample is particularly significant to classification learning algorithm, and if is synthetically generated new samples with other samples
It is easy to cause sample to deviate, therefore takes duplication to operate boundary sample.
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly
Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK
Sample an xj, newly synthesized sample x belonged in the minority class sample set is randomly choosed in neighbouri'=xi+(xi-xj) * a, a
For the random number between 0 to 1.
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set
Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis
Example is added in corresponding final minority class sample set.
By above-mentioned processing, the sample of different classifications is handled respectively, improves the quality of newly-generated sample, Jin Erti
The performance of high-class learning algorithm, and ensure that the sample number of the final minority class sample set of generation and be equal to user preset
Ideal sample number.
In the present embodiment, above-mentioned processing successively is carried out to each minority class sample set to get corresponding final minority is arrived
Class sample set.
S106 is calculated to the sample in each most class sample sets and is belonged to the majority class sample in the k neighbour of each sample
Each sample is divided into noise sample, boundary sample or stablizes sample by the number of the sample of concentration and other class samples, and
Stamp corresponding label.
Specifically, when the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets,
Then marking the sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples
When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set
When example, then marking the sample is to stablize sample.
Above-mentioned most, most of, the case where can refer to minority class sample close to corresponding ratio, the present invention is herein not
It repeats.
S107 is deleted or is saved according to the label of each sample to the sample in each most class sample sets, with
To final most class sample sets.
Specifically, to the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
Boundary sample is particularly significant to classification learning algorithm, therefore takes reservation operations to boundary sample, i.e., does not delete and appoint
What boundary sample.
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority
The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set.
In the present embodiment, selectively removing is taken to operate to stablizing sample, to guarantee final most class sample sets
Sample number is ideal sample number.
In one implementation, can in the following way to stablize sample carry out selectively removing:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set
Sample distance;
In the present embodiment, the calculating of distance is different according to different classifications calculation and object method, such as object of classification is word
Vector then can calculate distance with vector Euclidean.
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample
F (even f=f+1);
Choose next stable sample.
In above-described embodiment, during selectively removing, a possibility that being distributed more intensive sample, being deleted, is got over
Greatly, so that remaining sample retains the feature of all samples as far as possible.The processing respectively of different samples is improved in this way and owes to adopt
The quality of sample after sample, and then improve the performance of classification learning algorithm.
In the present embodiment, successively corresponding final most to get arriving to the above-mentioned processing of each most class sample sets progress
Class sample set.
S108 generates final sample collection according to the final minority class sample set and most class sample sets, to realize not
The pretreatment of equilibrium data.
The embodiment of the present invention realizes a kind of towards polytypic unbalanced data preprocess method, combined use over-sampling
And Undersampling technique effectively improves the classification of unbalanced data so that newly-generated final sample collection meets sorting algorithm demand
Accuracy.Specifically, the embodiment of the present invention allows user to input the total sample number needed and intentionally get multiple sample sets
Uneven ratio, by the way that the ideal sample number of each sample set is calculated, according to final sample concentrate number of samples determine
Each sample set is most classes or minority class, and solving a sample set in conventional method may in different two sorting algorithms
The problem of being simultaneously most classes and minority class.When handling the sample in each sample set, every other classification
Minority class sample is divided into noise sample, unstable sample, boundary sample and stable sample and located respectively by sample set merging treatment
Majority class sample is divided into noise sample, boundary sample and stable sample and handled respectively, solves a sample in conventional method by reason
Example may belong to a different category in different two sorting algorithms leads to the problem of conflicting to the processing of sample.In conclusion
The embodiment of the present invention enables final sample collection to effectively improve more classification and calculates by making most suitable processing to each sample
The accuracy of method.
Referring to Fig. 4, second embodiment of the invention provides a kind of pretreatment unit of unbalanced data, comprising:
Data-reading unit 10, for reading original sample collection, wherein the original sample collection includes at least two classifications
Sample set;
Data receipt unit 20, for receiving between final sample collection size input by user and each sample set
Uneven ratio, the ideal sample number that final sample concentrates each sample set is calculated;
Judging unit 30, for judging the sample with practical sample number according to the ideal sample number in each sample set
Collection belongs to minority class sample set or most class sample sets;
Minority class sample taxon 40, for the sample in each minority class sample set, the k for calculating each sample to be close
Other class samples and belong to the number of sample in the minority class sample set in neighbour, by each sample be divided into noise sample,
Unstable sample, boundary sample stablize sample and stamp corresponding label;Wherein, other class samples refer to except the minority class sample
The sample in other sample sets other than the sample that example is concentrated;
Minority class sample processing unit 50, for the sample in each minority class sample set, according to the mark of each sample
It remembers row deletion into, save, replicate or synthesize, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons 60, for the sample in each most class sample sets, the k for calculating each sample to be close
The number for belonging to the sample and other class samples in the majority class sample set in neighbour, by each sample be divided into noise sample,
Boundary sample stablizes sample, and stamps corresponding label;
Most class sample processing units 70, for the sample in each most class sample sets, according to the mark of each sample
It remembers row into delete or save, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit 80, for according to the final minority class sample set and most class sample sets, life
At final sample collection, to realize the pretreatment of unbalanced data.
Preferably, the specific threshold value of the judging unit 30 is used for:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority
Class sample set;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
Preferably, the minority class sample taxon 40 is specifically used for:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark
The sample is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked
Example is unstable sample;
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set
In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set
When example, then marking the sample is to stablize sample.
Preferably, the minority class sample processing unit 50 is specifically used for:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, together by the boundary sample and the sample of duplication
Corresponding final minority class sample set is added;Wherein, c is reproduction ratio, and c=(the ideal sample number-of the minority class sample set
Unstable sample number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly
Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK
A sample x belonged in the minority class sample set is randomly choosed in neighbourj, newly synthesized sample xi'=xi+(xi-xj) * a, a
For the random number between 0 to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set
Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis
Example is added in corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
Preferably, most class sample taxons 60 are specifically used for:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then mark
The sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples
When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set
When example, then marking the sample is to stablize sample.
Preferably, most class sample processing units 70 are specifically used for:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority
The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set;
Obtain final most class sample sets corresponding to each most class sample sets.
Preferably, described that selectively removing operation is executed to each stable sample, until it is specific to delete e stable sample
Are as follows:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set
Sample distance;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample
f;
Choose next stable sample.
Third embodiment of the invention additionally provides a kind of pre-processing device of unbalanced data, including processor, memory
And store the computer program that can be run in the memory and on the processor.The processor executes the meter
Above-mentioned each step is realized when calculation machine program.Alternatively, the processor realizes above-mentioned each device when executing the computer program
The function of each module in embodiment.
Illustratively, the computer program can be divided into one or more units, one or more of lists
Member is stored in the memory, and is executed by the processor, to complete the present invention.One or more of units can be with
It is the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing the computer program in face
Implementation procedure into polytypic unbalanced data pre-processing device.
It is described to can be desktop PC, notebook, palm electricity towards polytypic unbalanced data pre-processing device
Brain and cloud server etc. calculate equipment.It is described to may include, but are not limited to towards polytypic unbalanced data pre-processing device
Processor, memory.It will be understood by those skilled in the art that the schematic diagram is only pre- towards polytypic unbalanced data
The example of processing equipment is not constituted to the restriction towards polytypic unbalanced data pre-processing device, may include than figure
Show more or fewer components, perhaps combines certain components or different components, such as described towards polytypic imbalance
Data prediction equipment can also include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng, the processor is the control centre towards polytypic unbalanced data pre-processing device, using various interfaces and
Various pieces of the connection entirely towards polytypic unbalanced data pre-processing device.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes
Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization
Various functions towards polytypic unbalanced data pre-processing device.The memory can mainly include storing program area and deposit
Store up data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound is broadcast
Playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio according to mobile phone
Data, phone directory etc.) etc..In addition, memory may include high-speed random access memory, it can also include non-volatile memories
Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid
State memory device.
Wherein, if the unit integrated towards polytypic unbalanced data pre-processing device is with SFU software functional unit
Form realize and when sold or used as an independent product, can store in a computer readable storage medium.
Based on this understanding, the present invention realizes all or part of the process in above-described embodiment method, can also pass through computer journey
Sequence is completed to instruct relevant hardware, and the computer program can be stored in a computer readable storage medium, the meter
Calculation machine program is when being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program packet
Include computer program code, the computer program code can for source code form, object identification code form, executable file or
Certain intermediate forms etc..The computer-readable medium may include: any reality that can carry the computer program code
Body or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-
Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and
Software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be according in jurisdiction
Legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, meter
Calculation machine readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separation unit
The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with
It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual
It needs that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention
In embodiment attached drawing, the connection relationship between module indicate between them have communication connection, specifically can be implemented as one or
A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, it can understand
And implement.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (9)
1. one kind is towards polytypic unbalanced data preprocess method, which comprises the steps of:
Read original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications;
The uneven ratio between final sample collection size input by user and each sample set is received, it is final to be calculated
Ideal sample number in sample set in each sample set;
Judge that the sample set belongs to minority class sample set also according to the ideal sample number of each sample set and practical sample number
It is most class sample sets;
To the sample in each minority class sample set, calculates other class samples in the k neighbour of each sample and belong to the minority class
Each sample is divided into noise sample, unstable sample, boundary sample or stablizes sample by the number of the sample in sample set
And stamp corresponding label;Wherein, other class samples refer to other sample sets in addition to the sample in the minority class sample set
In sample;
To the sample in each minority class sample set, is deleted, is saved, replicated or is synthesized according to the label of each sample,
To obtain the final minority class sample set corresponding to each minority class sample set;
To the sample in each most class sample sets, the sample belonged in the majority class sample set in the k neighbour of each sample is calculated
Each sample is divided into noise sample, boundary sample or stablizes sample, and stamped corresponding by the number of example and other class samples
Label;
To the sample in each most class sample sets, is deleted or saved according to the label of each sample, to be corresponded to
Final most class sample sets of each majority class sample set;
According to the final minority class sample set and most class sample sets, final sample collection is generated, to realize unbalanced data
Pretreatment.
2. according to claim 1 towards polytypic unbalanced data preprocess method, which is characterized in that the basis
The ideal sample number and practical sample number of each sample set judge that the sample set belongs to minority class sample set or most classes
Sample set specifically:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority class sample
Example collection;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
3. according to claim 1 towards polytypic unbalanced data preprocess method, which is characterized in that described to every
Sample in a minority class sample set calculates other class samples in the k neighbour of each sample and belongs in the minority class sample set
Sample number, each sample is divided into noise sample, unstable sample, boundary sample or stablizes and sample and stamps phase
The label answered, specifically includes:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then the sample is marked
Example is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked to be
Unstable sample;
When in the number of other class samples of the k neighbour's sample for judging the sample in minority class sample set and minority class sample set
When the number of sample is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set
When, then marking the sample is to stablize sample.
4. according to claim 3 towards polytypic unbalanced data preprocess method, which is characterized in that described to every
Sample in a minority class sample set, is deleted, is saved, replicated or is synthesized according to the label of each sample, to obtain phase
Should in the final minority class sample set of each minority class sample set, specifically:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, the sample of the boundary sample and duplication is added together
Corresponding final minority class sample set;Wherein, c is reproduction ratio, and c=(ideal sample number-shakiness of the minority class sample set
Random sample example number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and new synthesis
Sample corresponding final minority class sample set is added together;Wherein, synthetic method is every time from the stable sample xiK neighbour
One sample xj, newly synthesized sample x belonged in the minority class sample set of middle random selectioni'=xi+ (xi-xj) * a, a 0
Random number between to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the ideal of the d=minority class sample set
Sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, the new sample of synthesis is added
Enter into corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
5. according to claim 3 towards polytypic unbalanced data preprocess method, which is characterized in that described to every
Sample in a majority class sample set, calculates the sample belonged in the majority class sample set in the k neighbour of each sample and other
Each sample is divided into noise sample, boundary sample or stablizes sample, and stamps corresponding label by the number of class sample,
Specifically:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then the sample is marked
Example is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class sample sets
In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set
When, then marking the sample is to stablize sample.
6. according to claim 5 towards polytypic unbalanced data preprocess method, which is characterized in that described to every
Sample in a majority class sample set, is deleted or is saved according to the label of each sample, to obtain corresponding to each majority
Final most class sample sets of class sample set, specifically include:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority class sample
Practical sample number-noise sample number-of example collection majority class sample set ideal sample number;
The final majority corresponding to each most class sample sets is obtained according to the boundary sample of reservation and remaining stable sample
Class sample set.
7. according to claim 6 towards polytypic unbalanced data preprocess method, which is characterized in that described to every
A stable sample executes selectively removing operation, until deleting e stable sample specifically:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the sample that the stable sample belongs to the majority class sample set to the k neighbour of surrounding is calculated
The distance of example;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number f of deleted stable sample;
Choose next stable sample.
8. one kind is towards polytypic unbalanced data pretreatment unit characterized by comprising
Data-reading unit, for reading original sample collection;Wherein, the original sample collection includes the sample of at least two classifications
Collection;
Data receipt unit, for receiving the imbalance between final sample collection size input by user and each sample set
Than the ideal sample number that final sample concentrates each sample set is calculated;
It is few to judge that the sample set belongs to for the ideal sample number and practical sample number according to each sample set for judging unit
Several classes of sample sets or most class sample sets;
Minority class sample taxon, for calculating its in the k neighbour of each sample to the sample in each minority class sample set
Each sample is divided into noise sample, unstable by his class sample and the number for belonging to the sample in the minority class sample set
Sample, boundary sample stablize sample and stamp corresponding label;Wherein, other class samples refer to except in the minority class sample set
Sample other than other sample sets in sample;
Minority class sample processing unit, for being carried out according to the label of each sample to the sample in each minority class sample set
It deletes, save, replicate or synthesizes, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons, for calculating and belonging in the k neighbour of each sample to the sample in each most class sample sets
The number of sample and other class samples in the majority class sample set, is divided into noise sample, boundary sample for each sample
Example stablizes sample, and stamps corresponding label;
Most class sample processing units, for being carried out according to the label of each sample to the sample in each most class sample sets
It deletes or saves, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit, for generating final according to the final minority class sample set and most class sample sets
Sample set, to realize the pretreatment of unbalanced data.
9. one kind is towards polytypic unbalanced data pre-processing device, which is characterized in that including processor, memory and deposit
The computer program executed by the processor is stored up in the memory and is configured as, the processor executes the calculating
It realizes when machine program as claimed in any of claims 1 to 7 in one of claims towards polytypic unbalanced data preprocess method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810599969.3A CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810599969.3A CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033148A true CN109033148A (en) | 2018-12-18 |
Family
ID=64612664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810599969.3A Pending CN109033148A (en) | 2018-06-11 | 2018-06-11 | One kind is towards polytypic unbalanced data preprocess method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033148A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
CN110378352A (en) * | 2019-07-11 | 2019-10-25 | 河海大学 | The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment |
CN112749719A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for sample balanced classification |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN113298148A (en) * | 2021-05-25 | 2021-08-24 | 南京邮电大学 | Ecological environment evaluation-oriented unbalanced data resampling method |
-
2018
- 2018-06-11 CN CN201810599969.3A patent/CN109033148A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
CN110378352A (en) * | 2019-07-11 | 2019-10-25 | 河海大学 | The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment |
CN110378352B (en) * | 2019-07-11 | 2021-03-19 | 河海大学 | High-precision anti-interference two-dimensional filtering navigation data denoising method in complex underwater environment |
CN112749719A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for sample balanced classification |
CN112766394A (en) * | 2021-01-26 | 2021-05-07 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN112766394B (en) * | 2021-01-26 | 2024-03-12 | 维沃移动通信有限公司 | Modeling sample generation method and device |
CN113298148A (en) * | 2021-05-25 | 2021-08-24 | 南京邮电大学 | Ecological environment evaluation-oriented unbalanced data resampling method |
CN113298148B (en) * | 2021-05-25 | 2022-08-05 | 南京邮电大学 | Ecological environment evaluation-oriented unbalanced data resampling method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033148A (en) | One kind is towards polytypic unbalanced data preprocess method, device and equipment | |
CN103455542B (en) | Multiclass evaluator and multiclass recognition methods | |
US20210073669A1 (en) | Generating training data for machine-learning models | |
CN106874435A (en) | User portrait construction method and device | |
CN111428217B (en) | Fraudulent party identification method, apparatus, electronic device and computer readable storage medium | |
CN111275491A (en) | Data processing method and device | |
CN109766902A (en) | To the method, apparatus and equipment of the vehicle cluster in same region | |
CN109739985A (en) | Automatic document classification method, equipment and storage medium | |
CN107622326A (en) | User's classification, available resources Forecasting Methodology, device and equipment | |
CN107908796A (en) | E-Government duplicate checking method, apparatus and computer-readable recording medium | |
CN112035549A (en) | Data mining method and device, computer equipment and storage medium | |
CN109191167A (en) | A kind of method for digging and device of target user | |
CN108959516A (en) | Conversation message treating method and apparatus | |
CN108647727A (en) | Unbalanced data classification lack sampling method, apparatus, equipment and medium | |
CN108537270A (en) | Image labeling method, terminal device and storage medium based on multi-tag study | |
CN108346098A (en) | A kind of method and device of air control rule digging | |
CN107748739A (en) | A kind of extracting method and relevant apparatus of short message text masterplate | |
CN113656699A (en) | User feature vector determination method, related device and medium | |
CN107506407A (en) | A kind of document classification, the method and device called | |
CN102339278A (en) | Information processing device, information processing method, and program | |
CN111984842B (en) | Bank customer data processing method and device | |
CN108647728A (en) | Unbalanced data classification oversampler method, device, equipment and medium | |
CN108596271A (en) | Appraisal procedure, device, storage medium and the terminal of fingerprint developing algorithm | |
CN114697127B (en) | Service session risk processing method based on cloud computing and server | |
CN116977692A (en) | Data processing method, device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |