CN109033148A - One kind is towards polytypic unbalanced data preprocess method, device and equipment - Google Patents

One kind is towards polytypic unbalanced data preprocess method, device and equipment Download PDF

Info

Publication number
CN109033148A
CN109033148A CN201810599969.3A CN201810599969A CN109033148A CN 109033148 A CN109033148 A CN 109033148A CN 201810599969 A CN201810599969 A CN 201810599969A CN 109033148 A CN109033148 A CN 109033148A
Authority
CN
China
Prior art keywords
sample
class
minority
sample set
minority class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810599969.3A
Other languages
Chinese (zh)
Inventor
韩伟红
李树栋
王乐
方滨兴
贾焰
黄子中
周斌
殷丽华
田志宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201810599969.3A priority Critical patent/CN109033148A/en
Publication of CN109033148A publication Critical patent/CN109033148A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses one kind towards polytypic unbalanced data preprocess method, device, equipment, and method includes: the uneven ratio for receiving final sample collection size and sample set, obtains the ideal sample number of each classification;Minority class sample set and most class sample sets are judged according to ideal sample number and practical sample number;To the sample in minority class sample set, the number of other class samples and minority class sample in k neighbour is calculated, with classification marker;It to the sample in minority class sample set, deleted, saved, replicated or is synthesized according to the label of sample, obtain final minority class sample set;To the sample in most class sample sets, the number of the majority class sample and other class samples in k neighbour is calculated, with classification marker;It to the sample in most class sample sets, is deleted or is saved according to the label of sample, obtain final most class sample sets;Generate final sample collection.The invention enables the accuracys that final sample collection can effectively improve multi-classification algorithm.

Description

One kind is towards polytypic unbalanced data preprocess method, device and equipment
Technical field
The present invention relates to big data processing fields, more particularly to one kind is towards polytypic unbalanced data pretreatment side Method, device and equipment.
Background technique
With the continuous progress of technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous Development, data acquisition technology, memory technology, processing technique obtain significant progress, and data just increase at an unprecedented rate, We have come into big data era.The data scale of big data huge (volume) generates high speed (velocity), form Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to Unprecedented challenge is encountered when big data field.
Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation It is widely present in boundary, especially in big data application field.For example, the data of each classification are not in internet text classification Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to Fraud and abnormal behaviour in business, these data are submerged in a large amount of normal behaviour data, belong to knockdown Unbalanced dataset.There are also medical diagnosis, Satellite Remote Sensing Data Classifications etc. for similar application.Therefore, uneven big data classification It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.
Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive It is difficult the classifying quality obtained, unbalanced data classification example as shown in Figure 1, wherein circle is minority class sample, triangle For most class samples, for imbalance than being 3:1, i.e., most class samples are 3 times of minority class sample, and in actual large data sets In, imbalance is even higher than being often 10000:1, therefore first needs to pre-process data before being classified, with The learning effect got well.
It is existing imbalance big data preprocess method primarily directed to two sorting algorithms, i.e., in data set only there are two Classification, most classes and minority class carry out lack sampling to most classes in pretreatment, carry out over-sampling or two for minority class Person carries out simultaneously, reduces the uneven ratio of data, and then improve classifying quality.The uneven big data of multi-classification algorithm is located in advance Reason lacks correlative study at present, and multi-classification algorithm, that is, data are concentrated with multiple classifications, and sorting algorithm will be learnt by training, number According to one assigned in multiple classifications.Current method is that more classification problems are simplified to two classification problems to be handled, i.e. handle Multiple classifications are divided into multiple two-category data collection in data set, are handled two-by-two.
More classification problems are converted to multiple two classification problems and face following problem:
1, the data set of some classification is minority class in two classification problems, can in another two classification problem It can be most classes, it can not be effectively treated in this way.As shown in Fig. 2, circle sample set with triangle sample Belong to minority class in the classification of this collection, and belongs to most classes in the classification with fork-shaped sample set.
2, some sample may be different sample classification in two different classification problems, for example, at one two points It is noise in class, needs to delete, is important boundary sample in another two classification, needs to retain, using existing method It can not be effectively treated.As shown in Fig. 2, the triangle sample in circle is in two classification problems with circle sample Noise needs to delete;Be important boundary sample in the two of fork-shaped sample classification, need to retain.
In short, if more classification problems are considered not comprehensively considering the processing of sample at multiple two classification problems Different situations in each classification can not effectively improve the accuracy of multi-classification algorithm.
Summary of the invention
In view of the above-mentioned problems, the purpose of the present invention is to provide one kind towards polytypic unbalanced data pretreatment side Method, device and equipment can adapt to the demand of different multi-classification algorithms, effectively improve the accuracy of multi-classification algorithm.
The embodiment of the invention provides one kind towards polytypic unbalanced data preprocess method, includes the following steps:
Read original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications;
The uneven ratio between final sample collection size input by user and each sample set is received, to be calculated Final sample concentrates the ideal sample number of each sample set;
Judge that the sample set belongs to minority class sample according to the ideal sample number of each sample set and practical sample number Collection or most class sample sets;
To the sample in each minority class sample set, calculates in the k neighbour of each sample other class samples and to belong to this few Each sample is divided into noise sample, unstable sample, boundary sample or stabilization by the number of the sample in several classes of sample sets Sample simultaneously stamps corresponding label;Wherein, other class samples refer to other samples in addition to the sample in the minority class sample set The sample that example is concentrated;
To the sample in each minority class sample set, is deleted, is saved, replicated or closed according to the label of each sample At to obtain the final minority class sample set corresponding to each minority class sample set;
To the sample in each most class sample sets, calculates and belong in the majority class sample set in the k neighbour of each sample Sample and other class samples number, each sample is divided into noise sample, boundary sample or stablizes sample, and is stamped Corresponding label;
To the sample in each most class sample sets, is deleted or saved according to the label of each sample, to obtain phase It should be in final most class sample sets of each most class sample sets;
According to the final minority class sample set and most class sample sets, final sample collection is generated, to realize imbalance The pretreatment of data.
Preferably, the ideal sample number according to each sample set and practical sample number judge that the sample set belongs to Minority class sample set or most class sample sets specifically:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority Class sample set;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
Preferably, the sample in each minority class sample set calculates other class samples in the k neighbour of each sample With the number of the sample belonged in the minority class sample set, each sample is divided into noise sample, unstable sample, boundary Sample stablizes sample and stamps corresponding label, specifically includes:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark The sample is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked Example is unstable sample;
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set When example, then marking the sample is to stablize sample.
Preferably, the sample in each minority class sample set, is deleted according to the label of each sample, is protected It deposits, replicate or synthesizes, to obtain final minority class sample set, specifically:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, together by the boundary sample and the sample of duplication Corresponding final minority class sample set is added;Wherein, c is reproduction ratio, and c=(the ideal sample number-of the minority class sample set Unstable sample number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK Sample an xj, newly synthesized sample x belonged in the minority class sample set is randomly choosed in neighbouri'=xi+(xi-xj) * a, a For the random number between 0 to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis Example is added in corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
Preferably, the sample in each most class sample sets, calculates in the k neighbour of each sample and belongs to the majority The number of sample and other class samples in class sample set, is divided into noise sample, boundary sample or stabilization for each sample Sample, and corresponding label is stamped, specifically:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then mark The sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set When example, then marking the sample is to stablize sample.
Preferably, the sample in each most class sample sets, is deleted or is protected according to the label of each sample It deposits, to obtain final most class sample sets, specifically includes:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set;
Obtain final most class sample sets corresponding to each most class sample sets.
Preferably, described that selectively removing operation is executed to each stable sample, until it is specific to delete e stable sample Are as follows:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set Sample distance;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample f;
Choose next stable sample.
The embodiment of the invention also provides one kind towards polytypic unbalanced data pretreatment unit, comprising:
Data-reading unit, for reading original sample collection;Wherein, the original sample collection includes at least two classifications Sample set;
Data receipt unit, for receiving between final sample collection size input by user and each sample set not Equilibrium ratio, the ideal sample number that final sample concentrates each sample set is calculated;
Judging unit, for judging the sample set category according to the ideal sample number and practical sample number of each sample set In minority class sample set or most class sample sets;
Minority class sample taxon, for calculating the k neighbour of each sample to the sample in each minority class sample set In other class samples and belong to the number of sample in the minority class sample set, each sample is divided into noise sample, no Stablize sample, boundary sample or stablizes sample and stamp corresponding label;Wherein, other class samples refer to except the minority class sample The sample in other sample sets other than the sample of concentration;
Minority class sample processing unit, for the sample in each minority class sample set, according to the label of each sample It deleted, saved, replicated or is synthesized, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons, for calculating the k neighbour of each sample to the sample in each most class sample sets In belong to the number of sample and other class samples in the majority class sample set, each sample is divided into noise sample, side Boundary's sample stablizes sample, and stamps corresponding label;
Most class sample processing units, for the sample in each most class sample sets, according to the label of each sample It is deleted or is saved, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit, for generating according to the final minority class sample set and most class sample sets Final sample collection, to realize the pretreatment of unbalanced data.
The embodiment of the present invention has been also provided to a kind of towards polytypic unbalanced data pre-processing device, including processing Device, memory and storage in the memory and are configured as the computer program executed by the processor, the place Reason device is realized when executing the computer program as above-mentioned towards polytypic unbalanced data preprocess method.
The embodiment of the present invention realizes a kind of towards polytypic unbalanced data preprocess method, combined use over-sampling And Undersampling technique effectively improves the classification of unbalanced data so that newly-generated final sample collection meets sorting algorithm demand Accuracy.Specifically, the embodiment of the present invention allows user to input the total sample number needed and intentionally get multiple sample sets Uneven ratio, by the way that the ideal sample number of each sample set is calculated, according to final sample concentrate number of samples determine Each sample set is most classes or minority class, and solving a sample set in conventional method may in different two sorting algorithms The problem of being simultaneously most classes and minority class.When handling the sample in each sample set, every other classification Minority class sample is divided into noise sample, unstable sample, boundary sample and stable sample and located respectively by sample set merging treatment Majority class sample is divided into noise sample, boundary sample and stable sample and handled respectively, solves a sample in conventional method by reason Example may belong to a different category in different two sorting algorithms leads to the problem of conflicting to the processing of sample.On end, this hair Bright embodiment enables final sample collection to effectively improve multi-classification algorithm by making most suitable processing to each sample Accuracy.
Detailed description of the invention
In order to illustrate more clearly of technical solution of the present invention, attached drawing needed in embodiment will be made below Simply introduce, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of unbalanced data classification exemplary diagram;
Fig. 2 is another unbalanced data classification exemplary diagram;
Fig. 3 is the process signal towards polytypic unbalanced data preprocess method that first embodiment of the invention provides Figure;
Fig. 4 is the result signal towards polytypic unbalanced data pretreatment unit that second embodiment of the invention provides Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 3, first embodiment of the invention provide it is a kind of towards polytypic unbalanced data preprocess method, It can be by executing towards polytypic unbalanced data pre-processing device (hereinafter referred to as equipment), and includes at least following step It is rapid:
S101 reads original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications.
In the present embodiment, the equipment reads original sample collection first, wherein the original sample collection includes at least two The sample set of a classification.
S102 receives the uneven ratio between final sample collection size input by user and each sample set, in terms of Calculation show that final sample concentrates the ideal sample number of each sample set.
In the present embodiment, the equipment is the equipment with data-handling capacity, for example, the equipment can be personal meter Calculation machine, notebook, plate, server or server cluster etc., the present invention is not specifically limited.
In the present embodiment, equipment receives final conceivable final sample collection size x input by user and reason first Imbalance between each sample set thought is than a1:a2: ...: an (assuming that having n sample set).Then final samples are calculated This concentrates the ideal sample number of the sample set of each classification to be
For example it is assumed that the ideal final sample collection of user's input includes 20000 samples, wherein having 4 sample sets, respectively Uneven than being 4:3:2:1 between a sample set, then ideal sample number is respectively as follows: in each sample set finally obtained
x1=20000*4/ (4+3+2+1)=8000;
x2=20000*3/ (4+3+2+1)=6000;
x3=20000*2/ (4+3+2+1)=4000;
x4=20000*1/ (4+3+2+1)=2000.
S103 judges that the sample set belongs to minority class according to the ideal sample number of each sample set and practical sample number Sample set or most class sample sets.
In the present embodiment, to each sample set xi, it is minority class sample set or majority that equipment, which judges the sample set, Class sample set, method of discrimination are as follows:
For each sample set, if its ideal sample number is greater than practical sample number (concentrated in original sample Number), then judge the sample set for minority class sample set;If its ideal sample number is less than or equal to practical sample number, judge The sample set is most class sample sets.
For example, it is assumed that for sample set x1, ideal sample number is 8000, and practical sample number is 12000, then says Bright sample set x1For most class sample sets.
For another example for sample set x4, ideal sample number is 2000, and practical sample number is 200, then illustrates sample Example collection x4For minority class sample set.
S104 calculates other class samples in the k neighbour of each sample and belongs to the sample in each minority class sample set The number of sample in the minority class sample set, by each sample be divided into noise sample, unstable sample, boundary sample or Stablize sample and stamps corresponding label;Wherein, other class samples refer to its in addition to the sample in the minority class sample set Sample in his sample set.
In the present embodiment, when handling each sample set, the sample set of every other classification is merged into a sample Collection, referred to as other class sample sets, in sample be known as other class samples.For example, in processing x1In sample set, x2、x3、x4It closes And it is known as other class sample sets.In the present embodiment:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark The sample is noise sample.
In the present embodiment, the value of k is inputted by user, and different classifications algorithm has best effects under different value of K.k Neighbouring sample refers to: assuming that the position of some sample is xi, select k a from xiNearest neighbor node (neighbours' sample).
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples Ratio be greater than first threshold (such as 80%) when, then mark the sample be noise sample.Certainly, first threshold can be according to practical need It is set, such as may be set to 75%, 85% etc., the present invention does not do specific setting.
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked Example is unstable sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples Ratio be less than first threshold (such as 80%) and be greater than second threshold (such as 60%) when, then mark the sample be unstable sample. Certainly, second threshold can be set according to actual needs, such as may be set to 55%, 65% etc., the present invention does not do specific setting.
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set In sample number it is close when, then mark the sample be boundary sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples Ratio be less than second threshold (such as 60%) and be greater than third threshold value (such as 40%) when, mark the sample be boundary sample.
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set When example, then marking the sample is to stablize sample.
For example, when in the k neighbour's sample for judging some sample in minority class sample set, the number and k of other class samples Ratio be less than third threshold value (such as 40%) when, mark the sample be stablize sample.
S105 is deleted according to the label of each sample, is saved, is replicated to the sample in each minority class sample set Or synthesis, to obtain the final minority class sample set corresponding to each minority class sample set.
In the present embodiment, to the sample in each minority class sample set, following processing is taken according to its label respectively:
1, all noise samples in the minority class sample set are deleted;
2, final minority class sample set corresponding with the minority class sample set is added in all unstable samples;
3, to each boundary sample, it is replicated, replicates number Wei ∣ c-1 ∣, by the boundary sample and duplication Corresponding final minority class sample set is added in sample together;Wherein, c is reproduction ratio, and the c=(ideal of the minority class sample set Sample number-unstable sample number)/(the practical sample number of the minority class sample set-noise sample number-shakiness random sample Example number);
Wherein, boundary sample is particularly significant to classification learning algorithm, and if is synthetically generated new samples with other samples It is easy to cause sample to deviate, therefore takes duplication to operate boundary sample.
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK Sample an xj, newly synthesized sample x belonged in the minority class sample set is randomly choosed in neighbouri'=xi+(xi-xj) * a, a For the random number between 0 to 1.
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis Example is added in corresponding final minority class sample set.
By above-mentioned processing, the sample of different classifications is handled respectively, improves the quality of newly-generated sample, Jin Erti The performance of high-class learning algorithm, and ensure that the sample number of the final minority class sample set of generation and be equal to user preset Ideal sample number.
In the present embodiment, above-mentioned processing successively is carried out to each minority class sample set to get corresponding final minority is arrived Class sample set.
S106 is calculated to the sample in each most class sample sets and is belonged to the majority class sample in the k neighbour of each sample Each sample is divided into noise sample, boundary sample or stablizes sample by the number of the sample of concentration and other class samples, and Stamp corresponding label.
Specifically, when the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, Then marking the sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set When example, then marking the sample is to stablize sample.
Above-mentioned most, most of, the case where can refer to minority class sample close to corresponding ratio, the present invention is herein not It repeats.
S107 is deleted or is saved according to the label of each sample to the sample in each most class sample sets, with To final most class sample sets.
Specifically, to the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
Boundary sample is particularly significant to classification learning algorithm, therefore takes reservation operations to boundary sample, i.e., does not delete and appoint What boundary sample.
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set.
In the present embodiment, selectively removing is taken to operate to stablizing sample, to guarantee final most class sample sets Sample number is ideal sample number.
In one implementation, can in the following way to stablize sample carry out selectively removing:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set Sample distance;
In the present embodiment, the calculating of distance is different according to different classifications calculation and object method, such as object of classification is word Vector then can calculate distance with vector Euclidean.
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample F (even f=f+1);
Choose next stable sample.
In above-described embodiment, during selectively removing, a possibility that being distributed more intensive sample, being deleted, is got over Greatly, so that remaining sample retains the feature of all samples as far as possible.The processing respectively of different samples is improved in this way and owes to adopt The quality of sample after sample, and then improve the performance of classification learning algorithm.
In the present embodiment, successively corresponding final most to get arriving to the above-mentioned processing of each most class sample sets progress Class sample set.
S108 generates final sample collection according to the final minority class sample set and most class sample sets, to realize not The pretreatment of equilibrium data.
The embodiment of the present invention realizes a kind of towards polytypic unbalanced data preprocess method, combined use over-sampling And Undersampling technique effectively improves the classification of unbalanced data so that newly-generated final sample collection meets sorting algorithm demand Accuracy.Specifically, the embodiment of the present invention allows user to input the total sample number needed and intentionally get multiple sample sets Uneven ratio, by the way that the ideal sample number of each sample set is calculated, according to final sample concentrate number of samples determine Each sample set is most classes or minority class, and solving a sample set in conventional method may in different two sorting algorithms The problem of being simultaneously most classes and minority class.When handling the sample in each sample set, every other classification Minority class sample is divided into noise sample, unstable sample, boundary sample and stable sample and located respectively by sample set merging treatment Majority class sample is divided into noise sample, boundary sample and stable sample and handled respectively, solves a sample in conventional method by reason Example may belong to a different category in different two sorting algorithms leads to the problem of conflicting to the processing of sample.In conclusion The embodiment of the present invention enables final sample collection to effectively improve more classification and calculates by making most suitable processing to each sample The accuracy of method.
Referring to Fig. 4, second embodiment of the invention provides a kind of pretreatment unit of unbalanced data, comprising:
Data-reading unit 10, for reading original sample collection, wherein the original sample collection includes at least two classifications Sample set;
Data receipt unit 20, for receiving between final sample collection size input by user and each sample set Uneven ratio, the ideal sample number that final sample concentrates each sample set is calculated;
Judging unit 30, for judging the sample with practical sample number according to the ideal sample number in each sample set Collection belongs to minority class sample set or most class sample sets;
Minority class sample taxon 40, for the sample in each minority class sample set, the k for calculating each sample to be close Other class samples and belong to the number of sample in the minority class sample set in neighbour, by each sample be divided into noise sample, Unstable sample, boundary sample stablize sample and stamp corresponding label;Wherein, other class samples refer to except the minority class sample The sample in other sample sets other than the sample that example is concentrated;
Minority class sample processing unit 50, for the sample in each minority class sample set, according to the mark of each sample It remembers row deletion into, save, replicate or synthesize, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons 60, for the sample in each most class sample sets, the k for calculating each sample to be close The number for belonging to the sample and other class samples in the majority class sample set in neighbour, by each sample be divided into noise sample, Boundary sample stablizes sample, and stamps corresponding label;
Most class sample processing units 70, for the sample in each most class sample sets, according to the mark of each sample It remembers row into delete or save, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit 80, for according to the final minority class sample set and most class sample sets, life At final sample collection, to realize the pretreatment of unbalanced data.
Preferably, the specific threshold value of the judging unit 30 is used for:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority Class sample set;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
Preferably, the minority class sample taxon 40 is specifically used for:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then mark The sample is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked Example is unstable sample;
When the number and minority class sample set of other class samples of the k neighbour's sample for judging the sample in minority class sample set In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set When example, then marking the sample is to stablize sample.
Preferably, the minority class sample processing unit 50 is specifically used for:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, together by the boundary sample and the sample of duplication Corresponding final minority class sample set is added;Wherein, c is reproduction ratio, and c=(the ideal sample number-of the minority class sample set Unstable sample number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and newly Corresponding final minority class sample set is added in the sample of synthesis together;Wherein, synthetic method is every time from the stable sample xiK A sample x belonged in the minority class sample set is randomly choosed in neighbourj, newly synthesized sample xi'=xi+(xi-xj) * a, a For the random number between 0 to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the d=minority class sample set Ideal sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, by the new sample of synthesis Example is added in corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
Preferably, most class sample taxons 60 are specifically used for:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then mark The sample is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class samples When the number for the sample that example is concentrated is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set When example, then marking the sample is to stablize sample.
Preferably, most class sample processing units 70 are specifically used for:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority The ideal sample number of practical sample number-noise sample number-of class sample set majority class sample set;
Obtain final most class sample sets corresponding to each most class sample sets.
Preferably, described that selectively removing operation is executed to each stable sample, until it is specific to delete e stable sample Are as follows:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the k neighbour for calculating the stable sample to surrounding belongs to the majority class sample set Sample distance;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number of deleted stable sample f;
Choose next stable sample.
Third embodiment of the invention additionally provides a kind of pre-processing device of unbalanced data, including processor, memory And store the computer program that can be run in the memory and on the processor.The processor executes the meter Above-mentioned each step is realized when calculation machine program.Alternatively, the processor realizes above-mentioned each device when executing the computer program The function of each module in embodiment.
Illustratively, the computer program can be divided into one or more units, one or more of lists Member is stored in the memory, and is executed by the processor, to complete the present invention.One or more of units can be with It is the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing the computer program in face Implementation procedure into polytypic unbalanced data pre-processing device.
It is described to can be desktop PC, notebook, palm electricity towards polytypic unbalanced data pre-processing device Brain and cloud server etc. calculate equipment.It is described to may include, but are not limited to towards polytypic unbalanced data pre-processing device Processor, memory.It will be understood by those skilled in the art that the schematic diagram is only pre- towards polytypic unbalanced data The example of processing equipment is not constituted to the restriction towards polytypic unbalanced data pre-processing device, may include than figure Show more or fewer components, perhaps combines certain components or different components, such as described towards polytypic imbalance Data prediction equipment can also include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng, the processor is the control centre towards polytypic unbalanced data pre-processing device, using various interfaces and Various pieces of the connection entirely towards polytypic unbalanced data pre-processing device.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization Various functions towards polytypic unbalanced data pre-processing device.The memory can mainly include storing program area and deposit Store up data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound is broadcast Playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio according to mobile phone Data, phone directory etc.) etc..In addition, memory may include high-speed random access memory, it can also include non-volatile memories Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.
Wherein, if the unit integrated towards polytypic unbalanced data pre-processing device is with SFU software functional unit Form realize and when sold or used as an independent product, can store in a computer readable storage medium. Based on this understanding, the present invention realizes all or part of the process in above-described embodiment method, can also pass through computer journey Sequence is completed to instruct relevant hardware, and the computer program can be stored in a computer readable storage medium, the meter Calculation machine program is when being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program packet Include computer program code, the computer program code can for source code form, object identification code form, executable file or Certain intermediate forms etc..The computer-readable medium may include: any reality that can carry the computer program code Body or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read- Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and Software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be according in jurisdiction Legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, meter Calculation machine readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention In embodiment attached drawing, the connection relationship between module indicate between them have communication connection, specifically can be implemented as one or A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, it can understand And implement.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (9)

1. one kind is towards polytypic unbalanced data preprocess method, which comprises the steps of:
Read original sample collection;Wherein, the original sample collection includes the sample set of at least two classifications;
The uneven ratio between final sample collection size input by user and each sample set is received, it is final to be calculated Ideal sample number in sample set in each sample set;
Judge that the sample set belongs to minority class sample set also according to the ideal sample number of each sample set and practical sample number It is most class sample sets;
To the sample in each minority class sample set, calculates other class samples in the k neighbour of each sample and belong to the minority class Each sample is divided into noise sample, unstable sample, boundary sample or stablizes sample by the number of the sample in sample set And stamp corresponding label;Wherein, other class samples refer to other sample sets in addition to the sample in the minority class sample set In sample;
To the sample in each minority class sample set, is deleted, is saved, replicated or is synthesized according to the label of each sample, To obtain the final minority class sample set corresponding to each minority class sample set;
To the sample in each most class sample sets, the sample belonged in the majority class sample set in the k neighbour of each sample is calculated Each sample is divided into noise sample, boundary sample or stablizes sample, and stamped corresponding by the number of example and other class samples Label;
To the sample in each most class sample sets, is deleted or saved according to the label of each sample, to be corresponded to Final most class sample sets of each majority class sample set;
According to the final minority class sample set and most class sample sets, final sample collection is generated, to realize unbalanced data Pretreatment.
2. according to claim 1 towards polytypic unbalanced data preprocess method, which is characterized in that the basis The ideal sample number and practical sample number of each sample set judge that the sample set belongs to minority class sample set or most classes Sample set specifically:
For each sample set, if its ideal sample number is greater than practical sample number, judge the sample set for minority class sample Example collection;If its ideal sample number is less than or equal to practical sample number, judge the sample set for most class sample sets.
3. according to claim 1 towards polytypic unbalanced data preprocess method, which is characterized in that described to every Sample in a minority class sample set calculates other class samples in the k neighbour of each sample and belongs in the minority class sample set Sample number, each sample is divided into noise sample, unstable sample, boundary sample or stablizes and sample and stamps phase The label answered, specifically includes:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in minority class sample set, then the sample is marked Example is noise sample;
When judging most of k neighbour's sample of the sample in minority class sample set for other class samples, then the sample is marked to be Unstable sample;
When in the number of other class samples of the k neighbour's sample for judging the sample in minority class sample set and minority class sample set When the number of sample is close, then marking the sample is boundary sample;
When judging most of k neighbour's sample of the sample in minority class sample set for the sample that belongs in the minority class sample set When, then marking the sample is to stablize sample.
4. according to claim 3 towards polytypic unbalanced data preprocess method, which is characterized in that described to every Sample in a minority class sample set, is deleted, is saved, replicated or is synthesized according to the label of each sample, to obtain phase Should in the final minority class sample set of each minority class sample set, specifically:
To the sample in each minority class sample set:
Delete all noise samples in the minority class sample set;
Corresponding final minority class sample set is added in all unstable samples;
Each boundary sample is replicated, number Wei ∣ c-1 ∣ is replicated, the sample of the boundary sample and duplication is added together Corresponding final minority class sample set;Wherein, c is reproduction ratio, and c=(ideal sample number-shakiness of the minority class sample set Random sample example number)/(practical sample number-noise sample number-unstable sample number of the minority class sample set);
To each stable sample, new sample is synthesized with sample around, synthesizes number Wei ∣ c-1 ∣, and by the sample and new synthesis Sample corresponding final minority class sample set is added together;Wherein, synthetic method is every time from the stable sample xiK neighbour One sample xj, newly synthesized sample x belonged in the minority class sample set of middle random selectioni'=xi+ (xi-xj) * a, a 0 Random number between to 1;
Calculate the sample number d for belonging to the minority class sample set for also needing to generate;Wherein, the ideal of the d=minority class sample set Sample number-finally lacks the current number of the sample for belonging to the minority class sample set in class sample set;
D stable sample is randomly choosed, each stable sample synthesizes a new sample with surrounding sample, the new sample of synthesis is added Enter into corresponding final minority class sample set;
Obtain the final minority class sample set for corresponding to each minority class sample set.
5. according to claim 3 towards polytypic unbalanced data preprocess method, which is characterized in that described to every Sample in a majority class sample set, calculates the sample belonged in the majority class sample set in the k neighbour of each sample and other Each sample is divided into noise sample, boundary sample or stablizes sample, and stamps corresponding label by the number of class sample, Specifically:
When the overwhelming majority is other class sample in the k neighbour's sample for judging the sample in most class sample sets, then the sample is marked Example is noise sample;
When other class samples of the k neighbour's sample for judging the sample in most class sample sets number and belong to most class sample sets In sample number it is close when, then mark the sample be boundary sample;
When judging most of k neighbour's sample of sample in most class sample sets for the sample that belongs in the majority class sample set When, then marking the sample is to stablize sample.
6. according to claim 5 towards polytypic unbalanced data preprocess method, which is characterized in that described to every Sample in a majority class sample set, is deleted or is saved according to the label of each sample, to obtain corresponding to each majority Final most class sample sets of class sample set, specifically include:
To the sample in each most class sample sets:
Delete noise sample;
Retain all boundary samples;
To each stable sample, selectively removing operation is executed, until deleting e stable sample;Wherein, e=majority class sample Practical sample number-noise sample number-of example collection majority class sample set ideal sample number;
The final majority corresponding to each most class sample sets is obtained according to the boundary sample of reservation and remaining stable sample Class sample set.
7. according to claim 6 towards polytypic unbalanced data preprocess method, which is characterized in that described to every A stable sample executes selectively removing operation, until deleting e stable sample specifically:
It repeats the steps of until the number f of deleted stable sample is equal to e;
To the stabilization sample currently chosen, the sample that the stable sample belongs to the majority class sample set to the k neighbour of surrounding is calculated The distance of example;
The probability for deleting the stable sample is calculated according to the distance;Wherein, apart from smaller, probability of erasure is bigger;
If probability of erasure is more than or equal to 0.5, the stable sample is deleted, and update the number f of deleted stable sample;
Choose next stable sample.
8. one kind is towards polytypic unbalanced data pretreatment unit characterized by comprising
Data-reading unit, for reading original sample collection;Wherein, the original sample collection includes the sample of at least two classifications Collection;
Data receipt unit, for receiving the imbalance between final sample collection size input by user and each sample set Than the ideal sample number that final sample concentrates each sample set is calculated;
It is few to judge that the sample set belongs to for the ideal sample number and practical sample number according to each sample set for judging unit Several classes of sample sets or most class sample sets;
Minority class sample taxon, for calculating its in the k neighbour of each sample to the sample in each minority class sample set Each sample is divided into noise sample, unstable by his class sample and the number for belonging to the sample in the minority class sample set Sample, boundary sample stablize sample and stamp corresponding label;Wherein, other class samples refer to except in the minority class sample set Sample other than other sample sets in sample;
Minority class sample processing unit, for being carried out according to the label of each sample to the sample in each minority class sample set It deletes, save, replicate or synthesizes, to obtain the final minority class sample set corresponding to each minority class sample set;
Most class sample taxons, for calculating and belonging in the k neighbour of each sample to the sample in each most class sample sets The number of sample and other class samples in the majority class sample set, is divided into noise sample, boundary sample for each sample Example stablizes sample, and stamps corresponding label;
Most class sample processing units, for being carried out according to the label of each sample to the sample in each most class sample sets It deletes or saves, to obtain final most class sample sets corresponding to each most class sample sets;
Final sample collection generation unit, for generating final according to the final minority class sample set and most class sample sets Sample set, to realize the pretreatment of unbalanced data.
9. one kind is towards polytypic unbalanced data pre-processing device, which is characterized in that including processor, memory and deposit The computer program executed by the processor is stored up in the memory and is configured as, the processor executes the calculating It realizes when machine program as claimed in any of claims 1 to 7 in one of claims towards polytypic unbalanced data preprocess method.
CN201810599969.3A 2018-06-11 2018-06-11 One kind is towards polytypic unbalanced data preprocess method, device and equipment Pending CN109033148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810599969.3A CN109033148A (en) 2018-06-11 2018-06-11 One kind is towards polytypic unbalanced data preprocess method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810599969.3A CN109033148A (en) 2018-06-11 2018-06-11 One kind is towards polytypic unbalanced data preprocess method, device and equipment

Publications (1)

Publication Number Publication Date
CN109033148A true CN109033148A (en) 2018-12-18

Family

ID=64612664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810599969.3A Pending CN109033148A (en) 2018-06-11 2018-06-11 One kind is towards polytypic unbalanced data preprocess method, device and equipment

Country Status (1)

Country Link
CN (1) CN109033148A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978009A (en) * 2019-02-27 2019-07-05 广州杰赛科技股份有限公司 Behavior classification method, device and storage medium based on wearable intelligent equipment
CN110378352A (en) * 2019-07-11 2019-10-25 河海大学 The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment
CN112749719A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for sample balanced classification
CN112766394A (en) * 2021-01-26 2021-05-07 维沃移动通信有限公司 Modeling sample generation method and device
CN113298148A (en) * 2021-05-25 2021-08-24 南京邮电大学 Ecological environment evaluation-oriented unbalanced data resampling method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978009A (en) * 2019-02-27 2019-07-05 广州杰赛科技股份有限公司 Behavior classification method, device and storage medium based on wearable intelligent equipment
CN110378352A (en) * 2019-07-11 2019-10-25 河海大学 The anti-interference two-dimensional filtering navigation data denoising method of high-precision in complicated underwater environment
CN110378352B (en) * 2019-07-11 2021-03-19 河海大学 High-precision anti-interference two-dimensional filtering navigation data denoising method in complex underwater environment
CN112749719A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for sample balanced classification
CN112766394A (en) * 2021-01-26 2021-05-07 维沃移动通信有限公司 Modeling sample generation method and device
CN112766394B (en) * 2021-01-26 2024-03-12 维沃移动通信有限公司 Modeling sample generation method and device
CN113298148A (en) * 2021-05-25 2021-08-24 南京邮电大学 Ecological environment evaluation-oriented unbalanced data resampling method
CN113298148B (en) * 2021-05-25 2022-08-05 南京邮电大学 Ecological environment evaluation-oriented unbalanced data resampling method

Similar Documents

Publication Publication Date Title
CN109033148A (en) One kind is towards polytypic unbalanced data preprocess method, device and equipment
CN103455542B (en) Multiclass evaluator and multiclass recognition methods
US20210073669A1 (en) Generating training data for machine-learning models
CN106874435A (en) User portrait construction method and device
CN111428217B (en) Fraudulent party identification method, apparatus, electronic device and computer readable storage medium
CN111275491A (en) Data processing method and device
CN109766902A (en) To the method, apparatus and equipment of the vehicle cluster in same region
CN109739985A (en) Automatic document classification method, equipment and storage medium
CN107622326A (en) User's classification, available resources Forecasting Methodology, device and equipment
CN107908796A (en) E-Government duplicate checking method, apparatus and computer-readable recording medium
CN112035549A (en) Data mining method and device, computer equipment and storage medium
CN109191167A (en) A kind of method for digging and device of target user
CN108959516A (en) Conversation message treating method and apparatus
CN108647727A (en) Unbalanced data classification lack sampling method, apparatus, equipment and medium
CN108537270A (en) Image labeling method, terminal device and storage medium based on multi-tag study
CN108346098A (en) A kind of method and device of air control rule digging
CN107748739A (en) A kind of extracting method and relevant apparatus of short message text masterplate
CN113656699A (en) User feature vector determination method, related device and medium
CN107506407A (en) A kind of document classification, the method and device called
CN102339278A (en) Information processing device, information processing method, and program
CN111984842B (en) Bank customer data processing method and device
CN108647728A (en) Unbalanced data classification oversampler method, device, equipment and medium
CN108596271A (en) Appraisal procedure, device, storage medium and the terminal of fingerprint developing algorithm
CN114697127B (en) Service session risk processing method based on cloud computing and server
CN116977692A (en) Data processing method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218