CN110348486A

CN110348486A - Based on sampling and feature brief non-equilibrium data collection conversion method and system

Info

Publication number: CN110348486A
Application number: CN201910508530.XA
Authority: CN
Inventors: 龙春; 魏金侠; 万巍; 赵静; 杨帆
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2019-10-18
Also published as: CN112085046A

Abstract

The present invention provide it is a kind of based on sampling and feature brief non-equilibrium data collection conversion method and system, this method sampled using the sample that the method for sampling concentrates non-equilibrium data, reach minority class number of samples with most class numbers of samples close to balancing；Then sequence from big to small is carried out to feature using the correlation between feature and class label；Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, the sample data set for reducing one-dimensional characteristic is just inputted Random Forest model by every one-dimensional characteristic of deleting, calculate the corresponding ACC value of these samples, compare all ACC values, chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.More classification SVM are input to by the new non-equilibrium data that the above conversion method obtains to be trained, and are remarkably improved the accuracy of classification.

Description

Based on sampling and feature brief non-equilibrium data collection conversion method and system

Technical field

It is the invention belongs to non-equilibrium data switch technology field, in particular to a kind of based on sampling and brief non-flat of feature The data set conversion method that weighs and system.

Background technique

Non-equilibrium data collection conversion method is carried out from data plane to data set when classifying to non-equilibrium data collection Reconstruct, to reduce non-equilibrium degree, the method for improving classification accuracy.Unbalanced dataset classification refers to Different categories of samples data not phase Classification problem Deng in the case where.By taking two classification problems as an example, i.e., certain a kind of data sample proportion is significantly more than other classes Other data sample.Wherein, the sample more than ratio forms most class sample sets, and the few sample of ratio forms minority class sample set. Non-equilibrium data is very widely used in real life, such as the neck such as risk intrusion detection, rare disease forecasting, financial swindling Domain.

Most common method is to carry out over-sampling processing to minority class sample set in data plane, by increasing minority class sample Originally data set is made to be distributed relative equilibrium.

1. the existing method for carrying out over-sampling to minority class sample set makes no exception to all minority class sample sets, not Consider the different different degrees of different minority class sample set classifiers；2. the feature of data set has the performance of classifier critically important Influence, if not having effective field comprising more multipair classification results in feature, can be brought to the training process of classifier compared with Big complexity.

Summary of the invention

In order to solve the problems in the existing technology, the present invention provides a kind of based on sampling and brief non-equilibrium of feature Data set conversion method.

In order to achieve the above objectives, the present invention adopts the following technical scheme:

The present invention provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, this method comprises:

Non-equilibrium data collection is obtained, the non-equilibrium data collection includes most class sample sets and minority class sample set；

Sampling processing is carried out to non-equilibrium data collection, obtains new non-equilibrium data collection；

Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature.

Preferred technical solution, described includes carrying out to minority class sample set to non-equilibrium data collection progress sampling processing Sampling carries out over-sampling to minority class sample set including the use of S-NKSMOTE algorithm, specifically:

Obtain k neighbour's sample of sample x in minority class sample set；

Minority class number of samples in k neighbour's sample is compared with most class number of samples, when minority class sample Number when being more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than majority class samples Number, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, marks x to make an uproar Sound sample；

When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is leaned on according to following manner generation The new samples X of nearly minority class sample_new, all new samples form new minority class sample set；

X_new=x+rand (0.5,1) (x '-x)

When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most classes Sample then generates the new samples X close to x according to following manner_new, all new samples form new minority class sample set；

X_new=x+rand (0,0.5) (x '-x)

If x ' belongs to minority class sample, the new samples X close to x is generated according to following equation_new, all new samples shapes At new minority class sample set:

X_new=x+rand (0,1) (x '-x).

Preferred technical solution, it is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:

Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature It is ranked up from big to small according to the correlation with class label；

Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional New non-equilibrium data collection after feature is input in Random Forest model, and is calculated new non-after every reduction one-dimensional characteristic The corresponding ACC value of equilibrium data collection；

Compare all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after characteristic dimension.

Preferred technical solution, described further includes carrying out to most class sample sets to non-equilibrium data collection progress sampling processing Lack sampling, specifically:

Obtain the boundary sample collection of most class sample sets and minority class sample set；

Obtain the central sample of boundary sample collection；

The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance pair of calculating Most class sample sets carry out lack sampling, obtain new most class sample sets, new most class sample sets and new minority class sample New non-equilibrium data collection is assembled.

Preferred technical solution obtains the specific method of the boundary sample collection of most class sample sets and minority class sample set Are as follows:

It calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets；

It calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set；

Pick out the corresponding most class samples of minimum range and minority class sample；

Obtain m neighbour's sample of most class samples and n neighbour's sample of minority class sample；

Obtain boundary sample collection D, D=m ∩ n.

Preferred technical solution obtains the central sample of boundary sample collection method particularly includes:

The distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively；

Calculate separately the variance SD of each sample respective distances and apart from summation E；

Calculate dispersion degree B, B=SD*E；

Pick out sample centered on the smallest sample of dispersion degree.

Preferred technical solution calculates the distance of each sample distance center sample in most class sample sets, and according to meter The distance of calculation carries out lack sampling to most class sample sets, method particularly includes:

Calculate the distance of each sample distance center sample in most class sample sets；

It is ranked up from small to large according to distance, then forms the matrix of R × T；

The relative standard deviation RSD of each row distance in calculating matrix；

By relative standard deviation RSD and threshold value RSD₁It is compared, as RSD≤RSD₁When, calculate being averaged for the row distance Value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted；Work as RSD > RSD₁When, the corresponding all samples of the row are deleted；

In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to random forest mould In type；

Calculate Δ Gm₁, Δ Gm₁=Gm_i- Gm, Gm_iIt is input to for most class sample sets after the i-th row sample of deletion random gloomy The G_mean value exported in woods model, Gm are that original non-equilibrium data collection is input to the G_mean exported in Random Forest model Value；

By Δ Gm₁It is compared with threshold value Δ Gm, as Δ Gm₁When >=Δ Gm, stop lack sampling, sample at this time is as new Most class sample sets.

Preferred technical solution is described each few in minority class sample set to calculate to minority class sample set progress over-sampling The distance of several classes of sample distance center samples, and over-sampling is carried out to minority class sample set according to the distance of calculating, it obtains new Minority class sample set, specifically includes:

Calculate the distance of each sample distance center sample in minority class sample set；

R ' × T ' matrix will be formed apart from by sorting from small to large；

Since the first row, over-sampling is carried out to the corresponding sample of every a line using S-NKSMOTE algorithm；

The sample set formed after over-sampling is input to random by the sample of every a line after over-sampling in matrix In forest model；

Calculate Δ Gm₂, Δ Gm₂=Gm_j- Gm, Gm_jThe minority class sample set formed after over-sampling for jth row sample is defeated Enter the G_mean value exported into Random Forest model, Gm is input to defeated in Random Forest model for original non-equilibrium data collection G_mean value out；

By Δ Gm₂It is compared with threshold value Δ Gm, as Δ Gm₂When >=Δ Gm, stop over-sampling, sample at this time is as new Minority class sample set.

Another aspect of the present invention provides a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, the conversion System includes:

Obtain the data acquisition module of non-equilibrium data collection；The non-equilibrium data collection includes most class sample sets and minority Class sample set；

Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module of new non-equilibrium data collection；

Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature Processing module.

Further aspect of the present invention provides a kind of computer readable storage medium, is stored thereon with computer program, the program Realized when being executed by processor it is provided by the invention based on sampling with feature brief non-equilibrium data collection conversion method the step of.

It is provided by the invention based on sampling with the brief non-equilibrium data collection conversion method of feature, first with the method for sampling The sample concentrated to non-equilibrium data samples, and the number of samples of minority class is made to reach close flat with most class numbers of samples Weighing apparatus reduces the disequilibrium of minority class sample.Then feature is carried out from big using the correlation between feature and class label To small sequence；Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic just will The sample data set for reducing one-dimensional characteristic inputs Random Forest model, calculates the corresponding ACC value of these samples using random forest As fitness, (deleted since finally one-dimensional, every deletion feature once will just count until calculating to only remaining first dimensional feature Calculated according to collection input random forest it is primary, until being left the first dimensional feature) the corresponding ACC value of sample data set.Compare all ACC value chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.Pass through the above conversion method The new non-equilibrium data obtained is input to more classification SVM and is trained, and is remarkably improved the accuracy of classification.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the stream based on sampling with the brief non-equilibrium data collection conversion method of feature that the embodiment of the present invention 1 provides Journey schematic diagram；

Fig. 2 is the 1 specific flow chart of step S2 over-sampling of embodiment；

Fig. 3 is the 1 specific flow chart of step S3 of embodiment；

Fig. 4 is the 2 specific flow chart of step S2 of embodiment；

Fig. 5 is the specific flow chart of step S210；

Fig. 6 is the specific flow chart of step S220；

Fig. 7 is the specific flow chart of step S230；

Fig. 8 is the specific flow chart of step S240；

Fig. 9 is the structural block diagram based on sampling with the brief non-equilibrium data collection conversion method of feature that embodiment 3 provides.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Embodiment 1

The embodiment of the present invention 1 provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, such as Fig. 1 institute Show, this method comprises the following steps:

S1: obtaining non-equilibrium data collection, and the non-equilibrium data collection includes most class sample sets and minority class sample set；

S2: sampling processing is carried out to non-equilibrium data collection, new non-equilibrium data collection is obtained, including the use of S-NKSMOTE Algorithm carries out over-sampling to minority class sample set, with reference to Fig. 2, specifically:

S21 obtains k neighbour's sample of sample x in minority class sample set；

Wherein, k neighbour's sample is k sample distance sample x nearest on nuclear space, and the value of k can be set It is fixed, it can be 100,500 etc.；

S22: the minority class number of samples in k neighbour's sample is compared with most class number of samples, works as minority class When the number of sample is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than most classes The number of sample, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, label x For noise sample；

S23: when x is noise sample, randomly choosing a sample x ' in minority class sample set, raw according to following manner At the new samples X close to minority class sample_new, all new samples form new minority class sample set；

X_new=x+rand (0.5,1) (x '-x)

S24: when x is not noise sample, from its k neighbour's sample randomly choose 1 sample x ', if x ' belong to it is more Several classes of samples then generate the new samples X close to x according to following manner_new, all new samples form new minority class sample set；

X_new=x+rand (0,0.5) (x '-x)

If x ' belongs to minority class sample, the new samples X close to x is generated according to following equation_new, all new samples shapes At new minority class sample set；

X_new=x+rand (0,1) (x '-x)；

Wherein, the rand (a, b) in each step indicates a random number in section (a, b), a=0 or 0.5, b=0.5 Or 1；

S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is joined Fig. 3 is examined, is specifically comprised the following steps:

S31: the feature with the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection are analyzed, and will Feature is ranked up from big to small according to the correlation with class label；

Correlation analysis can be analyzed according to existing method, for example, can according to comentropy or the method for mutual information into Row analysis；

S32: since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will be reduced New non-equilibrium data collection after one-dimensional characteristic is input in Random Forest model, and is calculated new after every reduction one-dimensional characteristic The corresponding ACC value of non-equilibrium data collection；

Wherein, ACC value indicates accuracy, since it is last it is one-dimensional delete, it is every delete that feature is primary just will be new non-equilibrium Data set inputs random forest and calculates once, until being left the first dimensional feature；

S33: more all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after feature dimensions Degree.

Such as the ACC value calculated after the v times deletion is maximum, then y-v dimensional feature before retaining, y is the dimension of primitive character.

It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S- NKSMOTE algorithm carries out over-sampling to minority class sample set, improves the quantity of minority class sample, reduces the injustice of minority class sample Weighing apparatus property；Then it is combined using correlation analysis and random forest, realizes the brief processing of feature to new non-equilibrium data collection, Data set after conversion is input to more classification SVM and is trained, and significantly improves the accuracy of classification.

Embodiment 2

The embodiment of the present invention 2 provides a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, this method Include the following steps:

S2: carrying out sampling processing to non-equilibrium data collection, obtains new non-equilibrium data collection, the reference of step S2 specific method Fig. 4 is specifically included:

S210: the boundary sample collection of most class sample sets and minority class sample set is obtained；

With reference to Fig. 5, step S210 is specifically, wherein following signified distance is all the distance on nuclear space；

S211: it calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets；

S212: it calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set；

S213: the corresponding most class samples of minimum range and minority class sample are picked out；

S214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained；

Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.；

S215: boundary sample collection D, D=m ∩ n is obtained.

The boundary sample wherein obtained integrates as the intersection of m neighbour's sample and n neighbour's sample, is by m neighbour's sample This is formed with identical sample in n neighbour's sample.

S220: the central sample of boundary sample collection is obtained；

With reference to Fig. 6, step S220 specifically:

S221: the distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively；

Seek sample x in boundary sample collection_bDistance S apart from other e sample_bf, boundary sample integrate in total number of samples as e + 1；B=1,2..., e, e+1, S_bfFor sample x_bDistance sample x_fDistance, f=1,2..., e, e+1, f ≠ b；

S222: the variance SD of each sample respective distances is calculated separately and apart from summation E；

Calculate sample x_bVarianceE =S_b1+S_b2+…+S_bf+…S_be+S_b(e+1),Wherein, S_b1、S_b2、...、S_bf、...S_beAnd S_b(e+1)Respectively indicate sample x_bDistance sample x₁、x₂、...、x_f、...x_eAnd x_e+1Distance；

S223: dispersion degree B, B=SD*E are calculated；

S224: sample centered on the smallest sample of dispersion degree is picked out；

S230: the distance of each most class sample distance center samples in most class sample sets is calculated, and according to calculating Distance carries out lack sampling to most class sample sets, obtains new most class sample sets, new most class sample sets and new minority Class sample set is at new non-equilibrium data collection；

With reference to Fig. 7, step S230 specifically:

S231: the distance of each sample distance center sample in most class sample sets is calculated；

S232: being ranked up from small to large according to distance, then forms the matrix of R × T；

R and T is setting value, can be identical or different, can take 50,100 or 200 equivalences；

S233: the relative standard deviation RSD of each row distance in calculating matrix；

Wherein,S_h1、S_h2、...、S_hgRespectively indicate the 1st of matrix h row, 2 It is a ..., the distance of g-th sample distance center sample；H=1,2...T；

S234: by relative standard deviation RSD and threshold value RSD₁It is compared, as RSD≤RSD₁When, calculate the row distance Average value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted；When RSD > RSD₁When, the corresponding all samples of the row are deleted；Wherein threshold value RSD₁For setting value；

S235: in matrix after every deletion a line sample, all by reduce a line sample after most class sample sets be input to In machine forest model；The Random Forest model is the model after training；

S236: Δ Gm is calculated₁, Δ Gm₁=Gm_i- Gm, Gm_iIt is input to delete most class sample sets after the i-th row sample The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model G_mean value；

S237: by Δ Gm₁It is compared with threshold value Δ Gm, as Δ Gm₁When >=Δ Gm, stop lack sampling, sample at this time is For new most class sample sets；

S240: the distance of each minority class sample distance center sample in minority class sample set is calculated, and according to calculating Distance carries out over-sampling to minority class sample set, obtains new minority class sample set；

With reference to Fig. 8, step S240 specifically:

S241: the distance of each sample distance center sample in minority class sample set is calculated；

S242: R ' × T ' matrix will be formed apart from by sorting from small to large；

R ' and T ' is setting value, can be identical or different, can take 50,100 or 200 equivalences

S243: since the first row, over-sampling, reference are carried out to the corresponding sample of every a line using S-NKSMOTE algorithm Embodiment 1 and the method for Fig. 2 carry out over-sampling；

S244: the sample set formed after over-sampling is input to by the sample of every a line after over-sampling in matrix Into Random Forest model；

S245: Δ Gm is calculated₂, Δ Gm₂=Gm_j- Gm, Gm_jThe minority class sample formed after over-sampling for jth row sample This collection is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is input to random forest mould The G_mean value exported in type；

S246: by Δ Gm₂It is compared with threshold value Δ Gm, as Δ Gm₂When >=Δ Gm, stop over-sampling, sample at this time is For new minority class sample set.

S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is had Body step embodiment 1 and Fig. 3.

What the embodiment of the present invention 2 provided can be reduced now based on sampling and the brief non-equilibrium data collection conversion method of feature Asking for important information sample is accidentally deleted when having technology over-sampling bring overfitting problem, and avoiding prior art lack sampling Topic, the data set for converting formation are input to more classification SVM and are trained, and further improve the accuracy of classification, and reduce Time of classification.

Embodiment 3

The embodiment of the present invention 3 provide it is a kind of based on sampling with the brief non-equilibrium data collection converting system of feature, such as Fig. 9 institute Show, which includes:

The data acquisition module 1 of non-equilibrium data collection is obtained, the non-equilibrium data collection is including most class sample sets and less Several classes of sample sets；

Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module 2 of new non-equilibrium data collection；

Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature Processing module 3.

With continued reference to Fig. 9, sampling processing module 2 includes:

Boundary sample acquisition submodule 210: for obtaining the boundary sample collection of most class sample sets and minority class sample set； Wherein, boundary sample acquisition submodule 210 includes:

First computing unit 211: for calculate separately in most class sample sets it is each majority class samples and its recently lack The distance of several classes of samples；

Second computing unit 212: for calculating separately each minority class sample in minority class sample set and it is nearest more The distance of several classes of samples；

First module of selection 213: for picking out the corresponding most class samples of minimum range and minority class sample；

Acquiring unit 214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained, is obtained Boundary sample collection D, D=m ∩ n

Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.；The boundary of acquisition Sample set is the intersection of m neighbour's sample and n neighbour's sample, is by identical in m neighbour's sample and n neighbour's sample What sample was formed.

Central sample acquisition submodule 220: for obtaining the central sample of boundary sample collection, the central sample obtains son Module includes:

Third computing unit 221: own in boundary sample collection for seeking each sample in boundary sample collection respectively The distance of other samples；

4th computing unit 222: for calculating separately the variance SD of each sample respective distances and apart from summation E；

5th computing unit 223: for calculating dispersion degree B, B=SD*E；

Second module of selection 224: for picking out sample centered on the smallest sample of dispersion degree；

Lack sampling handles submodule 230: for calculating each majority class sample distance center sample in most class sample sets Distance, and according to the distance of calculating to most class sample sets carry out lack sampling, obtain new most class sample sets, new majority Class sample set and new minority class sample set are at new non-equilibrium data collection；

The lack sampling handles submodule 230

6th computing unit 231: for calculating the distance of each sample distance center sample in most class sample sets；

First matrix forms unit 232: for being ranked up from small to large according to distance, then forming the matrix of R × T；

7th computing unit 233: the relative standard deviation RSD for row distance each in calculating matrix；

First comparing unit 234: it is used for relative standard deviation RSD and threshold value RSD₁It is compared, as RSD≤RSD₁When, The average value of the row distance is calculated, and calculates the difference of each distance and the average value row Nei, difference is greater than threshold value pair The sample answered is deleted；As RSD > RSD₁When, the corresponding all samples of the row are deleted；Wherein threshold value RSD₁For setting value；

First input unit 235: in matrix after every deletion a line sample, for most class samples after a line sample will to be reduced This collection is input in Random Forest model；The Random Forest model is the model after training；

8th computing unit 236: for calculating Δ Gm₁, Δ Gm₁=Gm_i- Gm, Gm_iTo delete the majority after the i-th row sample Class sample set is input to the G_mean value exported in Random Forest model, and Gm is input to random gloomy for original non-equilibrium data collection The G_mean value exported in woods model；

Second comparing unit 237: it is used for Δ Gm₁It is compared with threshold value Δ Gm, as Δ Gm₁When >=Δ Gm, stopping owes to adopt Sample, sample at this time are new most class sample sets；

Over-sampling handles submodule 240: for calculating each minority class sample distance center sample in minority class sample set Distance, and according to the distance of calculating to minority class sample set carry out over-sampling, obtain new minority class sample set；

The over-sampling handles submodule 240

9th computing unit 241: for calculating the distance of each sample distance center sample in minority class sample set；

Second matrix forms unit 242: for that will form the matrix of R × T apart from by sorting from small to large；

Over-sampling unit 243: for since the first row, using S-NKSMOTE algorithm to the corresponding sample of every a line into Row over-sampling, the subelement that over-sampling unit specifically includes are as follows:

Neighbour obtains subelement: for obtaining k neighbour's sample of sample x in minority class sample set；

Multilevel iudge subelement: for by k neighbour's sample minority class number of samples and most class number of samples into Row compares, and when the number of minority class sample is more than the number of most class samples, label x is safe sample, when minority class sample Number be less than the numbers of most class samples, and there are minority class samples, and marking x is dangerous sample, when k neighbour's sample is entirely Most class samples, label x are noise sample；

First sample generates subelement: for randomly choosing a sample in minority class sample set when x is noise sample This x ' generates the new samples X close to minority class sample according to following manner_new, all new samples form new minority class sample Collection；

X_new=x+rand (0.5,1) (x '-x)

First sample generates subelement: for randomly choosing 1 from its k neighbour's sample when x is not noise sample Sample x ' generates the new samples X close to x according to following manner if x ' belongs to most class samples_new, all new samples are formed New minority class sample set；

X_new=x+rand (0,0.5) (x '-x)

X_new=x+rand (0,1) (x '-x)；

Second input unit 244: the sample of every a line is after over-sampling in matrix, for that will be formed after over-sampling Sample set be input in Random Forest model；

Tenth computing unit 245: for calculating Δ Gm₂, Δ Gm₂=Gm_j- Gm, Gm_jIt is jth row sample after over-sampling The minority class sample set of formation is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is defeated Enter the G_mean value exported into Random Forest model；

Third comparing unit 246: it is used for Δ Gm₂It is compared with threshold value Δ Gm, as Δ Gm₂When >=Δ Gm, stopped adopting Sample, sample at this time are new minority class sample set.

The dimension-reduction treatment module 3 includes:

Sorting sub-module 31 is analyzed, for analyzing the feature with corresponding classification of every one kind sample in new non-equilibrium data collection The correlation of label, and feature is ranked up from big to small according to the correlation with class label；

Computational submodule 32: for since feature it is last it is one-dimensional successively delete one-dimensional characteristic, every deletion in sequence New non-equilibrium data collection after reducing one-dimensional characteristic is input in Random Forest model by one-dimensional characteristic, and calculates every reduction The corresponding ACC value of new non-equilibrium data collection after one-dimensional characteristic；

Comparative sub-module 33: being used for more all ACC values, chooses the corresponding characteristic dimension of maximum ACC value, that is, is characterized letter Characteristic dimension after about.

It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S- The mixed method of NKSMOTE algorithm and lack sampling samples non-equilibrium data collection, reduces prior art over-sampling bring Overfitting problem, the problem of important information sample is accidentally deleted when avoiding prior art lack sampling, convert the data set input of formation It is trained to more classification SVM, further improves the accuracy of classification, and reduce the time of classification.

Embodiment 4

The embodiment of the present invention 4 also provides another computer readable storage medium, which can be with It is computer readable storage medium included in the memory in above-described embodiment；It is also possible to individualism, without supplying Computer readable storage medium in terminal.The computer-readable recording medium storage has one or more than one program, The one or more programs are used to execute 2 institute of embodiment 1 and embodiment by one or more than one processor The method of offer.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that the described method includes:

2. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described includes carrying out over-sampling to minority class sample set to non-equilibrium data collection progress sampling processing, including the use of S-NKSMOTE Algorithm carries out over-sampling to minority class sample set, specifically:

Obtain k neighbour's sample of sample x in minority class sample set；

Minority class number of samples in k neighbour's sample is compared with most class number of samples, as of minority class sample When number is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than a of most class samples Number, and there are minority class samples, marking x is dangerous sample, and when k neighbour's sample is most class samples entirely, label x is noise sample This；

When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is generated according to following manner close to few The new samples X of several classes of samples_new, all new samples form new minority class sample set；

X_new=x+rand (0.5,1) (x '-x)

When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most class samples This, then generate the new samples X close to x according to following manner_new, all new samples form new minority class sample set；

X_new=x+rand (0,0.5) (x '-x)

If x ' belongs to minority class sample, the new samples X close to x is generated according to following equation_new, all new samples form New minority class sample set:

X_new=x+rand (0,1) (x '-x).

3. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that It is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:

Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature according to It is ranked up from big to small with the correlation of class label；

Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional characteristic New non-equilibrium data collection afterwards is input in Random Forest model, and is calculated new non-equilibrium after every reduction one-dimensional characteristic The corresponding ACC value of data set；

4. according to claim 2 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described further includes carrying out lack sampling to most class sample sets to non-equilibrium data collection progress sampling processing, specifically:

Obtain the central sample of boundary sample collection；

The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance of calculating to majority Class sample set carries out lack sampling, obtains new most class sample sets, new most class sample sets and new minority class sample set At new non-equilibrium data collection.

5. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Obtain the boundary sample collection of most class sample sets and minority class sample set method particularly includes:

Obtain boundary sample collection D, D=m ∩ n.

6. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Obtain the central sample of boundary sample collection method particularly includes:

Calculate dispersion degree B, B=SD*E；

Pick out sample centered on the smallest sample of dispersion degree.

7. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Calculate the distance of each sample distance center sample in most class sample sets, and according to the distance of calculating to most class sample sets into Row lack sampling, method particularly includes:

By relative standard deviation RSD and threshold value RSD₁It is compared, as RSD≤RSD₁When, the average value of the row distance is calculated, and Difference is greater than the corresponding sample of threshold value and deleted by the difference for calculating each distance and the average value row Nei；As RSD > RSD₁ When, the corresponding all samples of the row are deleted；

In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to Random Forest model In；

Calculate Δ Gm₁, Δ Gm₁=Gm_i- Gm, Gm_iRandom forest mould is input to delete most class sample sets after the i-th row sample The G_mean value exported in type, Gm are that original non-equilibrium data collection is input to the G_mean value exported in Random Forest model；

By Δ Gm₁It is compared with threshold value Δ Gm, as Δ Gm₁When >=Δ Gm, stop lack sampling, sample at this time is new more Several classes of sample sets.

8. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described is to calculate each minority class sample distance center sample in minority class sample set to minority class sample set progress over-sampling Distance, and over-sampling is carried out to minority class sample set according to the distance of calculating, new minority class sample set is obtained, is specifically included:

R ' × T ' matrix will be formed apart from by sorting from small to large；

The sample set formed after over-sampling is input to random forest after over-sampling by the sample of every a line in matrix In model；

Calculate Δ Gm₂, Δ Gm₂=Gm_j- Gm, Gm_jThe minority class sample set formed after over-sampling for jth row sample is input to The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model G_mean value；

By Δ Gm₂It is compared with threshold value Δ Gm, as Δ Gm₂When >=Δ Gm, stop over-sampling, sample at this time is new lacks Several classes of sample sets.

9. a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, which is characterized in that the converting system packet It includes:

Obtain the data acquisition module of non-equilibrium data collection；The non-equilibrium data collection includes most class sample sets and minority class sample This collection；

Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimension-reduction treatment of the brief new non-equilibrium data collection of feature Module.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The step of any one of claim 1-8 claim the method is realized when device executes.