CN110348486A - Based on sampling and feature brief non-equilibrium data collection conversion method and system - Google Patents

Based on sampling and feature brief non-equilibrium data collection conversion method and system Download PDF

Info

Publication number
CN110348486A
CN110348486A CN201910508530.XA CN201910508530A CN110348486A CN 110348486 A CN110348486 A CN 110348486A CN 201910508530 A CN201910508530 A CN 201910508530A CN 110348486 A CN110348486 A CN 110348486A
Authority
CN
China
Prior art keywords
sample
new
data collection
sampling
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910508530.XA
Other languages
Chinese (zh)
Inventor
龙春
魏金侠
万巍
赵静
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201910508530.XA priority Critical patent/CN110348486A/en
Publication of CN110348486A publication Critical patent/CN110348486A/en
Priority to CN202010371648.5A priority patent/CN112085046A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provide it is a kind of based on sampling and feature brief non-equilibrium data collection conversion method and system, this method sampled using the sample that the method for sampling concentrates non-equilibrium data, reach minority class number of samples with most class numbers of samples close to balancing;Then sequence from big to small is carried out to feature using the correlation between feature and class label;Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, the sample data set for reducing one-dimensional characteristic is just inputted Random Forest model by every one-dimensional characteristic of deleting, calculate the corresponding ACC value of these samples, compare all ACC values, chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.More classification SVM are input to by the new non-equilibrium data that the above conversion method obtains to be trained, and are remarkably improved the accuracy of classification.

Description

Based on sampling and feature brief non-equilibrium data collection conversion method and system
Technical field
It is the invention belongs to non-equilibrium data switch technology field, in particular to a kind of based on sampling and brief non-flat of feature The data set conversion method that weighs and system.
Background technique
Non-equilibrium data collection conversion method is carried out from data plane to data set when classifying to non-equilibrium data collection Reconstruct, to reduce non-equilibrium degree, the method for improving classification accuracy.Unbalanced dataset classification refers to Different categories of samples data not phase Classification problem Deng in the case where.By taking two classification problems as an example, i.e., certain a kind of data sample proportion is significantly more than other classes Other data sample.Wherein, the sample more than ratio forms most class sample sets, and the few sample of ratio forms minority class sample set. Non-equilibrium data is very widely used in real life, such as the neck such as risk intrusion detection, rare disease forecasting, financial swindling Domain.
Most common method is to carry out over-sampling processing to minority class sample set in data plane, by increasing minority class sample Originally data set is made to be distributed relative equilibrium.
1. the existing method for carrying out over-sampling to minority class sample set makes no exception to all minority class sample sets, not Consider the different different degrees of different minority class sample set classifiers;2. the feature of data set has the performance of classifier critically important Influence, if not having effective field comprising more multipair classification results in feature, can be brought to the training process of classifier compared with Big complexity.
Summary of the invention
In order to solve the problems in the existing technology, the present invention provides a kind of based on sampling and brief non-equilibrium of feature Data set conversion method.
In order to achieve the above objectives, the present invention adopts the following technical scheme:
The present invention provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, this method comprises:
Non-equilibrium data collection is obtained, the non-equilibrium data collection includes most class sample sets and minority class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature.
Preferred technical solution, described includes carrying out to minority class sample set to non-equilibrium data collection progress sampling processing Sampling carries out over-sampling to minority class sample set including the use of S-NKSMOTE algorithm, specifically:
Obtain k neighbour's sample of sample x in minority class sample set;
Minority class number of samples in k neighbour's sample is compared with most class number of samples, when minority class sample Number when being more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than majority class samples Number, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, marks x to make an uproar Sound sample;
When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is leaned on according to following manner generation The new samples X of nearly minority class samplenew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most classes Sample then generates the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes At new minority class sample set:
Xnew=x+rand (0,1) (x '-x).
Preferred technical solution, it is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:
Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature It is ranked up from big to small according to the correlation with class label;
Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional New non-equilibrium data collection after feature is input in Random Forest model, and is calculated new non-after every reduction one-dimensional characteristic The corresponding ACC value of equilibrium data collection;
Compare all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after characteristic dimension.
Preferred technical solution, described further includes carrying out to most class sample sets to non-equilibrium data collection progress sampling processing Lack sampling, specifically:
Obtain the boundary sample collection of most class sample sets and minority class sample set;
Obtain the central sample of boundary sample collection;
The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance pair of calculating Most class sample sets carry out lack sampling, obtain new most class sample sets, new most class sample sets and new minority class sample New non-equilibrium data collection is assembled.
Preferred technical solution obtains the specific method of the boundary sample collection of most class sample sets and minority class sample set Are as follows:
It calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
It calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
Pick out the corresponding most class samples of minimum range and minority class sample;
Obtain m neighbour's sample of most class samples and n neighbour's sample of minority class sample;
Obtain boundary sample collection D, D=m ∩ n.
Preferred technical solution obtains the central sample of boundary sample collection method particularly includes:
The distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Calculate separately the variance SD of each sample respective distances and apart from summation E;
Calculate dispersion degree B, B=SD*E;
Pick out sample centered on the smallest sample of dispersion degree.
Preferred technical solution calculates the distance of each sample distance center sample in most class sample sets, and according to meter The distance of calculation carries out lack sampling to most class sample sets, method particularly includes:
Calculate the distance of each sample distance center sample in most class sample sets;
It is ranked up from small to large according to distance, then forms the matrix of R × T;
The relative standard deviation RSD of each row distance in calculating matrix;
By relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, calculate being averaged for the row distance Value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted;Work as RSD > RSD1When, the corresponding all samples of the row are deleted;
In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to random forest mould In type;
Calculate Δ Gm1, Δ Gm1=Gmi- Gm, GmiIt is input to for most class sample sets after the i-th row sample of deletion random gloomy The G_mean value exported in woods model, Gm are that original non-equilibrium data collection is input to the G_mean exported in Random Forest model Value;
By Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is as new Most class sample sets.
Preferred technical solution is described each few in minority class sample set to calculate to minority class sample set progress over-sampling The distance of several classes of sample distance center samples, and over-sampling is carried out to minority class sample set according to the distance of calculating, it obtains new Minority class sample set, specifically includes:
Calculate the distance of each sample distance center sample in minority class sample set;
R ' × T ' matrix will be formed apart from by sorting from small to large;
Since the first row, over-sampling is carried out to the corresponding sample of every a line using S-NKSMOTE algorithm;
The sample set formed after over-sampling is input to random by the sample of every a line after over-sampling in matrix In forest model;
Calculate Δ Gm2, Δ Gm2=Gmj- Gm, GmjThe minority class sample set formed after over-sampling for jth row sample is defeated Enter the G_mean value exported into Random Forest model, Gm is input to defeated in Random Forest model for original non-equilibrium data collection G_mean value out;
By Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is as new Minority class sample set.
Another aspect of the present invention provides a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, the conversion System includes:
Obtain the data acquisition module of non-equilibrium data collection;The non-equilibrium data collection includes most class sample sets and minority Class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature Processing module.
Further aspect of the present invention provides a kind of computer readable storage medium, is stored thereon with computer program, the program Realized when being executed by processor it is provided by the invention based on sampling with feature brief non-equilibrium data collection conversion method the step of.
It is provided by the invention based on sampling with the brief non-equilibrium data collection conversion method of feature, first with the method for sampling The sample concentrated to non-equilibrium data samples, and the number of samples of minority class is made to reach close flat with most class numbers of samples Weighing apparatus reduces the disequilibrium of minority class sample.Then feature is carried out from big using the correlation between feature and class label To small sequence;Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic just will The sample data set for reducing one-dimensional characteristic inputs Random Forest model, calculates the corresponding ACC value of these samples using random forest As fitness, (deleted since finally one-dimensional, every deletion feature once will just count until calculating to only remaining first dimensional feature Calculated according to collection input random forest it is primary, until being left the first dimensional feature) the corresponding ACC value of sample data set.Compare all ACC value chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.Pass through the above conversion method The new non-equilibrium data obtained is input to more classification SVM and is trained, and is remarkably improved the accuracy of classification.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is the stream based on sampling with the brief non-equilibrium data collection conversion method of feature that the embodiment of the present invention 1 provides Journey schematic diagram;
Fig. 2 is the 1 specific flow chart of step S2 over-sampling of embodiment;
Fig. 3 is the 1 specific flow chart of step S3 of embodiment;
Fig. 4 is the 2 specific flow chart of step S2 of embodiment;
Fig. 5 is the specific flow chart of step S210;
Fig. 6 is the specific flow chart of step S220;
Fig. 7 is the specific flow chart of step S230;
Fig. 8 is the specific flow chart of step S240;
Fig. 9 is the structural block diagram based on sampling with the brief non-equilibrium data collection conversion method of feature that embodiment 3 provides.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
Embodiment 1
The embodiment of the present invention 1 provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, such as Fig. 1 institute Show, this method comprises the following steps:
S1: obtaining non-equilibrium data collection, and the non-equilibrium data collection includes most class sample sets and minority class sample set;
S2: sampling processing is carried out to non-equilibrium data collection, new non-equilibrium data collection is obtained, including the use of S-NKSMOTE Algorithm carries out over-sampling to minority class sample set, with reference to Fig. 2, specifically:
S21 obtains k neighbour's sample of sample x in minority class sample set;
Wherein, k neighbour's sample is k sample distance sample x nearest on nuclear space, and the value of k can be set It is fixed, it can be 100,500 etc.;
S22: the minority class number of samples in k neighbour's sample is compared with most class number of samples, works as minority class When the number of sample is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than most classes The number of sample, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, label x For noise sample;
S23: when x is noise sample, randomly choosing a sample x ' in minority class sample set, raw according to following manner At the new samples X close to minority class samplenew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
S24: when x is not noise sample, from its k neighbour's sample randomly choose 1 sample x ', if x ' belong to it is more Several classes of samples then generate the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes At new minority class sample set;
Xnew=x+rand (0,1) (x '-x);
Wherein, the rand (a, b) in each step indicates a random number in section (a, b), a=0 or 0.5, b=0.5 Or 1;
S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is joined Fig. 3 is examined, is specifically comprised the following steps:
S31: the feature with the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection are analyzed, and will Feature is ranked up from big to small according to the correlation with class label;
Correlation analysis can be analyzed according to existing method, for example, can according to comentropy or the method for mutual information into Row analysis;
S32: since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will be reduced New non-equilibrium data collection after one-dimensional characteristic is input in Random Forest model, and is calculated new after every reduction one-dimensional characteristic The corresponding ACC value of non-equilibrium data collection;
Wherein, ACC value indicates accuracy, since it is last it is one-dimensional delete, it is every delete that feature is primary just will be new non-equilibrium Data set inputs random forest and calculates once, until being left the first dimensional feature;
S33: more all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after feature dimensions Degree.
Such as the ACC value calculated after the v times deletion is maximum, then y-v dimensional feature before retaining, y is the dimension of primitive character.
It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S- NKSMOTE algorithm carries out over-sampling to minority class sample set, improves the quantity of minority class sample, reduces the injustice of minority class sample Weighing apparatus property;Then it is combined using correlation analysis and random forest, realizes the brief processing of feature to new non-equilibrium data collection, Data set after conversion is input to more classification SVM and is trained, and significantly improves the accuracy of classification.
Embodiment 2
The embodiment of the present invention 2 provides a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, this method Include the following steps:
S1: obtaining non-equilibrium data collection, and the non-equilibrium data collection includes most class sample sets and minority class sample set;
S2: carrying out sampling processing to non-equilibrium data collection, obtains new non-equilibrium data collection, the reference of step S2 specific method Fig. 4 is specifically included:
S210: the boundary sample collection of most class sample sets and minority class sample set is obtained;
With reference to Fig. 5, step S210 is specifically, wherein following signified distance is all the distance on nuclear space;
S211: it calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
S212: it calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
S213: the corresponding most class samples of minimum range and minority class sample are picked out;
S214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained;
Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.;
S215: boundary sample collection D, D=m ∩ n is obtained.
The boundary sample wherein obtained integrates as the intersection of m neighbour's sample and n neighbour's sample, is by m neighbour's sample This is formed with identical sample in n neighbour's sample.
S220: the central sample of boundary sample collection is obtained;
With reference to Fig. 6, step S220 specifically:
S221: the distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Seek sample x in boundary sample collectionbDistance S apart from other e samplebf, boundary sample integrate in total number of samples as e + 1;B=1,2..., e, e+1, SbfFor sample xbDistance sample xfDistance, f=1,2..., e, e+1, f ≠ b;
S222: the variance SD of each sample respective distances is calculated separately and apart from summation E;
Calculate sample xbVarianceE =Sb1+Sb2+…+Sbf+…Sbe+Sb(e+1),Wherein, Sb1、Sb2、...、Sbf、...SbeAnd Sb(e+1)Respectively indicate sample xbDistance sample x1、x2、...、xf、...xeAnd xe+1Distance;
S223: dispersion degree B, B=SD*E are calculated;
S224: sample centered on the smallest sample of dispersion degree is picked out;
S230: the distance of each most class sample distance center samples in most class sample sets is calculated, and according to calculating Distance carries out lack sampling to most class sample sets, obtains new most class sample sets, new most class sample sets and new minority Class sample set is at new non-equilibrium data collection;
With reference to Fig. 7, step S230 specifically:
S231: the distance of each sample distance center sample in most class sample sets is calculated;
S232: being ranked up from small to large according to distance, then forms the matrix of R × T;
R and T is setting value, can be identical or different, can take 50,100 or 200 equivalences;
S233: the relative standard deviation RSD of each row distance in calculating matrix;
Wherein,Sh1、Sh2、...、ShgRespectively indicate the 1st of matrix h row, 2 It is a ..., the distance of g-th sample distance center sample;H=1,2...T;
S234: by relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, calculate the row distance Average value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted;When RSD > RSD1When, the corresponding all samples of the row are deleted;Wherein threshold value RSD1For setting value;
S235: in matrix after every deletion a line sample, all by reduce a line sample after most class sample sets be input to In machine forest model;The Random Forest model is the model after training;
S236: Δ Gm is calculated1, Δ Gm1=Gmi- Gm, GmiIt is input to delete most class sample sets after the i-th row sample The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model G_mean value;
S237: by Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is For new most class sample sets;
S240: the distance of each minority class sample distance center sample in minority class sample set is calculated, and according to calculating Distance carries out over-sampling to minority class sample set, obtains new minority class sample set;
With reference to Fig. 8, step S240 specifically:
S241: the distance of each sample distance center sample in minority class sample set is calculated;
S242: R ' × T ' matrix will be formed apart from by sorting from small to large;
R ' and T ' is setting value, can be identical or different, can take 50,100 or 200 equivalences
S243: since the first row, over-sampling, reference are carried out to the corresponding sample of every a line using S-NKSMOTE algorithm Embodiment 1 and the method for Fig. 2 carry out over-sampling;
S244: the sample set formed after over-sampling is input to by the sample of every a line after over-sampling in matrix Into Random Forest model;
S245: Δ Gm is calculated2, Δ Gm2=Gmj- Gm, GmjThe minority class sample formed after over-sampling for jth row sample This collection is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is input to random forest mould The G_mean value exported in type;
S246: by Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is For new minority class sample set.
S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is had Body step embodiment 1 and Fig. 3.
What the embodiment of the present invention 2 provided can be reduced now based on sampling and the brief non-equilibrium data collection conversion method of feature Asking for important information sample is accidentally deleted when having technology over-sampling bring overfitting problem, and avoiding prior art lack sampling Topic, the data set for converting formation are input to more classification SVM and are trained, and further improve the accuracy of classification, and reduce Time of classification.
Embodiment 3
The embodiment of the present invention 3 provide it is a kind of based on sampling with the brief non-equilibrium data collection converting system of feature, such as Fig. 9 institute Show, which includes:
The data acquisition module 1 of non-equilibrium data collection is obtained, the non-equilibrium data collection is including most class sample sets and less Several classes of sample sets;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module 2 of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature Processing module 3.
With continued reference to Fig. 9, sampling processing module 2 includes:
Boundary sample acquisition submodule 210: for obtaining the boundary sample collection of most class sample sets and minority class sample set; Wherein, boundary sample acquisition submodule 210 includes:
First computing unit 211: for calculate separately in most class sample sets it is each majority class samples and its recently lack The distance of several classes of samples;
Second computing unit 212: for calculating separately each minority class sample in minority class sample set and it is nearest more The distance of several classes of samples;
First module of selection 213: for picking out the corresponding most class samples of minimum range and minority class sample;
Acquiring unit 214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained, is obtained Boundary sample collection D, D=m ∩ n
Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.;The boundary of acquisition Sample set is the intersection of m neighbour's sample and n neighbour's sample, is by identical in m neighbour's sample and n neighbour's sample What sample was formed.
Central sample acquisition submodule 220: for obtaining the central sample of boundary sample collection, the central sample obtains son Module includes:
Third computing unit 221: own in boundary sample collection for seeking each sample in boundary sample collection respectively The distance of other samples;
4th computing unit 222: for calculating separately the variance SD of each sample respective distances and apart from summation E;
5th computing unit 223: for calculating dispersion degree B, B=SD*E;
Second module of selection 224: for picking out sample centered on the smallest sample of dispersion degree;
Lack sampling handles submodule 230: for calculating each majority class sample distance center sample in most class sample sets Distance, and according to the distance of calculating to most class sample sets carry out lack sampling, obtain new most class sample sets, new majority Class sample set and new minority class sample set are at new non-equilibrium data collection;
The lack sampling handles submodule 230
6th computing unit 231: for calculating the distance of each sample distance center sample in most class sample sets;
First matrix forms unit 232: for being ranked up from small to large according to distance, then forming the matrix of R × T;
R and T is setting value, can be identical or different, can take 50,100 or 200 equivalences;
7th computing unit 233: the relative standard deviation RSD for row distance each in calculating matrix;
First comparing unit 234: it is used for relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, The average value of the row distance is calculated, and calculates the difference of each distance and the average value row Nei, difference is greater than threshold value pair The sample answered is deleted;As RSD > RSD1When, the corresponding all samples of the row are deleted;Wherein threshold value RSD1For setting value;
First input unit 235: in matrix after every deletion a line sample, for most class samples after a line sample will to be reduced This collection is input in Random Forest model;The Random Forest model is the model after training;
8th computing unit 236: for calculating Δ Gm1, Δ Gm1=Gmi- Gm, GmiTo delete the majority after the i-th row sample Class sample set is input to the G_mean value exported in Random Forest model, and Gm is input to random gloomy for original non-equilibrium data collection The G_mean value exported in woods model;
Second comparing unit 237: it is used for Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stopping owes to adopt Sample, sample at this time are new most class sample sets;
Over-sampling handles submodule 240: for calculating each minority class sample distance center sample in minority class sample set Distance, and according to the distance of calculating to minority class sample set carry out over-sampling, obtain new minority class sample set;
The over-sampling handles submodule 240
9th computing unit 241: for calculating the distance of each sample distance center sample in minority class sample set;
Second matrix forms unit 242: for that will form the matrix of R × T apart from by sorting from small to large;
Over-sampling unit 243: for since the first row, using S-NKSMOTE algorithm to the corresponding sample of every a line into Row over-sampling, the subelement that over-sampling unit specifically includes are as follows:
Neighbour obtains subelement: for obtaining k neighbour's sample of sample x in minority class sample set;
Multilevel iudge subelement: for by k neighbour's sample minority class number of samples and most class number of samples into Row compares, and when the number of minority class sample is more than the number of most class samples, label x is safe sample, when minority class sample Number be less than the numbers of most class samples, and there are minority class samples, and marking x is dangerous sample, when k neighbour's sample is entirely Most class samples, label x are noise sample;
First sample generates subelement: for randomly choosing a sample in minority class sample set when x is noise sample This x ' generates the new samples X close to minority class sample according to following mannernew, all new samples form new minority class sample Collection;
Xnew=x+rand (0.5,1) (x '-x)
First sample generates subelement: for randomly choosing 1 from its k neighbour's sample when x is not noise sample Sample x ' generates the new samples X close to x according to following manner if x ' belongs to most class samplesnew, all new samples are formed New minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes At new minority class sample set;
Xnew=x+rand (0,1) (x '-x);
Wherein, the rand (a, b) in each step indicates a random number in section (a, b), a=0 or 0.5, b=0.5 Or 1;
Second input unit 244: the sample of every a line is after over-sampling in matrix, for that will be formed after over-sampling Sample set be input in Random Forest model;
Tenth computing unit 245: for calculating Δ Gm2, Δ Gm2=Gmj- Gm, GmjIt is jth row sample after over-sampling The minority class sample set of formation is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is defeated Enter the G_mean value exported into Random Forest model;
Third comparing unit 246: it is used for Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stopped adopting Sample, sample at this time are new minority class sample set.
The dimension-reduction treatment module 3 includes:
Sorting sub-module 31 is analyzed, for analyzing the feature with corresponding classification of every one kind sample in new non-equilibrium data collection The correlation of label, and feature is ranked up from big to small according to the correlation with class label;
Computational submodule 32: for since feature it is last it is one-dimensional successively delete one-dimensional characteristic, every deletion in sequence New non-equilibrium data collection after reducing one-dimensional characteristic is input in Random Forest model by one-dimensional characteristic, and calculates every reduction The corresponding ACC value of new non-equilibrium data collection after one-dimensional characteristic;
Comparative sub-module 33: being used for more all ACC values, chooses the corresponding characteristic dimension of maximum ACC value, that is, is characterized letter Characteristic dimension after about.
It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S- The mixed method of NKSMOTE algorithm and lack sampling samples non-equilibrium data collection, reduces prior art over-sampling bring Overfitting problem, the problem of important information sample is accidentally deleted when avoiding prior art lack sampling, convert the data set input of formation It is trained to more classification SVM, further improves the accuracy of classification, and reduce the time of classification.
Embodiment 4
The embodiment of the present invention 4 also provides another computer readable storage medium, which can be with It is computer readable storage medium included in the memory in above-described embodiment;It is also possible to individualism, without supplying Computer readable storage medium in terminal.The computer-readable recording medium storage has one or more than one program, The one or more programs are used to execute 2 institute of embodiment 1 and embodiment by one or more than one processor The method of offer.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims (10)

1. a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that the described method includes:
Non-equilibrium data collection is obtained, the non-equilibrium data collection includes most class sample sets and minority class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature.
2. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described includes carrying out over-sampling to minority class sample set to non-equilibrium data collection progress sampling processing, including the use of S-NKSMOTE Algorithm carries out over-sampling to minority class sample set, specifically:
Obtain k neighbour's sample of sample x in minority class sample set;
Minority class number of samples in k neighbour's sample is compared with most class number of samples, as of minority class sample When number is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than a of most class samples Number, and there are minority class samples, marking x is dangerous sample, and when k neighbour's sample is most class samples entirely, label x is noise sample This;
When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is generated according to following manner close to few The new samples X of several classes of samplesnew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most class samples This, then generate the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples form New minority class sample set:
Xnew=x+rand (0,1) (x '-x).
3. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that It is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:
Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature according to It is ranked up from big to small with the correlation of class label;
Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional characteristic New non-equilibrium data collection afterwards is input in Random Forest model, and is calculated new non-equilibrium after every reduction one-dimensional characteristic The corresponding ACC value of data set;
Compare all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after characteristic dimension.
4. according to claim 2 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described further includes carrying out lack sampling to most class sample sets to non-equilibrium data collection progress sampling processing, specifically:
Obtain the boundary sample collection of most class sample sets and minority class sample set;
Obtain the central sample of boundary sample collection;
The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance of calculating to majority Class sample set carries out lack sampling, obtains new most class sample sets, new most class sample sets and new minority class sample set At new non-equilibrium data collection.
5. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Obtain the boundary sample collection of most class sample sets and minority class sample set method particularly includes:
It calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
It calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
Pick out the corresponding most class samples of minimum range and minority class sample;
Obtain m neighbour's sample of most class samples and n neighbour's sample of minority class sample;
Obtain boundary sample collection D, D=m ∩ n.
6. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Obtain the central sample of boundary sample collection method particularly includes:
The distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Calculate separately the variance SD of each sample respective distances and apart from summation E;
Calculate dispersion degree B, B=SD*E;
Pick out sample centered on the smallest sample of dispersion degree.
7. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Calculate the distance of each sample distance center sample in most class sample sets, and according to the distance of calculating to most class sample sets into Row lack sampling, method particularly includes:
Calculate the distance of each sample distance center sample in most class sample sets;
It is ranked up from small to large according to distance, then forms the matrix of R × T;
The relative standard deviation RSD of each row distance in calculating matrix;
By relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, the average value of the row distance is calculated, and Difference is greater than the corresponding sample of threshold value and deleted by the difference for calculating each distance and the average value row Nei;As RSD > RSD1 When, the corresponding all samples of the row are deleted;
In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to Random Forest model In;
Calculate Δ Gm1, Δ Gm1=Gmi- Gm, GmiRandom forest mould is input to delete most class sample sets after the i-th row sample The G_mean value exported in type, Gm are that original non-equilibrium data collection is input to the G_mean value exported in Random Forest model;
By Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is new more Several classes of sample sets.
8. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that Described is to calculate each minority class sample distance center sample in minority class sample set to minority class sample set progress over-sampling Distance, and over-sampling is carried out to minority class sample set according to the distance of calculating, new minority class sample set is obtained, is specifically included:
Calculate the distance of each sample distance center sample in minority class sample set;
R ' × T ' matrix will be formed apart from by sorting from small to large;
Since the first row, over-sampling is carried out to the corresponding sample of every a line using S-NKSMOTE algorithm;
The sample set formed after over-sampling is input to random forest after over-sampling by the sample of every a line in matrix In model;
Calculate Δ Gm2, Δ Gm2=Gmj- Gm, GmjThe minority class sample set formed after over-sampling for jth row sample is input to The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model G_mean value;
By Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is new lacks Several classes of sample sets.
9. a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, which is characterized in that the converting system packet It includes:
Obtain the data acquisition module of non-equilibrium data collection;The non-equilibrium data collection includes most class sample sets and minority class sample This collection;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimension-reduction treatment of the brief new non-equilibrium data collection of feature Module.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The step of any one of claim 1-8 claim the method is realized when device executes.
CN201910508530.XA 2019-06-13 2019-06-13 Based on sampling and feature brief non-equilibrium data collection conversion method and system Pending CN110348486A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910508530.XA CN110348486A (en) 2019-06-13 2019-06-13 Based on sampling and feature brief non-equilibrium data collection conversion method and system
CN202010371648.5A CN112085046A (en) 2019-06-13 2020-05-06 Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910508530.XA CN110348486A (en) 2019-06-13 2019-06-13 Based on sampling and feature brief non-equilibrium data collection conversion method and system

Publications (1)

Publication Number Publication Date
CN110348486A true CN110348486A (en) 2019-10-18

Family

ID=68181860

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910508530.XA Pending CN110348486A (en) 2019-06-13 2019-06-13 Based on sampling and feature brief non-equilibrium data collection conversion method and system
CN202010371648.5A Pending CN112085046A (en) 2019-06-13 2020-05-06 Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010371648.5A Pending CN112085046A (en) 2019-06-13 2020-05-06 Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion

Country Status (1)

Country Link
CN (2) CN110348486A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036515A (en) * 2020-11-04 2020-12-04 北京淇瑀信息科技有限公司 Oversampling method and device based on SMOTE algorithm and electronic equipment
CN112085046A (en) * 2019-06-13 2020-12-15 中国科学院计算机网络信息中心 Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion
CN112395558A (en) * 2020-11-27 2021-02-23 广东电网有限责任公司肇庆供电局 Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter
CN113052198A (en) * 2019-12-28 2021-06-29 中移信息技术有限公司 Data processing method, device, equipment and storage medium
WO2021135271A1 (en) * 2019-12-30 2021-07-08 山东英信计算机技术有限公司 Classification model training method and system, electronic device and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076438B (en) * 2021-04-28 2023-12-15 华南理工大学 Classification method based on conversion from majority class to minority class under unbalanced data set
CN113553581A (en) * 2021-07-12 2021-10-26 华东师范大学 Intrusion detection system for unbalanced data
CN113901448A (en) * 2021-09-03 2022-01-07 燕山大学 Intrusion detection method based on convolutional neural network and lightweight gradient elevator
CN115242431A (en) * 2022-06-10 2022-10-25 国家计算机网络与信息安全管理中心 Industrial Internet of things data anomaly detection method based on random forest and long-short term memory network
CN117253095B (en) * 2023-11-16 2024-01-30 吉林大学 Image classification system and method based on biased shortest distance criterion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582813B (en) * 2009-06-26 2011-07-20 西安电子科技大学 Distributed migration network learning-based intrusion detection system and method thereof
CN103716204B (en) * 2013-12-20 2017-02-08 中国科学院信息工程研究所 Abnormal intrusion detection ensemble learning method and apparatus based on Wiener process
CN104598813B (en) * 2014-12-09 2017-05-17 西安电子科技大学 Computer intrusion detection method based on integrated study and semi-supervised SVM
CN109150830B (en) * 2018-07-11 2021-04-06 浙江理工大学 Hierarchical intrusion detection method based on support vector machine and probabilistic neural network
CN110348486A (en) * 2019-06-13 2019-10-18 中国科学院计算机网络信息中心 Based on sampling and feature brief non-equilibrium data collection conversion method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085046A (en) * 2019-06-13 2020-12-15 中国科学院计算机网络信息中心 Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion
CN113052198A (en) * 2019-12-28 2021-06-29 中移信息技术有限公司 Data processing method, device, equipment and storage medium
WO2021135271A1 (en) * 2019-12-30 2021-07-08 山东英信计算机技术有限公司 Classification model training method and system, electronic device and storage medium
US11762949B2 (en) 2019-12-30 2023-09-19 Shandong Yingxin Computer Technologies Co., Ltd. Classification model training method, system, electronic device and strorage medium
CN112036515A (en) * 2020-11-04 2020-12-04 北京淇瑀信息科技有限公司 Oversampling method and device based on SMOTE algorithm and electronic equipment
CN112395558A (en) * 2020-11-27 2021-02-23 广东电网有限责任公司肇庆供电局 Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter

Also Published As

Publication number Publication date
CN112085046A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN110348486A (en) Based on sampling and feature brief non-equilibrium data collection conversion method and system
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
CN111967343B (en) Detection method based on fusion of simple neural network and extreme gradient lifting model
Fu et al. Low-level feature extraction for edge detection using genetic programming
CN112613552B (en) Convolutional neural network emotion image classification method combined with emotion type attention loss
CN112883839B (en) Remote sensing image interpretation method based on adaptive sample set construction and deep learning
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN102938054B (en) Method for recognizing compressed-domain sensitive images based on visual attention models
Hafemann et al. Meta-learning for fast classifier adaptation to new users of signature verification systems
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN112580445B (en) Human body gait image visual angle conversion method based on generation of confrontation network
CN114091661A (en) Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm
CN110580510A (en) clustering result evaluation method and system
CN113033567A (en) Oracle bone rubbing image character extraction method fusing segmentation network and generation network
CN104615635B (en) Palm vein classified index construction method based on direction character
CN108509588B (en) Lawyer evaluation method and recommendation method based on big data
CN112200260B (en) Figure attribute identification method based on discarding loss function
Zhang et al. Intrusion detection model of CNN-BiLSTM algorithm based on mean control
CN115277159B (en) Industrial Internet security situation assessment method based on improved random forest
Pandey et al. A hierarchical clustering approach for image datasets
CN111582440A (en) Data processing method based on deep learning
Rahmat et al. Tree identification to calculate the amount of palm trees using haar-cascade classifier algorithm
CN111221915A (en) Online learning resource quality analysis method based on CWK-means
CN112529637B (en) Service demand dynamic prediction method and system based on context awareness
CN108427967B (en) Real-time image clustering method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191018