CN110348486A - Based on sampling and feature brief non-equilibrium data collection conversion method and system - Google Patents
Based on sampling and feature brief non-equilibrium data collection conversion method and system Download PDFInfo
- Publication number
- CN110348486A CN110348486A CN201910508530.XA CN201910508530A CN110348486A CN 110348486 A CN110348486 A CN 110348486A CN 201910508530 A CN201910508530 A CN 201910508530A CN 110348486 A CN110348486 A CN 110348486A
- Authority
- CN
- China
- Prior art keywords
- sample
- new
- data collection
- sampling
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The present invention provide it is a kind of based on sampling and feature brief non-equilibrium data collection conversion method and system, this method sampled using the sample that the method for sampling concentrates non-equilibrium data, reach minority class number of samples with most class numbers of samples close to balancing;Then sequence from big to small is carried out to feature using the correlation between feature and class label;Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, the sample data set for reducing one-dimensional characteristic is just inputted Random Forest model by every one-dimensional characteristic of deleting, calculate the corresponding ACC value of these samples, compare all ACC values, chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.More classification SVM are input to by the new non-equilibrium data that the above conversion method obtains to be trained, and are remarkably improved the accuracy of classification.
Description
Technical field
It is the invention belongs to non-equilibrium data switch technology field, in particular to a kind of based on sampling and brief non-flat of feature
The data set conversion method that weighs and system.
Background technique
Non-equilibrium data collection conversion method is carried out from data plane to data set when classifying to non-equilibrium data collection
Reconstruct, to reduce non-equilibrium degree, the method for improving classification accuracy.Unbalanced dataset classification refers to Different categories of samples data not phase
Classification problem Deng in the case where.By taking two classification problems as an example, i.e., certain a kind of data sample proportion is significantly more than other classes
Other data sample.Wherein, the sample more than ratio forms most class sample sets, and the few sample of ratio forms minority class sample set.
Non-equilibrium data is very widely used in real life, such as the neck such as risk intrusion detection, rare disease forecasting, financial swindling
Domain.
Most common method is to carry out over-sampling processing to minority class sample set in data plane, by increasing minority class sample
Originally data set is made to be distributed relative equilibrium.
1. the existing method for carrying out over-sampling to minority class sample set makes no exception to all minority class sample sets, not
Consider the different different degrees of different minority class sample set classifiers;2. the feature of data set has the performance of classifier critically important
Influence, if not having effective field comprising more multipair classification results in feature, can be brought to the training process of classifier compared with
Big complexity.
Summary of the invention
In order to solve the problems in the existing technology, the present invention provides a kind of based on sampling and brief non-equilibrium of feature
Data set conversion method.
In order to achieve the above objectives, the present invention adopts the following technical scheme:
The present invention provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, this method comprises:
Non-equilibrium data collection is obtained, the non-equilibrium data collection includes most class sample sets and minority class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature.
Preferred technical solution, described includes carrying out to minority class sample set to non-equilibrium data collection progress sampling processing
Sampling carries out over-sampling to minority class sample set including the use of S-NKSMOTE algorithm, specifically:
Obtain k neighbour's sample of sample x in minority class sample set;
Minority class number of samples in k neighbour's sample is compared with most class number of samples, when minority class sample
Number when being more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than majority class samples
Number, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, marks x to make an uproar
Sound sample;
When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is leaned on according to following manner generation
The new samples X of nearly minority class samplenew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most classes
Sample then generates the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes
At new minority class sample set:
Xnew=x+rand (0,1) (x '-x).
Preferred technical solution, it is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:
Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature
It is ranked up from big to small according to the correlation with class label;
Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional
New non-equilibrium data collection after feature is input in Random Forest model, and is calculated new non-after every reduction one-dimensional characteristic
The corresponding ACC value of equilibrium data collection;
Compare all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after characteristic dimension.
Preferred technical solution, described further includes carrying out to most class sample sets to non-equilibrium data collection progress sampling processing
Lack sampling, specifically:
Obtain the boundary sample collection of most class sample sets and minority class sample set;
Obtain the central sample of boundary sample collection;
The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance pair of calculating
Most class sample sets carry out lack sampling, obtain new most class sample sets, new most class sample sets and new minority class sample
New non-equilibrium data collection is assembled.
Preferred technical solution obtains the specific method of the boundary sample collection of most class sample sets and minority class sample set
Are as follows:
It calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
It calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
Pick out the corresponding most class samples of minimum range and minority class sample;
Obtain m neighbour's sample of most class samples and n neighbour's sample of minority class sample;
Obtain boundary sample collection D, D=m ∩ n.
Preferred technical solution obtains the central sample of boundary sample collection method particularly includes:
The distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Calculate separately the variance SD of each sample respective distances and apart from summation E;
Calculate dispersion degree B, B=SD*E;
Pick out sample centered on the smallest sample of dispersion degree.
Preferred technical solution calculates the distance of each sample distance center sample in most class sample sets, and according to meter
The distance of calculation carries out lack sampling to most class sample sets, method particularly includes:
Calculate the distance of each sample distance center sample in most class sample sets;
It is ranked up from small to large according to distance, then forms the matrix of R × T;
The relative standard deviation RSD of each row distance in calculating matrix;
By relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, calculate being averaged for the row distance
Value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted;Work as RSD
> RSD1When, the corresponding all samples of the row are deleted;
In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to random forest mould
In type;
Calculate Δ Gm1, Δ Gm1=Gmi- Gm, GmiIt is input to for most class sample sets after the i-th row sample of deletion random gloomy
The G_mean value exported in woods model, Gm are that original non-equilibrium data collection is input to the G_mean exported in Random Forest model
Value;
By Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is as new
Most class sample sets.
Preferred technical solution is described each few in minority class sample set to calculate to minority class sample set progress over-sampling
The distance of several classes of sample distance center samples, and over-sampling is carried out to minority class sample set according to the distance of calculating, it obtains new
Minority class sample set, specifically includes:
Calculate the distance of each sample distance center sample in minority class sample set;
R ' × T ' matrix will be formed apart from by sorting from small to large;
Since the first row, over-sampling is carried out to the corresponding sample of every a line using S-NKSMOTE algorithm;
The sample set formed after over-sampling is input to random by the sample of every a line after over-sampling in matrix
In forest model;
Calculate Δ Gm2, Δ Gm2=Gmj- Gm, GmjThe minority class sample set formed after over-sampling for jth row sample is defeated
Enter the G_mean value exported into Random Forest model, Gm is input to defeated in Random Forest model for original non-equilibrium data collection
G_mean value out;
By Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is as new
Minority class sample set.
Another aspect of the present invention provides a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, the conversion
System includes:
Obtain the data acquisition module of non-equilibrium data collection;The non-equilibrium data collection includes most class sample sets and minority
Class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature
Processing module.
Further aspect of the present invention provides a kind of computer readable storage medium, is stored thereon with computer program, the program
Realized when being executed by processor it is provided by the invention based on sampling with feature brief non-equilibrium data collection conversion method the step of.
It is provided by the invention based on sampling with the brief non-equilibrium data collection conversion method of feature, first with the method for sampling
The sample concentrated to non-equilibrium data samples, and the number of samples of minority class is made to reach close flat with most class numbers of samples
Weighing apparatus reduces the disequilibrium of minority class sample.Then feature is carried out from big using the correlation between feature and class label
To small sequence;Again since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic just will
The sample data set for reducing one-dimensional characteristic inputs Random Forest model, calculates the corresponding ACC value of these samples using random forest
As fitness, (deleted since finally one-dimensional, every deletion feature once will just count until calculating to only remaining first dimensional feature
Calculated according to collection input random forest it is primary, until being left the first dimensional feature) the corresponding ACC value of sample data set.Compare all
ACC value chooses target signature dimension of the corresponding characteristic dimension of maximum ACC value as feature reduction.Pass through the above conversion method
The new non-equilibrium data obtained is input to more classification SVM and is trained, and is remarkably improved the accuracy of classification.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is the stream based on sampling with the brief non-equilibrium data collection conversion method of feature that the embodiment of the present invention 1 provides
Journey schematic diagram;
Fig. 2 is the 1 specific flow chart of step S2 over-sampling of embodiment;
Fig. 3 is the 1 specific flow chart of step S3 of embodiment;
Fig. 4 is the 2 specific flow chart of step S2 of embodiment;
Fig. 5 is the specific flow chart of step S210;
Fig. 6 is the specific flow chart of step S220;
Fig. 7 is the specific flow chart of step S230;
Fig. 8 is the specific flow chart of step S240;
Fig. 9 is the structural block diagram based on sampling with the brief non-equilibrium data collection conversion method of feature that embodiment 3 provides.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other
Embodiment shall fall within the protection scope of the present invention.
Embodiment 1
The embodiment of the present invention 1 provide it is a kind of based on sampling with the brief non-equilibrium data collection conversion method of feature, such as Fig. 1 institute
Show, this method comprises the following steps:
S1: obtaining non-equilibrium data collection, and the non-equilibrium data collection includes most class sample sets and minority class sample set;
S2: sampling processing is carried out to non-equilibrium data collection, new non-equilibrium data collection is obtained, including the use of S-NKSMOTE
Algorithm carries out over-sampling to minority class sample set, with reference to Fig. 2, specifically:
S21 obtains k neighbour's sample of sample x in minority class sample set;
Wherein, k neighbour's sample is k sample distance sample x nearest on nuclear space, and the value of k can be set
It is fixed, it can be 100,500 etc.;
S22: the minority class number of samples in k neighbour's sample is compared with most class number of samples, works as minority class
When the number of sample is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than most classes
The number of sample, and there are minority class sample, marking x is dangerous sample, when k neighbour's sample is most class samples entirely, label x
For noise sample;
S23: when x is noise sample, randomly choosing a sample x ' in minority class sample set, raw according to following manner
At the new samples X close to minority class samplenew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
S24: when x is not noise sample, from its k neighbour's sample randomly choose 1 sample x ', if x ' belong to it is more
Several classes of samples then generate the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes
At new minority class sample set;
Xnew=x+rand (0,1) (x '-x);
Wherein, the rand (a, b) in each step indicates a random number in section (a, b), a=0 or 0.5, b=0.5
Or 1;
S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is joined
Fig. 3 is examined, is specifically comprised the following steps:
S31: the feature with the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection are analyzed, and will
Feature is ranked up from big to small according to the correlation with class label;
Correlation analysis can be analyzed according to existing method, for example, can according to comentropy or the method for mutual information into
Row analysis;
S32: since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will be reduced
New non-equilibrium data collection after one-dimensional characteristic is input in Random Forest model, and is calculated new after every reduction one-dimensional characteristic
The corresponding ACC value of non-equilibrium data collection;
Wherein, ACC value indicates accuracy, since it is last it is one-dimensional delete, it is every delete that feature is primary just will be new non-equilibrium
Data set inputs random forest and calculates once, until being left the first dimensional feature;
S33: more all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after feature dimensions
Degree.
Such as the ACC value calculated after the v times deletion is maximum, then y-v dimensional feature before retaining, y is the dimension of primitive character.
It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S-
NKSMOTE algorithm carries out over-sampling to minority class sample set, improves the quantity of minority class sample, reduces the injustice of minority class sample
Weighing apparatus property;Then it is combined using correlation analysis and random forest, realizes the brief processing of feature to new non-equilibrium data collection,
Data set after conversion is input to more classification SVM and is trained, and significantly improves the accuracy of classification.
Embodiment 2
The embodiment of the present invention 2 provides a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, this method
Include the following steps:
S1: obtaining non-equilibrium data collection, and the non-equilibrium data collection includes most class sample sets and minority class sample set;
S2: carrying out sampling processing to non-equilibrium data collection, obtains new non-equilibrium data collection, the reference of step S2 specific method
Fig. 4 is specifically included:
S210: the boundary sample collection of most class sample sets and minority class sample set is obtained;
With reference to Fig. 5, step S210 is specifically, wherein following signified distance is all the distance on nuclear space;
S211: it calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
S212: it calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
S213: the corresponding most class samples of minimum range and minority class sample are picked out;
S214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained;
Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.;
S215: boundary sample collection D, D=m ∩ n is obtained.
The boundary sample wherein obtained integrates as the intersection of m neighbour's sample and n neighbour's sample, is by m neighbour's sample
This is formed with identical sample in n neighbour's sample.
S220: the central sample of boundary sample collection is obtained;
With reference to Fig. 6, step S220 specifically:
S221: the distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Seek sample x in boundary sample collectionbDistance S apart from other e samplebf, boundary sample integrate in total number of samples as e
+ 1;B=1,2..., e, e+1, SbfFor sample xbDistance sample xfDistance, f=1,2..., e, e+1, f ≠ b;
S222: the variance SD of each sample respective distances is calculated separately and apart from summation E;
Calculate sample xbVarianceE
=Sb1+Sb2+…+Sbf+…Sbe+Sb(e+1),Wherein, Sb1、Sb2、...、Sbf、...SbeAnd Sb(e+1)Respectively indicate sample
xbDistance sample x1、x2、...、xf、...xeAnd xe+1Distance;
S223: dispersion degree B, B=SD*E are calculated;
S224: sample centered on the smallest sample of dispersion degree is picked out;
S230: the distance of each most class sample distance center samples in most class sample sets is calculated, and according to calculating
Distance carries out lack sampling to most class sample sets, obtains new most class sample sets, new most class sample sets and new minority
Class sample set is at new non-equilibrium data collection;
With reference to Fig. 7, step S230 specifically:
S231: the distance of each sample distance center sample in most class sample sets is calculated;
S232: being ranked up from small to large according to distance, then forms the matrix of R × T;
R and T is setting value, can be identical or different, can take 50,100 or 200 equivalences;
S233: the relative standard deviation RSD of each row distance in calculating matrix;
Wherein,Sh1、Sh2、...、ShgRespectively indicate the 1st of matrix h row, 2
It is a ..., the distance of g-th sample distance center sample;H=1,2...T;
S234: by relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, calculate the row distance
Average value, and the difference of each distance and the average value row Nei is calculated, difference is greater than the corresponding sample of threshold value and is deleted;When
RSD > RSD1When, the corresponding all samples of the row are deleted;Wherein threshold value RSD1For setting value;
S235: in matrix after every deletion a line sample, all by reduce a line sample after most class sample sets be input to
In machine forest model;The Random Forest model is the model after training;
S236: Δ Gm is calculated1, Δ Gm1=Gmi- Gm, GmiIt is input to delete most class sample sets after the i-th row sample
The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model
G_mean value;
S237: by Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is
For new most class sample sets;
S240: the distance of each minority class sample distance center sample in minority class sample set is calculated, and according to calculating
Distance carries out over-sampling to minority class sample set, obtains new minority class sample set;
With reference to Fig. 8, step S240 specifically:
S241: the distance of each sample distance center sample in minority class sample set is calculated;
S242: R ' × T ' matrix will be formed apart from by sorting from small to large;
R ' and T ' is setting value, can be identical or different, can take 50,100 or 200 equivalences
S243: since the first row, over-sampling, reference are carried out to the corresponding sample of every a line using S-NKSMOTE algorithm
Embodiment 1 and the method for Fig. 2 carry out over-sampling;
S244: the sample set formed after over-sampling is input to by the sample of every a line after over-sampling in matrix
Into Random Forest model;
S245: Δ Gm is calculated2, Δ Gm2=Gmj- Gm, GmjThe minority class sample formed after over-sampling for jth row sample
This collection is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is input to random forest mould
The G_mean value exported in type;
S246: by Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is
For new minority class sample set.
S3: dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature, is had
Body step embodiment 1 and Fig. 3.
What the embodiment of the present invention 2 provided can be reduced now based on sampling and the brief non-equilibrium data collection conversion method of feature
Asking for important information sample is accidentally deleted when having technology over-sampling bring overfitting problem, and avoiding prior art lack sampling
Topic, the data set for converting formation are input to more classification SVM and are trained, and further improve the accuracy of classification, and reduce
Time of classification.
Embodiment 3
The embodiment of the present invention 3 provide it is a kind of based on sampling with the brief non-equilibrium data collection converting system of feature, such as Fig. 9 institute
Show, which includes:
The data acquisition module 1 of non-equilibrium data collection is obtained, the non-equilibrium data collection is including most class sample sets and less
Several classes of sample sets;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module 2 of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimensionality reduction of the brief new non-equilibrium data collection of feature
Processing module 3.
With continued reference to Fig. 9, sampling processing module 2 includes:
Boundary sample acquisition submodule 210: for obtaining the boundary sample collection of most class sample sets and minority class sample set;
Wherein, boundary sample acquisition submodule 210 includes:
First computing unit 211: for calculate separately in most class sample sets it is each majority class samples and its recently lack
The distance of several classes of samples;
Second computing unit 212: for calculating separately each minority class sample in minority class sample set and it is nearest more
The distance of several classes of samples;
First module of selection 213: for picking out the corresponding most class samples of minimum range and minority class sample;
Acquiring unit 214: m neighbour's sample of most class samples and n neighbour's sample of minority class sample are obtained, is obtained
Boundary sample collection D, D=m ∩ n
Wherein, m and n is the value of setting, is the positive integer greater than 1, and value can be 50,100 etc.;The boundary of acquisition
Sample set is the intersection of m neighbour's sample and n neighbour's sample, is by identical in m neighbour's sample and n neighbour's sample
What sample was formed.
Central sample acquisition submodule 220: for obtaining the central sample of boundary sample collection, the central sample obtains son
Module includes:
Third computing unit 221: own in boundary sample collection for seeking each sample in boundary sample collection respectively
The distance of other samples;
4th computing unit 222: for calculating separately the variance SD of each sample respective distances and apart from summation E;
5th computing unit 223: for calculating dispersion degree B, B=SD*E;
Second module of selection 224: for picking out sample centered on the smallest sample of dispersion degree;
Lack sampling handles submodule 230: for calculating each majority class sample distance center sample in most class sample sets
Distance, and according to the distance of calculating to most class sample sets carry out lack sampling, obtain new most class sample sets, new majority
Class sample set and new minority class sample set are at new non-equilibrium data collection;
The lack sampling handles submodule 230
6th computing unit 231: for calculating the distance of each sample distance center sample in most class sample sets;
First matrix forms unit 232: for being ranked up from small to large according to distance, then forming the matrix of R × T;
R and T is setting value, can be identical or different, can take 50,100 or 200 equivalences;
7th computing unit 233: the relative standard deviation RSD for row distance each in calculating matrix;
First comparing unit 234: it is used for relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When,
The average value of the row distance is calculated, and calculates the difference of each distance and the average value row Nei, difference is greater than threshold value pair
The sample answered is deleted;As RSD > RSD1When, the corresponding all samples of the row are deleted;Wherein threshold value RSD1For setting value;
First input unit 235: in matrix after every deletion a line sample, for most class samples after a line sample will to be reduced
This collection is input in Random Forest model;The Random Forest model is the model after training;
8th computing unit 236: for calculating Δ Gm1, Δ Gm1=Gmi- Gm, GmiTo delete the majority after the i-th row sample
Class sample set is input to the G_mean value exported in Random Forest model, and Gm is input to random gloomy for original non-equilibrium data collection
The G_mean value exported in woods model;
Second comparing unit 237: it is used for Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stopping owes to adopt
Sample, sample at this time are new most class sample sets;
Over-sampling handles submodule 240: for calculating each minority class sample distance center sample in minority class sample set
Distance, and according to the distance of calculating to minority class sample set carry out over-sampling, obtain new minority class sample set;
The over-sampling handles submodule 240
9th computing unit 241: for calculating the distance of each sample distance center sample in minority class sample set;
Second matrix forms unit 242: for that will form the matrix of R × T apart from by sorting from small to large;
Over-sampling unit 243: for since the first row, using S-NKSMOTE algorithm to the corresponding sample of every a line into
Row over-sampling, the subelement that over-sampling unit specifically includes are as follows:
Neighbour obtains subelement: for obtaining k neighbour's sample of sample x in minority class sample set;
Multilevel iudge subelement: for by k neighbour's sample minority class number of samples and most class number of samples into
Row compares, and when the number of minority class sample is more than the number of most class samples, label x is safe sample, when minority class sample
Number be less than the numbers of most class samples, and there are minority class samples, and marking x is dangerous sample, when k neighbour's sample is entirely
Most class samples, label x are noise sample;
First sample generates subelement: for randomly choosing a sample in minority class sample set when x is noise sample
This x ' generates the new samples X close to minority class sample according to following mannernew, all new samples form new minority class sample
Collection;
Xnew=x+rand (0.5,1) (x '-x)
First sample generates subelement: for randomly choosing 1 from its k neighbour's sample when x is not noise sample
Sample x ' generates the new samples X close to x according to following manner if x ' belongs to most class samplesnew, all new samples are formed
New minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples shapes
At new minority class sample set;
Xnew=x+rand (0,1) (x '-x);
Wherein, the rand (a, b) in each step indicates a random number in section (a, b), a=0 or 0.5, b=0.5
Or 1;
Second input unit 244: the sample of every a line is after over-sampling in matrix, for that will be formed after over-sampling
Sample set be input in Random Forest model;
Tenth computing unit 245: for calculating Δ Gm2, Δ Gm2=Gmj- Gm, GmjIt is jth row sample after over-sampling
The minority class sample set of formation is input to the G_mean value exported in Random Forest model, and Gm is that original non-equilibrium data collection is defeated
Enter the G_mean value exported into Random Forest model;
Third comparing unit 246: it is used for Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stopped adopting
Sample, sample at this time are new minority class sample set.
The dimension-reduction treatment module 3 includes:
Sorting sub-module 31 is analyzed, for analyzing the feature with corresponding classification of every one kind sample in new non-equilibrium data collection
The correlation of label, and feature is ranked up from big to small according to the correlation with class label;
Computational submodule 32: for since feature it is last it is one-dimensional successively delete one-dimensional characteristic, every deletion in sequence
New non-equilibrium data collection after reducing one-dimensional characteristic is input in Random Forest model by one-dimensional characteristic, and calculates every reduction
The corresponding ACC value of new non-equilibrium data collection after one-dimensional characteristic;
Comparative sub-module 33: being used for more all ACC values, chooses the corresponding characteristic dimension of maximum ACC value, that is, is characterized letter
Characteristic dimension after about.
It is provided in an embodiment of the present invention based on sampling with the brief non-equilibrium data collection conversion method of feature, utilize S-
The mixed method of NKSMOTE algorithm and lack sampling samples non-equilibrium data collection, reduces prior art over-sampling bring
Overfitting problem, the problem of important information sample is accidentally deleted when avoiding prior art lack sampling, convert the data set input of formation
It is trained to more classification SVM, further improves the accuracy of classification, and reduce the time of classification.
Embodiment 4
The embodiment of the present invention 4 also provides another computer readable storage medium, which can be with
It is computer readable storage medium included in the memory in above-described embodiment;It is also possible to individualism, without supplying
Computer readable storage medium in terminal.The computer-readable recording medium storage has one or more than one program,
The one or more programs are used to execute 2 institute of embodiment 1 and embodiment by one or more than one processor
The method of offer.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.
Claims (10)
1. a kind of based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that the described method includes:
Non-equilibrium data collection is obtained, the non-equilibrium data collection includes most class sample sets and minority class sample set;
Sampling processing is carried out to non-equilibrium data collection, obtains new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the brief new non-equilibrium data collection of feature.
2. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Described includes carrying out over-sampling to minority class sample set to non-equilibrium data collection progress sampling processing, including the use of S-NKSMOTE
Algorithm carries out over-sampling to minority class sample set, specifically:
Obtain k neighbour's sample of sample x in minority class sample set;
Minority class number of samples in k neighbour's sample is compared with most class number of samples, as of minority class sample
When number is more than the number of most class samples, label x is safe sample, when the number of minority class sample is less than a of most class samples
Number, and there are minority class samples, marking x is dangerous sample, and when k neighbour's sample is most class samples entirely, label x is noise sample
This;
When x is noise sample, a sample x ' is randomly choosed in minority class sample set, is generated according to following manner close to few
The new samples X of several classes of samplesnew, all new samples form new minority class sample set;
Xnew=x+rand (0.5,1) (x '-x)
When x is not noise sample, 1 sample x ' is randomly choosed from its k neighbour's sample, if x ' belongs to most class samples
This, then generate the new samples X close to x according to following mannernew, all new samples form new minority class sample set;
Xnew=x+rand (0,0.5) (x '-x)
If x ' belongs to minority class sample, the new samples X close to x is generated according to following equationnew, all new samples form
New minority class sample set:
Xnew=x+rand (0,1) (x '-x).
3. according to claim 1 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
It is described that dimension-reduction treatment is carried out to new non-equilibrium data collection method particularly includes:
Analyze feature and the correlation of corresponding class label of every one kind sample in new non-equilibrium data collection, and by feature according to
It is ranked up from big to small with the correlation of class label;
Since feature it is last it is one-dimensional successively delete one-dimensional characteristic in sequence, every deletion one-dimensional characteristic will reduce one-dimensional characteristic
New non-equilibrium data collection afterwards is input in Random Forest model, and is calculated new non-equilibrium after every reduction one-dimensional characteristic
The corresponding ACC value of data set;
Compare all ACC values, choose the corresponding characteristic dimension of maximum ACC value, that is, be characterized it is brief after characteristic dimension.
4. according to claim 2 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Described further includes carrying out lack sampling to most class sample sets to non-equilibrium data collection progress sampling processing, specifically:
Obtain the boundary sample collection of most class sample sets and minority class sample set;
Obtain the central sample of boundary sample collection;
The distance of each most class sample distance center samples in most class sample sets is calculated, and according to the distance of calculating to majority
Class sample set carries out lack sampling, obtains new most class sample sets, new most class sample sets and new minority class sample set
At new non-equilibrium data collection.
5. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Obtain the boundary sample collection of most class sample sets and minority class sample set method particularly includes:
It calculates separately at a distance from the minority class sample that each majority class sample is nearest with its in most class sample sets;
It calculates separately at a distance from most class samples that each minority class sample is nearest with it in minority class sample set;
Pick out the corresponding most class samples of minimum range and minority class sample;
Obtain m neighbour's sample of most class samples and n neighbour's sample of minority class sample;
Obtain boundary sample collection D, D=m ∩ n.
6. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Obtain the central sample of boundary sample collection method particularly includes:
The distance of each sample every other sample in boundary sample collection in boundary sample collection is sought respectively;
Calculate separately the variance SD of each sample respective distances and apart from summation E;
Calculate dispersion degree B, B=SD*E;
Pick out sample centered on the smallest sample of dispersion degree.
7. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Calculate the distance of each sample distance center sample in most class sample sets, and according to the distance of calculating to most class sample sets into
Row lack sampling, method particularly includes:
Calculate the distance of each sample distance center sample in most class sample sets;
It is ranked up from small to large according to distance, then forms the matrix of R × T;
The relative standard deviation RSD of each row distance in calculating matrix;
By relative standard deviation RSD and threshold value RSD1It is compared, as RSD≤RSD1When, the average value of the row distance is calculated, and
Difference is greater than the corresponding sample of threshold value and deleted by the difference for calculating each distance and the average value row Nei;As RSD > RSD1
When, the corresponding all samples of the row are deleted;
In matrix after every deletion a line sample, most class sample sets after reduction a line sample are input to Random Forest model
In;
Calculate Δ Gm1, Δ Gm1=Gmi- Gm, GmiRandom forest mould is input to delete most class sample sets after the i-th row sample
The G_mean value exported in type, Gm are that original non-equilibrium data collection is input to the G_mean value exported in Random Forest model;
By Δ Gm1It is compared with threshold value Δ Gm, as Δ Gm1When >=Δ Gm, stop lack sampling, sample at this time is new more
Several classes of sample sets.
8. according to claim 4 based on sampling and the brief non-equilibrium data collection conversion method of feature, which is characterized in that
Described is to calculate each minority class sample distance center sample in minority class sample set to minority class sample set progress over-sampling
Distance, and over-sampling is carried out to minority class sample set according to the distance of calculating, new minority class sample set is obtained, is specifically included:
Calculate the distance of each sample distance center sample in minority class sample set;
R ' × T ' matrix will be formed apart from by sorting from small to large;
Since the first row, over-sampling is carried out to the corresponding sample of every a line using S-NKSMOTE algorithm;
The sample set formed after over-sampling is input to random forest after over-sampling by the sample of every a line in matrix
In model;
Calculate Δ Gm2, Δ Gm2=Gmj- Gm, GmjThe minority class sample set formed after over-sampling for jth row sample is input to
The G_mean value exported in Random Forest model, Gm are that original non-equilibrium data collection is input to and exports in Random Forest model
G_mean value;
By Δ Gm2It is compared with threshold value Δ Gm, as Δ Gm2When >=Δ Gm, stop over-sampling, sample at this time is new lacks
Several classes of sample sets.
9. a kind of based on sampling and the brief non-equilibrium data collection converting system of feature, which is characterized in that the converting system packet
It includes:
Obtain the data acquisition module of non-equilibrium data collection;The non-equilibrium data collection includes most class sample sets and minority class sample
This collection;
Sampling processing is carried out to non-equilibrium data collection, obtains the sampling processing module of new non-equilibrium data collection;
Dimension-reduction treatment is carried out to new non-equilibrium data collection, is converted into the dimension-reduction treatment of the brief new non-equilibrium data collection of feature
Module.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed
The step of any one of claim 1-8 claim the method is realized when device executes.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910508530.XA CN110348486A (en) | 2019-06-13 | 2019-06-13 | Based on sampling and feature brief non-equilibrium data collection conversion method and system |
CN202010371648.5A CN112085046A (en) | 2019-06-13 | 2020-05-06 | Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910508530.XA CN110348486A (en) | 2019-06-13 | 2019-06-13 | Based on sampling and feature brief non-equilibrium data collection conversion method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110348486A true CN110348486A (en) | 2019-10-18 |
Family
ID=68181860
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910508530.XA Pending CN110348486A (en) | 2019-06-13 | 2019-06-13 | Based on sampling and feature brief non-equilibrium data collection conversion method and system |
CN202010371648.5A Pending CN112085046A (en) | 2019-06-13 | 2020-05-06 | Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010371648.5A Pending CN112085046A (en) | 2019-06-13 | 2020-05-06 | Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110348486A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112036515A (en) * | 2020-11-04 | 2020-12-04 | 北京淇瑀信息科技有限公司 | Oversampling method and device based on SMOTE algorithm and electronic equipment |
CN112085046A (en) * | 2019-06-13 | 2020-12-15 | 中国科学院计算机网络信息中心 | Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion |
CN112395558A (en) * | 2020-11-27 | 2021-02-23 | 广东电网有限责任公司肇庆供电局 | Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter |
CN113052198A (en) * | 2019-12-28 | 2021-06-29 | 中移信息技术有限公司 | Data processing method, device, equipment and storage medium |
WO2021135271A1 (en) * | 2019-12-30 | 2021-07-08 | 山东英信计算机技术有限公司 | Classification model training method and system, electronic device and storage medium |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076438B (en) * | 2021-04-28 | 2023-12-15 | 华南理工大学 | Classification method based on conversion from majority class to minority class under unbalanced data set |
CN113553581A (en) * | 2021-07-12 | 2021-10-26 | 华东师范大学 | Intrusion detection system for unbalanced data |
CN113901448A (en) * | 2021-09-03 | 2022-01-07 | 燕山大学 | Intrusion detection method based on convolutional neural network and lightweight gradient elevator |
CN115242431A (en) * | 2022-06-10 | 2022-10-25 | 国家计算机网络与信息安全管理中心 | Industrial Internet of things data anomaly detection method based on random forest and long-short term memory network |
CN117253095B (en) * | 2023-11-16 | 2024-01-30 | 吉林大学 | Image classification system and method based on biased shortest distance criterion |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582813B (en) * | 2009-06-26 | 2011-07-20 | 西安电子科技大学 | Distributed migration network learning-based intrusion detection system and method thereof |
CN103716204B (en) * | 2013-12-20 | 2017-02-08 | 中国科学院信息工程研究所 | Abnormal intrusion detection ensemble learning method and apparatus based on Wiener process |
CN104598813B (en) * | 2014-12-09 | 2017-05-17 | 西安电子科技大学 | Computer intrusion detection method based on integrated study and semi-supervised SVM |
CN109150830B (en) * | 2018-07-11 | 2021-04-06 | 浙江理工大学 | Hierarchical intrusion detection method based on support vector machine and probabilistic neural network |
CN110348486A (en) * | 2019-06-13 | 2019-10-18 | 中国科学院计算机网络信息中心 | Based on sampling and feature brief non-equilibrium data collection conversion method and system |
-
2019
- 2019-06-13 CN CN201910508530.XA patent/CN110348486A/en active Pending
-
2020
- 2020-05-06 CN CN202010371648.5A patent/CN112085046A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112085046A (en) * | 2019-06-13 | 2020-12-15 | 中国科学院计算机网络信息中心 | Intrusion detection method and system based on sampling and feature reduction for unbalanced data set conversion |
CN113052198A (en) * | 2019-12-28 | 2021-06-29 | 中移信息技术有限公司 | Data processing method, device, equipment and storage medium |
WO2021135271A1 (en) * | 2019-12-30 | 2021-07-08 | 山东英信计算机技术有限公司 | Classification model training method and system, electronic device and storage medium |
US11762949B2 (en) | 2019-12-30 | 2023-09-19 | Shandong Yingxin Computer Technologies Co., Ltd. | Classification model training method, system, electronic device and strorage medium |
CN112036515A (en) * | 2020-11-04 | 2020-12-04 | 北京淇瑀信息科技有限公司 | Oversampling method and device based on SMOTE algorithm and electronic equipment |
CN112395558A (en) * | 2020-11-27 | 2021-02-23 | 广东电网有限责任公司肇庆供电局 | Improved unbalanced data hybrid sampling method suitable for historical fault data of intelligent electric meter |
Also Published As
Publication number | Publication date |
---|---|
CN112085046A (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110348486A (en) | Based on sampling and feature brief non-equilibrium data collection conversion method and system | |
CN112308158B (en) | Multi-source field self-adaptive model and method based on partial feature alignment | |
CN111967343B (en) | Detection method based on fusion of simple neural network and extreme gradient lifting model | |
Fu et al. | Low-level feature extraction for edge detection using genetic programming | |
CN112613552B (en) | Convolutional neural network emotion image classification method combined with emotion type attention loss | |
CN112883839B (en) | Remote sensing image interpretation method based on adaptive sample set construction and deep learning | |
CN111556016B (en) | Network flow abnormal behavior identification method based on automatic encoder | |
CN102938054B (en) | Method for recognizing compressed-domain sensitive images based on visual attention models | |
Hafemann et al. | Meta-learning for fast classifier adaptation to new users of signature verification systems | |
CN112529638B (en) | Service demand dynamic prediction method and system based on user classification and deep learning | |
CN112580445B (en) | Human body gait image visual angle conversion method based on generation of confrontation network | |
CN114091661A (en) | Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm | |
CN110580510A (en) | clustering result evaluation method and system | |
CN113033567A (en) | Oracle bone rubbing image character extraction method fusing segmentation network and generation network | |
CN104615635B (en) | Palm vein classified index construction method based on direction character | |
CN108509588B (en) | Lawyer evaluation method and recommendation method based on big data | |
CN112200260B (en) | Figure attribute identification method based on discarding loss function | |
Zhang et al. | Intrusion detection model of CNN-BiLSTM algorithm based on mean control | |
CN115277159B (en) | Industrial Internet security situation assessment method based on improved random forest | |
Pandey et al. | A hierarchical clustering approach for image datasets | |
CN111582440A (en) | Data processing method based on deep learning | |
Rahmat et al. | Tree identification to calculate the amount of palm trees using haar-cascade classifier algorithm | |
CN111221915A (en) | Online learning resource quality analysis method based on CWK-means | |
CN112529637B (en) | Service demand dynamic prediction method and system based on context awareness | |
CN108427967B (en) | Real-time image clustering method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191018 |