CN107578071A - The unbalanced method of solution data based on Epoch - Google Patents
The unbalanced method of solution data based on Epoch Download PDFInfo
- Publication number
- CN107578071A CN107578071A CN201710954471.XA CN201710954471A CN107578071A CN 107578071 A CN107578071 A CN 107578071A CN 201710954471 A CN201710954471 A CN 201710954471A CN 107578071 A CN107578071 A CN 107578071A
- Authority
- CN
- China
- Prior art keywords
- epoch
- weight
- sample
- data
- resampling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the unbalanced method of solution data based on Epoch, belong to deep learning field.Each Epoch carries out random resampling according to weight to each classification in the training process, the sample in each Epoch of training process is averaged expression;Each sample is added to according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.The present invention can more effectively solve the unbalanced problem of data, each Epoch carries out random resampling to method based on Epoch according to weight to each classification in the training process, make in training that sample can be averaged expression in each Epoch, its main thought is to be added to each sample according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.
Description
Technical field
The invention belongs to deep learning field, more particularly to a kind of unbalanced method of solution data based on Epoch, belong to
The technical field of deep learning.
Background technology
Either increasing concern, uneven number are attracted in academia or industrial quarters, imbalance study
According to scene also appear in the every aspects of the Internet, applications, such as the click prediction of search engine, (webpage of click tends to take up
The ratio of very little), the commercial product recommending (the purchased ratio of the commodity of recommendation is very low) of e-commerce field, credit card fraud inspection
Survey, network attack identification etc..
Influence to be commonly done in classification problem caused by unbalanced data.100 row data are contained now with one
Binomial classification problem (two class data).Wherein have that 90 row data represent is the first kind, and what remaining 10 row represented is data the
Two classes.Here it is a unbalanced data (uneven class data), its primary sources and secondary sources ratio is 90:10 or
Person says 9:1.When this data set is handled, establish a disaggregated model and its accuracy has reached 90%, but
It is that can find that this 90% accuracy is the accuracy of a kind of data when continuing deeper mining data.Uneven class is asked
Topic is likely to occur in binomial classification problem or multinomial classification problem.Most methods can be used in both problems
On.
There is following three concept in deep learning framework:Batch Size are a propagated forward computing and backpropagation
Training sample number required in computing;Iterations is a weight renewal for iteration each time, each time weight
Renewal needs the progress propagated forward computing of Batch Size data to obtain loss function, then back-propagation algorithm undated parameter;
Epoch is that training samples all in deep learning complete a propagated forward computing and a backpropagation computing.So
The data amount check for participating in an Epoch training is the summation of training sample.The data of high quality are machine learning and deep learning
Key point, the scarcity of data can hinder the development of a model, past using the model after the data training of high quality
Toward meeting more healthy and stronger (preventing over-fitting), it might even be possible to because making training become simple and quick the reason for data set.Now with very
It is used for solving the unbalanced method of data.
1) sample
The method of sampling is the data set for making it become to balance from unbalanced data set by carrying out processing to training set,
Lifting can be brought to final result in most cases.It is that small species is answered that sampling, which is divided into up-sampling and down-sampling, up-sampling,
More parts are made, down-sampling is from generic middle some samples of rejecting, in other words only from generic middle selected part sample.
Can occur some samples in data set after up-sampling repeatedly, train the model come and have certain over-fitting.
And the method for solving over-fitting mainly has early stop, addition L1 regularizations or L2 regularizations and addition Dropout layers.And
The shortcomings that down-sampling, is apparent, that is, final training set lost data, and model has only acquired one of aggregated model
Point.
2) data strengthen
The data small to data volume, which carry out data enhancing, can effectively suppress the unbalanced influence brought of data.Data enhancing side
Method is extremely important in machine learning field and deep learning field, and suitable data enhancement methods are it is possible to prevente effectively from over-fitting
The problems such as and can effectively improve the robustness of model.
Data enhancing technology towards image has many kinds, if A=[a1,a2,...,a7] it is the collection that data strengthen technology
Close, wherein, a1For rotation transformation, a2For reflection transformation, a3For turning-over changed, a4For scale transformation, a5For translation transformation, a6For chi
Degree conversion, a7For contrast variation.
Make Mi=xm,....,xnRepresent the sequence of operation for gradually carrying out data enhancing technology.Wherein, i ∈ R represent operation sequence
The order of row, xn∈ A, n ∈ [1,7] represent a certain data enhancing technology.Such as:M=x1,x2Representative is using rotation transformation and instead
Conversion progress data enhancing is penetrated to be analyzed.
More generally, M can be definedi=λ1x1+.....+λlxlStrengthen the sequence of operation of technology for the data of cum rights.As a result
Weighting.Wherein, λlStrengthen the weight of technology distribution for each data.
Then the total model manipulation sequence order of data enhancing of gray scale image is:
y1=M1(m)
y2=M2(y1)
……
yk=Mk(yk-1)
Wherein, m be input gray scale image data, ykTo strengthen the knot obtained after technical operation sequence by k data
Fruit.
Define D (m)=Mk(yk-1), wherein D (m) represents k data enhancing technical operation sequence of input data m experience and obtained
The end product arrived.It is final to provide support for malicious code mutation detection by carrying out data enhancing to multidimensional gray scale image.
3) method based on weight
The process of deep learning training data namely trains the process of each layer of weight of neutral net, and each layer of god
Weight through network updates caused error progress backpropagation during needs export according to network and is adjusted every time.Based on weight
The unbalanced method of processing data by being to be produced per a kind of before reverse propagated error for one weight of every a kind of setting
Raw error asks weighting again after being multiplied by every a kind of weighted value.Then backpropagation is carried out further according to the error of generation.
The unbalanced method of processing data based on weight solves the problems, such as data nonbalance by way of weighting, mainly
Thought is that different punishment strategies is set to different classes of sample, to generate the error based on weight.So that the class that sample is few
Not big by caused cost after misclassification, the sample more than sample is small by caused cost after misclassification.The difficult point of this method is to set
Rational weight is put, typically allows the weighting loss value approximately equal of each sorting room in practical application.
The content of the invention
It is a primary object of the present invention to propose the unbalanced method of solution data based on deep learning, this method is
A kind of method of new solution data nonbalance:The unbalanced method of solution data based on Epoch.
To achieve the above object, the technical solution adopted by the present invention is the unbalanced method of solution data based on Epoch,
Each Epoch carries out random resampling according to weight to each classification in the training process, makes in each Epoch of training process
Sample can be averaged expression;Each sample is added to according to the weight of resampling, then according to weight proportion from Sample Storehouse
In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.
Assuming that the total data set for classification shares N number of sample, belong to M classification.If X=(x1,x2,...,xn) be
All N number of training samples, Y=(y1,y2,...,ym) it is whole M classifications.N ∈ N+, m ∈ N+, n, m are positive integer.If
The number of samples of a certain classification is 10 times of the number of samples of other classifications or even more in data set, then the data set is referred to as not
Equalization data collection.Carrying out deep learning training using imbalanced data sets easily causes incredible training result.
First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, to imbalanced data sets
Resampling is carried out, obtains the sample set of the sample size size of population of relative equilibrium.The flow of this method is as follows:
According to the initial weight W of resamplinginitAnd the final weight W of resamplingendCalculate the progress next time of resampling
The weight W of settingi.Specific calculation formula is as follows:
Wi=ri-1Winit+(1-ri-1)Wend (1)
Wherein, Winit=(w1,w2,...,wm) be resampling initial weight, Wend=(w1,w2,...,wm) it is resampling
Final weight.wiFor the weight of the i-th category setting, i ∈ [1, M].The weighted value set more than sample size is relatively small, sample number
It is relatively large to measure few weighted value.R is the paces size that each iteration is calculated, and i is iteration Epoch number.
Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, lnFor n-th sample institute from
The class label of category, 1 { ln=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m.WmRepresent m classes
Weight, weightiRepresent the weight of i-th of sample.
It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2).And to whole
Individual weights is normalized so that it is whole weights's and be 1;Finally, it is big according to weights, random resampling Epoch
Small data.
It is as shown in Figure 1 specifically weight weights resampling Epoch sample algorithms to be distributed according to classification.
Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning god
Through being trained in network, after Epoch data all participate in training neutral net, it is necessary to by the i Epoch's represented
Iterations adds 1, according to initial weight, final weight and iterations adjustment weight next time, and resampling one again
Epoch data, the training of neutral net is continued to participate in, until training reaches steady.
Compared with prior art, the present invention has the advantages that.
1. the unbalanced method of data based on deep learning.The data of high quality are the keys of machine learning and deep learning
Place, the scarcity of data can hinder the development of a model, using the model after the data training of high quality often more
It is healthy and strong, it might even be possible to because making training become simple and quick the reason for data set.It is used for solving data inequality now with a lot
The method of weighing apparatus:The method of sampling, data enhancing and the method based on weight.Solution data proposed by the present invention based on Epoch
Unbalanced method can more effectively solve the unbalanced problem of data.
2. the present invention proposes the unbalanced method of solution data based on Epoch.Trained based on Epoch method
Each Epoch carries out random resampling according to weight to each classification in journey, makes in training in each Epoch sample can be by
It is average to represent.Its main thought is according to the weight of resampling to be added to each sample, then according to weight proportion from Sample Storehouse
In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.
Brief description of the drawings
Solution data unbalanced method flow diagrams of the Fig. 1 based on Epoch;
Resampling methods of the Fig. 2 based on Epoch;
Fig. 3 training error contrast curves;
Fig. 4 training precision contrast curves;
Fig. 5 test error contrast curves;
Fig. 6 measuring accuracy contrast curves;
Embodiment
For the purpose of the present invention, technical scheme and feature is more clearly understood, below in conjunction with specific embodiment, and join
According to accompanying drawing, further refinement explanation is carried out to the present invention.The unbalanced method flow diagram of solution data such as Fig. 1 based on Epoch
It is shown.
Each step is explained as follows:
1) total training data training data is sent in the method based on Epoch, according to the class weight of initialization,
The sample set of sample size is randomly choosed, obtains the sample set of the relative equilibrium of resampling, deep learning neutral net is sent into and enters
Row training.
2) after the complete Epoch of neural metwork training data, class weight is adjusted, then by being based on Epoch
Method, then proceed to train, by multiple iterative calculation, class weight is infinitely fitted the last weight set, makes god
Training error through network becomes steady and global minima.
Experimental enviroment
This section by cross experiment come verify set forth herein the unbalanced method of solution data based on Epoch actual effect.
Experimental enviroment is ubuntu14.04 main frames, 8G internal memories, 1T hard disks, and experimental data comes from manually generated imbalanced data sets,
Specific unbalanced classification has reached 1:60:
Item name | Quantity | Item name | Quantity |
Classification one | 120 | Classification two | 7200 |
This experiment has the effect for testing several solution unbalanced methods of data on identical imbalanced data sets altogether.Survey
The unbalanced control methods of solution data of examination has:Former data (not making an amendment), data enhancing, the method based on Epoch and
The method being combined based on Epoch with data enhancing.
Strengthen combination technique using data and data enhancing is carried out to image, by being expanded using data enhancing combination small sample
Training data scale is filled, over-fitting and the incomplete problem of training caused by for solving the unbalanced possibility of amount of training data,
Wherein, data enhancing is as follows to be combined on sample:
Attribute | Arranges value |
rotation_range | 0.1 |
width_shift | 0.1 |
height_shift | 0.1 |
rescale | 1/255 |
shear_range | 0.1 |
zoom_range | 0.1 |
horizontal_flip | True |
fill_mode | nearest |
Wherein, rotation_range represents rotating range, and width_shift is represented and translated along horizontal direction,
Height_shift is represented and translated along vertical direction, and rescale is represented and image is amplified according to specified scale factor
Or reduce, shear_range represents horizontally or vertically projective transformation, zoom_range proportionally random zoomed image sizes,
Horizontal_flip represents flip horizontal image, and fill_mode represents is filled pixel side after rotating or translating
Formula.
Several contrast curves for solving the unbalanced method of data on train_loss are as shown in figure 3, several solution numbers
According to contrast curve of the unbalanced method on train_acc as shown in figure 4, several solution unbalanced methods of data are in val_
Contrast curve on loss is as shown in figure 5, several solve the unbalanced method of the data contrast curve on val_acc such as figure
Shown in 6.
Train_loss is the training error that training Epoch data obtain every time, and train_acc is training Epoch every time
The training precision that data obtain, val_loss are the training error that training Epoch data obtain every time, and val_acc is instruction every time
Practice the training precision that Epoch data obtain.
Four contrast curves can be seen that set forth herein comparing other traditional methods based on Epoch methods more than,
It is greatly improved effect for the unbalanced classification problem of data sample.
Claims (1)
1. the unbalanced method of solution data based on Epoch, each Epoch enters according to weight to each classification in the training process
The random resampling of row, makes the sample in each Epoch of training process to be averaged expression;Added according to the weight of resampling
To each sample, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach weight
Sample the result of Epoch data relative equilibrium;
Assuming that the total data set for classification shares N number of sample, belong to M classification;If X=(x1,x2,...,xn) it is whole N
Individual training sample, Y=(y1,y2,...,ym) it is whole M classifications;N ∈ N+, m ∈ N+, n, m are positive integer;If data set
In the number of samples of a certain classification be 10 times of number of samples of other classifications or even more, then the data set is referred to as unbalanced number
According to collection;Carrying out deep learning training using imbalanced data sets easily causes incredible training result;It is characterized in that:
First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, imbalanced data sets are carried out
Resampling, obtain the sample set of the sample size size of population of relative equilibrium;The flow of this method is as follows:
According to the initial weight W of resamplinginitAnd the final weight W of resamplingendCalculate being set next time for resampling
Weight Wi;Specific calculation formula is as follows:
Wi=ri-1Winit+(1-ri-1)Wend (1)
Wherein, Winit=(w1,w2,...,wm) be resampling initial weight, Wend=(w1,w2,...,wm) for resampling most
Whole weight;wiFor the weight of the i-th category setting, i ∈ [1, M];The weighted value set more than sample size is relatively small, and sample size is few
Weighted value it is relatively large;R is the paces size that each iteration is calculated, and i is iteration Epoch number;
<mrow>
<msub>
<mi>weight</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>m</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mn>1</mn>
<mo>{</mo>
<mi>i</mi>
<mo>=</mo>
<mi>m</mi>
<mo>}</mo>
<msub>
<mi>W</mi>
<mi>m</mi>
</msub>
</mrow>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>m</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mn>1</mn>
<mo>{</mo>
<msub>
<mi>l</mi>
<mi>n</mi>
</msub>
<mo>=</mo>
<mi>m</mi>
<mo>}</mo>
<msub>
<mi>W</mi>
<mi>m</mi>
</msub>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, lnFor n-th sample institute subordinate
Class label, 1 { ln=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m;WmRepresent the power of m classes
Weight, weightiRepresent the weight of i-th of sample;
It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2);And to whole
Weights is normalized so that it is whole weights's and be 1;Finally, according to weights, random resampling Epoch sizes
Data;
Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning nerve net
It is trained in network, it is necessary to iteration by the i Epoch represented after Epoch data all participate in training neutral net
Number adds 1, according to initial weight, final weight and iterations adjustment weight next time, and one Epoch of resampling again
Data, continue to participate in the training of neutral net, until training reach steady.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710954471.XA CN107578071A (en) | 2017-10-13 | 2017-10-13 | The unbalanced method of solution data based on Epoch |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710954471.XA CN107578071A (en) | 2017-10-13 | 2017-10-13 | The unbalanced method of solution data based on Epoch |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107578071A true CN107578071A (en) | 2018-01-12 |
Family
ID=61037130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710954471.XA Pending CN107578071A (en) | 2017-10-13 | 2017-10-13 | The unbalanced method of solution data based on Epoch |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107578071A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304287A (en) * | 2018-01-22 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of disk failure detection method, device and relevant device |
CN108460277A (en) * | 2018-02-10 | 2018-08-28 | 北京工业大学 | A kind of automation malicious code mutation detection method |
CN108564570A (en) * | 2018-03-29 | 2018-09-21 | 哈尔滨工业大学(威海) | A kind of method and apparatus of intelligentized pathological tissues positioning |
CN109598349A (en) * | 2018-11-23 | 2019-04-09 | 华南理工大学 | Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling |
CN110188592A (en) * | 2019-04-10 | 2019-08-30 | 西安电子科技大学 | A kind of urinary formed element cell image disaggregated model construction method and classification method |
CN110689066A (en) * | 2019-09-24 | 2020-01-14 | 成都考拉悠然科技有限公司 | Training method combining face recognition data equalization and enhancement |
CN110717515A (en) * | 2019-09-06 | 2020-01-21 | 北京三快在线科技有限公司 | Model training method and device and electronic equipment |
CN110991402A (en) * | 2019-12-19 | 2020-04-10 | 湘潭大学 | Skin disease classification device and method based on deep learning |
-
2017
- 2017-10-13 CN CN201710954471.XA patent/CN107578071A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304287A (en) * | 2018-01-22 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of disk failure detection method, device and relevant device |
CN108304287B (en) * | 2018-01-22 | 2021-05-28 | 腾讯科技(深圳)有限公司 | Disk fault detection method and device and related equipment |
CN108460277A (en) * | 2018-02-10 | 2018-08-28 | 北京工业大学 | A kind of automation malicious code mutation detection method |
CN108564570A (en) * | 2018-03-29 | 2018-09-21 | 哈尔滨工业大学(威海) | A kind of method and apparatus of intelligentized pathological tissues positioning |
CN109598349A (en) * | 2018-11-23 | 2019-04-09 | 华南理工大学 | Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling |
CN110188592A (en) * | 2019-04-10 | 2019-08-30 | 西安电子科技大学 | A kind of urinary formed element cell image disaggregated model construction method and classification method |
CN110188592B (en) * | 2019-04-10 | 2021-06-29 | 西安电子科技大学 | Urine formed component cell image classification model construction method and classification method |
CN110717515A (en) * | 2019-09-06 | 2020-01-21 | 北京三快在线科技有限公司 | Model training method and device and electronic equipment |
CN110689066A (en) * | 2019-09-24 | 2020-01-14 | 成都考拉悠然科技有限公司 | Training method combining face recognition data equalization and enhancement |
CN110991402A (en) * | 2019-12-19 | 2020-04-10 | 湘潭大学 | Skin disease classification device and method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107578071A (en) | The unbalanced method of solution data based on Epoch | |
Salmeron et al. | Learning fuzzy cognitive maps with modified asexual reproduction optimisation algorithm | |
CN110533631B (en) | SAR image change detection method based on pyramid pooling twin network | |
Hewamalage et al. | Global models for time series forecasting: A simulation study | |
CN107230113A (en) | A kind of house property appraisal procedure of multi-model fusion | |
CN108717568A (en) | A kind of image characteristics extraction and training method based on Three dimensional convolution neural network | |
CN113486981A (en) | RGB image classification method based on multi-scale feature attention fusion network | |
CN111340614A (en) | Sample sampling method and device based on federal learning and readable storage medium | |
CN112070357A (en) | Radar radiation source threat assessment method based on improved BP neural network | |
Tayfur et al. | Principle component analysis in conjuction with data driven methods for sediment load prediction | |
CN105279692A (en) | Financial information technology system performance prediction method and apparatus | |
CN106897703A (en) | Remote Image Classification based on AGA PKF SVM | |
CN101706443A (en) | Smoothness evaluation method of seams of clothing fabrics | |
CN103955714A (en) | Navy detection model construction method and system and navy detection method | |
CN109508498A (en) | Rubber shock absorber formula designing system and method based on BP artificial neural network | |
Wanke et al. | Revisiting camels rating system and the performance of Asean banks: a comprehensive mcdm/z-numbers approach | |
Purnama | Increasing Understanding of One-Way ANOVA Material for Accounting Students: A Case Study of Deposit Interest | |
CN110956543A (en) | Method for detecting abnormal transaction | |
CN114462872A (en) | Internet platform franchising party rating method integrating weighted fuzzy evaluation | |
CN114519508A (en) | Credit risk assessment method based on time sequence deep learning and legal document information | |
Qiyas et al. | Decision support system based on fuzzy credibility Dombi aggregation operators and modified TOPSIS method | |
CN106844626A (en) | Using microblogging keyword and the method and system of positional information simulated air quality | |
Kadam et al. | Loan Approval Prediction System using Logistic Regression and CIBIL Score | |
Sabzi et al. | Exploring the best model for sorting blood orange using ANFIS method | |
Nemeshaev et al. | Model of the forecasting cash withdrawals in the ATM network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180112 |
|
RJ01 | Rejection of invention patent application after publication |