CN107578071A - The unbalanced method of solution data based on Epoch - Google Patents

The unbalanced method of solution data based on Epoch Download PDF

Info

Publication number
CN107578071A
CN107578071A CN201710954471.XA CN201710954471A CN107578071A CN 107578071 A CN107578071 A CN 107578071A CN 201710954471 A CN201710954471 A CN 201710954471A CN 107578071 A CN107578071 A CN 107578071A
Authority
CN
China
Prior art keywords
epoch
weight
sample
data
resampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710954471.XA
Other languages
Chinese (zh)
Inventor
赵建峰
宁振虎
蔡永泉
薛菲
公备
王昱波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710954471.XA priority Critical patent/CN107578071A/en
Publication of CN107578071A publication Critical patent/CN107578071A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the unbalanced method of solution data based on Epoch, belong to deep learning field.Each Epoch carries out random resampling according to weight to each classification in the training process, the sample in each Epoch of training process is averaged expression;Each sample is added to according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.The present invention can more effectively solve the unbalanced problem of data, each Epoch carries out random resampling to method based on Epoch according to weight to each classification in the training process, make in training that sample can be averaged expression in each Epoch, its main thought is to be added to each sample according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.

Description

The unbalanced method of solution data based on Epoch
Technical field
The invention belongs to deep learning field, more particularly to a kind of unbalanced method of solution data based on Epoch, belong to The technical field of deep learning.
Background technology
Either increasing concern, uneven number are attracted in academia or industrial quarters, imbalance study According to scene also appear in the every aspects of the Internet, applications, such as the click prediction of search engine, (webpage of click tends to take up The ratio of very little), the commercial product recommending (the purchased ratio of the commodity of recommendation is very low) of e-commerce field, credit card fraud inspection Survey, network attack identification etc..
Influence to be commonly done in classification problem caused by unbalanced data.100 row data are contained now with one Binomial classification problem (two class data).Wherein have that 90 row data represent is the first kind, and what remaining 10 row represented is data the Two classes.Here it is a unbalanced data (uneven class data), its primary sources and secondary sources ratio is 90:10 or Person says 9:1.When this data set is handled, establish a disaggregated model and its accuracy has reached 90%, but It is that can find that this 90% accuracy is the accuracy of a kind of data when continuing deeper mining data.Uneven class is asked Topic is likely to occur in binomial classification problem or multinomial classification problem.Most methods can be used in both problems On.
There is following three concept in deep learning framework:Batch Size are a propagated forward computing and backpropagation Training sample number required in computing;Iterations is a weight renewal for iteration each time, each time weight Renewal needs the progress propagated forward computing of Batch Size data to obtain loss function, then back-propagation algorithm undated parameter; Epoch is that training samples all in deep learning complete a propagated forward computing and a backpropagation computing.So The data amount check for participating in an Epoch training is the summation of training sample.The data of high quality are machine learning and deep learning Key point, the scarcity of data can hinder the development of a model, past using the model after the data training of high quality Toward meeting more healthy and stronger (preventing over-fitting), it might even be possible to because making training become simple and quick the reason for data set.Now with very It is used for solving the unbalanced method of data.
1) sample
The method of sampling is the data set for making it become to balance from unbalanced data set by carrying out processing to training set, Lifting can be brought to final result in most cases.It is that small species is answered that sampling, which is divided into up-sampling and down-sampling, up-sampling, More parts are made, down-sampling is from generic middle some samples of rejecting, in other words only from generic middle selected part sample.
Can occur some samples in data set after up-sampling repeatedly, train the model come and have certain over-fitting. And the method for solving over-fitting mainly has early stop, addition L1 regularizations or L2 regularizations and addition Dropout layers.And The shortcomings that down-sampling, is apparent, that is, final training set lost data, and model has only acquired one of aggregated model Point.
2) data strengthen
The data small to data volume, which carry out data enhancing, can effectively suppress the unbalanced influence brought of data.Data enhancing side Method is extremely important in machine learning field and deep learning field, and suitable data enhancement methods are it is possible to prevente effectively from over-fitting The problems such as and can effectively improve the robustness of model.
Data enhancing technology towards image has many kinds, if A=[a1,a2,...,a7] it is the collection that data strengthen technology Close, wherein, a1For rotation transformation, a2For reflection transformation, a3For turning-over changed, a4For scale transformation, a5For translation transformation, a6For chi Degree conversion, a7For contrast variation.
Make Mi=xm,....,xnRepresent the sequence of operation for gradually carrying out data enhancing technology.Wherein, i ∈ R represent operation sequence The order of row, xn∈ A, n ∈ [1,7] represent a certain data enhancing technology.Such as:M=x1,x2Representative is using rotation transformation and instead Conversion progress data enhancing is penetrated to be analyzed.
More generally, M can be definedi1x1+.....+λlxlStrengthen the sequence of operation of technology for the data of cum rights.As a result Weighting.Wherein, λlStrengthen the weight of technology distribution for each data.
Then the total model manipulation sequence order of data enhancing of gray scale image is:
y1=M1(m)
y2=M2(y1)
……
yk=Mk(yk-1)
Wherein, m be input gray scale image data, ykTo strengthen the knot obtained after technical operation sequence by k data Fruit.
Define D (m)=Mk(yk-1), wherein D (m) represents k data enhancing technical operation sequence of input data m experience and obtained The end product arrived.It is final to provide support for malicious code mutation detection by carrying out data enhancing to multidimensional gray scale image.
3) method based on weight
The process of deep learning training data namely trains the process of each layer of weight of neutral net, and each layer of god Weight through network updates caused error progress backpropagation during needs export according to network and is adjusted every time.Based on weight The unbalanced method of processing data by being to be produced per a kind of before reverse propagated error for one weight of every a kind of setting Raw error asks weighting again after being multiplied by every a kind of weighted value.Then backpropagation is carried out further according to the error of generation.
The unbalanced method of processing data based on weight solves the problems, such as data nonbalance by way of weighting, mainly Thought is that different punishment strategies is set to different classes of sample, to generate the error based on weight.So that the class that sample is few Not big by caused cost after misclassification, the sample more than sample is small by caused cost after misclassification.The difficult point of this method is to set Rational weight is put, typically allows the weighting loss value approximately equal of each sorting room in practical application.
The content of the invention
It is a primary object of the present invention to propose the unbalanced method of solution data based on deep learning, this method is A kind of method of new solution data nonbalance:The unbalanced method of solution data based on Epoch.
To achieve the above object, the technical solution adopted by the present invention is the unbalanced method of solution data based on Epoch, Each Epoch carries out random resampling according to weight to each classification in the training process, makes in each Epoch of training process Sample can be averaged expression;Each sample is added to according to the weight of resampling, then according to weight proportion from Sample Storehouse In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.
Assuming that the total data set for classification shares N number of sample, belong to M classification.If X=(x1,x2,...,xn) be All N number of training samples, Y=(y1,y2,...,ym) it is whole M classifications.N ∈ N+, m ∈ N+, n, m are positive integer.If The number of samples of a certain classification is 10 times of the number of samples of other classifications or even more in data set, then the data set is referred to as not Equalization data collection.Carrying out deep learning training using imbalanced data sets easily causes incredible training result.
First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, to imbalanced data sets Resampling is carried out, obtains the sample set of the sample size size of population of relative equilibrium.The flow of this method is as follows:
According to the initial weight W of resamplinginitAnd the final weight W of resamplingendCalculate the progress next time of resampling The weight W of settingi.Specific calculation formula is as follows:
Wi=ri-1Winit+(1-ri-1)Wend (1)
Wherein, Winit=(w1,w2,...,wm) be resampling initial weight, Wend=(w1,w2,...,wm) it is resampling Final weight.wiFor the weight of the i-th category setting, i ∈ [1, M].The weighted value set more than sample size is relatively small, sample number It is relatively large to measure few weighted value.R is the paces size that each iteration is calculated, and i is iteration Epoch number.
Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, lnFor n-th sample institute from The class label of category, 1 { ln=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m.WmRepresent m classes Weight, weightiRepresent the weight of i-th of sample.
It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2).And to whole Individual weights is normalized so that it is whole weights's and be 1;Finally, it is big according to weights, random resampling Epoch Small data.
It is as shown in Figure 1 specifically weight weights resampling Epoch sample algorithms to be distributed according to classification.
Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning god Through being trained in network, after Epoch data all participate in training neutral net, it is necessary to by the i Epoch's represented Iterations adds 1, according to initial weight, final weight and iterations adjustment weight next time, and resampling one again Epoch data, the training of neutral net is continued to participate in, until training reaches steady.
Compared with prior art, the present invention has the advantages that.
1. the unbalanced method of data based on deep learning.The data of high quality are the keys of machine learning and deep learning Place, the scarcity of data can hinder the development of a model, using the model after the data training of high quality often more It is healthy and strong, it might even be possible to because making training become simple and quick the reason for data set.It is used for solving data inequality now with a lot The method of weighing apparatus:The method of sampling, data enhancing and the method based on weight.Solution data proposed by the present invention based on Epoch Unbalanced method can more effectively solve the unbalanced problem of data.
2. the present invention proposes the unbalanced method of solution data based on Epoch.Trained based on Epoch method Each Epoch carries out random resampling according to weight to each classification in journey, makes in training in each Epoch sample can be by It is average to represent.Its main thought is according to the weight of resampling to be added to each sample, then according to weight proportion from Sample Storehouse In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.
Brief description of the drawings
Solution data unbalanced method flow diagrams of the Fig. 1 based on Epoch;
Resampling methods of the Fig. 2 based on Epoch;
Fig. 3 training error contrast curves;
Fig. 4 training precision contrast curves;
Fig. 5 test error contrast curves;
Fig. 6 measuring accuracy contrast curves;
Embodiment
For the purpose of the present invention, technical scheme and feature is more clearly understood, below in conjunction with specific embodiment, and join According to accompanying drawing, further refinement explanation is carried out to the present invention.The unbalanced method flow diagram of solution data such as Fig. 1 based on Epoch It is shown.
Each step is explained as follows:
1) total training data training data is sent in the method based on Epoch, according to the class weight of initialization, The sample set of sample size is randomly choosed, obtains the sample set of the relative equilibrium of resampling, deep learning neutral net is sent into and enters Row training.
2) after the complete Epoch of neural metwork training data, class weight is adjusted, then by being based on Epoch Method, then proceed to train, by multiple iterative calculation, class weight is infinitely fitted the last weight set, makes god Training error through network becomes steady and global minima.
Experimental enviroment
This section by cross experiment come verify set forth herein the unbalanced method of solution data based on Epoch actual effect. Experimental enviroment is ubuntu14.04 main frames, 8G internal memories, 1T hard disks, and experimental data comes from manually generated imbalanced data sets, Specific unbalanced classification has reached 1:60:
Item name Quantity Item name Quantity
Classification one 120 Classification two 7200
This experiment has the effect for testing several solution unbalanced methods of data on identical imbalanced data sets altogether.Survey The unbalanced control methods of solution data of examination has:Former data (not making an amendment), data enhancing, the method based on Epoch and The method being combined based on Epoch with data enhancing.
Strengthen combination technique using data and data enhancing is carried out to image, by being expanded using data enhancing combination small sample Training data scale is filled, over-fitting and the incomplete problem of training caused by for solving the unbalanced possibility of amount of training data, Wherein, data enhancing is as follows to be combined on sample:
Attribute Arranges value
rotation_range 0.1
width_shift 0.1
height_shift 0.1
rescale 1/255
shear_range 0.1
zoom_range 0.1
horizontal_flip True
fill_mode nearest
Wherein, rotation_range represents rotating range, and width_shift is represented and translated along horizontal direction, Height_shift is represented and translated along vertical direction, and rescale is represented and image is amplified according to specified scale factor Or reduce, shear_range represents horizontally or vertically projective transformation, zoom_range proportionally random zoomed image sizes, Horizontal_flip represents flip horizontal image, and fill_mode represents is filled pixel side after rotating or translating Formula.
Several contrast curves for solving the unbalanced method of data on train_loss are as shown in figure 3, several solution numbers According to contrast curve of the unbalanced method on train_acc as shown in figure 4, several solution unbalanced methods of data are in val_ Contrast curve on loss is as shown in figure 5, several solve the unbalanced method of the data contrast curve on val_acc such as figure Shown in 6.
Train_loss is the training error that training Epoch data obtain every time, and train_acc is training Epoch every time The training precision that data obtain, val_loss are the training error that training Epoch data obtain every time, and val_acc is instruction every time Practice the training precision that Epoch data obtain.
Four contrast curves can be seen that set forth herein comparing other traditional methods based on Epoch methods more than, It is greatly improved effect for the unbalanced classification problem of data sample.

Claims (1)

1. the unbalanced method of solution data based on Epoch, each Epoch enters according to weight to each classification in the training process The random resampling of row, makes the sample in each Epoch of training process to be averaged expression;Added according to the weight of resampling To each sample, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach weight Sample the result of Epoch data relative equilibrium;
Assuming that the total data set for classification shares N number of sample, belong to M classification;If X=(x1,x2,...,xn) it is whole N Individual training sample, Y=(y1,y2,...,ym) it is whole M classifications;N ∈ N+, m ∈ N+, n, m are positive integer;If data set In the number of samples of a certain classification be 10 times of number of samples of other classifications or even more, then the data set is referred to as unbalanced number According to collection;Carrying out deep learning training using imbalanced data sets easily causes incredible training result;It is characterized in that:
First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, imbalanced data sets are carried out Resampling, obtain the sample set of the sample size size of population of relative equilibrium;The flow of this method is as follows:
According to the initial weight W of resamplinginitAnd the final weight W of resamplingendCalculate being set next time for resampling Weight Wi;Specific calculation formula is as follows:
Wi=ri-1Winit+(1-ri-1)Wend (1)
Wherein, Winit=(w1,w2,...,wm) be resampling initial weight, Wend=(w1,w2,...,wm) for resampling most Whole weight;wiFor the weight of the i-th category setting, i ∈ [1, M];The weighted value set more than sample size is relatively small, and sample size is few Weighted value it is relatively large;R is the paces size that each iteration is calculated, and i is iteration Epoch number;
<mrow> <msub> <mi>weight</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mn>1</mn> <mo>{</mo> <mi>i</mi> <mo>=</mo> <mi>m</mi> <mo>}</mo> <msub> <mi>W</mi> <mi>m</mi> </msub> </mrow> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mn>1</mn> <mo>{</mo> <msub> <mi>l</mi> <mi>n</mi> </msub> <mo>=</mo> <mi>m</mi> <mo>}</mo> <msub> <mi>W</mi> <mi>m</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, lnFor n-th sample institute subordinate Class label, 1 { ln=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m;WmRepresent the power of m classes Weight, weightiRepresent the weight of i-th of sample;
It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2);And to whole Weights is normalized so that it is whole weights's and be 1;Finally, according to weights, random resampling Epoch sizes Data;
Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning nerve net It is trained in network, it is necessary to iteration by the i Epoch represented after Epoch data all participate in training neutral net Number adds 1, according to initial weight, final weight and iterations adjustment weight next time, and one Epoch of resampling again Data, continue to participate in the training of neutral net, until training reach steady.
CN201710954471.XA 2017-10-13 2017-10-13 The unbalanced method of solution data based on Epoch Pending CN107578071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710954471.XA CN107578071A (en) 2017-10-13 2017-10-13 The unbalanced method of solution data based on Epoch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710954471.XA CN107578071A (en) 2017-10-13 2017-10-13 The unbalanced method of solution data based on Epoch

Publications (1)

Publication Number Publication Date
CN107578071A true CN107578071A (en) 2018-01-12

Family

ID=61037130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710954471.XA Pending CN107578071A (en) 2017-10-13 2017-10-13 The unbalanced method of solution data based on Epoch

Country Status (1)

Country Link
CN (1) CN107578071A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304287A (en) * 2018-01-22 2018-07-20 腾讯科技(深圳)有限公司 A kind of disk failure detection method, device and relevant device
CN108460277A (en) * 2018-02-10 2018-08-28 北京工业大学 A kind of automation malicious code mutation detection method
CN108564570A (en) * 2018-03-29 2018-09-21 哈尔滨工业大学(威海) A kind of method and apparatus of intelligentized pathological tissues positioning
CN109598349A (en) * 2018-11-23 2019-04-09 华南理工大学 Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling
CN110188592A (en) * 2019-04-10 2019-08-30 西安电子科技大学 A kind of urinary formed element cell image disaggregated model construction method and classification method
CN110689066A (en) * 2019-09-24 2020-01-14 成都考拉悠然科技有限公司 Training method combining face recognition data equalization and enhancement
CN110717515A (en) * 2019-09-06 2020-01-21 北京三快在线科技有限公司 Model training method and device and electronic equipment
CN110991402A (en) * 2019-12-19 2020-04-10 湘潭大学 Skin disease classification device and method based on deep learning

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304287A (en) * 2018-01-22 2018-07-20 腾讯科技(深圳)有限公司 A kind of disk failure detection method, device and relevant device
CN108304287B (en) * 2018-01-22 2021-05-28 腾讯科技(深圳)有限公司 Disk fault detection method and device and related equipment
CN108460277A (en) * 2018-02-10 2018-08-28 北京工业大学 A kind of automation malicious code mutation detection method
CN108564570A (en) * 2018-03-29 2018-09-21 哈尔滨工业大学(威海) A kind of method and apparatus of intelligentized pathological tissues positioning
CN109598349A (en) * 2018-11-23 2019-04-09 华南理工大学 Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling
CN110188592A (en) * 2019-04-10 2019-08-30 西安电子科技大学 A kind of urinary formed element cell image disaggregated model construction method and classification method
CN110188592B (en) * 2019-04-10 2021-06-29 西安电子科技大学 Urine formed component cell image classification model construction method and classification method
CN110717515A (en) * 2019-09-06 2020-01-21 北京三快在线科技有限公司 Model training method and device and electronic equipment
CN110689066A (en) * 2019-09-24 2020-01-14 成都考拉悠然科技有限公司 Training method combining face recognition data equalization and enhancement
CN110991402A (en) * 2019-12-19 2020-04-10 湘潭大学 Skin disease classification device and method based on deep learning

Similar Documents

Publication Publication Date Title
CN107578071A (en) The unbalanced method of solution data based on Epoch
Salmeron et al. Learning fuzzy cognitive maps with modified asexual reproduction optimisation algorithm
CN110533631B (en) SAR image change detection method based on pyramid pooling twin network
Hewamalage et al. Global models for time series forecasting: A simulation study
CN107230113A (en) A kind of house property appraisal procedure of multi-model fusion
CN108717568A (en) A kind of image characteristics extraction and training method based on Three dimensional convolution neural network
CN113486981A (en) RGB image classification method based on multi-scale feature attention fusion network
CN111340614A (en) Sample sampling method and device based on federal learning and readable storage medium
CN112070357A (en) Radar radiation source threat assessment method based on improved BP neural network
Tayfur et al. Principle component analysis in conjuction with data driven methods for sediment load prediction
CN105279692A (en) Financial information technology system performance prediction method and apparatus
CN106897703A (en) Remote Image Classification based on AGA PKF SVM
CN101706443A (en) Smoothness evaluation method of seams of clothing fabrics
CN103955714A (en) Navy detection model construction method and system and navy detection method
CN109508498A (en) Rubber shock absorber formula designing system and method based on BP artificial neural network
Wanke et al. Revisiting camels rating system and the performance of Asean banks: a comprehensive mcdm/z-numbers approach
Purnama Increasing Understanding of One-Way ANOVA Material for Accounting Students: A Case Study of Deposit Interest
CN110956543A (en) Method for detecting abnormal transaction
CN114462872A (en) Internet platform franchising party rating method integrating weighted fuzzy evaluation
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
Qiyas et al. Decision support system based on fuzzy credibility Dombi aggregation operators and modified TOPSIS method
CN106844626A (en) Using microblogging keyword and the method and system of positional information simulated air quality
Kadam et al. Loan Approval Prediction System using Logistic Regression and CIBIL Score
Sabzi et al. Exploring the best model for sorting blood orange using ANFIS method
Nemeshaev et al. Model of the forecasting cash withdrawals in the ATM network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180112

RJ01 Rejection of invention patent application after publication