CN107578071A

CN107578071A - The unbalanced method of solution data based on Epoch

Info

Publication number: CN107578071A
Application number: CN201710954471.XA
Authority: CN
Inventors: 赵建峰; 宁振虎; 蔡永泉; 薛菲; 公备; 王昱波
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-01-12

Abstract

The invention discloses the unbalanced method of solution data based on Epoch, belong to deep learning field.Each Epoch carries out random resampling according to weight to each classification in the training process, the sample in each Epoch of training process is averaged expression；Each sample is added to according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.The present invention can more effectively solve the unbalanced problem of data, each Epoch carries out random resampling to method based on Epoch according to weight to each classification in the training process, make in training that sample can be averaged expression in each Epoch, its main thought is to be added to each sample according to the weight of resampling, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach the result of resampling Epoch data relative equilibrium.

Description

The unbalanced method of solution data based on Epoch

Technical field

The invention belongs to deep learning field, more particularly to a kind of unbalanced method of solution data based on Epoch, belong to The technical field of deep learning.

Background technology

Either increasing concern, uneven number are attracted in academia or industrial quarters, imbalance study According to scene also appear in the every aspects of the Internet, applications, such as the click prediction of search engine, (webpage of click tends to take up The ratio of very little), the commercial product recommending (the purchased ratio of the commodity of recommendation is very low) of e-commerce field, credit card fraud inspection Survey, network attack identification etc..

Influence to be commonly done in classification problem caused by unbalanced data.100 row data are contained now with one Binomial classification problem (two class data).Wherein have that 90 row data represent is the first kind, and what remaining 10 row represented is data the Two classes.Here it is a unbalanced data (uneven class data), its primary sources and secondary sources ratio is 90:10 or Person says 9:1.When this data set is handled, establish a disaggregated model and its accuracy has reached 90%, but It is that can find that this 90% accuracy is the accuracy of a kind of data when continuing deeper mining data.Uneven class is asked Topic is likely to occur in binomial classification problem or multinomial classification problem.Most methods can be used in both problems On.

There is following three concept in deep learning framework：Batch Size are a propagated forward computing and backpropagation Training sample number required in computing；Iterations is a weight renewal for iteration each time, each time weight Renewal needs the progress propagated forward computing of Batch Size data to obtain loss function, then back-propagation algorithm undated parameter； Epoch is that training samples all in deep learning complete a propagated forward computing and a backpropagation computing.So The data amount check for participating in an Epoch training is the summation of training sample.The data of high quality are machine learning and deep learning Key point, the scarcity of data can hinder the development of a model, past using the model after the data training of high quality Toward meeting more healthy and stronger (preventing over-fitting), it might even be possible to because making training become simple and quick the reason for data set.Now with very It is used for solving the unbalanced method of data.

1) sample

The method of sampling is the data set for making it become to balance from unbalanced data set by carrying out processing to training set, Lifting can be brought to final result in most cases.It is that small species is answered that sampling, which is divided into up-sampling and down-sampling, up-sampling, More parts are made, down-sampling is from generic middle some samples of rejecting, in other words only from generic middle selected part sample.

Can occur some samples in data set after up-sampling repeatedly, train the model come and have certain over-fitting. And the method for solving over-fitting mainly has early stop, addition L1 regularizations or L2 regularizations and addition Dropout layers.And The shortcomings that down-sampling, is apparent, that is, final training set lost data, and model has only acquired one of aggregated model Point.

2) data strengthen

The data small to data volume, which carry out data enhancing, can effectively suppress the unbalanced influence brought of data.Data enhancing side Method is extremely important in machine learning field and deep learning field, and suitable data enhancement methods are it is possible to prevente effectively from over-fitting The problems such as and can effectively improve the robustness of model.

Data enhancing technology towards image has many kinds, if A=[a₁,a₂,...,a₇] it is the collection that data strengthen technology Close, wherein, a₁For rotation transformation, a₂For reflection transformation, a₃For turning-over changed, a₄For scale transformation, a₅For translation transformation, a₆For chi Degree conversion, a₇For contrast variation.

Make M_i=x_m,....,x_nRepresent the sequence of operation for gradually carrying out data enhancing technology.Wherein, i ∈ R represent operation sequence The order of row, x_n∈ A, n ∈ [1,7] represent a certain data enhancing technology.Such as：M=x₁,x₂Representative is using rotation transformation and instead Conversion progress data enhancing is penetrated to be analyzed.

More generally, M can be defined_i=λ₁x₁+.....+λ_lx_lStrengthen the sequence of operation of technology for the data of cum rights.As a result Weighting.Wherein, λ_lStrengthen the weight of technology distribution for each data.

Then the total model manipulation sequence order of data enhancing of gray scale image is：

y₁=M₁(m)

y₂=M₂(y₁)

……

y_k=M_k(y_k-1)

Wherein, m be input gray scale image data, y_kTo strengthen the knot obtained after technical operation sequence by k data Fruit.

Define D (m)=M_k(y_k-1), wherein D (m) represents k data enhancing technical operation sequence of input data m experience and obtained The end product arrived.It is final to provide support for malicious code mutation detection by carrying out data enhancing to multidimensional gray scale image.

3) method based on weight

The process of deep learning training data namely trains the process of each layer of weight of neutral net, and each layer of god Weight through network updates caused error progress backpropagation during needs export according to network and is adjusted every time.Based on weight The unbalanced method of processing data by being to be produced per a kind of before reverse propagated error for one weight of every a kind of setting Raw error asks weighting again after being multiplied by every a kind of weighted value.Then backpropagation is carried out further according to the error of generation.

The unbalanced method of processing data based on weight solves the problems, such as data nonbalance by way of weighting, mainly Thought is that different punishment strategies is set to different classes of sample, to generate the error based on weight.So that the class that sample is few Not big by caused cost after misclassification, the sample more than sample is small by caused cost after misclassification.The difficult point of this method is to set Rational weight is put, typically allows the weighting loss value approximately equal of each sorting room in practical application.

The content of the invention

It is a primary object of the present invention to propose the unbalanced method of solution data based on deep learning, this method is A kind of method of new solution data nonbalance：The unbalanced method of solution data based on Epoch.

To achieve the above object, the technical solution adopted by the present invention is the unbalanced method of solution data based on Epoch, Each Epoch carries out random resampling according to weight to each classification in the training process, makes in each Epoch of training process Sample can be averaged expression；Each sample is added to according to the weight of resampling, then according to weight proportion from Sample Storehouse In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.

Assuming that the total data set for classification shares N number of sample, belong to M classification.If X=(x₁,x₂,...,x_n) be All N number of training samples, Y=(y₁,y₂,...,y_m) it is whole M classifications.N ∈ N+, m ∈ N+, n, m are positive integer.If The number of samples of a certain classification is 10 times of the number of samples of other classifications or even more in data set, then the data set is referred to as not Equalization data collection.Carrying out deep learning training using imbalanced data sets easily causes incredible training result.

First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, to imbalanced data sets Resampling is carried out, obtains the sample set of the sample size size of population of relative equilibrium.The flow of this method is as follows：

According to the initial weight W of resampling_initAnd the final weight W of resampling_endCalculate the progress next time of resampling The weight W of setting_i.Specific calculation formula is as follows：

W_i=r^i-1W_init+(1-r^i-1)W_end (1)

Wherein, W_init=(w₁,w₂,...,w_m) be resampling initial weight, W_end=(w₁,w₂,...,w_m) it is resampling Final weight.w_iFor the weight of the i-th category setting, i ∈ [1, M].The weighted value set more than sample size is relatively small, sample number It is relatively large to measure few weighted value.R is the paces size that each iteration is calculated, and i is iteration Epoch number.

Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, l_nFor n-th sample institute from The class label of category, 1 { l_n=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m.W_mRepresent m classes Weight, weight_iRepresent the weight of i-th of sample.

It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2).And to whole Individual weights is normalized so that it is whole weights's and be 1；Finally, it is big according to weights, random resampling Epoch Small data.

It is as shown in Figure 1 specifically weight weights resampling Epoch sample algorithms to be distributed according to classification.

Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning god Through being trained in network, after Epoch data all participate in training neutral net, it is necessary to by the i Epoch's represented Iterations adds 1, according to initial weight, final weight and iterations adjustment weight next time, and resampling one again Epoch data, the training of neutral net is continued to participate in, until training reaches steady.

Compared with prior art, the present invention has the advantages that.

1. the unbalanced method of data based on deep learning.The data of high quality are the keys of machine learning and deep learning Place, the scarcity of data can hinder the development of a model, using the model after the data training of high quality often more It is healthy and strong, it might even be possible to because making training become simple and quick the reason for data set.It is used for solving data inequality now with a lot The method of weighing apparatus：The method of sampling, data enhancing and the method based on weight.Solution data proposed by the present invention based on Epoch Unbalanced method can more effectively solve the unbalanced problem of data.

2. the present invention proposes the unbalanced method of solution data based on Epoch.Trained based on Epoch method Each Epoch carries out random resampling according to weight to each classification in journey, makes in training in each Epoch sample can be by It is average to represent.Its main thought is according to the weight of resampling to be added to each sample, then according to weight proportion from Sample Storehouse In random one Epoch size of resampling sample set, to reach the result of resampling Epoch data relative equilibrium.

Brief description of the drawings

Solution data unbalanced method flow diagrams of the Fig. 1 based on Epoch；

Resampling methods of the Fig. 2 based on Epoch；

Fig. 3 training error contrast curves；

Fig. 4 training precision contrast curves；

Fig. 5 test error contrast curves；

Fig. 6 measuring accuracy contrast curves；

Embodiment

For the purpose of the present invention, technical scheme and feature is more clearly understood, below in conjunction with specific embodiment, and join According to accompanying drawing, further refinement explanation is carried out to the present invention.The unbalanced method flow diagram of solution data such as Fig. 1 based on Epoch It is shown.

Each step is explained as follows：

1) total training data training data is sent in the method based on Epoch, according to the class weight of initialization, The sample set of sample size is randomly choosed, obtains the sample set of the relative equilibrium of resampling, deep learning neutral net is sent into and enters Row training.

2) after the complete Epoch of neural metwork training data, class weight is adjusted, then by being based on Epoch Method, then proceed to train, by multiple iterative calculation, class weight is infinitely fitted the last weight set, makes god Training error through network becomes steady and global minima.

Experimental enviroment

This section by cross experiment come verify set forth herein the unbalanced method of solution data based on Epoch actual effect. Experimental enviroment is ubuntu14.04 main frames, 8G internal memories, 1T hard disks, and experimental data comes from manually generated imbalanced data sets, Specific unbalanced classification has reached 1:60：

Item name	Quantity	Item name	Quantity
				Classification one	120	Classification two	7200

This experiment has the effect for testing several solution unbalanced methods of data on identical imbalanced data sets altogether.Survey The unbalanced control methods of solution data of examination has：Former data (not making an amendment), data enhancing, the method based on Epoch and The method being combined based on Epoch with data enhancing.

Strengthen combination technique using data and data enhancing is carried out to image, by being expanded using data enhancing combination small sample Training data scale is filled, over-fitting and the incomplete problem of training caused by for solving the unbalanced possibility of amount of training data, Wherein, data enhancing is as follows to be combined on sample：

Attribute	Arranges value
		rotation_range	0.1
width_shift	0.1
		height_shift	0.1
rescale	1/255
		shear_range	0.1
zoom_range	0.1
		horizontal_flip	True
fill_mode	nearest

Wherein, rotation_range represents rotating range, and width_shift is represented and translated along horizontal direction, Height_shift is represented and translated along vertical direction, and rescale is represented and image is amplified according to specified scale factor Or reduce, shear_range represents horizontally or vertically projective transformation, zoom_range proportionally random zoomed image sizes, Horizontal_flip represents flip horizontal image, and fill_mode represents is filled pixel side after rotating or translating Formula.

Several contrast curves for solving the unbalanced method of data on train_loss are as shown in figure 3, several solution numbers According to contrast curve of the unbalanced method on train_acc as shown in figure 4, several solution unbalanced methods of data are in val_ Contrast curve on loss is as shown in figure 5, several solve the unbalanced method of the data contrast curve on val_acc such as figure Shown in 6.

Train_loss is the training error that training Epoch data obtain every time, and train_acc is training Epoch every time The training precision that data obtain, val_loss are the training error that training Epoch data obtain every time, and val_acc is instruction every time Practice the training precision that Epoch data obtain.

Four contrast curves can be seen that set forth herein comparing other traditional methods based on Epoch methods more than, It is greatly improved effect for the unbalanced classification problem of data sample.

Claims

1. the unbalanced method of solution data based on Epoch, each Epoch enters according to weight to each classification in the training process The random resampling of row, makes the sample in each Epoch of training process to be averaged expression；Added according to the weight of resampling To each sample, then according to the sample set of weight proportion one Epoch size of random resampling from Sample Storehouse, to reach weight Sample the result of Epoch data relative equilibrium；

Assuming that the total data set for classification shares N number of sample, belong to M classification；If X=(x₁,x₂,...,x_n) it is whole N Individual training sample, Y=(y₁,y₂,...,y_m) it is whole M classifications；N ∈ N+, m ∈ N+, n, m are positive integer；If data set In the number of samples of a certain classification be 10 times of number of samples of other classifications or even more, then the data set is referred to as unbalanced number According to collection；Carrying out deep learning training using imbalanced data sets easily causes incredible training result；It is characterized in that：

First, by the unbalanced method of solution data based on Epoch, according to the weight of setting, imbalanced data sets are carried out Resampling, obtain the sample set of the sample size size of population of relative equilibrium；The flow of this method is as follows：

According to the initial weight W of resampling_initAnd the final weight W of resampling_endCalculate being set next time for resampling Weight W_i；Specific calculation formula is as follows：

W_i=r^i-1W_init+(1-r^i-1)W_end (1)

Wherein, W_init=(w₁,w₂,...,w_m) be resampling initial weight, W_end=(w₁,w₂,...,w_m) for resampling most Whole weight；w_iFor the weight of the i-th category setting, i ∈ [1, M]；The weighted value set more than sample size is relatively small, and sample size is few Weighted value it is relatively large；R is the paces size that each iteration is calculated, and i is iteration Epoch number；

<mrow> <msub> <mi>weight</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mn>1</mn> <mo>{</mo> <mi>i</mi> <mo>=</mo> <mi>m</mi> <mo>}</mo> <msub> <mi>W</mi> <mi>m</mi> </msub> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mn>1</mn> <mo>{</mo> <msub> <mi>l</mi> <mi>n</mi> </msub> <mo>=</mo> <mi>m</mi> <mo>}</mo> <msub> <mi>W</mi> <mi>m</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Wherein, i is the classification of this sample, and M be classification number, and N is Epoch sample number, l_nFor n-th sample institute subordinate Class label, 1 { l_n=m } if as 1 being otherwise 0 when the class label for representing sample institute subordinate is equal to classification m；W_mRepresent the power of m classes Weight, weight_iRepresent the weight of i-th of sample；

It is that whole Epoch samples distribute weight weights according to classification according to above weight calculation formula (2)；And to whole Weights is normalized so that it is whole weights's and be 1；Finally, according to weights, random resampling Epoch sizes Data；

Then, the data set for solving the basis equalization of the unbalanced method of data by Epoch is transported to deep learning nerve net It is trained in network, it is necessary to iteration by the i Epoch represented after Epoch data all participate in training neutral net Number adds 1, according to initial weight, final weight and iterations adjustment weight next time, and one Epoch of resampling again Data, continue to participate in the training of neutral net, until training reach steady.