CN109919301A - Deep neural network batch optimization method based on InfoMax criterion - Google Patents
Deep neural network batch optimization method based on InfoMax criterion Download PDFInfo
- Publication number
- CN109919301A CN109919301A CN201910141284.9A CN201910141284A CN109919301A CN 109919301 A CN109919301 A CN 109919301A CN 201910141284 A CN201910141284 A CN 201910141284A CN 109919301 A CN109919301 A CN 109919301A
- Authority
- CN
- China
- Prior art keywords
- batch
- random signal
- power
- random
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 18
- 238000005457 optimization Methods 0.000 title claims abstract description 11
- 238000012549 training Methods 0.000 claims description 21
- 230000005540 biological transmission Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 14
- 230000000694 effects Effects 0.000 abstract description 4
- 238000002347 injection Methods 0.000 abstract description 4
- 239000007924 injection Substances 0.000 abstract description 4
- 238000001514 detection method Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 17
- 238000013459 approach Methods 0.000 description 5
- 239000000243 solution Substances 0.000 description 5
- 238000012804 iterative process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 239000002904 solvent Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of deep neural network batch optimization method based on InfoMax criterion: first, the principle of injection random signal is analyzed from aerodynamic point, specify effect of the random signal in deep neural network, to based on this, as unit of batch, random signal is injected in input sample;Secondly, adjusting random signal general power as unit of batch: as the increase of the number of iterations reduces the random signal general power on each batch, helping algorithm to flee from saddle point, converge to minimum value;Finally, in each iteration, there is good detection effect so that finally obtained model performance reaches balanced in the upper reasonable distribution random signal power of batch based on InfoMax criterion.
Description
Technical Field
The invention belongs to the technical field of deep neural network optimization problems, and particularly relates to a deep neural network batch optimization method based on an information transmission maximization criterion.
Background
The deep neural network is the basis of deep learning, extracts typical characteristics in input samples by carrying out a series of nonlinear operations on input data, learns the internal structure and the law of the data, is widely applied to tasks such as image classification and image identification, and shows good detection effect. In deep neural networks, the loss function is typically a non-convex function, so there are a large number of saddle points in the training process, and as the number of layers of the network increases, the number of saddle points also increases. Therefore, the common optimization algorithm is easy to fall into a saddle point, and the trained model has poor performance. To solve this problem, a great deal of work is currently done to find how to escape the saddle point, so that the algorithm converges to a minimum. The method for escaping from the saddle point based on Hessian information finds the direction of the escaping saddle point by using the geometric information about the saddle point provided by the second-order information of the loss function, and trains along the direction, thereby converging to the minimum value. However, such methods are computationally expensive and complex in a high-dimensional space. Another method for escaping from the saddle point based on the random signal utilizes the unstable characteristic of the saddle point to add the random signal in each direction of gradient descending, so that the algorithm can continuously update parameters under the push of the random signal when falling into the saddle, thereby escaping from the saddle point. Although this method is simple and effective, the performance of the trained model still needs to be improved.
Disclosure of Invention
The invention provides a method for optimizing deep neural network batches based on a random signal escape saddle point, which is based on a principle of analyzing and injecting random signals from a dynamics angle, and the effect of the random signals in the deep neural network is determined, so that on the basis, the random signals are injected into input samples by taking batches as a unit, the total power of the random signals in the batches is reduced along with the increase of iteration times, and finally an algorithm is helped to escape from the saddle point. Meanwhile, in order to further optimize the deep neural network and balance the performance of the model obtained by training, the invention adjusts the power of the random signal on the basis of the information transmission maximization criterion in batch, so that the information of various input samples is fully utilized in the training process. The method has the specific idea that: firstly, taking optimization algorithms in a deep neural network, namely a Gradient Descriptent (GD) algorithm and a batch Stochastic Gradient Descriptent (SGD) algorithm as examples, explaining the principle of kinetic injection of random signals; then based on the analysis, dividing the input samples into a plurality of batches at random, and injecting a random signal into each sample by taking the batch as a unit; then, reducing the total power of the random signals on the batch along with the increase of the iteration times; at each iteration, random signal power over the batch is allocated based on an information transfer maximization criterion.
For the convenience of describing the present invention, the following terms are first defined:
definitions 1 principle of kinetic injection of random signals
First, the loss function of the deep neural network is expressed by the following formula
Where F (ω) represents the loss function for all input samples, assuming it is continuous and second order differentiable, and F:d represents a dimension and is a finite positive integer. f (x)iω) represents the loss function corresponding to the ith input sample, assuming it is continuous and second order differentiable, and f:omega denotes a parameter of the network which,xirepresents the ith input sample, i ∈ [1, N]N denotes the total number of training samples, and N is a finite positive integer.
Assuming that the total iteration time is T, the GD algorithm performs parameter update when the iteration time is T as follows
Wherein,when the function F (ω) ═ F (ω)t) Is at a point ωtWhen an increment is generated, the ratio of the increment of the function output value to the increment of the argument is a limit when the argument increment approaches 0. ε represents the learning rate and 0 < ε < 1, T represents the current iteration time and 0 < T < T.
In the training process, the learning rate epsilon is far less than 1 and not 0, and the formula (2) is changed to obtain
Assuming that the parameter ω is a function of time t and the left-hand equation in equation (3) is approximately equal to the derivative of ω, we can then derive an ordinary differential powertrain system, expressed as follows using the ordinary differential equation
Wherein,denotes the derivative function of ω with respect to t, d ω is the infinitesimal quantity of ω, and dt is the infinitesimal quantity of t.
Then, the formula (3) can be regarded as an euler solution of the formula (4), that is, the parameter updating process of the GD algorithm actually uses the euler solution to solve the power system. Because (4) is a constant differential powerTo learn the system, so that the solution of ω is a definite trajectory, ω will eventually converge to a stationary pointThis point may be the minimum point, but may also be a saddle point. Since the deep neural network has a large number of saddle points in the training process, the algorithm that the GD algorithm only relies on the first-order gradient information for parameter updating is very likely to fall into the saddle points.
The batch SGD algorithm differs from the GD algorithm in that the gradient over the batch is used during the gradient descent processSubstitution of the overall gradient in the GD AlgorithmThe parameter updating process of the batch SGD algorithm at the time t is shown as follows
Wherein,represents the current function FB(ω)=FB(ωt) Is at a point ωtWhen an increment is generated, the ratio of the increment of the function output value to the increment of the argument is a limit when the argument increment approaches 0. B represents the size of the batch and is a finite positive integer.
Therefore, compared with the GD algorithm, the batch SGD algorithm introduces a certain random signal at each iteration, and the random signal introduced at the iteration time of t is ξtIndicates that then there is
Wherein, ξtIs a mean of 0 and a variance of σ2Due to ξtIs 0, so ξtVariance σ of2Equal to ξtOf the power of (c).
At this time, the original ordinary differential kinetic system (5) is added with a random signal to become a random kinetic system, which is expressed by the following Langmian equation
Wherein ξ (t) is a Gaussian process.
According to the solution method of the Langewaten kinetic equation, the solution of omega in the equilibrium state can be obtained
It can be seen from equation (8) that the parameter ω is not a definite value, but a function of the random signal power σ2The smallest point in the original F (ω) of the probability distribution now becomes the largest point in P (ω). Sigma2The larger, the flatter P (ω), the more ω may "run" around, so the algorithm will traverse more regions; sigma2The smaller P (ω) becomes "sharper" and the more prominent the region of the maximum point of P (ω), where ω will not "run out" easily if it falls into this region, and gradually converges to a minimum.
Method for defining 2 batch injection of random signals
In the batch SGD algorithm, the magnitude of the random signal power depends on the batch and the learning rate. The smaller the batch is, the larger the learning rate is, and the larger the random signal power is; the larger the batch, the smaller the learning rate and the smaller the random signal power. The Smith el al study shows that as the number of iterations increases, the size of the batch is slowly increased, reducing the learning rate, thereby reducing the random signal power and helping the algorithm converge to a minimum. The reason is that in the earlier stage of iteration, the set batch is a little smaller, the learning rate is higher, the power of the random signal is higher, and the algorithm is favorable for traversing more areas; with the increase of the iteration times, the algorithm slowly approaches the minimum value area, and the random signal power is reduced at the moment, so that the step length is prevented from being too large and crossing the minimum value.
But the batch size is increased, the performance requirement on computer hardware is higher, and the training cost is increased; reducing the learning rate reduces the convergence drop of the algorithm. Therefore, in order to save cost and improve training speed as much as possible while escaping from saddle points, the invention does not change the batch and learning rate in the deep neural network in the training process, directly adds random signals on input samples, and reduces the power of the random signals along with the increase of iteration times, thereby leading the algorithm to finally converge to the minimum value. In particular, the amount of the solvent to be used,
1) dividing all N input samples into K batches, wherein the size of each batch is B;
2) a random signal is added to the input samples in batches. Then the relationship between each input sample and the random signal over the kth (K e [1, K ]) lot is as follows
Wherein,representing the ith input sample, x, on the kth batchiFor newly generated samples, ξi~N(0,1),Represents the power of the random signal;
3) the total power of the random signal over a batch is expressed as the number of iterations increasesOn a per batch basis
Definitions 3 method for maximally adjusting random signal power based on information transmission
Random signals are injected in batches and the total power of the random signals on the batches is reduced as the number of iterations increases. However, the total random signal power over the batch is constant at each iteration, thus involving the problem of random signal power allocation. In order to fully utilize information contained in various samples in the network training process and enable the model performance to reach balance, the random signal power on a batch is reasonably distributed based on the information transmission maximization criterion. Specifically, an objective function based on the information transmission maximization criterion is provided, and then the function is solved to obtain the method for distributing the random signal power in batches.
First, assume that after introducing a random signal, the total information amount in the deep neural network isWherein, ckFor the amount of information on each batch, C is equal to the sum of the amounts of information on all the batches. Thus, maximizing the total amount of information is equivalent to maximizing the amount of information on each batch maxck。
In the water filling algorithm, when the total power of the signals is not changed, the signal power is reasonably distributed, and the maximization of the channel capacity can be realized. Inspired by the thought, the objective function based on information transmission maximization is provided
Wherein,λ is a lagrangian constant for the random signal power assigned to the ith input sample. p is a radical ofiRepresenting the signal power, niRepresents the noise power, and pi+ni=ri。riThe expression is the power of the ith input sample, and because the batch optimization strategy is adopted in the method, certain noise exists in each gradient update, the sample power r at the momentiIt is seen that the signal power p isi(without using batch optimization ri=pi) Sum noise power niTwo parts are formed.
The meaning of the objective function herein is: total power of random signal injected on batch at each iterationIs fixed and reasonably distributes the power of random signalsInformation transmission can be maximized.
To maximizeOrder toAfter simplification, the method for reasonably distributing the random signal power in the text batch is obtained, which is as follows
Wherein, a+Denotes max (a, 0). Due to the fact thatSo whenWhen it is negative, orderI.e. not allocating random signal power to the ith input sample
The meaning of formula (11) is: when the input sample power riLarger, to allow the information originally in such samples to be fully utilized, smaller random signal power is allocated to themEven no power is allocated; when r isiSmaller, in order to make the network pay more attention to the input samples in the training process, and to balance the model performance obtained by final training, larger samples are allocated
Therefore, finally, the method for adding random signals to input samples and distributing random signal power on a batch at each iteration in this document is as follows
Wherein,representing input samples xiNew samples generated after adding the random signal.
The method comprises the following specific steps:
the method comprises the following steps: dividing all N training samples into K batches, wherein the size of each batch is B;
step two: random signals are injected for the samples in batches. The method for injecting a random signal on each sample of each batch is as follows
Representation of the total power of the random signal per batchAnd is
Step three: from the whole iterative process, the algorithm is far away from the minimum value point in the initial training period of the model, and the algorithm is carried out on each batchLarger, helping the algorithm to escape from the saddle point, and to go quickly towards the minimum point; as the number of training increases, the algorithm approaches a minimum value gradually, at which point it is setSmaller, preventIf the value is too large, the value is over the minimum value, and the next saddle point is sunk;
step four: at the time of each iteration of the process,the random signal power corresponding to each input sample on the batch is distributed based on the information transmission maximization criterion without changing, and the specific method is as follows
If it is notThen sample xiThe random signal power allocated above is 0.
The invention has the advantages that: aiming at the problem that the deep neural network falls into a saddle point in the training process, a batch optimization algorithm based on information transmission maximization is provided, random signals are injected into input samples, and the total power of the random signals in the batch is reduced along with the increase of iteration times, so that the algorithm is helped to be rapidly converged to the minimum value. Meanwhile, in order to fully utilize the information of various input samples, random signal power is distributed on a batch basis on the basis of an information transmission maximization criterion, so that the model obtained through final training has good performance.
Drawings
Fig. 1 is a diagram illustrating the allocation of random signal power over batches based on information transfer maximization criteria, assuming that each batch size is equal to 8.
Detailed Description
The following detailed description is further illustrative of the methods and techniques provided by the present invention and should not be construed as limiting the invention.
Taking the case that the size of each batch is equal to 8 as an example, the implementation flow of the invention is shown in fig. 1, and the specific implementation steps are as follows:
the method comprises the following steps: dividing all N training samples into K batches, wherein the size of each batch is 8;
step two: random signals are injected for the samples in batches. The method for injecting a random signal on each sample of each batch is as follows
Representation of the total power of the random signal per batchAnd is
Step three: from the whole iterative process, the algorithm is far away from the minimum value point in the initial training period of the model, and the algorithm is carried out on each batchLarger, helping the algorithm to escape from the saddle point, and to go quickly towards the minimum point; as the number of training increases, the algorithm approaches a minimum value gradually, at which point it is setSmaller, preventIf the value is too large, the value is over the minimum value, and the next saddle point is sunk;
step four: at the time of each iteration of the process,the random signal power corresponding to each input sample on the batch is distributed based on the information transmission maximization criterion without changing, and the specific method is as follows
If it is notThen sample xiThe random signal power allocated above is 0.
The above description of the embodiments is only intended to facilitate the understanding of the method of the invention and its core ideas. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (3)
1. The deep neural network batch optimization method based on the information transmission maximization criterion is characterized by comprising the following steps of:
dividing all N training samples into K batches, wherein the size of each batch is B;
and step two, injecting a random signal into the samples by taking the batch as a unit, wherein in the kth batch (K belongs to [1, K ]), the relation between each input sample and the random signal is as follows:
wherein x isiRepresenting the ith input sample on the kth batch,in order to create a new sample of the sample,
ξi~N(0,1),represents the power of the random signal;
total power of random signal per batchIt is shown that,
step three: iterative computation, wherein the total power of random signals on each batch is reduced along with the increase of the iteration times, so that the algorithm is helped to escape from saddle points and is converged to the minimum value;
step four: at the time of each iteration of the process,the random signal power corresponding to each input sample in the batch is distributed based on the information transmission maximization criterion without changing, specifically
Order toa + represents max (a,0), riRepresenting the power of the ith input sample,representing the total power of the input samples;
if it is notThen sample xiThe random signal power allocated above is 0.
2. The method of claim 1, wherein: in step three, the total power of the random signals on each batch is reducedPrevent fromCrossing the minimum, trapping into the next saddle point.
3. The method of claim 1, wherein: in step four, the objective function based on information transmission maximization is as follows:
wherein λ is the Lagrangian constant; p is a radical ofiRepresenting the signal power, niRepresenting the noise power, pi+ni=ri。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910141284.9A CN109919301A (en) | 2019-02-26 | 2019-02-26 | Deep neural network batch optimization method based on InfoMax criterion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910141284.9A CN109919301A (en) | 2019-02-26 | 2019-02-26 | Deep neural network batch optimization method based on InfoMax criterion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109919301A true CN109919301A (en) | 2019-06-21 |
Family
ID=66962418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910141284.9A Pending CN109919301A (en) | 2019-02-26 | 2019-02-26 | Deep neural network batch optimization method based on InfoMax criterion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109919301A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111027717A (en) * | 2019-12-11 | 2020-04-17 | 支付宝(杭州)信息技术有限公司 | Model training method and system |
-
2019
- 2019-02-26 CN CN201910141284.9A patent/CN109919301A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111027717A (en) * | 2019-12-11 | 2020-04-17 | 支付宝(杭州)信息技术有限公司 | Model training method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102521656A (en) | Integrated transfer learning method for classification of unbalance samples | |
CN113902021A (en) | High-energy-efficiency clustering federal edge learning strategy generation method and device | |
CN110276451A (en) | One kind being based on the normalized deep neural network compression method of weight | |
CN105469144A (en) | Mobile communication user loss prediction method based on particle classification and BP neural network | |
CN108460486A (en) | A kind of voltage deviation prediction technique based on improvement clustering algorithm and neural network | |
CN112861362B (en) | Power assembly performance parameter optimization method and device based on vehicle oil consumption | |
Qu et al. | Single-shot embedding dimension search in recommender system | |
US20220383036A1 (en) | Clustering data using neural networks based on normalized cuts | |
CN116503676B (en) | Picture classification method and system based on knowledge distillation small sample increment learning | |
CN115374853A (en) | Asynchronous federal learning method and system based on T-Step polymerization algorithm | |
CN112990420A (en) | Pruning method for convolutional neural network model | |
CN102184421A (en) | Training method of support vector regression machine | |
CN115841176A (en) | Charging pile variable frequency load security situation sensing method and system | |
Hou et al. | When to learn what: Model-adaptive data augmentation curriculum | |
CN116484495A (en) | Pneumatic data fusion modeling method based on test design | |
CN109919301A (en) | Deep neural network batch optimization method based on InfoMax criterion | |
CN113516163B (en) | Vehicle classification model compression method, device and storage medium based on network pruning | |
CN110851911B (en) | Terminal state calculation model training method, control sequence searching method and device | |
CN109936141A (en) | A kind of Economic Dispatch method and system | |
CN113657029A (en) | Efficient approximate optimization method for aircraft driven by heterogeneous data | |
CN116400963A (en) | Model automatic parallel method, device and storage medium based on load balancing | |
CN113132482B (en) | Distributed message system parameter adaptive optimization method based on reinforcement learning | |
CN115423149A (en) | Incremental iterative clustering method for energy internet load prediction and noise level estimation | |
CN115601578A (en) | Multi-view clustering method and system based on self-walking learning and view weighting | |
CN114219043A (en) | Multi-teacher knowledge distillation method and device based on confrontation sample |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190621 |