CN111860830A

CN111860830A - Method, device, terminal and storage medium for dynamically optimizing sample number in model training

Info

Publication number: CN111860830A
Application number: CN202010566690.2A
Authority: CN
Inventors: 辛永欣
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-30

Abstract

The invention discloses a method, a device, a terminal and a storage medium for dynamically optimizing sample number in model training_k‑h(ii) a Recording the gradient g of the k-th iteration_k(ii) a Calculating the gradient g_k‑hAnd gradient g_kCosine value of the included angle; judging whether the calculated cosine value is smaller than a preset value; if the number of the samples is smaller than the preset value, the number of the samples is increased during the iteration of the step k + 1. The method judges whether to adjust the number of samples or not by calculating the cosine similarity of the gradient between two iterations, is simple and efficient by dynamically adjusting the number of samples by monitoring the gradient updating direction, and adjusts the number of samples to be better according to the gradient, thereby improving the training performance of the model, accelerating the convergence speed of the model, shortening the training time and saving resources.

Description

Method, device, terminal and storage medium for dynamically optimizing sample number in model training

Technical Field

The invention relates to the field of sample number optimization in model training, in particular to a method, a device, a terminal and a storage medium for dynamically optimizing sample number in model training.

Background

Model optimization is one of the most difficult challenges in the implementation of neural network learning algorithms. Hyper-parameter optimization aims to find hyper-parameters that optimize the performance of the deep learning algorithm on the validation dataset. The hyper-parameters are different from the parameters of a general model, and the hyper-parameters are parameters which do not need to be trained and are manually set before training. In a neural network, there are many hyper-parameters to be set, such as learning rate, batch _ size (the number of samples used in a training), number of network layers, number of neuron nodes.

The setting of the hyper-parameters has a direct influence on the model performance, and in order to maximize the model performance, it is important to know how to optimize the hyper-parameters. Several commonly used hyper-parametric optimization methods: manual parameter adjustment, gridding optimization, random optimization searching and the like. Currently, the parameter is adjusted manually.

In the deep neural network, the adjustment of the hyper-parameters is a necessary skill, the training state of the current model is judged by observing monitoring indexes such as loss functions and accuracy in the training process, and the hyper-parameters are adjusted in time to train the model more scientifically, so that the resource utilization rate can be improved. Adjusting different hyper-parameters has different influences on the performance of the training model, such as adjusting the learning rate, when the learning rate is too high, the model may not be converged, and the loss function continuously vibrates up and down; if the learning rate is too low, the convergence speed of the model is slow, and longer training time is needed; increasing the batch size generally allows the network to converge faster, but due to memory resource constraints, an over-sized batch may result in insufficient memory or a crash of the program core.

The prior art researches pay attention to the influence of the learning rate on the accelerated convergence of the model, and the influence of the batch _ size on the training performance of the model is relatively less researched. And increasing the batch _ size within a reasonable range can also bring benefits to the training and performance of the model: 1) the memory utilization rate is improved, and the parallelization efficiency of large matrix multiplication is improved; 2) the iteration times required by running one epoch (full data set) are reduced, and the processing speed of the same data volume is further accelerated; 3) within a certain range, generally, the larger the batch _ size, the more accurate the determined falling direction thereof, and the smaller the training oscillation caused.

The batch _ size affects the performance of the model training, and in the prior art, the fixed batch _ size is used in the whole training process, is preset according to experience, and cannot be dynamically adjusted according to needs, which is not beneficial to the model training.

Disclosure of Invention

In order to solve the problems, the invention provides a method, a device, a terminal and a storage medium for dynamically optimizing the number of samples in model training, wherein the number of samples is dynamically optimized in the training process, and the optimization mode is simple and efficient.

The technical scheme of the invention is as follows: a method for dynamically optimizing sample number in model training is based on small-batch gradient descent and comprises the following steps:

recording the gradient g of the k-h step iteration_k-h；

Recording the gradient g of the k-th iteration_k；

Calculating the gradient g_k-hAnd gradient g_kCosine value of the included angle;

judging whether the calculated cosine value is smaller than a preset value;

if the number of the samples is smaller than the preset value, the number of the samples is increased during the iteration of the step k + 1.

Further, increasing the number of samples means increasing the number of samples to twice the number of original samples.

Further, h is 1.

Further, if the calculated cosine value is larger than the preset value, keeping the number of samples unchanged.

The technical scheme of the invention also comprises a device for dynamically optimizing the sample number in model training, which is based on small-batch gradient descent and comprises,

A gradient recording module: recording the gradient g of each iteration;

cosine value calculation module: calculating the gradient g of the k-h step iteration_k-hGradient g iterated with step k_kCosine value of the included angle;

cosine value judging module: judging whether the calculated cosine value is smaller than a preset value;

a sample number optimization module: and if the calculated cosine value is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.

Further, the sample number optimization module increases the number of samples by twice as much as the original number of samples.

Further, h is 1.

Further, if the calculated cosine similarity is greater than a preset value, the sample number optimization module keeps the sample number unchanged.

The technical scheme of the invention also comprises a terminal, which comprises:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method described above.

The technical solution of the present invention also includes a computer readable storage medium storing a computer program, which when executed by a processor implements the method as described above.

The method, the device, the terminal and the storage medium for dynamically optimizing the number of samples in model training provided by the invention have the advantages that the gradient of each iteration is recorded, whether the number of samples is adjusted or not is judged by calculating the cosine similarity of the gradient between two iterations, the method for dynamically adjusting the number of samples by monitoring the gradient updating direction is simple and efficient, the number of samples is adjusted to be better according to the gradient, the model training performance is improved, the model convergence speed is accelerated, the training time is shortened, and resources are saved.

Drawings

Fig. 1 shows the positional relationship between the a-vector and the b-vector.

FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a second embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings by way of specific examples, which are illustrative of the present invention and are not limited to the following embodiments.

Example one

In the deep neural learning network, the whole neural network can be regarded as a complex nonlinear function and used as a fitting model of a training sample. The value of the loss function (also called an objective function) is used to evaluate the quality of the assumed model, and the smaller the value of the loss function is, the better the assumed model fits the training data. Using gradient descent, it is actually the process of minimizing the loss function whose gradient indicates the direction of descent.

Gradient descent is a common optimization method for machine learning, and can be divided into three forms according to data of samples used each time: batch Gradient decline (Batch Gradient Description), random Gradient decline (Stochasticgradient Description) and minibatch Gradient decline (Mini-Batch Gradient Description).

Among them, the small batch gradient descent is the most common optimization method, which randomly uses the batch _ size samples for parameter update each time. The batch _ size refers to the number of samples used in one training session, and is hereinafter referred to as the number of samples.

Assuming that batch _ size is m, each sample is (x)_i，y_i) For a mini batch:

a loss function of

Gradient is as follows

It should be noted that the gradient g is a vector.

For cosine similarity, the cosine value of the included angle between two vectors in the vector space is used as a measure for measuring the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.

As shown in FIG. 1, the angle between the two vectors a and b is θ, and the cosine of the angle is

The closer the cosine value is to 1, the closer the angle θ is to 0 degrees, i.e., the more similar the a and b vectors are.

The method for dynamically optimizing the number of samples in model training provided by the embodiment is based on small-batch gradient descent.

Within a certain range, generally, the larger the batch _ size, the more accurate the determined falling direction thereof, and the smaller the training oscillation caused. However, the determined falling direction of the batch _ size is not changed after the batch _ size is increased to a certain degree, so that the excessive batch _ size does not contribute much to the training precision, and only increases the calculation amount of training.

In order to measure the change of the gradient direction, the cosine similarity of two gradient vectors is adopted to measure the change of the gradient direction, if the cosine value of an included angle between two gradients is large, the change of the gradient angle is small, the fluctuation of the gradient direction is not large, and the batch _ size does not need to be updated. If the cosine value of the included angle between the two gradients is small, the fluctuation of the gradient direction is large, and the batch _ size is updated.

As shown in fig. 2, specifically, the method includes the following steps:

s1, recording the gradient g of the k-h step iteration_k-h；

S2, recording the gradient g of the k step iteration_k；

S3, calculating a gradient g_k-hAnd gradient g_kCosine value of the included angle;

s4, judging whether the calculated cosine value is less than a preset value;

and S5, if the number of samples is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.

And h is an integer which satisfies the condition that h is more than or equal to 1 and is less than k, preferably, h is 1, namely, the cosine value of the gradient included angle of two adjacent steps of iteration is calculated, and the optimization precision is improved.

In this embodiment, increasing the sample number means increasing the sample number twice as large as the original sample number, that is, when the cosine value of the included angle between the kth step iteration gradient and the kth-h step iteration gradient is smaller than the preset value, increasing the sample number of the (k + 1) th step iteration twice as large as the sample number of the kth step iteration.

It should be noted that, if the calculated cosine value is greater than the preset value, the number of samples is kept unchanged, that is, the number of samples in the kth step is the same as that in the (k + 1) th iteration.

Example two

Based on the first embodiment, the present embodiment provides a device for dynamically optimizing sample number in model training, and similarly, the device is based on small-batch gradient descent, and includes the following functional modules.

Gradient recording module 101: recording the gradient g of each iteration;

cosine value calculation module 102: calculating the gradient g of the k-h step iteration_k-hGradient g iterated with step k_kCosine value of the included angle;

cosine value determination module 103: judging whether the calculated cosine value is smaller than a preset value;

sample number optimization module 104: and if the calculated cosine value is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.

EXAMPLE III

The present embodiments provide a terminal that includes a processor and a memory.

The memory is used for storing the execution instructions of the processor. The memory may be implemented by any type or combination of volatile or non-volatile memory terminals, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. The executable instructions in the memory, when executed by the processor, enable the terminal to perform some or all of the steps in the above-described method embodiments.

The processor is a control center of the storage terminal, connects various parts of the whole electronic terminal by using various interfaces and lines, and executes various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions.

Example four

The present embodiment provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided in the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for dynamically optimizing sample number in model training is based on small-batch gradient descent and is characterized by comprising the following steps of:

recording the gradient g of the k-h step iteration_k-h；

Recording the gradient g of the k-th iteration_k；

judging whether the calculated cosine value is smaller than a preset value;

2. The method for dynamically optimizing the number of samples in model training according to claim 1, wherein increasing the number of samples means increasing the number of samples to twice the number of original samples.

3. The method for dynamically optimizing the number of samples in model training according to claim 1 or 2, wherein h is 1.

4. The method for dynamically optimizing the number of samples in model training according to claim 1 or 2, wherein the number of samples is kept unchanged if the calculated cosine value is greater than a preset value.

5. A device for dynamically optimizing the number of samples in model training is based on small-batch gradient descent and is characterized by comprising,

a gradient recording module: recording the gradient g of each iteration;

6. The apparatus for dynamically optimizing the number of samples in model training according to claim 5, wherein the sample number optimizing module increases the number of samples to twice the number of original samples.

7. The device for dynamically optimizing the sample number in model training according to claim 5 or 6, wherein h is 1.

8. The apparatus for dynamically optimizing the number of samples in model training according to claim 5 or 6, wherein the sample number optimizing module keeps the number of samples unchanged if the calculated cosine similarity is greater than a preset value.

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.