CN111860830A - Method, device, terminal and storage medium for dynamically optimizing sample number in model training - Google Patents
Method, device, terminal and storage medium for dynamically optimizing sample number in model training Download PDFInfo
- Publication number
- CN111860830A CN111860830A CN202010566690.2A CN202010566690A CN111860830A CN 111860830 A CN111860830 A CN 111860830A CN 202010566690 A CN202010566690 A CN 202010566690A CN 111860830 A CN111860830 A CN 111860830A
- Authority
- CN
- China
- Prior art keywords
- samples
- gradient
- iteration
- model training
- cosine value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012549 training Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000005457 optimization Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 2
- 238000012544 monitoring process Methods 0.000 abstract description 3
- 238000004904 shortening Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 230000008569 process Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000010355 oscillation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method, a device, a terminal and a storage medium for dynamically optimizing sample number in model trainingk‑h(ii) a Recording the gradient g of the k-th iterationk(ii) a Calculating the gradient gk‑hAnd gradient gkCosine value of the included angle; judging whether the calculated cosine value is smaller than a preset value; if the number of the samples is smaller than the preset value, the number of the samples is increased during the iteration of the step k + 1. The method judges whether to adjust the number of samples or not by calculating the cosine similarity of the gradient between two iterations, is simple and efficient by dynamically adjusting the number of samples by monitoring the gradient updating direction, and adjusts the number of samples to be better according to the gradient, thereby improving the training performance of the model, accelerating the convergence speed of the model, shortening the training time and saving resources.
Description
Technical Field
The invention relates to the field of sample number optimization in model training, in particular to a method, a device, a terminal and a storage medium for dynamically optimizing sample number in model training.
Background
Model optimization is one of the most difficult challenges in the implementation of neural network learning algorithms. Hyper-parameter optimization aims to find hyper-parameters that optimize the performance of the deep learning algorithm on the validation dataset. The hyper-parameters are different from the parameters of a general model, and the hyper-parameters are parameters which do not need to be trained and are manually set before training. In a neural network, there are many hyper-parameters to be set, such as learning rate, batch _ size (the number of samples used in a training), number of network layers, number of neuron nodes.
The setting of the hyper-parameters has a direct influence on the model performance, and in order to maximize the model performance, it is important to know how to optimize the hyper-parameters. Several commonly used hyper-parametric optimization methods: manual parameter adjustment, gridding optimization, random optimization searching and the like. Currently, the parameter is adjusted manually.
In the deep neural network, the adjustment of the hyper-parameters is a necessary skill, the training state of the current model is judged by observing monitoring indexes such as loss functions and accuracy in the training process, and the hyper-parameters are adjusted in time to train the model more scientifically, so that the resource utilization rate can be improved. Adjusting different hyper-parameters has different influences on the performance of the training model, such as adjusting the learning rate, when the learning rate is too high, the model may not be converged, and the loss function continuously vibrates up and down; if the learning rate is too low, the convergence speed of the model is slow, and longer training time is needed; increasing the batch size generally allows the network to converge faster, but due to memory resource constraints, an over-sized batch may result in insufficient memory or a crash of the program core.
The prior art researches pay attention to the influence of the learning rate on the accelerated convergence of the model, and the influence of the batch _ size on the training performance of the model is relatively less researched. And increasing the batch _ size within a reasonable range can also bring benefits to the training and performance of the model: 1) the memory utilization rate is improved, and the parallelization efficiency of large matrix multiplication is improved; 2) the iteration times required by running one epoch (full data set) are reduced, and the processing speed of the same data volume is further accelerated; 3) within a certain range, generally, the larger the batch _ size, the more accurate the determined falling direction thereof, and the smaller the training oscillation caused.
The batch _ size affects the performance of the model training, and in the prior art, the fixed batch _ size is used in the whole training process, is preset according to experience, and cannot be dynamically adjusted according to needs, which is not beneficial to the model training.
Disclosure of Invention
In order to solve the problems, the invention provides a method, a device, a terminal and a storage medium for dynamically optimizing the number of samples in model training, wherein the number of samples is dynamically optimized in the training process, and the optimization mode is simple and efficient.
The technical scheme of the invention is as follows: a method for dynamically optimizing sample number in model training is based on small-batch gradient descent and comprises the following steps:
recording the gradient g of the k-h step iterationk-h;
Recording the gradient g of the k-th iterationk;
Calculating the gradient gk-hAnd gradient gkCosine value of the included angle;
judging whether the calculated cosine value is smaller than a preset value;
if the number of the samples is smaller than the preset value, the number of the samples is increased during the iteration of the step k + 1.
Further, increasing the number of samples means increasing the number of samples to twice the number of original samples.
Further, h is 1.
Further, if the calculated cosine value is larger than the preset value, keeping the number of samples unchanged.
The technical scheme of the invention also comprises a device for dynamically optimizing the sample number in model training, which is based on small-batch gradient descent and comprises,
A gradient recording module: recording the gradient g of each iteration;
cosine value calculation module: calculating the gradient g of the k-h step iterationk-hGradient g iterated with step kkCosine value of the included angle;
cosine value judging module: judging whether the calculated cosine value is smaller than a preset value;
a sample number optimization module: and if the calculated cosine value is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.
Further, the sample number optimization module increases the number of samples by twice as much as the original number of samples.
Further, h is 1.
Further, if the calculated cosine similarity is greater than a preset value, the sample number optimization module keeps the sample number unchanged.
The technical scheme of the invention also comprises a terminal, which comprises:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method described above.
The technical solution of the present invention also includes a computer readable storage medium storing a computer program, which when executed by a processor implements the method as described above.
The method, the device, the terminal and the storage medium for dynamically optimizing the number of samples in model training provided by the invention have the advantages that the gradient of each iteration is recorded, whether the number of samples is adjusted or not is judged by calculating the cosine similarity of the gradient between two iterations, the method for dynamically adjusting the number of samples by monitoring the gradient updating direction is simple and efficient, the number of samples is adjusted to be better according to the gradient, the model training performance is improved, the model convergence speed is accelerated, the training time is shortened, and resources are saved.
Drawings
Fig. 1 shows the positional relationship between the a-vector and the b-vector.
FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a second embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings by way of specific examples, which are illustrative of the present invention and are not limited to the following embodiments.
Example one
In the deep neural learning network, the whole neural network can be regarded as a complex nonlinear function and used as a fitting model of a training sample. The value of the loss function (also called an objective function) is used to evaluate the quality of the assumed model, and the smaller the value of the loss function is, the better the assumed model fits the training data. Using gradient descent, it is actually the process of minimizing the loss function whose gradient indicates the direction of descent.
Gradient descent is a common optimization method for machine learning, and can be divided into three forms according to data of samples used each time: batch Gradient decline (Batch Gradient Description), random Gradient decline (Stochasticgradient Description) and minibatch Gradient decline (Mini-Batch Gradient Description).
Among them, the small batch gradient descent is the most common optimization method, which randomly uses the batch _ size samples for parameter update each time. The batch _ size refers to the number of samples used in one training session, and is hereinafter referred to as the number of samples.
Assuming that batch _ size is m, each sample is (x)i,yi) For a mini batch:
It should be noted that the gradient g is a vector.
For cosine similarity, the cosine value of the included angle between two vectors in the vector space is used as a measure for measuring the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are, the angle is equal to 0, i.e. the two vectors are equal.
As shown in FIG. 1, the angle between the two vectors a and b is θ, and the cosine of the angle isThe closer the cosine value is to 1, the closer the angle θ is to 0 degrees, i.e., the more similar the a and b vectors are.
The method for dynamically optimizing the number of samples in model training provided by the embodiment is based on small-batch gradient descent.
Within a certain range, generally, the larger the batch _ size, the more accurate the determined falling direction thereof, and the smaller the training oscillation caused. However, the determined falling direction of the batch _ size is not changed after the batch _ size is increased to a certain degree, so that the excessive batch _ size does not contribute much to the training precision, and only increases the calculation amount of training.
In order to measure the change of the gradient direction, the cosine similarity of two gradient vectors is adopted to measure the change of the gradient direction, if the cosine value of an included angle between two gradients is large, the change of the gradient angle is small, the fluctuation of the gradient direction is not large, and the batch _ size does not need to be updated. If the cosine value of the included angle between the two gradients is small, the fluctuation of the gradient direction is large, and the batch _ size is updated.
As shown in fig. 2, specifically, the method includes the following steps:
s1, recording the gradient g of the k-h step iterationk-h;
S2, recording the gradient g of the k step iterationk;
S3, calculating a gradient gk-hAnd gradient gkCosine value of the included angle;
s4, judging whether the calculated cosine value is less than a preset value;
and S5, if the number of samples is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.
And h is an integer which satisfies the condition that h is more than or equal to 1 and is less than k, preferably, h is 1, namely, the cosine value of the gradient included angle of two adjacent steps of iteration is calculated, and the optimization precision is improved.
In this embodiment, increasing the sample number means increasing the sample number twice as large as the original sample number, that is, when the cosine value of the included angle between the kth step iteration gradient and the kth-h step iteration gradient is smaller than the preset value, increasing the sample number of the (k + 1) th step iteration twice as large as the sample number of the kth step iteration.
It should be noted that, if the calculated cosine value is greater than the preset value, the number of samples is kept unchanged, that is, the number of samples in the kth step is the same as that in the (k + 1) th iteration.
Example two
Based on the first embodiment, the present embodiment provides a device for dynamically optimizing sample number in model training, and similarly, the device is based on small-batch gradient descent, and includes the following functional modules.
Gradient recording module 101: recording the gradient g of each iteration;
cosine value calculation module 102: calculating the gradient g of the k-h step iterationk-hGradient g iterated with step kkCosine value of the included angle;
cosine value determination module 103: judging whether the calculated cosine value is smaller than a preset value;
sample number optimization module 104: and if the calculated cosine value is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.
And h is an integer which satisfies the condition that h is more than or equal to 1 and is less than k, preferably, h is 1, namely, the cosine value of the gradient included angle of two adjacent steps of iteration is calculated, and the optimization precision is improved.
In this embodiment, increasing the sample number means increasing the sample number twice as large as the original sample number, that is, when the cosine value of the included angle between the kth step iteration gradient and the kth-h step iteration gradient is smaller than the preset value, increasing the sample number of the (k + 1) th step iteration twice as large as the sample number of the kth step iteration.
It should be noted that, if the calculated cosine value is greater than the preset value, the number of samples is kept unchanged, that is, the number of samples in the kth step is the same as that in the (k + 1) th iteration.
EXAMPLE III
The present embodiments provide a terminal that includes a processor and a memory.
The memory is used for storing the execution instructions of the processor. The memory may be implemented by any type or combination of volatile or non-volatile memory terminals, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. The executable instructions in the memory, when executed by the processor, enable the terminal to perform some or all of the steps in the above-described method embodiments.
The processor is a control center of the storage terminal, connects various parts of the whole electronic terminal by using various interfaces and lines, and executes various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions.
Example four
The present embodiment provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided in the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A method for dynamically optimizing sample number in model training is based on small-batch gradient descent and is characterized by comprising the following steps of:
recording the gradient g of the k-h step iterationk-h;
Recording the gradient g of the k-th iterationk;
Calculating the gradient gk-hAnd gradient gkCosine value of the included angle;
judging whether the calculated cosine value is smaller than a preset value;
if the number of the samples is smaller than the preset value, the number of the samples is increased during the iteration of the step k + 1.
2. The method for dynamically optimizing the number of samples in model training according to claim 1, wherein increasing the number of samples means increasing the number of samples to twice the number of original samples.
3. The method for dynamically optimizing the number of samples in model training according to claim 1 or 2, wherein h is 1.
4. The method for dynamically optimizing the number of samples in model training according to claim 1 or 2, wherein the number of samples is kept unchanged if the calculated cosine value is greater than a preset value.
5. A device for dynamically optimizing the number of samples in model training is based on small-batch gradient descent and is characterized by comprising,
a gradient recording module: recording the gradient g of each iteration;
cosine value calculation module: calculating the gradient g of the k-h step iterationk-hGradient g iterated with step kkCosine value of the included angle;
cosine value judging module: judging whether the calculated cosine value is smaller than a preset value;
a sample number optimization module: and if the calculated cosine value is smaller than the preset value, increasing the number of samples during the iteration of the step k + 1.
6. The apparatus for dynamically optimizing the number of samples in model training according to claim 5, wherein the sample number optimizing module increases the number of samples to twice the number of original samples.
7. The device for dynamically optimizing the sample number in model training according to claim 5 or 6, wherein h is 1.
8. The apparatus for dynamically optimizing the number of samples in model training according to claim 5 or 6, wherein the sample number optimizing module keeps the number of samples unchanged if the calculated cosine similarity is greater than a preset value.
9. A terminal, comprising:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method of any one of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010566690.2A CN111860830A (en) | 2020-06-19 | 2020-06-19 | Method, device, terminal and storage medium for dynamically optimizing sample number in model training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010566690.2A CN111860830A (en) | 2020-06-19 | 2020-06-19 | Method, device, terminal and storage medium for dynamically optimizing sample number in model training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111860830A true CN111860830A (en) | 2020-10-30 |
Family
ID=72986950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010566690.2A Withdrawn CN111860830A (en) | 2020-06-19 | 2020-06-19 | Method, device, terminal and storage medium for dynamically optimizing sample number in model training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111860830A (en) |
-
2020
- 2020-06-19 CN CN202010566690.2A patent/CN111860830A/en not_active Withdrawn
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bottou et al. | Large scale online learning | |
CN110832509B (en) | Black box optimization using neural networks | |
Bertsekas et al. | Improved temporal difference methods with linear function approximation | |
WO2018039011A1 (en) | Asychronous training of machine learning model | |
CN112154464B (en) | Parameter searching method, parameter searching device, and parameter searching program | |
CN111079780A (en) | Training method of space map convolution network, electronic device and storage medium | |
WO2022095432A1 (en) | Neural network model training method and apparatus, computer device, and storage medium | |
CN115563610B (en) | Training method, recognition method and device for intrusion detection model | |
CN112686383B (en) | Method, system and device for reducing distributed random gradient of communication parallelism | |
US10482351B2 (en) | Feature transformation device, recognition device, feature transformation method and computer readable recording medium | |
US11550274B2 (en) | Information processing apparatus and information processing method | |
CN113541985B (en) | Internet of things fault diagnosis method, model training method and related devices | |
CN111832693B (en) | Neural network layer operation and model training method, device and equipment | |
EP4009239A1 (en) | Method and apparatus with neural architecture search based on hardware performance | |
CN109522939A (en) | Image classification method, terminal device and computer readable storage medium | |
CN110717601B (en) | Anti-fraud method based on supervised learning and unsupervised learning | |
CN112685841A (en) | Finite element modeling and correcting method and system for structure with connection relation | |
CN117059169A (en) | Biological multi-sequence comparison method and system based on parameter self-adaptive growth optimizer | |
CN116993548A (en) | Incremental learning-based education training institution credit assessment method and system for LightGBM-SVM | |
CN111860830A (en) | Method, device, terminal and storage medium for dynamically optimizing sample number in model training | |
CN111930484A (en) | Method and system for optimizing performance of thread pool of power grid information communication server | |
CN112561047B (en) | Apparatus, method and computer readable storage medium for processing data | |
Paternain et al. | Learning policies for markov decision processes in continuous spaces | |
US20210365838A1 (en) | Apparatus and method for machine learning based on monotonically increasing quantization resolution | |
US20220405599A1 (en) | Automated design of architectures of artificial neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201030 |
|
WW01 | Invention patent application withdrawn after publication |