CN112418422B

CN112418422B - Deep neural network training data sampling method based on human brain memory mechanism

Info

Publication number: CN112418422B
Application number: CN202011307776.XA
Authority: CN
Inventors: 何水兵; 胡双; 孙贤和; 银燕龙; 陈刚; 任祖杰
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-05-27
Anticipated expiration: 2040-11-20
Also published as: CN112418422A

Abstract

The invention discloses a deep neural network training data sampling method based on a human brain memory mechanism, which comprises the following steps: s1, setting the next round of sequences to be trained as the whole training set in the initial training period; s2, packing the data contained in the training sequence into a plurality of batches according to the batch size, putting the batches into a neural network for training, and obtaining the training loss value of the sample; s3, dividing the sample sequence into three types of difficult, intermediate and simple types according to the loss value; s4, adding a basic clock to the samples of the whole training sequence, wherein the intermediate and simple samples need to respectively calculate the additionally added clock number of the samples according to a countdown waiting function; s5, reducing the number of clocks for the samples of the whole training set, and putting the samples with the number of clocks being 0 into the sequence to be trained in the next round; s6, repeating the step 2-5 until the neural network converges or the training period number ends.

Description

Deep neural network training data sampling method based on human brain memory mechanism

Technical Field

The invention relates to the technical field of neural networks, in particular to a method and a framework for sampling importance of a deep neural network.

Background

With the development of deep learning in recent years, deep neural networks have achieved significant success in the fields of computer vision, speech recognition and natural language processing.

Training a high-precision deep neural network typically consumes a significant amount of time and computer resources. The standard neural network training process treats all samples indiscriminately. But this approach ignores the fact that there is variability between samples. In fact, not all samples contribute the same to the gradient descent, even if one and the same sample does not contribute the gradient descent of the neural network at different stages of the overall training. Therefore, the waste of system CPU resources, memory and IO resources can be caused by treating all samples equally in the training process, and the opportunity of reducing training time and accelerating training is lost.

Therefore, the whole training process is accelerated by skipping the training of some unimportant samples in the training process. But two problems that need to be solved when sampling the importance of a sample are 1) how to evaluate the importance of a sample; 2) how many important samples should be selected in different training phases. An optimal sampling distribution can be obtained through gradient calculation of samples, but the current deep learning framework (such as pytorch or tensorflow) cannot quickly obtain the gradient of a single sample, so that the method cannot be applied in practice. On the other hand, there is the importance of using the loss, the customized upper gradient limit to replace or approximate the sample gradient, and training an auxiliary neural network to predict the sample. However, training the auxiliary network requires the introduction of additional computational resource overhead, and the computation of the gradient upper bound is more complex and time-consuming than the loss. Meanwhile, the method for evaluating the importance of the sample by using the loss only performs experiments on small data and image classification tasks at present, and the application range of the algorithm has certain limitation.

Disclosure of Invention

In order to solve the defects of the prior art, and achieve the purposes of reducing the complexity of calculation, widening the application range and improving the acceleration efficiency when important sample sampling is carried out in the deep neural network training stage, the invention adopts the following technical scheme:

a deep neural network training data sampling method based on a human brain memory mechanism adopts a memory sampling mode to apply two characteristics of memory:

1. the stress of memory. Throughout the training process, the neural network should focus on samples that are often made mistakes, rather than samples that are judged correct or easy to judge.

2. Memory interval. In order to improve the effectiveness of the memory data, the period interval of sample training is adjusted through the difficulty degree of the samples.

As shown in fig. 1, in the sampling stage, all samples are sampled individually by the MSampler (memarized Sampler) method proposed by the present invention, or may be used in series with other samplers (non-MSampler), that is, other Sampler-filtered samples are used as the input of MSampler, or may be used in parallel with other samplers, and the intersection of two Sampler-filtered samples is used as the input data of each last epoch (training period).

As shown in fig. 2 and 3, the steps of the sampling method are as follows:

1. in the initial training period, the next round of sequence to be trained (running _ list) is set as the whole training set (total _ list).

2. The data contained in the training sequence (running _ list) is packed into a plurality of lots according to the lot size, put into the neural network for training, and the training loss value of the sample is obtained.

3. The sample sequence is divided into three types of Hard (Hard), Middle (Middle) and simple (Easy) according to the loss value loss. The rule of partitioning follows the following equation:

where N represents the total number of samples in the training set and γ represents the relaxation factor, the main effect is to increase the number of Hard samples so that more samples can be selected in the next epoch (training cycle). ε represents a minimum (usually 0) indicating that samples with a loss less than a minimum are judged to be easy samples.

4. The samples of the entire training sequence (running _ list) are incremented by one basic clock. Intermediate and simple samples require an additional incremental number of clocks (requiring additional wait time compared to difficult samples) to be calculated for the samples, respectively, according to a countdown wait function. This process indicates that training should be further enhanced immediately for samples with wrong judgment of the neural network, and samples with correct judgment of the neural network can wait for a certain time to train. In this way the total number of samples that need to be trained in the following training period can be reduced, thereby reducing training time.

5. Here, three countdown waiting functions are proposed, specifically as follows:

(1) step back waiting: the training latency of the samples is linearly increased every fixed number of cycles, as follows:

counts＝bcount+1*(epoch/interval)

wherein counts represents the waiting time of each type of sample obtained by calculation, bcount represents the base number of waiting of samples in different levels, for example: the number of bases of the Middle category is set to be 2, the number of bases of the Hard category is set to be 3, the number of bases of the Middle category can also be set to be 1, the number of bases of the Hard category is set to be 2, and according to the actual situation, as a hyper parameter, epoch represents the number of training cycles (number of rounds), and interval represents how many iteration numbers are spaced.

(2) Exponential backoff wait: the training latency of the samples is exponentially increased every fixed number of cycles, as follows:

the min () represents a minimum function, the values of the left side and the right side in the brackets are smaller, the creating rate represents the base number of the exponential growth rate of the exponential backoff waiting mode, the base number is also used as a super parameter and needs to be set in advance, the larget count is used as a super parameter and needs to be set in advance, the waiting time upper limit is represented, and the longest waiting time cannot exceed the larget count.

(3) History sliding window: a history category value is maintained for each sample, and the number of sample wait periods is determined based on the sample history type and the number of consecutive times.

If a sample is of Middle class for 3 consecutive cycles, 2 waiting cycles are added. If 4 consecutive cycles are of Middle class, 3 wait cycles are added, and so on. If a sample is Easy in 2 consecutive cycles, 2 waiting cycles are added. If 3 consecutive cycles are all Easy categories, then add 3 wait cycles or discard (remove the Easy category sample from the training set), and so on.

6. The samples of the entire training set (total _ list) are decremented by one clock number, and then the samples with a clock count of 0 are placed into the sequence to be trained in the next round.

7. Repeating steps 2-6 until the neural network converges or the training cycle number ends.

The invention has the advantages and beneficial effects that:

in the training process, the samples with richer and more difficult information are focused on, rather than the samples with higher prediction accuracy are obtained easily, so that the total number of the neural network training set is reduced, the iteration times of each training period are reduced, the training time of the whole neural network is reduced, and the aim of accelerating the neural network training is fulfilled.

The importance of the sample is judged in advance through loss, whether the sample is read or not is determined, and the expenses on calculation and IO bandwidth resources are reduced.

The method and other acceleration strategies independent of training samples are mutually orthogonal, such as Loop performance in approximate computation, or low-rank decomposition acceleration training in Tensor computation, and the like.

The important sampling based on the memorability of the invention can be only provided for the interface of the MSampler by realizing the encapsulation of the details, thereby reducing the code modification to the original training process and having higher practicability.

Drawings

Fig. 1 is an overall frame diagram of the present invention.

FIG. 2 is a flow chart of the method of the present invention.

FIG. 3 is a diagram illustrating historical window rollback in the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

A deep neural network training data sampling method based on a human brain memory mechanism comprises the following steps:

1. when the epoch is 1, preprocessing operations (such as shuffle and data enhancement) are performed on all samples in the training set, the training set samples are divided into a plurality of lots according to the lot _ size, and the lots are sequentially put into the neural network for training.

2. As the neural network propagates forward, the loss for each sample is recorded in a loss _ history _ list.

3. Pass _ history _ list, current epoch as parameters are passed into the custom sampler of the pytorech deep learning framework.

4. Sampler needs to predefine the sampled hyper-parameters including the lose factor (γ), interval, increment _ rate, largest _ count, ε, the number of waiting clock cycles, bcount, the length of the sliding window, etc. The default 1 waiting clock period is 1epoch, the cardinality of the default Middle sample waiting clock period is 1 clock period, and the cardinality of the Easy sample waiting clock period is 2 clock periods; the length of the sliding window is 3 by default.

5. Sampler sorts samples according to loss value loss in loss _ history _ list, and then divides samples into three levels of difficulty (Hard), Middle (Middle), and simple (Easy) according to the following formula:

where N represents the total number of training set samples and γ represents the relaxation factor, the main effect is to increase the number of Hard samples so that more samples can be selected in the next epoch (training cycle). ε represents a minimum (usually 0) indicating that samples with a loss less than a minimum are judged to be easy samples.

6. If the historical sliding window function is used to determine the clock that Middle and Easy samples wait (as a countdown function), the loss per sample needs to be recorded in the sliding window list (sliding window) per sample to facilitate the calculation of the countdown wait clock later.

7. The number of clocks each sample should be incremented is calculated from the level to which each sample is divided and by a countdown wait function.

8. And subtracting 1 from the clock number of all samples in the whole training set, and placing the sample sequence with the clock number of 0 into the sample sequence to be trained in the next period.

9. And performing shuffle operation on the obtained sample sequence and returning, transmitting the obtained self-defined sampler serving as a parameter into a pyrrch DataLoader, generating batch data, and putting the batch data into a neural network for forward propagation and backward propagation to update the parameter.

10. And repeating the steps 2-9 until the training of the neural network is finished.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A deep neural network training data sampling method based on a human brain memory mechanism is characterized by comprising the following steps:

s1, setting the next round of sequences to be trained as the whole training set in the initial training period;

s2, packing the data contained in the training sequence into a plurality of batches according to the batch size batch _ size, putting the batches into a neural network for training, and obtaining the training loss value loss of the sample;

s3, dividing the sample sequence into three types of Hard Hard, Middle and simple Easy according to the loss value loss, wherein the division adopts the following formula:

wherein N represents the total number of training set samples, gamma represents a relaxation factor used for adjusting the number of Hard samples, and epsilon represents a minimum value;

s4, adding a basic clock to the samples of the whole training sequence, wherein the intermediate and simple samples need to respectively calculate the additionally added clock number of the samples according to a countdown waiting function;

s5, reducing the number of clocks for the samples of the whole training set, and putting the samples with the number of clocks being 0 into the sequence to be trained in the next round;

s6, repeating the steps S2-S5 until the neural network converges or the training period number ends.

2. The method as claimed in claim 1, wherein the countdown waiting function in step S4 uses step back waiting to linearly increase the training waiting time of the samples every fixed number of cycles, and the formula is as follows:

counts＝bcount+1*(epoch/interval)

wherein, counts represents the waiting time of each type of sample obtained by calculation, bcount represents the waiting base number of samples of different grades, epoch represents the training period number, i.e. the number of rounds, interval represents the number of rounds, and then count increase is carried out again, i.e. the clock number updating interval.

3. The method as claimed in claim 1, wherein the countdown waiting function in step S4 employs exponential backoff waiting, and exponentially increases the training waiting time of the sample every fixed number of cycles, and the formula is as follows:

wherein, min () represents to take the minimum function, take the smaller value in the left and right sides value in the bracket, increment rate represents the exponential growth rate base number, larget count represents the waiting time upper limit, epoch represents the training cycle number, i.e. the number of rounds, interval represents to increase counts again after many rounds, i.e. the clock number update interval.

4. The method as claimed in claim 1, wherein the countdown waiting function in step S4 uses a history sliding window to maintain a history category value for each sample, and determines the number of waiting periods of the sample according to the history type and the number of consecutive times of the sample.

5. The method for sampling deep neural network training data based on human brain memory mechanism as claimed in claim 4, wherein the method for determining the number of sample waiting periods by using said historical sliding window is as follows: if a sample is in Middle class in 3 consecutive cycles, m is increased based on bcount₁A waiting period, if 4 consecutive periods are all of Middle class, m is increased₁+1 waiting periods, and so on, until the number of consecutive periods equals the sliding window length; if a sample is Easy in 2 consecutive cycles, m is increased based on bcount₂A waiting period, if 3 continuous periods are all Easy type, m is increased₂+1 waiting periods, and so on until consecutive periods equal the sliding window length, remove the Easy category sample from the training set, by default m₁＝m₂＝2。

6. The method as claimed in claim 1, wherein the step S1 is performed by preprocessing all samples in the training set, the preprocessing includes shuffling and enhancing the data.

7. The method as claimed in claim 1, wherein the step S2 is performed by recording loss value loss of each sample in a history loss list during forward propagation of the neural network, introducing the history loss list and the current training cycle number as parameters into a Sampler self-defined by a deep learning framework, and after the Sampler has predefined hyper-parameters of sampling, sorting the samples according to the loss value loss in the history loss list, and then ranking the samples.

8. The method as claimed in claim 7, wherein the hyper-parameters include a relaxation factor γ, a clock update interval, an exponential growth rate, a waiting time upper limit, a minimum value, a waiting clock period, a counting base bcount, and an increasing base m₁,m₂Or the length of the sliding window.

9. The method according to claim 1, wherein after step S5, the obtained sample sequence is subjected to shuffle operation and returned, the obtained custom sampler is introduced into a DataLoader of the deep learning framework as a parameter, and then batch data is generated and put into the neural network for forward propagation and backward propagation to update the parameter.

10. The sampling method for deep neural network training data based on human brain memory mechanism as claimed in one of claims 1-9, wherein said sampling method is used in series or in parallel with other sampling methods, said series use is to use the output of other sampling method as the input of said sampling method, said parallel use is to intersect the other sampling method with the sample outputted in said sampling method, and the intersected sample is used as the input data of each epoch training period.