CN113656333B

CN113656333B - Method for accelerating deep learning training task data loading

Info

Publication number: CN113656333B
Application number: CN202111221953.7A
Authority: CN
Inventors: 朱春节; 银燕龙; 何水兵; 曾令仿; 秦亦; 周方
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-03-18
Anticipated expiration: 2041-10-20
Also published as: CN113656333A

Abstract

The invention discloses a method for accelerating deep learning training task data loading, which uses a double random sequence mode to calculate a random sequence of the next period in advance when each training period starts, and applies an independent memory to cache data required by the initial stage of the next period in advance. While data are sequentially prepared for the neural network according to the random sequence of the current period, the data required by the initial stage of the next period can be copied from the memory to the cache in time by referring to the random sequence of the next period, so that all the data required by the initial stage of the next period can be obtained from the cache. The method does not need to modify the existing deep learning framework, is simple to implement, introduces low calculation cost, can completely hit the cache data and can be used for multiple times, thereby reducing the data reading from a back-end storage system, and the acceleration effect of the method is more obvious when the training period number is more.

Description

Method for accelerating deep learning training task data loading

Technical Field

The invention relates to the field of deep learning, in particular to a method for accelerating the loading of deep learning training task data.

Background

Deep learning is a branch of machine learning, is an algorithm for performing characterization learning on data based on an artificial neural network, and is widely applied to the fields of computer vision, speech recognition, natural language processing and the like. The training process of the deep learning training task is executed in a plurality of cycles, and a convergent model is generated by repeated training. The training process of each period can be divided into three stages, namely data loading, data enhancement and neural network model training. The data loading stage needs to realize two functions, namely reading the training set from a back-end storage system to a memory, and randomly shuffling the training set. The function of the data enhancement stage is to perform operations such as turning, rotating, zooming, clipping, shifting, color mixing and the like on the training data in the memory, and increase the sample space covered by the training set.

And in the neural network model training stage, the enhanced data is utilized to calculate a neural network model containing a plurality of parameters.

Where the data load phase is I/O intensive and the other two phases are compute intensive. The increasing speed of computer computing power is far greater than the increasing speed of storage-side I/O performance in recent years, so that the proportion of the time overhead of a data loading phase in the total training overhead is continuously increased, and the data loading phase is gradually one of the bottlenecks of deep learning training.

The traditional method for accelerating data loading focuses on optimizing the organization mode and access mode of a training set in a back-end storage system, for example, small files in the training set are packaged into bundles, and the training set is loaded into a memory by taking the bundles as units, so that random reading operation of the small files with low performance is avoided, or the small files are sequentially loaded into the memory according to the data storage sequence, and then local randomization is realized in the memory, so that random reading with low performance is converted into rapid sequential reading. The methods can effectively utilize the I/O bandwidth of a back-end storage system and accelerate the speed of loading the training set into the memory, but the acceleration effect of the methods on data loading almost reaches the limit.

In order to avoid the overfitting phenomenon of deep learning, the training set is required to be shuffled globally and randomly in the data loading stage under normal conditions. However, because the size of the training set is too large, the global shuffling process cannot be directly performed in the memory, so a random sequence needs to be calculated in each training period, and then the data of the training set is loaded into the memory one by one according to the sequence, and the data enhancement operation directly modifies the original data in the memory, which results in that the data can be used only once after being loaded into the memory from the back-end storage system each time, the data needs to be loaded again in the next period, and the back-end storage system faces heavy I/O burden. Currently, there is a lack of an effective solution to this problem.

Disclosure of Invention

In order to solve the defects of the prior art, and achieve the purposes of reducing data reading from a back-end storage system, improving the data loading speed and having more acceleration effect along with more training cycles, the invention adopts the following technical scheme:

a method for accelerating deep learning training task data loading comprises the following steps:

s1, when the deep learning training task is initialized, an area is divided into a Cache from the memory occupied by the deep learning training task and recorded as Cache_nextThe method provides data required by a deep learning training task in the current period, and caches data required by the initial stage of the next period in advance;

s2, constructing a double random sequence mode for determining the sequence of the data of the training set entering the neural network, wherein the elements of the random sequence correspond to the data of the training set one by one, and in each training period, two new and old random sequences which are different from each other and are independent from each other exist simultaneously;

s3, before the first training period, a random sequence S is generated_nextAnd at the beginning of any training period, the existing random sequence S is set_nextIs assigned to S_curDetermining the sequence of the current period data entering the neural network, and then generating a new random sequence denoted as S_nextFor determining the order of entry of the next cycle data into the neural network, S_nextComprising a subsequence S_{next_prefix}Covering the training set data to be used in the initial stage of the next cycle, as the data loading stage is periodically executed, each cycle is traversed by S_curFor S_curEach element S of_cur[i]From Cache_nextOr back-end storage system acquisition S_cur[i]Putting the data of the corresponding training set into a memory, and then referring to S_{next_prefix}Updating Cache_nextThe method comprises the following steps:

s31, when S is_cur[i]In Cache_nextOn a hit of the preceding curList, S_cur[i]Copying the data of corresponding training set from curList to memory, and deleting S in curList_cur[i]Corresponding toData, at this time, if S_cur[i]Present in S_{next_prefix}In, then S_cur[i]Data corresponding to the training set is inserted into the Cache_nextThe next segment of nexList;

s32, when S is_cur[i]In Cache_nextWhen the front-end curList is not hit, S is read from the back-end storage system_cur[i]The data of the corresponding training set is stored in the memory, and if S is detected at this time_cur[i]Present in S_{next_prefix}In, then S_cur[i]Data corresponding to the training set is inserted into the Cache_nextThe next segment of nexList;

s33, when S is_curWhen the traversal is finished, clearing S_curLeaving only one random sequence S_next；

And S4, completing the current cycle, if the number of completed cycles is less than the preset number N, returning to S3 to start the training of the next cycle, and if all training cycles are completed, finishing the deep learning training task.

Further, the S2 includes the following steps:

s21, before the first period of the deep learning training task is started, a random sequence is generated and is marked as S_next；

S22, at the beginning of each cycle, S_nextIs assigned to S_cur，S_curDetermining the sequence of the current period data entering the neural network;

s23, generating another random sequence assignment to S by using the new random seed_next，S_nextDetermining the sequence of the data of the next period entering the neural network, so that two mutually different and mutually independent random sequences exist in the system at the same time;

s24, clearing S when one period is finished_curRetention of S_next。

Further, the Cache of S1_nextLogically divided into a curList that buffers data to be used in a current training cycle and a nexList that buffers data to be used in a next training cycle, comprising the steps of:

s11, before the first training period begins, Cache_nextEmpty, curList and nexList are also empty;

s12, in the first training period, inserting the curList as null into the Cache_nextAll located in the nexList;

s13, when the non-first training period starts, the data of the nexList are all transferred into the curList, and the nexList is empty;

s14, during the non-first training period, the data hit in curList is removed, its length is gradually shortened, and a new Cache is inserted_nextThe data of (2) all enter the nexList, so that the length of the nexList is gradually increased;

s15, when a training period is over, the curList length is zero, the nexList length is equal to the Cache_nextLength of (d);

s16, the data sequence in the nexList and the ID of the data are in S_{next_prefix}The order of (a) and (b) is kept consistent.

Further, in the S1, Cache_nextThe organization is in a linked list mode and is logically divided into curList and nexList.

Further, in the S1, Cache_nextThe capacity of (a) is determined at the discretion of the developer based on the actual available memory of the system.

Further, in S2, the element of the random sequence is the ID of the data in the training set.

Further, in S2, the random sequence is generated by using a random function, and the random seed required by the random function is initialized by using a clock of the computer.

Further, in S2, the random sequence corresponds to the data in the training set one by one, and the sequence length is the same as the total number of the data in the training set.

Further, in the S3, S_{next_prefix}Length of (d) and Cache_nextThe number of the contained nodes is the same, and the Cache_nextIs predefined, so S_{next_prefix}The number of elements included is set before the deep learning training task begins.

Further, in S3, after the data of the training set enters the memory, the data enhancement stage is performed, and the enhancement operation directly modifies the original data in the memory, and the enhanced data and other enhanced data form a batch, and then the neural network model training stage is performed.

The invention has the advantages and beneficial effects that:

the invention additionally occupies a memory as a cache, and when each training period of the deep learning training task starts, the random sequence required by shuffling in the next period is calculated in advance, so that the data loading stage of each period has double random sequences. When the training set data is loaded into the memory according to the random sequence of the current period, the data to be used in the initial stage of the next period is cached in sequence by referring to another random sequence, so that the required data can be quickly read from the cache in the initial stage of the next period in the data loading stage without being read to a back-end storage system, the time overhead of the data loading stage is obviously reduced, and the I/O bottleneck of a deep learning training task is eliminated. The memory space occupied by the cache is configurable, the larger the configured memory is, the better the acceleration effect of the algorithm is, and in addition, the more the number of cycles executed by the deep learning training task is, the better the acceleration effect of the algorithm is. Finally, the situation that the data can only be used once after being loaded into the memory from the back-end storage system every time and needs to be loaded again in the next period is avoided, and the heavy I/O burden of the back-end storage system is relieved.

Drawings

FIG. 1 is a diagram of a working framework for accelerating deep learning training tasks using the method of the present invention.

FIG. 2 shows the Cache of the present invention_nextSchematic node design of (1).

FIG. 3 shows the Cache of the present invention_nextSchematic diagram of the organization structure of (1).

FIG. 4 is a diagram of a double random sequence in the present invention.

FIG. 5 shows S in the present invention_nextSchematic diagram of the logical partitioning of random sequences.

Fig. 6 is a flow chart of a method of the present invention.

FIG. 7 is the update Cache of the present invention_nextIs described.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 6, a method for accelerating data loading of a deep learning training task is provided, which aims to significantly reduce the time overhead of a data loading phase and eliminate an I/O bottleneck of the deep learning training task. The algorithm needs to additionally occupy a memory, the occupied memory space is configurable, and the larger the configured memory is, the better the acceleration effect of the algorithm is. In addition, the more cycles the deep learning training task is performed, the better the algorithm accelerates. The algorithm calculates the random sequence required by the next cycle in advance, so that the data loading phase of each cycle has a double random sequence. Then, while loading the training set according to the random sequence of the current period, the data to be used in the initial period of the next period is buffered in sequence by referring to another random sequence in advance. Then, at the beginning of the next cycle, the data loading phase can quickly read the required data from the cache, but not from the back-end storage system.

The method provided by the invention adopts double random sequences in the data loading stage, wherein the double random sequences respectively indicate the sequence of the training set entering the neural network in the current period and the next period, but the random sequence of only one period needs to be updated in each period. Correspondingly, the algorithm separately applies for a memory area as a Cache_nextThe purpose is to provide the data needed by the current period on one hand and buffer the data to be used in the initial period of the next period in time on the other hand.

The method for training the ResNet model on the deep learning platform Pyorch by using the ImageNet data set comprises the following steps:

1. as shown in FIG. 1, the device of the present invention is deployed in the form of one component on a deep learning platform Pytrch.

1.1, when a ResNet model training task is initialized, the component applies for a buffer area for the task, and the buffer area is recorded as Cache_nextAt this time, the Cache_nextNo content is cached; cache memory_nextThe capacity of the Cache is preset by a user according to the available memory of the system, if the idle memory of the system has 2GB, the Cache is used_nextCannot be greater than 2 GB.

1.2、Cache_nextThe adopted data structure is a two-way linked list, the structure is shown in figure 3, and the Cache_nextHaving a pointer, which will Cache_nextThe linked list is divided into two parts, the part near the head of the list buffers the epoch_mDuring the period, Pythroch will reference the picture, which is marked as curList, and the part near the tail of the list caches epoch_m+1During the period, the Pythroch records the referenced picture as nexList, and the pointer points to the first node of the nexList; the node design of the linked list is as shown in fig. 2, each node mainly includes three members, namely pre, nex and data, where pre is a pointer pointing to a predecessor node of the current node in the linked list, nex is a pointer pointing to a successor node of the current node in the linked list, and data is a pointer pointing to a data field of the current node, and in this case, a data field of one node caches one picture of the ImageNet training set.

1.3, ImageNet training set is composed of a series of pictures, each picture is only quoted once during an epoch, Pythroch uses shuffle function to generate random sequence, the length of the sequence is equal to the total number of pictures in ImageNet training set, each element in the sequence corresponds to one picture in ImageNet training set, and Pythroch determines the sequence of pictures entering ResNet model according to the random sequence.

2. The training process of the ResNet model is performed N times periodically.

2.1、epoch₀Before start, S_curNull, Pyorch uses the shuffle function to generate a random sequence denoted S_nextIt determines the epoch₀During which the images of the ImageNet training set enter the order of the ResNet model.

2.2、epoch_m（0<=m<N-1) at the beginning, S is first introduced_nextIs assigned to S_curAnd is formed by S_curDetermining epoch_mDuring the process, pictures enter the training sequence of ResNet model, and then Pythrch uses the shuffle function to generate a new random sequence assignmentValue given to S_nextAnd S is_nextDetermining epoch_m+1The sequence of pictures entering ResNet model training; in this case, the Pythrch has two random sequences that are independent of each other and different from each other, as shown in FIG. 4.

2.3 when two random sequences S_curAnd S_nextReady, Pyorch at epoch_mAccording to S_curSequentially from Cache_nextOr the back-end storage system loads the picture and then according to S_nextUpdating Cache_nextThe process is shown in FIG. 6, and the specific steps are as follows

2.3.1 if S_cur[i]In Cache_nextHit, Pythrch is from Cache_nextTaking out the corresponding picture to the memory and taking the picture from the Cache_nextDeleting; otherwise, the Pythrch reads the picture from the back-end storage system to the memory.

2.3.2, as shown in FIG. 5, S_nextLogically divided into front and back parts, denoted as S_{next_prefix}And S_{next_suffix}In which S is_{next_prefix}Length of (D) is measured by Cache_nextIs determined by the capacity of S_{next_prefix}Covered pictures in epoch_mIs inserted into Cache in sequence_next；

If S is_cur[i]Present in S_{next_prefix}Then copy S in the memory_cur[i]Corresponding pictures are inserted into the Cache_nextEven if S is_cur[i]From Cache just in step 2.3.1 above_nextRemoving; otherwise, not updating the Cache_next。

2.4, this time S_cur[i]After the picture is ready in the memory, the picture and other pictures subjected to data enhancement form a batch after the data enhancement phase, and then the ResNet model is entered for training.

2.5 when S_cur[i]Is S_curThe last item of (1), epoch_mDuring the period when the data loading is completely finished, the emptying S_cur。

2.6 when epoch_mAfter the training of all the pictures in the ImageNet training set is completed, the epoch_mThe cycle is over.

3、Cache_nextProviding Pythrch in epoch_mData required during and cache epoch_m+1Data required at the beginning, and Cache_nextIs concurrently executed with the deep learning training task, Cache_nextThe updating process shown in FIG. 7 specifically includes the following steps

3.1、epoch₀At the beginning, Cache_nextEmpty, curList and nexList are also empty; epoch_m(0<=m<N), all nodes in the nexList are transferred into curList, the nexList is empty, and pointer points to NULL.

3.2, in step 2.3.2 above, S_cur[i]Corresponding picture insertion Cache_nextComprises the following steps: if the pointer is NULL, the picture is taken from the Cache_nextThe tail of the linked list is inserted to become the first node of the nexList, and then the pointer points to the new node; otherwise, inserting the picture into the Cache_nextThe length of the nexList increases and ensures that all nodes of the nexList are located with them at S_{next_prefix}The positions of the corresponding elements are kept consistent; at the end of an epoch, the nexList is equivalent to the Cache_next，S_{next_prefix}All covered pictures of are inserted into the Cache_next。

3.3, in the above step 2.3.1, judge S_cur[i]Whether or not it is in Cache_nextOn hit, only S needs to be judged_cur[i]Whether existing in the first node of curList, if yes, S_cur[i]In Cache_nextHit, otherwise miss; when S is_cur[i]In Cache_nextOn hit, S is removed from curList_cur[i]The length of curList is shortened; at the end of an epoch, curList is empty.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for accelerating deep learning training task data loading is characterized by comprising the following steps:

s1, dividing a region in the memory as Cache, recording as Cache_next；

s31, when S is_cur[i]In Cache_nextOn a hit of the preceding curList, S_cur[i]Copying the data of corresponding training set from curList to memory, and deleting S in curList_cur[i]Corresponding data, at this time, if S_cur[i]Present in S_{next_prefix}In, then S_cur[i]Data corresponding to the training set is inserted into the Cache_nextThe next segment of nexList; cache memory_nextLogically divided into curList that caches data used for the current training cycle and nexList that caches the next training cycleData to be used for each training period;

2. The method for accelerating the loading of deep learning training task data according to claim 1, wherein said S2 comprises the following steps:

s24, clearing S when one period is finished_curRetention of S_next。

3. The method for accelerating the loading of deep learning training task data according to claim 1, wherein said S1 comprises the following steps:

s11, before the first training period begins, Cache_nextEmpty, curList and nexList are alsoEmpty;

4. The method of claim 1, wherein the Cache in S1 is the Cache of_nextThe system is organized in a linked list mode and is logically divided into a curList and a nexList.

5. The method of claim 1, wherein in S1, the Cache is implemented by a computer, a computer readable medium, and a computer-readable storage medium_nextThe capacity of (c) is determined according to the actual available memory of the system.

6. The method of claim 1, wherein in S2, the elements of the random sequence are IDs of data in the training set.

7. The method of claim 1, wherein in step S2, the random sequence is generated by using a random function, and the random seed required by the random function is initialized by using a computer clock.

8. The method of claim 1, wherein in S2, the elements of the random sequence are in one-to-one correspondence with the data in the training set, and the sequence length is the same as the total number of the data in the training set.

9. The method of claim 1, wherein in S3, S is_{next_prefix}Length of (d) and Cache_nextThe number of the contained nodes is the same, and the Cache_nextIs predefined, so S_{next_prefix}The number of elements included is set before the deep learning training task begins.

10. The method according to claim 1, wherein in S3, the training set data enters a data enhancement stage after entering a memory, and the enhancement operation directly modifies the original data in the memory, and the enhanced data constitutes a batch and then enters a neural network model training stage.