CN114385554A

CN114385554A - DL training data reading method based on index shuffle

Info

Publication number: CN114385554A
Application number: CN202210062232.4A
Authority: CN
Inventors: 林嘉韵; 陈志广; 卢宇彤
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-22

Abstract

The invention discloses a DL training data reading method based on an index shuffle, which comprises the following steps: persisting the data into a non-exception memory and constructing an array index; dividing the array index and carrying out multithreading lock-free parallel shuffle to obtain a good array index of the shuffle; and traversing the array index and pre-reading the data from the non-volatile memory to the DRAM according to the good array index of the shuffle. By using the method and the device, the file system can be simplified, the reading performance of the data set is improved, and the training speed of DNN is finally improved. The method for reading the DL training data based on the index shuffle can be widely applied to the field of computer systems.

Description

DL training data reading method based on index shuffle

Technical Field

The invention relates to the field of computer systems, in particular to a DL training data reading method based on an index shuffle.

Background

Currently, the shuffle strategy for the existing DNN training data set mainly has the following disadvantages: 1) the complexity of an index structure of a default file system is high, and the expandability of a metadata-intensive large-scale DNN data set is poor; 2) the main defects of the shuffle based on the original data are that: the memory and CPU load is too heavy; 3) the main defects of the shuffle based on the metadata are as follows: disk I/O is the main bottleneck; 4) the cache hit rate is extremely low, which causes unexpected disk I/O and delays the data reading process; 5) single-threaded shaffles are inefficient, and multi-threaded shaffles involve the overhead of locks.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a DL training data reading method based on an index shuffle, which can simplify a file system, improve the reading performance of a data set, and finally improve the training speed of DNN.

The first technical scheme adopted by the invention is as follows: a DL training data reading method based on an index shuffle comprises the following steps:

s1, persisting the data into a non-exception memory and constructing an array index;

s2, dividing the array index and carrying out multithreading lock-free parallel shuffle to obtain a good shuffle array index;

s3, traversing the array index and pre-reading the data from the non-exclusive memory to the DRAM according to the good array index of the shuffle.

Further, the step of persisting the data into the non-volatile memory and constructing the array index specifically includes:

s11, acquiring a data set of the deep neural network;

s12, loading the data sets to a nonvolatile memory and recording the address of each sample by using arrays, wherein one data set corresponds to one array;

and S13, obtaining an array index.

Further, the step of dividing the array index and performing multi-thread lock-free parallel shuffle to obtain a good shuffle array index specifically includes:

s21, randomly dividing the array index and generating a plurality of threads in the shuffle stage of each epoch in deep neural network training;

and S22, based on the thread, performing shuffle on the array according to the array index to obtain a good shuffle array index.

Further, still include:

and S4, judging that the target precision of the deep neural network training is not reached, entering the next epoch, and returning to the step S21.

Further, the step of randomly dividing the array index and generating a plurality of threads in a shuffle stage of each epoch in deep neural network training specifically includes:

s211, randomly dividing the array index to generate random numbers and obtain a plurality of sub-arrays;

and S212, generating a corresponding number of threads according to the random number in the shuffle stage of each epoch in the deep neural network training.

Further, the random number corresponds to the number of the sub-arrays, each thread is only responsible for the shuffle of one sub-array, and the threads are isolated from each other.

Further, the traversing array index is a snake-shaped traversing, specifically:

in the first epoch, adopting a forward traversal index to read data, and recording a starting subscript of the cache data;

in the second epoch, a reverse traversal index is adopted, and all samples are read in a descending order according to the mark;

in the third epoch, reading data by adopting a forward traversal index;

and circulating the traversal steps until the target precision of the deep neural network training is reached, wherein the traversal sequence of each epoch is opposite to the traversal sequence of the previous time.

The method has the beneficial effects that: aiming at the problems of a DNN training calculation framework shuffle and low data reading efficiency, the invention provides a set of special file system, designs a more applicable array index structure aiming at the data set characteristics and the data set access characteristics of DNN training, further realizes a multithreading parallel high-efficiency lock-free shuffle strategy based on the array structure, preferentially reads a cache and starts pre-reading in a snake-shaped traversal mode, improves reading throughput, and finally aims to relieve the bottleneck of a data reading stage in the DNN training and improve the execution efficiency of DNN training operation.

Drawings

FIG. 1 is a flowchart illustrating steps of a DL training data reading method based on an index shuffle according to the present invention;

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Based on a traditional Deep Neural Network (DNN) computing framework, the method transfers the shuffle operation of a data set from an application layer to a file system layer, introduces a non-volatile memory (NVM), and designs an efficient index structure by using the characteristic of NVM byte addressing, thereby simplifying metadata, realizing multi-thread lock-free parallel shuffle, and enabling the function of the file system to be more consistent with a DNN-trained data access mode. In the traditional DNN computation framework, there are specific modules that are responsible for the data set shuffle.

As shown in fig. 1, the present invention provides a DL training data reading method based on an index shuffle, which includes the following steps:

s11, acquiring a data set of the deep neural network;

and S13, obtaining an array index.

Specifically, the index structure in the present invention is specifically designed as follows: when the data set is loaded to the nonvolatile memory, the addresses of each sample are recorded by using the array, one data set corresponds to one array, each sample only needs 4 bytes to store metadata, and the total space required by the whole index structure is the number of the samples multiplied by 4 bytes. The address space of the array in memory is contiguous and can be located directly to each element by subscript, so there is no additional pointer overhead. Because the semantics of parent-child nodes and the like do not exist between the elements, even if any two elements are exchanged, the function of the index structure cannot be damaged, the method has the greatest advantage that the method can directly shuffle the index, redundant metadata does not exist except the index, and the maximization of the space utilization rate is realized.

Specifically, based on the index structure, the invention designs a multi-thread lock-free parallel shuffle, and the specific strategy is as follows: the array is divided into a plurality of sub-arrays, a shuffle thread is started for each sub-array, each thread is only responsible for the shuffle of one sub-array, the threads are isolated, and message communication and data sharing are not carried out, so that each element can be accessed by one thread only, locking is not needed during access, and the parallel mode is an extremely efficient parallel mode. The specific shuffle algorithm is as follows: and traversing the sub-array, and exchanging each element with a subsequent element at a random position, wherein the algorithm complexity is O (n). For a shuffle, the strategy is a pseudo shuffle, because each element only appears in the sub-array where it is located. Therefore, in order to better ensure the randomness of sample reading and further ensure the final precision of model training, the method adopts random division when the subarrays are divided. The specific method comprises the following steps: when the sub-array is divided, a random number is generated, the random number is the number of the divided sub-array, namely a thread number, and then the array is divided equally according to the random number, and adjacent elements are a group. Thus, each element can reach different positions in different epochs, and each element has a probability of reaching any position in the whole array as long as the number of epochs is large enough.

In DNN training, the access pattern of the data set in each epoch has the following two features: first, the order of access for each sample is random; second, each sample is accessed only once. That is, each element in the index is read once in each epoch. Since the tree structure does not support in-place shuffle, all elements cannot be read in a traversal manner, each access to a sample is a random read of the index, and the time complexity is o (logn). However, after the array is adopted, all elements can be scanned on the whole disk in the simplest mode of traversing the array, and the time complexity is reduced to O (1).

Further, as a preferred embodiment of the method, the traversing of the array index is a snake-shaped traversal, specifically:

in the third epoch, reading data by adopting a forward traversal index;

Particularly, the read performance of the DRAM is superior to that of the NVM, and in order to fully exert the advantages of the DRAM, the invention additionally provides a traversal mode of the index when reading the data set. The operating system default LRU cache replacement policy is still employed in the system, such that in each epoch, the last set of samples accessed will be cached in memory, and in the next epoch, the set of cached samples will be read preferentially. The specific method comprises the following steps: in the first epoch, traversing the index in the forward direction to read data, recording the initial subscript of the cache data, dividing the data into a cache part and a non-cache part when in a shuffle, and respectively and independently executing the shuffle strategy; in the second epoch, because the samples with the largest subscript are all in the cache, all the samples are read in a reverse traversal mode according to the subscript descending order; in the third epoch, the data in the cache is replaced by the last accessed sample in the second epoch, i.e. the sample with the smallest subscript, so that a forward traversal mode is adopted; and the traversal mode of the subsequent epochs is analogized. It can be seen that each traversal is in the opposite order to the previous traversal, forming a serpentine traversal. Therefore, in each epoch, the data in the cache can be read preferentially, the cache hit rate is improved, and the final purpose of improving the reading performance is achieved. In order to enable more data to be acquired from the DRAM, the method further realizes a pre-reading strategy, acquires a data reading sequence according to the index, preferentially reads the data to be accessed into the DRAM, ensures the data acquisition efficiency of the GPU, reduces the GPU waiting time and improves the resource utilization rate.

The invention implements a file system with the above-described functionality that provides a shuffle (dataset) interface to the upper-level PyTorch-based deep learning application, where the dataset specifies the dataset used by this DNN training task. The developer calls the function when each epoch starts, and the shuffle work of the data set can be completed.

In summary, aiming at the problems of a DNN training calculation framework shuffle and low data reading efficiency, the invention provides a set of special file system, designs a more applicable array index structure aiming at the data set characteristics and the data set access characteristics of DNN training, further realizes a multithread parallel high-efficiency lock-free shuffle strategy based on the array structure, preferentially reads a cache and starts pre-reading by adopting a snake-shaped traversal mode, improves reading throughput, and finally aims to relieve the bottleneck of a data reading stage in DNN training and improve the execution efficiency of DNN training operation.

The beneficial effects of the invention specifically comprise the following:

1. the deep learning training speed is improved: on the premise of not influencing the randomness of sample reading and ensuring the final precision of DNN training, lock-free parallel shuffle is realized, the shuffle efficiency is improved in multiples, and the multiples of performance improvement are related to the thread number. The data is read by traversing the array, and the time complexity of the index is O (1), so that the method is convenient and quick. After the strategies of priority reading cache and pre-reading are introduced, all data can be acquired from the DRAM, and the speed and throughput of the GPU for acquiring the data are improved to a certain extent. The Shuffle and the data reading stage are no longer bottlenecks, and the overall performance of the DNN training will be improved.

2. The space resources are saved: in an application layer, the path information of each sample does not need to be maintained, and only a data set id needs to be known; on the file system level, the address space of the array structure is continuous, and no extra space is needed for maintaining the pointer, so that a large amount of memory space can be saved.

3. The resource utilization rate is improved: after the data reading speed is increased, the idle time of the GPU for waiting data is reduced, and GPU resources are more fully utilized; the multithreading shuffle can effectively utilize the multi-core CPU, and the resource utilization rate of the CPU is improved; in addition, the saving of memory space also means the improvement of memory utilization rate.

4. The system expandability is improved: the shuffle data volume based on the metadata is small, the time and space overhead of an array structure is small, a large amount of data can be stored, the expansion and maintenance are easy, and the data set expansion can be well adapted. The saved memory space and CPU cycles can be used for data preprocessing or storing intermediate data, and thus can also better accommodate the amplification of the DNN model.

A DL training data reading device based on index shuffle:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one program causes the at least one processor to implement the index shuffle-based DL training data reading method as described above.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing an index shuffle-based DL training data reading method as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A DL training data reading method based on an index shuffle is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of persisting data into non-volatile memory and constructing an array index includes:

s11, acquiring a data set of the deep neural network;

and S13, obtaining an array index.

3. The method for reading DL training data based on the index shuffle as claimed in claim 2, wherein the step of dividing the array index and performing multi-thread lock-free parallel shuffle to obtain good array index of the shuffle comprises:

4. The method for reading DL training data based on the index shuffle of claim 3, further comprising:

5. The method as claimed in claim 4, wherein the step of randomly dividing the array index and generating a plurality of threads in the shuffle stage of each epoch in the deep neural network training specifically includes:

6. The method as claimed in claim 5, wherein the random number corresponds to the number of sub-arrays, each thread is responsible for only one sub-array and the threads are isolated from each other.

7. The method as claimed in claim 6, wherein the traversing the array index is a snake-shaped traversal, specifically:

in the third epoch, reading data by adopting a forward traversal index;