CN112379849B

CN112379849B - Parallel deep learning training data input method and system based on sequence predictability

Info

Publication number: CN112379849B
Application number: CN202110062697.5A
Authority: CN
Inventors: 何水兵; 陈伟剑; 杨斯凌; 陈平; 陈帅犇; 曾令仿; 任祖杰; 杨弢
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-04-09
Anticipated expiration: 2041-01-18
Also published as: CN112379849A

Abstract

The invention provides a parallel deep learning training data input method based on sequence predictability, which fully utilizes the characteristic that an access sequence of data can be predetermined when data is prefetched and cached, determines the size of a prefetched data block when the data is prefetched from a bottom layer parallel file system by combining cache hit rate and disk access performance, and then performs data distribution and caching, so that the local hit rate of the first round of training in large-scale training is greatly improved. And data request combination is adopted in the training of the subsequent round, and cache replacement is carried out in advance according to the data to be used in the next round, so that the communication overhead in the whole distributed training process is reduced, and the data input speed of each node is accelerated. The invention also provides a data input system based on the method, which comprises a random sequence generation module, a data pre-fetching module and a cache replacement module and can accelerate the speed of reading data from the storage under the requirement of ensuring the random reading of global data.

Description

Parallel deep learning training data input method and system based on sequence predictability

Technical Field

The invention belongs to the field of computer science artificial intelligence, and particularly relates to the field of data input acceleration in a large-scale distributed neural network training scene.

Background

In order to train a deep neural network with higher prediction accuracy and stronger generalization, the amount of training data used by people is increasing, and therefore, distributed storage of the training data has become a necessary solution. A lot of research is focused on the calculation process and the communication process of distributed training, so that the calculation and communication of large-scale distributed neural network training are efficient, but when the number of distributed training nodes is very large, the data supply speed can become a key factor for limiting the whole training process.

When the traditional parallel file system cannot meet the requirement of the I/O speed, methods such as data prefetching and caching by using a burst buffer and designing a special file system are proposed to relieve the requirement, but when the data input is accelerated by the existing data prefetching and caching method, only the random access characteristic of a sample sequence is usually considered, the predictability of the I/O access sequence is not fully integrated into a relevant I/O optimization strategy, so that the data prefetched from a bottom file system or cached into the cache is not matched with the requirement of an upper training task, the local cache hit rate is reduced, the data input efficiency is still low, and the extremely high requirement of the training process on the I/O bandwidth cannot be effectively guaranteed.

Disclosure of Invention

In order to solve the problem of low data input efficiency in the large-scale distributed deep neural network training process, the invention provides a large-scale deep learning high-efficiency data input method and system based on sample sequence predictability. Particularly, when data is prefetched and cached, the cache hit rate of the first round of training in large-scale training is greatly improved by fully utilizing the characteristic that the access sequence of the data can be predetermined. And performing cache replacement according to data to be used in the next round in the training of the next round, and simultaneously adopting data request combination to reduce the communication overhead in the whole distributed training process, thereby accelerating the data input speed of each node.

The technical scheme is as follows:

the method comprises the following steps: when the first round of training of the neural network starts, the same global random sequence is generated at each node, and each node respectively takes out the training data number belonging to the node;

step two: determining the size of a pre-fetching data block when the data is pre-fetched from a bottom layer parallel file system by combining the cache hit rate and the disk access performance, namely the number of contained samples;

step three: caching the data block into each node according to the size of the pre-fetched data block determined in the step two and the allocation mode with the highest cache hit rate;

step four: in each training process after the first round, generating a random sequence of the next round in advance, and performing cache replacement on each node according to the random sequence to be accessed in the next round, locally cached data and data used in the round; and caching data to be used in the next round in the node in advance until the training is finished.

Further, the second step specifically includes the following substeps:

(2.1) expressing the size of the prefetch data block as bi, i is a natural number and expresses the iteration number, wherein the size of the initial prefetch data block is b0= N/M; n is the total number of training samples, and M is the number of nodes for parallel training;

(2.2) the main node performs analog data distribution on each node according to the size of the prefetched data block, and determines the cache hit rate hi of the system when the size of the current prefetched data block is taken as a value: hi = ni/N, wherein N is the total number of training data, and ni is the number of samples hit together in all nodes;

(2.3) after the local cache hit rate of each iteration is calculated, the sizes of hi and h (i-1) are compared; if hi is larger than h (i-1), the step (2.4) is carried out; otherwise, the size of the prefetch data block corresponding to the current iteration number is used as the size of the final prefetch data block;

(2.4) if bi is less than bmin, the size of the finally prefetched data block is b = bmin, otherwise, the size of the prefetched data block in the next iteration is half of the size of the prefetched data block in the previous iteration, and the step (2.2) is returned to calculate the cache hit rate again; where bmin = sxt is the minimum size of the prefetched data block, s is the seek time of the disk, and t is the sustained transmission rate of the disk.

Further, in the step (2.2), the main node performs data allocation to each node according to the size of the prefetch data block, specifically:

the main node divides all training data into blocks according to the size of the pre-fetched data block to obtain n blocks of data, and each node caches k = n/M blocks;

the main node starts traversing all data blocks from the node with the number of 1, finds out k blocks of data with the largest number of hit samples for the node 1 and distributes the k blocks of data to the node 1; then traversing all data blocks from the node with the number of 2 again, finding out the k block data with the maximum number of hit samples for the node 2, distributing the k block data to the node 2, and so on; wherein, the data blocks distributed by each node are mutually exclusive;

after the size of the finally prefetched data block is determined, the main node communicates the number of the k blocks of data distributed by each node to other nodes.

Further, in the fourth step, performing cache replacement according to the random sequence to be accessed in the next round, the locally cached data of each node, and the data used in the current round specifically includes: each node traverses each training data number distributed to the node in the next round, if the training data number is not in the local cache of the node, a remote request is initiated to exchange data, and the training data is deleted in the remote node; if the training data number is in the node local cache then it remains unchanged.

Further, in each round of training, if a certain node sends a plurality of requests to the same node, the request merging module merges the requests and sends the requests, so that small requests are prevented from being sent to the same node for multiple times.

Based on the method, the invention also provides a parallel deep learning training data input system based on sequence predictability, which comprises:

the random sequence generating module is used for generating the same global random sequence at each node;

the data prefetching module is used for determining the size of a prefetched data block and performing data block allocation and caching on each node according to the size of the prefetched data block;

and the cache replacement module is used for carrying out cache replacement in each node in advance according to the random sequence to be accessed in the next round, the locally cached data and the data used in the round, and caching the data to be used in the next round in the node in advance.

Further, the data prefetch module includes:

the prefetch granularity decision module is used for determining the size of a prefetch data block when data are prefetched from a bottom layer parallel file system by combining the cache hit rate and the disk access performance;

and the pre-fetching data block distribution module is used for distributing and caching data to each node according to the size of the pre-fetching data block.

Furthermore, the system also comprises a request merging module which is used for merging a plurality of small data requests of a certain node to the same node into a large request to be sent in each round of training.

The invention has the beneficial effects that: in the process of parallel training of the large-scale neural network, the speed of reading data from the storage can be accelerated under the requirement of ensuring random reading of global data.

Drawings

FIG. 1 is a diagram illustrating a conventional data prefetching method;

FIG. 2 is a diagram illustrating a data prefetching method according to the present invention.

Detailed Description

The present invention is described in detail below with reference to the accompanying drawings.

Fig. 1 shows a conventional data prefetching method, where if the number of training nodes is M, each training node first divides all training data into M groups, and caches the data from a parallel file system to the local. To facilitate metadata management, data is not moved between training nodes in training thereafter. The data prefetched from the bottom file system or cached in the cache is not matched with the requirement of the upper training task, so that the cache hit rate is reduced, and the data input efficiency is not high; in addition, in the multiple iterative training of each round of training, when one node requests several data from the same remote node for multiple times, the requests are not combined, and the requests of multiple small data are directly initiated.

Based on the above, the invention designs a high-efficiency data input system for large-scale deep learning training, which comprises a random sequence generation module, a data pre-fetching module and a cache replacement module; the data pre-fetching module comprises a pre-fetching granularity decision module and a pre-fetching data block distribution module. Fig. 2 is a schematic diagram of an efficient data input method based on sample sequence predictable large-scale deep learning training, which is provided by the present invention, and the present invention makes full use of the characteristic that an access sequence of data can be predetermined, and determines the size of a prefetch data block when prefetching data from a bottom-layer parallel file system by combining cache hit rate and disk access performance, and then performs data caching, so as to effectively improve the cache hit rate, as shown in fig. 2, the efficient data input method of the present invention is specifically implemented as follows:

the method comprises the following steps: when the first round of training is started, the random sequence generation module generates the same global random sequence at each node (wherein each node uses the training round number as a random seed, so the generated random sequences are the same), and then respectively extracts the training data numbers belonging to the node according to a remainder mode.

Step two: the pre-fetching granularity decision module comprehensively considers the factors of cache hit rate and disk access performance and determines the size of a pre-fetching data block when the data is pre-fetched from a bottom layer parallel file system.

In particular, a heuristic algorithm is employed to determine the prefetch data block size. The method mainly comprises the following steps:

(2.1) let the prefetch data block size be bi, i be a natural number, and represent the number of iterations, where the initial prefetch data block size b0= N/M, N is the total number of training samples, and M is the number of nodes in parallel training.

(2.2) the pre-fetching data block distributing module distributes data to each node according to the size of the pre-fetching data block, and determines the cache hit rate hi of the system when the size of the current pre-fetching data block is taken as a value: hi = ni/N, where N is the total number of training data and ni is the number of samples hit together in all nodes.

It should be noted that the size of the prefetch data block at this time is not finally determined, and only the prefetch data block allocation module is used for performing the pre-allocation, so as to obtain the cache hit rate hi corresponding to the size bi of the prefetch data block at this time, and not cache the data.

And (2.3) comparing the sizes of hi and h (i-1) after calculating the local cache hit rate of each round. If hi is larger than h (i-1), the step (2.4) is carried out; otherwise, the size of the prefetch data block corresponding to the current iteration number is used as the size of the final prefetch data block.

(2.4) if bi < bmin, finally prefetching the data block size b = bmin, otherwise, making the data block size of the next iteration be half of the data block size of the previous iteration, namely bi =1/2 b (i-1), and returning to the step (2.2) to recalculate the cache hit rate. Where bmin = sxt is the minimum size of the prefetched data block, s is the seek time of the disk, and t is the sustained transmission rate of the disk.

The step (2.4) is to avoid degrading the system performance due to the large amount of disk random access caused by the small size of the prefetched data block.

The main node performs data distribution on each node according to the size of the prefetch data block, and the method specifically comprises the following steps:

and the main node blocks all the training data according to the size of the pre-fetched data block to obtain n blocks of data, and each node caches k = n/M blocks.

The main node starts traversing all data blocks from the node with the number of 1, finds out k blocks of data with the largest number of hit samples for the node 1 and distributes the k blocks of data to the node 1; and then traversing the rest of the data blocks from the node with the number of 2 again, finding out the k blocks with the largest number of hit samples for the node 2, allocating the k blocks to the node 2, and so on until all the nodes are allocated completely. The data blocks which are already allocated by the previous nodes are not allocated any more, so that the data blocks which are finally allocated by each node are mutually exclusive.

And finally, after the size of the data block to be prefetched is determined, the main node communicates the number of the k blocks of data distributed by each node to other nodes.

Step three: and the pre-fetching data block allocation module selects an allocation mode with the highest hit rate to cache the data block into each computing node according to the size of the pre-fetching data block after determination.

Step four: the random sequence generation module generates a random sequence of the next round in advance, and the cache replacement module performs cache replacement according to the random sequence to be accessed in the next round, locally cached data and data used in the round; caching data to be used in the next round locally in advance; until the training is finished.

As a preferred scheme, in this step, performing cache replacement according to the random sequence to be accessed in the next round, the locally cached data of each node, and the data used in this round specifically includes:

each node traverses each training data number distributed to the node in the next round, if the training data number is not in the local cache of the node, a remote request is initiated to exchange data, and the training data is deleted in the remote node; if the training data number is in the node local cache then it remains unchanged. Because the global random sequences generated by each node are the same, each node can acquire the data distributed by each node in the previous round, namely the existing cache data of each node before the training of the current round is started; in other words, each node has a dynamic global metadata distribution information; thus, when a training data is not present at the node, the node with the data can be found to initiate a remote request cache replacement.

As a preferred embodiment, the efficient data input system further comprises a request merging module, and in the training process, if a certain node sends multiple requests to the same node, the request merging module merges and sends the multiple requests, so that multiple sending of small requests to the same node is avoided, time overhead of network transmission is reduced, and training efficiency is improved.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A parallel deep learning training data input method based on sequence predictability is characterized by comprising the following steps:

the method comprises the following steps: when the first round of training of the neural network starts, the same global random sequence is generated at each node, and each node obtains the respective training data number;

step two: determining the size of a pre-fetching data block when the data is pre-fetched from a bottom layer parallel file system according to the training data number of each node and by combining the cache hit rate and the disk access performance; the method specifically comprises the following substeps:

(2.2) the main node performs analog data distribution on each node according to the size of the prefetched data block, and determines the cache hit rate hi of the system when the size of the current prefetched data block is taken as a value: hi = ni/N, wherein N is the total number of training samples, and ni is the number of samples which are hit together in all nodes;

(2.4) if bi is less than bmin, the size of the finally prefetched data block is b = bmin, otherwise, the size of the prefetched data block in the next iteration is half of the size of the prefetched data block in the previous iteration, and the step (2.2) is returned to calculate the cache hit rate again; wherein bmin = sxt is the minimum size of the prefetch data block, s is the seek time of the disk, and t is the continuous transmission rate of the disk;

2. The sequence-based predictable parallel deep learning training data input method according to claim 1, wherein in the step (2.2), the master node performs analog data distribution on each node according to the size of the prefetched data block, specifically:

3. The sequence-predictable parallel deep learning training data input method according to claim 1, wherein in the fourth step, performing cache replacement according to the random sequence to be accessed in the next round, the locally cached data of each node, and the data used in the current round specifically comprises:

each node traverses each training data number distributed to the node in the next round, if the training data number is not in the local cache of the node, a remote request is initiated to exchange data, and the training data is deleted in the remote node; if the training data number is in the node local cache then it remains unchanged.

4. The method as claimed in claim 1, wherein a node is used to transmit multiple requests from the same node in a combined manner during each round of training.

5. A training data input system based on the sequence-based predictable parallel deep learning training data input method of claim 1, comprising:

6. The training data input system of claim 5, wherein the data pre-fetch module comprises:

7. The training data input system of claim 5, further comprising a request combining module for combining multiple small data requests of a node into one large request for transmission in each round of training.