CN110928481A

CN110928481A - Distributed deep neural network and storage method of parameters thereof

Info

Publication number: CN110928481A
Application number: CN201811092458.9A
Authority: CN
Inventors: 何东杰
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2020-03-27

Abstract

The invention relates to a storage method of parameters in a distributed deep neural network. The method comprises the following steps: the management node constructs parameters required in a network according to the deep neural network, and divides the parameters into M parameter blocks; setting M parameter service nodes, wherein each parameter service node stores one parameter block in M parameter blocks, M is a natural number, setting copies of N parameter service nodes, dividing a memory on each parameter service node into N block memory blocks, wherein in the N block memory blocks, 1 block memory block stores one parameter block in M parameters as a main parameter data block, and the rest N-1 block memory blocks store copies of the rest N-1 parameter blocks respectively, wherein N is a natural number. According to the invention, by adopting the distributed multi-copy, the efficient access efficiency of the parameter data can be ensured, and the high availability effect can be provided.

Description

Distributed deep neural network and storage method of parameters thereof

Technical Field

The invention relates to a computer technology, in particular to a distributed deep neural network and a parameter storage method thereof.

Background

In recent years, with the development of mobile internet, mobile payment has been rapidly developed. In the near field, with the rapid development of big data and artificial intelligence technology, more and more deep neural network algorithms are used in the aspects of customer analysis, marketing analysis, business decision and the like of enterprises. The current deep learning open source technology is represented as the leading edge of the technology development, and has been applied to many enterprises, including tenserflow, caffe, pytorch, and the like.

According to a deep neural network operation mechanism, operation nodes are divided into parameter service nodes and gradient calculation nodes worker. The gradient calculation node reads the training data to perform gradient calculation, and the parameter service node is responsible for updating, distributing and the like of the parameters. Generally speaking, training of a model requires many iterations, considering that the size of parameters is large, and in order to improve the efficiency of data access, each parameter service node stores a part of the parameters and stores the parameter data in the memory. In the operation process, if the parameter service node is abnormal, the parameter data can be lost, so that the model training task fails.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a distributed deep neural network and a method for storing parameters in the distributed deep neural network, which enable a model training task to continue to run when an anomaly occurs in a parameter service node by optimizing and improving the parameter service node in a distributed deep learning framework.

The method for storing the parameters in the distributed deep neural network is characterized by comprising the following steps:

a parameter dividing step, in which a management node divides parameters required in a network constructed according to a deep neural network into M parameter blocks; and

a parameter service node setting step, setting M parameter service nodes, each parameter service node storing one of M parameter blocks,

wherein M is a natural number. Optionally, before the parameter dividing step, the method further includes:

and a parameter node number estimation step, namely calculating the memory space required by storage according to the network scale during the operation of the deep neural network, and combining the number of backup copies and the memory size of each node to obtain the number M of the required minimum parameter service nodes, wherein the actual memory space is required to be larger than the memory space required by multi-copy storage.

Optionally, further comprising:

a copy of the N parameter service nodes is set,

the memory on each parameter service node is divided into N blocks of memory,

in the N block memory blocks, 1 block memory block stores one parameter number block in M block parameters as a main parameter data block, and the rest N-1 block memory blocks respectively store the copies of the rest N-1 parameter blocks, wherein N is a natural number.

Optionally, the parameters are divided into M equal blocks of parameter blocks, and the memory on each parameter service node is divided into N equal blocks of memory.

Optionally, storing non-repeating N parameter blocks in the N block memory block,

and dividing the parameter service nodes into N groups, wherein each group covers M parameter blocks.

Optionally, the management node randomly allocates copies of the parameter blocks to the remaining parameter nodes according to the number of copies of the parameter blocks so that each parameter node stores a different parameter block.

Optionally, each parameter service node synchronizes the parameters of the main parameter data block to the copies of the remaining N-1 parameter service nodes after updating the main parameter data block.

Optionally, each parameter service node persists data of the master parameter data block to the shared storage while updating the master parameter data block.

Optionally, setting M distributed memory service nodes,

the parameter service nodes are only responsible for parameter updating and accessing operations of parameter blocks, and the parameter distribution of the parameter blocks is stored in the M distributed memory service nodes.

Optionally, the parameter service node persists the parameters of the parameter block to a shared storage through the distributed memory service node.

The distributed deep neural network of an aspect of the present invention is characterized by comprising:

the management node is used for constructing parameters required in the network according to the deep neural network and dividing the parameters into M parameter blocks; and

the system comprises M parameter service nodes, wherein each parameter service node stores one of M parameter blocks, and M is a natural number.

Optionally, a memory space required for storage is calculated according to a network scale during operation of the deep neural network, and the number of the required minimum parameter service nodes is obtained by combining the number of the backup copies and the size of the memory of each node, wherein the requirement that the actual memory space is larger than the memory space required for multi-copy storage is met.

Optionally, further comprising:

the number of copies of the parameter service node is N,

the M parameter service nodes are respectively provided with a memory, the memory of each parameter service node is divided into N blocks of memory blocks,

in the N-block memory blocks, 1-block memory block stores one parameter number block in M-block parameters as a main parameter data block, the rest N-1-block memory blocks respectively store the copy of the rest N-1 parameter blocks,

wherein N is a natural number.

the parameter service nodes are divided into N groups, and each group covers M parameter blocks.

Optionally, the method further comprises:

and the shared storage is used for storing the data of the main parameter data block from each parameter service node.

Optionally, the method further comprises:

and the M distributed memory service nodes are used for storing the parameters of the parameter blocks from the M parameter service nodes in a distributed manner.

Optionally, the method further comprises:

and the shared storage is used for storing the parameters of the parameter blocks from the M distributed memory service nodes.

A computer-readable storage medium of an aspect of the present invention on which a computer program is stored is characterized in that the program, when executed by a processor, implements the above-described storage method of parameters in a distributed deep neural network.

A computer device according to an aspect of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that the processor implements the above-described method for storing parameters in the distributed deep neural network when executing the computer program.

As described above, according to the distributed deep neural network and the method for storing parameters in the distributed deep neural network in one aspect of the present invention, by using distributed multiple copies, efficient access efficiency of parameter data can be ensured, and a high availability effect can also be provided, by persisting the parameter data to a local shared storage, high availability of data can be further provided, further by dividing the parameter data into blocks, data synchronization is completely copied according to data blocks, and synchronization efficiency can be improved, and further, by dividing the parameter data into N parts, concise data division and access efficiency are provided, so that a failed node reaches 2/3 at most, and is N-1 at least.

Drawings

Fig. 1 is a schematic diagram showing a storage method of parameters in a distributed deep neural network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram showing a storage method of parameters in the distributed deep neural network according to still another embodiment of the present invention.

Detailed Description

The following description is of some of the several embodiments of the invention and is intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention.

The main technical scheme of the parameter storage method in the distributed deep neural network in one aspect of the invention is as follows:

calculating the memory space required by storage according to the network scale during the operation of the deep neural network, and then combining the number of backup copies and the memory size of each node to obtain the number of the required minimum parameter service nodes, and simultaneously ensuring that the actual memory space is slightly larger than the memory space required by the storage of a plurality of copies;

setting the number of the parameter service nodes as M and the number of the copies as N;

dividing the parameters into M blocks which are approximately equal, wherein each parameter service node stores 1 parameter data block in the M blocks respectively;

dividing the memory on each parameter service node into N blocks which are approximately equal, wherein 1 block stores a main parameter data block, and the rest spaces store the copies of the rest N-1 parameter blocks, namely the parameter service nodes synchronize the N-1 parameter data blocks from other nodes;

the parameter blocks are distributed according to a random distribution principle and a following principle, the memory blocks can be regarded as M-N matrixes, the principle is that each line in N lines is a nonrepeated M parameter blocks, the principle is that the parameter service nodes can be roughly divided into N groups, and each group can basically cover M parameter blocks;

each parameter service node updates the main parameter data block and simultaneously persists the parameter data into a shared storage;

after each parameter service node updates the main parameter data block, the main parameter data is synchronized to the rest N-1 replica nodes,

wherein M and N are natural numbers.

Thus, according to the method for storing the parameters in the distributed deep neural network, efficient access efficiency of parameter data can be ensured and a high availability effect can be provided by adopting the distributed multiple copies. Secondly, by persisting the parameter data to a local shared store, high availability of data can be further provided. In addition, the parameter data are divided into blocks, so that the data synchronization is completely copied according to the data blocks, and the synchronization efficiency can be improved. Furthermore, by dividing the parameter data into N parts, concise data division and access efficiency are provided, and the fault node can reach 2/3 at most and is N-1 at least.

Next, a method of storing parameters in a deep neural network and a storage system thereof according to an embodiment of the present invention will be described.

As shown in fig. 1, the distributed deep neural network according to the first embodiment of the present invention includes a management node 100, M parameter service nodes (i.e., parameter service node 1, parameter service node 2 … … parameter service node M in fig. 1), and a shared storage 200, where M is a natural number.

The management node 100 divides the parameters into M parameter blocks (i.e., parameter block 1 and parameter block 2 … … parameter block M) with substantially equal length according to the parameters required by the deep neural network in the network construction. As shown in fig. 1, each parameter service node stores one parameter block of M parameter blocks, that is, in fig. 1, the parameter block 1 is stored by the parameter service node 1, and the parameter block 2 is stored by the parameter service node 2.

Moreover, the management node 100 randomly distributes the copies to the remaining nodes according to the number of copies, ensuring that each parameter node stores different parameter blocks, specifically for example: the parameter service node 1 stores parameter blocks a and b as copies in addition to the parameter block 1, the parameter service node 2 stores parameter blocks c and d as copies in addition to the parameter block 2, and the parameter service node M stores parameter blocks e and f as copies in addition to the parameter block M. That is, it is equivalent to divide the memory on each parameter service node into approximately equal 3 blocks (i.e., N = 3), where 1 block stores the main parameter data block and the remaining space stores 2 (N-1) copies of the parameter block, that is, the parameter service node synchronizes 2 (N-1) parameter data blocks from other nodes. The management node 100 realizes synchronization of duplicate parameter blocks between parameter service nodes according to the situation of duplicate distribution.

In the invention, all the parameter blocks are stored in the memory, thereby ensuring the efficiency of updating and accessing the parameter data. Moreover, after the main parameter block stored by each parameter service node is updated, the updated information is synchronously persisted to the shared storage 200, for example, the updated information may be recorded in a log form or directly copied to the parameter data in the storage memory.

As described above, according to the method for storing parameters in a distributed deep neural network according to an embodiment of the present invention, efficient access efficiency of parameter data can be ensured and a high availability effect can be provided by using distributed multiple copies. Secondly, by persisting the parameter data to the local shared storage 200, high availability of data can be further provided. In addition, the parameter data are divided into blocks, so that the data synchronization is completely copied according to the data blocks, and the synchronization efficiency can be improved.

Next, a method of storing parameters in the distributed deep neural network according to still another embodiment of the present invention will be described.

As shown in fig. 2, the management node 110 is configured to configure M parameter service nodes (i.e., parameter service node 1, parameter service node 2 ….. parameter service node M) and M distributed memory service nodes (i.e., respective memory service node 1, respective memory service node 2 ….. respective memory service node M in fig. 2). The M parameter service nodes persist the parameters to the shared storage 210 through the M distributed memory service nodes.

For example, the management node 110 constructs parameter information required in the network according to a deep neural network created by a task, and divides parameters into M parameter blocks substantially equal to each other according to the number of parameter service nodes, assuming that the parameter block 1 is stored by the parameter service node 1, and the parameter block 2 is stored by the parameter service node 2.

Here, the M parameter service nodes are only responsible for parameter updating and access operations of the parameter block, and specific parameters are stored in the distributed memory service nodes 1 to M, that is, the parameter service and the memory storage can be separated.

And the parameter distribution of the parameter block of the M parameter service nodes is stored in the M distributed memory service nodes. The M parameter service nodes persist the parameters of the parameter block to the shared storage 210 through the M distributed memory service nodes. The invention realizes the distributed memory storage, and the data is stored and updated in batch in a data block mode, so that the storage and update efficiency can be greatly improved.

The M parameter service nodes persist the parameters to the shared storage 210 through the M distributed memory service nodes, for example, as an example, the parameter data in the storage memory may be recorded in a log form or directly copied.

The present invention also provides a computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the above-described method of storing parameters in a distributed deep neural network.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the above-mentioned method for storing parameters in the distributed deep neural network when executing the computer program.

The above examples mainly illustrate the distributed deep neural network of the present invention and the storage method of parameters in the distributed deep neural network of the present invention. Although only a few embodiments of the present invention have been described in detail, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A method for storing parameters in a distributed deep neural network is characterized by comprising the following steps:

wherein M is a natural number.

2. The method of storing parameters in a distributed deep neural network as claimed in claim 1, further comprising, before the parameter dividing step:

3. The method of storing parameters in a distributed deep neural network as claimed in claim 1, further comprising after said parameter service node setting step:

a copy of the N parameter service nodes is set,

the memory on each parameter service node is divided into N blocks of memory,

wherein N is a natural number.

4. The method of storing parameters in a distributed deep neural network of claim 3,

and dividing the parameters into M equal parameter blocks, and dividing the memory on each parameter service node into N equal memory blocks.

5. The method of storing parameters in a distributed deep neural network of claim 3,

storing non-repeating N blocks of parameters in the N block memory block,

6. The method of storing parameters in a distributed deep neural network of claim 3,

the management node randomly distributes the copies of the parameter blocks to the rest parameter nodes according to the copy number of the parameter blocks so that each parameter node stores different parameter blocks.

7. The method of storing parameters in a distributed deep neural network of claim 3,

after updating the main parameter data block, each parameter service node synchronizes the parameters of the main parameter data block to the copies of the remaining N-1 parameter service nodes.

8. The method of storing parameters in a distributed deep neural network of claim 3,

each parameter serving node persists data of the master parameter data block to shared storage while updating the master parameter data block.

9. The method of storing parameters in a distributed deep neural network of claim 2,

setting M distributed memory service nodes,

10. The method of storing parameters in a distributed deep neural network of claim 9,

and the parameter service node persists the parameters of the parameter block to shared storage through the distributed memory service node.

11. A distributed deep neural network, comprising:

12. The distributed deep neural network of claim 11,

calculating the memory space required by storage according to the network scale during the operation of the deep neural network, and combining the number of backup copies and the memory size of each node to obtain the number M of the required minimum parameter service nodes, wherein the actual memory space is required to be larger than the memory space required by multi-copy storage.

13. The distributed deep neural network of claim 11, further comprising:

the number of copies of the parameter service node is N,

wherein N is a natural number.

14. The distributed deep neural network of claim 11,

storing non-repeating N blocks of parameters in the N block memory block,

15. The distributed deep neural network of claim 14,

16. The distributed deep neural network of claim 11,

17. The distributed deep neural network of claim 11, further comprising:

18. The distributed deep neural network of claim 11, further comprising:

19. The distributed deep neural network of claim 18, further comprising:

20. A computer-readable storage medium on which a computer program is stored, which program, when executed by a processor, implements a method of storing parameters in a distributed deep neural network as claimed in any one of claims 1 to 10.

21. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements a method of storing parameters in a distributed deep neural network as claimed in any one of claims 1 to 10.