CN112261023A

CN112261023A - Data transmission method and device of convolutional neural network

Info

Publication number: CN112261023A
Application number: CN202011104673.3A
Authority: CN
Inventors: 罗建刚
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-22

Abstract

The invention discloses a data transmission method and a data transmission device of a convolutional neural network, wherein the method comprises the steps of dividing data to be transmitted into a plurality of arrays based on a data division mode, and sequentially executing the following steps for each array in response to the start of aggregation of the array above the array: calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array. The invention can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.

Description

Data transmission method and device of convolutional neural network

Technical Field

The present invention relates to the field of neural networks, and more particularly, to a data transmission method and apparatus for a convolutional neural network.

Background

Increasingly sophisticated machine learning algorithms, such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), etc., can achieve unprecedented performance in many practical applications and solve many areas of difficulty, such as speech recognition, text processing, and image recognition. However, a long time is often required for training on a single Graphics Processing Unit (GPU), and the application is limited to a certain extent due to low efficiency. The most widely used method to reduce training time is to perform data parallel training. In data parallel training, each GPU has a complete copy of the model parameters, and the GPU often exchanges parameters with other GPUs participating in the training, which results in significant communication costs and becomes a system bottleneck when communication is slow.

In order to solve the communication bottleneck in training, the communication bottleneck can be solved from two aspects of hardware and software. More advanced GPU interconnection technology is adopted in the aspect of hardware; advanced modern communication libraries are employed in software. The ring communication method is applied more in the existing communication method, and the Pipeline technology can be effectively adopted, so that the method has good expansibility and is applied more in large data volume transmission. However, under the limitation of a low-speed network, for example, under a partial PCIE connection, the transmission speed is only about 7.5GB/s, which has gradually become a bottleneck for GPU calculation. In the case of multi-node transmission, the transmission is often performed through a network, which imposes a more serious restriction on GPU interactive computation.

Aiming at the problems of large communication data volume, long time consumption and slow overall task processing progress of the convolutional neural network in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a data transmission method and apparatus for a convolutional neural network, which can reduce the amount of communication data to improve the transmission efficiency, reduce the latency, and improve the overall speed while ensuring the convergence accuracy.

In view of the above object, a first aspect of the embodiments of the present invention provides a data transmission method for a convolutional neural network, including dividing data to be transmitted into a plurality of arrays based on a data division manner, and sequentially performing the following steps for each array in response to a previous array starting aggregation:

calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array;

calling communication resources to execute a transmission mode-based protocol on the compressed array;

calling communication resources to perform transmission mode-based aggregation on the compressed array;

the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array.

In some embodiments, performing the sparse compression on the array to generate the compressed array comprises:

extracting the value and position of each element from the array to form a pair of numbers;

deleting the element number pairs with the value of zero;

the remaining pairs of element numbers are combined to form a compressed array.

In some embodiments, further comprising: after deleting the element number pairs having a value of zero, additionally deleting element number pairs having a value less than the filtering threshold based on a predetermined filtering threshold.

In some embodiments, the data partitioning and transmission modes are determined based on the processing unit topology.

In some embodiments, the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.

In some embodiments, the data partitioning is an average allocation based on the number of processing units; the transmission mode is annular transmission or annular full-protocol transmission; the processing unit topology is a ring topology.

In some embodiments, further comprising: while the transport-based aggregation is being performed, the compute resources are also initially invoked to perform sparse compression on its next array.

In some embodiments, further comprising: and a transmission interface is pre-established for the convolutional neural network, and transmission mode-based reduction and aggregation are performed on the compressed array based on the transmission interface.

A second aspect of an embodiment of the present invention provides a data transmission apparatus for a convolutional neural network, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed partitioning data to be transmitted into a plurality of arrays based on a data partitioning manner, and for each array performing the following steps in sequence in response to a previous array thereon starting aggregation:

In some embodiments, the data partitioning manner and the transmission manner are both determined based on the topology of the processing unit; the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.

The invention has the following beneficial technical effects: according to the data transmission method and device of the convolutional neural network, provided by the embodiment of the invention, the array is subjected to sparse compression by calling computing resources in a source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data transmission method of a convolutional neural network provided in the present invention;

FIG. 2 is a block diagram of a data transmission method of a convolutional neural network according to the present invention;

fig. 3 is a schematic pipeline diagram of a data transmission method of a convolutional neural network provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above object, a first aspect of embodiments of the present invention proposes an embodiment of a data transmission method of a convolutional neural network that reduces the amount of communication data while ensuring convergence accuracy. Fig. 1 is a schematic flow chart of a data transmission method of a convolutional neural network provided by the present invention.

As shown in fig. 1, the data transmission method of the convolutional neural network includes dividing data to be transmitted into a plurality of arrays based on a data division manner, and sequentially performing the following steps for each array in response to the start of aggregation of the previous array:

step S101, invoking computing resources to execute sparse compression on an array in a source processing unit to generate a compressed array;

step S103, calling communication resources to execute a transmission mode-based protocol on the compressed array;

step S105, calling communication resources to perform transmission mode-based aggregation on the compressed array;

step S107, the computing resource is called to decompress the compressed array in the target processing unit so as to extract the array.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium of the computer may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

deleting the element number pairs with the value of zero;

the remaining pairs of element numbers are combined to form a compressed array.

The following further illustrates embodiments of the invention in accordance with the embodiments shown in fig. 2 and 3.

Referring to fig. 2, the frame is mainly divided into three major parts: the method includes the steps that firstly, a deep learning framework data transmission interface is established, wherein the deep learning framework data transmission interface comprises a pyrrch, a TF, an mxnet and the like, the data transmission interface is consistent with nccl, and the universality of a program is guaranteed. Secondly, topology establishment and selection, namely, topology with lower delay is selected according to the gpu architecture establishment and by combining factors such as the size of data volume and the like. According to different topologies, transmission modes are different, and data division modes are also different, for example, in ring communication, each GPU can take Size/N data each time (Size is the total Size of data to be transmitted, and N is the number of GPUs). And thirdly, a sparse compression communication part, wherein the sparse storage mode adopts a row compression mode, and the transmission homogenization is a one-dimensional array form, so that the expression can be realized only by element values and column marks. For example, the transmission array is:

(0,6,0,0,7,0,0,0,0,0,0,0,2,0,0,1)

can be expressed as:

(1,4,12,15)(6,7,2,1)

it can be seen that with a sparseness of 25%, the amount of transmission is only 50% of the original amount of data. And the matrix after sparse compression can be subjected to reduction operation (summation, maximum value taking and the like) under the compression condition, so that the method has higher acceleration effect compared with the traditional compression method.

But sparse compression and decompression can take up computing time and affect the program efficiency. When the compression and decompression time is optimized and reduced, the same strategy is adopted as that adopted in the traditional compression, and pipeline is adopted to hide the sparse compression time, so that the program efficiency is improved. The compression of the second process is synchronously started in the ring aggregation in a manner as shown in fig. 3, and the communication bandwidth mainly occupied by the ring aggregation and the ring protocol is not large for the occupation of computing resources, so that the next transmission data can be subjected to sparse compression processing in the transmission process by using a pipeline, the compression time is hidden, and the program efficiency is improved.

The embodiment of the invention is based on the annular and tree communication, adopts a sparse compression method, reduces the data volume during transmission and improves the transmission bandwidth. In the case where the degree of thinning is 1/n of the source data, an acceleration ratio of n/2 times can be obtained at the highest. Tests prove that the deep learning framework cannot be negatively converged when a proper threshold value is obtained. Therefore, the communication bandwidth of the GPU is effectively improved through data sparsification, and the convergence of a deep learning model is guaranteed. The problems of low-speed network and low communication efficiency of the GPU are solved to a certain extent.

It can be seen from the foregoing embodiments that, in the data transmission method of a convolutional neural network provided in the embodiments of the present invention, a calculation resource is called to perform sparse compression on an array in a source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.

It should be particularly noted that, the steps in the embodiments of the data transmission method of the convolutional neural network described above can be mutually intersected, replaced, added, and deleted, so that the data transmission method of the convolutional neural network that is transformed by these reasonable permutations and combinations shall also belong to the scope of the present invention, and shall not limit the scope of the present invention to the described embodiments.

In view of the above object, a second aspect of the embodiments of the present invention proposes an embodiment of a data transmission apparatus of a convolutional neural network that reduces the amount of communication data while ensuring convergence accuracy. The data transmission device of the convolutional neural network comprises:

a processor; and

As can be seen from the foregoing embodiments, the data transmission apparatus of the convolutional neural network according to the embodiments of the present invention generates a compressed array by invoking computing resources to perform sparse compression on an array in a source processing unit; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.

It should be particularly noted that, the above-mentioned embodiment of the data transmission apparatus of the convolutional neural network adopts the embodiment of the data transmission method of the convolutional neural network to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data transmission method of the convolutional neural network. Of course, since the steps in the data transmission method embodiment of the convolutional neural network can be mutually intersected, replaced, added, and deleted, the data transmission apparatus of the convolutional neural network that is transformed by these reasonable permutations and combinations shall also belong to the scope of the present invention, and shall not limit the scope of the present invention to the above embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A data transmission method of a convolutional neural network is characterized by comprising the following steps of dividing data to be transmitted into a plurality of arrays based on a data division mode, and sequentially executing the following steps for each array in response to the beginning of aggregation of the previous array:

invoking computing resources to perform sparse compression on the array at a source processing unit to generate a compressed array;

invoking computing resources to perform decompression on the compressed array at the target processing unit to extract the array.

2. The method of claim 1, wherein performing sparse compression on the array to generate a compressed array comprises:

deleting the element number pairs with the value of zero;

and combining the remaining pairs of element numbers to form the compressed array.

3. The method of claim 2, further comprising: after deleting the element number pairs having a value of zero, additionally deleting element number pairs having a value less than the filtering threshold based on a predetermined filtering threshold.

4. The method of claim 1, wherein the data partitioning scheme and the transmission scheme are both determined based on processing unit topology.

5. The method of claim 4, wherein the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.

6. The method of claim 5, wherein the data partitioning is an average distribution based on a number of processing units; the transmission mode is ring transmission or ring full-protocol transmission; the processing unit topology is a ring topology.

7. The method of claim 1, further comprising: while the transport-based aggregation is being performed, the compute resources are also initially invoked to perform sparse compression on its next array.

8. The method of claim 1, further comprising: and a transmission interface is pre-established for the convolutional neural network, and transmission mode-based specification and aggregation are performed on the compressed array based on the transmission interface.

9. A data transmission apparatus for a convolutional neural network, comprising:

a processor; and

10. The apparatus of claim 9, wherein the data partitioning scheme and the transmission scheme are both determined based on a processing unit topology; the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.