CN110503201A

CN110503201A - A kind of neural network distributed parallel training method and device

Info

Publication number: CN110503201A
Application number: CN201910810557.4A
Authority: CN
Inventors: 高开; 郭振华; 曹芳
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-26

Abstract

The invention discloses a kind of neural network distributed parallel training method and devices, comprising: the convolution kernel that same channel is in each layer of deep learning model is divided into the same calculating equipment in multiple calculating equipment；Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, and newly-generated feature is passed to next layer of continuation convolution；Backpropagation loses error and updates each layer of gradient weight since the last layer.The present invention can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput and performance.

Description

A kind of neural network distributed parallel training method and device

Technical field

The present invention relates to computer fields, more specifically, particularly relating to a kind of neural network distributed parallel training method With device.

Background technique

Deep learning has been that artificial intelligence field brings huge progress, but training deep learning model needs It significantly largely to calculate.It is completed on a single machine with a modern times GPU once based on benchmark such as ImageNet The training of data set may expend up to one week time.Distribution training on more machines can be reduced the training time, But the prior art lacks corresponding embodiment.

Aiming at the problem that lacking handy distribution training parallel calculating method in the prior art, there has been no effective at present Solution.

Summary of the invention

In view of this, the purpose of the embodiment of the present invention is to propose a kind of neural network distributed parallel training method and dress It sets, the training time of distributed training parallel calculating method can be reduced, the degree of parallelism of algorithm is improved and improve throughput and property Energy.

Based on above-mentioned purpose, the first aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training side Method, comprising:

The convolution kernel that same channel is in each layer of deep learning model is divided into same in multiple calculating equipment One calculates equipment；

Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, by newly-generated spy Sign is passed to next layer of continuation convolution；

Backpropagation loses error and updates each layer of gradient weight since the last layer.

In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.

In some embodiments, calculating equipment is field programmable gate array.

In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file.

In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard Part circuit carrys out hardware-accelerated execution AOCX file.

The second aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training device, comprising:

Distribution module, by each layer by deep learning model be in same channel convolution kernel be divided into it is multiple based on Calculate the same calculating equipment in equipment；

Execution module executes convolution behaviour for being based on convolution kernel for each layer of independence respectively in each calculating equipment Make, newly-generated feature is passed to next layer of continuation convolution；

Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.

In some embodiments, calculating equipment is field programmable gate array.

In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file；The distributed parallel training algorithm hardware circuit in field programmable gate array is called hardware-accelerated Execute AOCX file.

The third aspect of the embodiment of the present invention provides a kind of field programmable gate array cluster, comprising:

Multiple field programmable gate arrays；

Processor；With

Memory, is stored with the program code that processor can be run, and program code executes above-mentioned nerve when being run Distributed parallel training method.

The present invention has following advantageous effects: neural network distributed parallel provided in an embodiment of the present invention training side Method and device, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple calculating equipment Same calculating equipment；Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be new The feature of generation is passed to next layer of continuation convolution；Backpropagation loses error and updates each layer of gradient since the last layer The technical solution of weight can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and change Kind throughput and performance.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow diagram of neural network distributed parallel training method provided by the invention；

The convolution operation schematic diagram of the deep learning model training of Fig. 2 prior art；

The convolution operation schematic diagram of Fig. 3 neural network distributed parallel training method provided by the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference The embodiment of the present invention is further described in attached drawing.

It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.

Based on above-mentioned purpose, the first aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time One embodiment of network distributed parallel training method.Shown in fig. 1 is neural network distributed parallel instruction provided by the invention Practice the flow diagram of method.

The neural network distributed parallel training method, as shown in Figure 1, comprising:

Step S101: the convolution kernel for being in same channel in each layer of deep learning model is divided into multiple calculating and is set Same calculating equipment in standby；

Step S103: being based on convolution kernel for each layer of independence respectively in each calculating equipment and execute convolution operation, will Newly-generated feature is passed to next layer of continuation convolution；

Step S105: backpropagation loses error and updates each layer of gradient weight since the last layer.

It is parallel that the embodiment of the present invention proposes the distribution training based on FPGA (field programmable gate array) cluster platform Convolution operation is assigned on FPGA device different in cluster by algorithm by deep learning network model by certain method, Each FPGA device is set to reach the state of load balancing.This method has good extension on large-scale FPGA cluster as the result is shown Property.If on each FPGA configure 6 transmitters, distribution training performance with FPGA device quantity linear increase.In In terms of energy consumption, compared with same graphics processor cluster, this method is averagely than 6.36 times of graphics processor cluster.

The embodiment of the present invention designs reasonable deep learning model partition strategy, makes whole network model on FPGA cluster The state for reaching a load balancing designs the deep learning distribution training parallel algorithm description of reasonable OpenCL description, Allow to map and generate more efficient FPGA hardware circuit structure, so that traditional single device training algorithm is in the more equipment of FPGA Parallel pipelining process executes, and then promotes the performance of deep learning model profile formula training.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, Ke Yitong Computer program is crossed to instruct related hardware and complete, the program can be stored in a computer-readable storage medium, The program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk, CD, read-only memory (ROM) or random access memory (RAM) etc..The embodiment of the computer program, Ke Yida The effect identical or similar to corresponding aforementioned any means embodiment.

In some embodiments, calculating equipment is field programmable gate array.

In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard Part circuit carrys out hardware-accelerated execution AOCX file.Wherein, host side program, central processing unit are run on general central processor Pass through high speed serialization computer expansion bus standard connection between field programmable gate array.

The embodiment of the present invention is uniform on FPGA cluster by input channel the convolution operation in deep learning model training It divides, the description to distributed training algorithm is then completed using OpenCL high-level language, using Altera SDK for OpenCL (AOC) High Level Synthesis tool is compiled synthesis to Kernel program file, and generation can be running on the FPGA AOCX file.Finally, running host side program on central processing unit, the distributed training algorithm hardware circuit on FPGA is called Carry out it is hardware-accelerated, between central processing unit and FPGA using high speed serialization computer expansion bus standard interface connect, carry out Data communication, using the DDR3 memory on FPGA development board as data buffer storage.

Disclosed method is also implemented as the computer program executed by CPU, the calculating according to embodiments of the present invention Machine program may be stored in a computer readable storage medium.When the computer program is executed by CPU, executes the present invention and implement The above-mentioned function of being limited in method disclosed in example.Above method step and system unit also can use controller and be used for Storage is so that controller realizes that the computer readable storage medium of the computer program of above-mentioned steps or Elementary Function is realized.

Below according to the specific embodiment specific embodiment that the present invention is further explained.

Referring to fig. 2, traditional deep learning model training is main including the following steps:

(1) forward calculation: since first layer, the convolution kernel in input feature vector and different channels carries out convolution operation, generates New feature is passed to next layer and carries out new convolution operation, until the last layer.

(2) backpropagation: since the last layer, by the loss error back propagation of calculating.

(3) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.

In contrast, deep learning model profile formula training process such as Fig. 3 based on FPGA cluster of the embodiment of the present invention It is shown, it mainly comprises the steps that

(1) model partition: each layer of model is divided on different FPGA devices according to the different channels of convolution kernel. Assuming that the input channel of convolution kernel has 28, FPGA device has 4, then the port number in each equipment is 7.

(2) forward calculation: since first layer, according to the convolution kernel on input feature vector and the equipment on each FPGA device Convolution operation is carried out, then by the Fusion Features generated in each equipment at a new feature, the new feature of generation is passed to Next layer carries out new convolution operation, until the last layer.

(3) backpropagation: since the last layer, by the loss error back propagation of calculating.

(4) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.

From above-described embodiment as can be seen that neural network distributed parallel training method provided in an embodiment of the present invention, leads to It crosses the convolution kernel in each layer by deep learning model in same channel and is divided into same calculating equipment；It is opened from first layer Begin, in each calculatings equipment independently based on convolution kernel execution convolution operation, by newly-generated feature be passed to next layer after Continuous convolution, to the last one layer；Since the last layer backpropagation lose error and update each layer gradient weight skill Art scheme can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput And performance.

It is important to note that each step in each embodiment of above-mentioned neural network distributed parallel training method Suddenly can intersect, replace, increase, deleting, therefore, these reasonable permutation and combination transformation in neural network distribution Parallel training method should also be as belonging to the scope of protection of the present invention, and protection scope of the present invention should not be confined to the reality It applies on example.

Based on above-mentioned purpose, the second aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time One embodiment of network distributed parallel training device.Neural network distributed parallel training device includes:

Various illustrative logical blocks, module, circuit and algorithm steps in conjunction with described in disclosure herein can be implemented For the combination of electronic hardware, computer software or both.In order to clearly demonstrate this interchangeability of hardware and software, General description has been carried out to it with regard to the function of various exemplary components, square, module, circuit and step.This function is Software is implemented as also to be implemented as hardware depending on concrete application and be applied to the design constraint of whole system.This field Technical staff can realize the function in various ways for every kind of concrete application, but determine should not be by for this realization It is construed to lead to be detached from range disclosed by the embodiments of the present invention.

In some embodiments, calculating equipment is field programmable gate array.

Based on above-mentioned purpose, the third aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time Field programmable gate array cluster one embodiment of network distributed parallel training.Field programmable gate array cluster includes:

Multiple field programmable gate arrays；

Processor；With

From above-described embodiment as can be seen that neural network distributed parallel training device provided in an embodiment of the present invention and existing Field programmable gate array cluster, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple Calculate the same calculating equipment in equipment；Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution Operation, is passed to next layer of continuation convolution for newly-generated feature；Backpropagation is lost error and is updated every since the last layer The technical solution of one layer of gradient weight can reduce the training time of distributed training parallel calculating method, improve algorithm Degree of parallelism simultaneously improves throughput and performance.

It is important to note that above-mentioned neural network distributed parallel training device and field programmable gate array cluster Embodiment use the embodiment of the neural network distributed parallel training method and illustrate the worked of each module Journey, those skilled in the art can be it is readily conceivable that by these module applications to neural network distributed parallel training side In the other embodiments of method.Certainly, since each step in the neural network distributed parallel training method embodiment is equal Can intersect, replace, increase, delete, therefore, these reasonable permutation and combination transformation it is distributed in the neural network Parallel training device and field programmable gate array cluster should also be as belonging to the scope of protection of the present invention, and should not will be of the invention Protection scope be confined on the embodiment.

It is exemplary embodiment disclosed by the invention above, it should be noted that in the sheet limited without departing substantially from claim Under the premise of inventive embodiments scope of disclosure, it may be many modifications and modify.According to open embodiment described herein The function of claim to a method, step and/or movement be not required to the execution of any particular order.In addition, although the present invention is implemented Element disclosed in example can be described or be required in the form of individual, but be unless explicitly limited odd number, it is understood that be multiple.

It should be understood that it is used in the present context, unless the context clearly supports exceptions, singular " one It is a " it is intended to also include plural form.It is to be further understood that "and/or" used herein refers to including one or one Any and all possible combinations of a above project listed in association.The embodiments of the present invention disclose embodiment sequence number only Only for description, do not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples；In the think of of the embodiment of the present invention Under road, it can also be combined between the technical characteristic in above embodiments or different embodiments, and exist as described above Many other variations of the different aspect of the embodiment of the present invention, for simplicity, they are not provided in details.Therefore, all at this Within the spirit and principle of inventive embodiments, any omission, modification, equivalent replacement, improvement for being made etc. should be included in this hair Within the protection scope of bright embodiment.

Claims

1. a kind of neural network distributed parallel training method, which comprises the following steps:

The convolution kernel that same channel is in each layer of deep learning model is divided into the same meter in multiple calculating equipment Calculate equipment；

The convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be newly-generated Feature be passed to next layer of continuation convolution；

2. the method according to claim 1, wherein same channel will be in each layer of deep learning model The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.

3. the method according to claim 1, wherein the calculating equipment is field programmable gate array.

4. according to the method described in claim 3, it is characterized in that, only for each layer of difference in each calculating equipment The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool Part, and use site programmable gate array executes the AOCX file.

5. according to the method described in claim 4, it is characterized by further comprising: calling the distribution in field programmable gate array Formula parallel training hardware algorithm circuit carrys out the hardware-accelerated execution AOCX file.

6. a kind of neural network distributed parallel training device characterized by comprising

Distribution module, the convolution kernel for being in same channel in each layer by deep learning model are divided into multiple calculating and set Same calculating equipment in standby；

Execution module executes convolution for being based on the convolution kernel for each layer of independence respectively in each calculating equipment Operation, is passed to next layer of continuation convolution for newly-generated feature；

7. device according to claim 6, which is characterized in that same channel will be in each layer of deep learning model The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.

8. device according to claim 6, which is characterized in that the calculating equipment is field programmable gate array.

9. device according to claim 8, which is characterized in that only for each layer of difference in each calculating equipment The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool Part, and use site programmable gate array executes the AOCX file；Call the distributed parallel in field programmable gate array Training algorithm hardware circuit carrys out the hardware-accelerated execution AOCX file.

10. a kind of field programmable gate array cluster characterized by comprising

Multiple field programmable gate arrays；

Processor；With

Memory, is stored with the program code that processor can be run, and said program code executes such as claim when being run Neural network distributed parallel training method described in any one of 1-8.