CN110503201A - A kind of neural network distributed parallel training method and device - Google Patents

A kind of neural network distributed parallel training method and device Download PDF

Info

Publication number
CN110503201A
CN110503201A CN201910810557.4A CN201910810557A CN110503201A CN 110503201 A CN110503201 A CN 110503201A CN 201910810557 A CN201910810557 A CN 201910810557A CN 110503201 A CN110503201 A CN 110503201A
Authority
CN
China
Prior art keywords
layer
calculating equipment
convolution
convolution kernel
programmable gate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910810557.4A
Other languages
Chinese (zh)
Inventor
高开
郭振华
曹芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Suzhou Wave Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wave Intelligent Technology Co Ltd filed Critical Suzhou Wave Intelligent Technology Co Ltd
Priority to CN201910810557.4A priority Critical patent/CN110503201A/en
Publication of CN110503201A publication Critical patent/CN110503201A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses a kind of neural network distributed parallel training method and devices, comprising: the convolution kernel that same channel is in each layer of deep learning model is divided into the same calculating equipment in multiple calculating equipment;Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, and newly-generated feature is passed to next layer of continuation convolution;Backpropagation loses error and updates each layer of gradient weight since the last layer.The present invention can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput and performance.

Description

A kind of neural network distributed parallel training method and device
Technical field
The present invention relates to computer fields, more specifically, particularly relating to a kind of neural network distributed parallel training method With device.
Background technique
Deep learning has been that artificial intelligence field brings huge progress, but training deep learning model needs It significantly largely to calculate.It is completed on a single machine with a modern times GPU once based on benchmark such as ImageNet The training of data set may expend up to one week time.Distribution training on more machines can be reduced the training time, But the prior art lacks corresponding embodiment.
Aiming at the problem that lacking handy distribution training parallel calculating method in the prior art, there has been no effective at present Solution.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is to propose a kind of neural network distributed parallel training method and dress It sets, the training time of distributed training parallel calculating method can be reduced, the degree of parallelism of algorithm is improved and improve throughput and property Energy.
Based on above-mentioned purpose, the first aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training side Method, comprising:
The convolution kernel that same channel is in each layer of deep learning model is divided into same in multiple calculating equipment One calculates equipment;
Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, by newly-generated spy Sign is passed to next layer of continuation convolution;
Backpropagation loses error and updates each layer of gradient weight since the last layer.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file.
In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard Part circuit carrys out hardware-accelerated execution AOCX file.
The second aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training device, comprising:
Distribution module, by each layer by deep learning model be in same channel convolution kernel be divided into it is multiple based on Calculate the same calculating equipment in equipment;
Execution module executes convolution behaviour for being based on convolution kernel for each layer of independence respectively in each calculating equipment Make, newly-generated feature is passed to next layer of continuation convolution;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file;The distributed parallel training algorithm hardware circuit in field programmable gate array is called hardware-accelerated Execute AOCX file.
The third aspect of the embodiment of the present invention provides a kind of field programmable gate array cluster, comprising:
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and program code executes above-mentioned nerve when being run Distributed parallel training method.
The present invention has following advantageous effects: neural network distributed parallel provided in an embodiment of the present invention training side Method and device, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple calculating equipment Same calculating equipment;Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be new The feature of generation is passed to next layer of continuation convolution;Backpropagation loses error and updates each layer of gradient since the last layer The technical solution of weight can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and change Kind throughput and performance.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of neural network distributed parallel training method provided by the invention;
The convolution operation schematic diagram of the deep learning model training of Fig. 2 prior art;
The convolution operation schematic diagram of Fig. 3 neural network distributed parallel training method provided by the invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference The embodiment of the present invention is further described in attached drawing.
It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.
Based on above-mentioned purpose, the first aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time One embodiment of network distributed parallel training method.Shown in fig. 1 is neural network distributed parallel instruction provided by the invention Practice the flow diagram of method.
The neural network distributed parallel training method, as shown in Figure 1, comprising:
Step S101: the convolution kernel for being in same channel in each layer of deep learning model is divided into multiple calculating and is set Same calculating equipment in standby;
Step S103: being based on convolution kernel for each layer of independence respectively in each calculating equipment and execute convolution operation, will Newly-generated feature is passed to next layer of continuation convolution;
Step S105: backpropagation loses error and updates each layer of gradient weight since the last layer.
It is parallel that the embodiment of the present invention proposes the distribution training based on FPGA (field programmable gate array) cluster platform Convolution operation is assigned on FPGA device different in cluster by algorithm by deep learning network model by certain method, Each FPGA device is set to reach the state of load balancing.This method has good extension on large-scale FPGA cluster as the result is shown Property.If on each FPGA configure 6 transmitters, distribution training performance with FPGA device quantity linear increase.In In terms of energy consumption, compared with same graphics processor cluster, this method is averagely than 6.36 times of graphics processor cluster.
The embodiment of the present invention designs reasonable deep learning model partition strategy, makes whole network model on FPGA cluster The state for reaching a load balancing designs the deep learning distribution training parallel algorithm description of reasonable OpenCL description, Allow to map and generate more efficient FPGA hardware circuit structure, so that traditional single device training algorithm is in the more equipment of FPGA Parallel pipelining process executes, and then promotes the performance of deep learning model profile formula training.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, Ke Yitong Computer program is crossed to instruct related hardware and complete, the program can be stored in a computer-readable storage medium, The program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk, CD, read-only memory (ROM) or random access memory (RAM) etc..The embodiment of the computer program, Ke Yida The effect identical or similar to corresponding aforementioned any means embodiment.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file.
In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard Part circuit carrys out hardware-accelerated execution AOCX file.Wherein, host side program, central processing unit are run on general central processor Pass through high speed serialization computer expansion bus standard connection between field programmable gate array.
The embodiment of the present invention is uniform on FPGA cluster by input channel the convolution operation in deep learning model training It divides, the description to distributed training algorithm is then completed using OpenCL high-level language, using Altera SDK for OpenCL (AOC) High Level Synthesis tool is compiled synthesis to Kernel program file, and generation can be running on the FPGA AOCX file.Finally, running host side program on central processing unit, the distributed training algorithm hardware circuit on FPGA is called Carry out it is hardware-accelerated, between central processing unit and FPGA using high speed serialization computer expansion bus standard interface connect, carry out Data communication, using the DDR3 memory on FPGA development board as data buffer storage.
Disclosed method is also implemented as the computer program executed by CPU, the calculating according to embodiments of the present invention Machine program may be stored in a computer readable storage medium.When the computer program is executed by CPU, executes the present invention and implement The above-mentioned function of being limited in method disclosed in example.Above method step and system unit also can use controller and be used for Storage is so that controller realizes that the computer readable storage medium of the computer program of above-mentioned steps or Elementary Function is realized.
Below according to the specific embodiment specific embodiment that the present invention is further explained.
Referring to fig. 2, traditional deep learning model training is main including the following steps:
(1) forward calculation: since first layer, the convolution kernel in input feature vector and different channels carries out convolution operation, generates New feature is passed to next layer and carries out new convolution operation, until the last layer.
(2) backpropagation: since the last layer, by the loss error back propagation of calculating.
(3) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.
In contrast, deep learning model profile formula training process such as Fig. 3 based on FPGA cluster of the embodiment of the present invention It is shown, it mainly comprises the steps that
(1) model partition: each layer of model is divided on different FPGA devices according to the different channels of convolution kernel. Assuming that the input channel of convolution kernel has 28, FPGA device has 4, then the port number in each equipment is 7.
(2) forward calculation: since first layer, according to the convolution kernel on input feature vector and the equipment on each FPGA device Convolution operation is carried out, then by the Fusion Features generated in each equipment at a new feature, the new feature of generation is passed to Next layer carries out new convolution operation, until the last layer.
(3) backpropagation: since the last layer, by the loss error back propagation of calculating.
(4) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.
From above-described embodiment as can be seen that neural network distributed parallel training method provided in an embodiment of the present invention, leads to It crosses the convolution kernel in each layer by deep learning model in same channel and is divided into same calculating equipment;It is opened from first layer Begin, in each calculatings equipment independently based on convolution kernel execution convolution operation, by newly-generated feature be passed to next layer after Continuous convolution, to the last one layer;Since the last layer backpropagation lose error and update each layer gradient weight skill Art scheme can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput And performance.
It is important to note that each step in each embodiment of above-mentioned neural network distributed parallel training method Suddenly can intersect, replace, increase, deleting, therefore, these reasonable permutation and combination transformation in neural network distribution Parallel training method should also be as belonging to the scope of protection of the present invention, and protection scope of the present invention should not be confined to the reality It applies on example.
Based on above-mentioned purpose, the second aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time One embodiment of network distributed parallel training device.Neural network distributed parallel training device includes:
Distribution module, by each layer by deep learning model be in same channel convolution kernel be divided into it is multiple based on Calculate the same calculating equipment in equipment;
Execution module executes convolution behaviour for being based on convolution kernel for each layer of independence respectively in each calculating equipment Make, newly-generated feature is passed to next layer of continuation convolution;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
Various illustrative logical blocks, module, circuit and algorithm steps in conjunction with described in disclosure herein can be implemented For the combination of electronic hardware, computer software or both.In order to clearly demonstrate this interchangeability of hardware and software, General description has been carried out to it with regard to the function of various exemplary components, square, module, circuit and step.This function is Software is implemented as also to be implemented as hardware depending on concrete application and be applied to the design constraint of whole system.This field Technical staff can realize the function in various ways for every kind of concrete application, but determine should not be by for this realization It is construed to lead to be detached from range disclosed by the embodiments of the present invention.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate Array executes AOCX file;The distributed parallel training algorithm hardware circuit in field programmable gate array is called hardware-accelerated Execute AOCX file.
Based on above-mentioned purpose, the third aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time Field programmable gate array cluster one embodiment of network distributed parallel training.Field programmable gate array cluster includes:
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and program code executes above-mentioned nerve when being run Distributed parallel training method.
From above-described embodiment as can be seen that neural network distributed parallel training device provided in an embodiment of the present invention and existing Field programmable gate array cluster, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple Calculate the same calculating equipment in equipment;Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution Operation, is passed to next layer of continuation convolution for newly-generated feature;Backpropagation is lost error and is updated every since the last layer The technical solution of one layer of gradient weight can reduce the training time of distributed training parallel calculating method, improve algorithm Degree of parallelism simultaneously improves throughput and performance.
It is important to note that above-mentioned neural network distributed parallel training device and field programmable gate array cluster Embodiment use the embodiment of the neural network distributed parallel training method and illustrate the worked of each module Journey, those skilled in the art can be it is readily conceivable that by these module applications to neural network distributed parallel training side In the other embodiments of method.Certainly, since each step in the neural network distributed parallel training method embodiment is equal Can intersect, replace, increase, delete, therefore, these reasonable permutation and combination transformation it is distributed in the neural network Parallel training device and field programmable gate array cluster should also be as belonging to the scope of protection of the present invention, and should not will be of the invention Protection scope be confined on the embodiment.
It is exemplary embodiment disclosed by the invention above, it should be noted that in the sheet limited without departing substantially from claim Under the premise of inventive embodiments scope of disclosure, it may be many modifications and modify.According to open embodiment described herein The function of claim to a method, step and/or movement be not required to the execution of any particular order.In addition, although the present invention is implemented Element disclosed in example can be described or be required in the form of individual, but be unless explicitly limited odd number, it is understood that be multiple.
It should be understood that it is used in the present context, unless the context clearly supports exceptions, singular " one It is a " it is intended to also include plural form.It is to be further understood that "and/or" used herein refers to including one or one Any and all possible combinations of a above project listed in association.The embodiments of the present invention disclose embodiment sequence number only Only for description, do not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples;In the think of of the embodiment of the present invention Under road, it can also be combined between the technical characteristic in above embodiments or different embodiments, and exist as described above Many other variations of the different aspect of the embodiment of the present invention, for simplicity, they are not provided in details.Therefore, all at this Within the spirit and principle of inventive embodiments, any omission, modification, equivalent replacement, improvement for being made etc. should be included in this hair Within the protection scope of bright embodiment.

Claims (10)

1. a kind of neural network distributed parallel training method, which comprises the following steps:
The convolution kernel that same channel is in each layer of deep learning model is divided into the same meter in multiple calculating equipment Calculate equipment;
The convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be newly-generated Feature be passed to next layer of continuation convolution;
Backpropagation loses error and updates each layer of gradient weight since the last layer.
2. the method according to claim 1, wherein same channel will be in each layer of deep learning model The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.
3. the method according to claim 1, wherein the calculating equipment is field programmable gate array.
4. according to the method described in claim 3, it is characterized in that, only for each layer of difference in each calculating equipment The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool Part, and use site programmable gate array executes the AOCX file.
5. according to the method described in claim 4, it is characterized by further comprising: calling the distribution in field programmable gate array Formula parallel training hardware algorithm circuit carrys out the hardware-accelerated execution AOCX file.
6. a kind of neural network distributed parallel training device characterized by comprising
Distribution module, the convolution kernel for being in same channel in each layer by deep learning model are divided into multiple calculating and set Same calculating equipment in standby;
Execution module executes convolution for being based on the convolution kernel for each layer of independence respectively in each calculating equipment Operation, is passed to next layer of continuation convolution for newly-generated feature;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
7. device according to claim 6, which is characterized in that same channel will be in each layer of deep learning model The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.
8. device according to claim 6, which is characterized in that the calculating equipment is field programmable gate array.
9. device according to claim 8, which is characterized in that only for each layer of difference in each calculating equipment The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution, Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool Part, and use site programmable gate array executes the AOCX file;Call the distributed parallel in field programmable gate array Training algorithm hardware circuit carrys out the hardware-accelerated execution AOCX file.
10. a kind of field programmable gate array cluster characterized by comprising
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and said program code executes such as claim when being run Neural network distributed parallel training method described in any one of 1-8.
CN201910810557.4A 2019-08-29 2019-08-29 A kind of neural network distributed parallel training method and device Pending CN110503201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910810557.4A CN110503201A (en) 2019-08-29 2019-08-29 A kind of neural network distributed parallel training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910810557.4A CN110503201A (en) 2019-08-29 2019-08-29 A kind of neural network distributed parallel training method and device

Publications (1)

Publication Number Publication Date
CN110503201A true CN110503201A (en) 2019-11-26

Family

ID=68590494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910810557.4A Pending CN110503201A (en) 2019-08-29 2019-08-29 A kind of neural network distributed parallel training method and device

Country Status (1)

Country Link
CN (1) CN110503201A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736986A (en) * 2020-05-29 2020-10-02 浪潮(北京)电子信息产业有限公司 FPGA (field programmable Gate array) accelerated execution method of deep learning model and related device
WO2022001134A1 (en) * 2020-06-28 2022-01-06 浪潮电子信息产业股份有限公司 Load balancing method, apparatus and device for parallel model training task, and storage medium
CN116303108A (en) * 2022-09-07 2023-06-23 芯砺智能科技(上海)有限公司 Convolutional neural network weight address arrangement method suitable for parallel computing architecture

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106600667A (en) * 2016-12-12 2017-04-26 南京大学 Method for driving face animation with video based on convolution neural network
CN107027036A (en) * 2017-05-12 2017-08-08 郑州云海信息技术有限公司 A kind of FPGA isomeries accelerate decompression method, the apparatus and system of platform
CN107609646A (en) * 2017-10-12 2018-01-19 郑州云海信息技术有限公司 A kind of residual error network implementation approach, system, equipment and computer-readable storage medium
CN108764466A (en) * 2018-03-07 2018-11-06 东南大学 Convolutional neural networks hardware based on field programmable gate array and its accelerated method
KR20180125843A (en) * 2017-05-16 2018-11-26 광운대학교 산학협력단 A hardware classifier applicable to various CNN models
CN109740731A (en) * 2018-12-15 2019-05-10 华南理工大学 A kind of adaptive convolutional layer hardware accelerator design method
CN109993299A (en) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 Data training method and device, storage medium, electronic device
CN110084356A (en) * 2018-01-26 2019-08-02 北京深鉴智能科技有限公司 A kind of deep neural network data processing method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106600667A (en) * 2016-12-12 2017-04-26 南京大学 Method for driving face animation with video based on convolution neural network
CN107027036A (en) * 2017-05-12 2017-08-08 郑州云海信息技术有限公司 A kind of FPGA isomeries accelerate decompression method, the apparatus and system of platform
KR20180125843A (en) * 2017-05-16 2018-11-26 광운대학교 산학협력단 A hardware classifier applicable to various CNN models
CN107609646A (en) * 2017-10-12 2018-01-19 郑州云海信息技术有限公司 A kind of residual error network implementation approach, system, equipment and computer-readable storage medium
CN109993299A (en) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 Data training method and device, storage medium, electronic device
CN110084356A (en) * 2018-01-26 2019-08-02 北京深鉴智能科技有限公司 A kind of deep neural network data processing method and device
CN108764466A (en) * 2018-03-07 2018-11-06 东南大学 Convolutional neural networks hardware based on field programmable gate array and its accelerated method
CN109740731A (en) * 2018-12-15 2019-05-10 华南理工大学 A kind of adaptive convolutional layer hardware accelerator design method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XUELEI LI ET AL.: "FPGA Accelerates Deep Residual Learning for Image Rcognition", 《2017 IEEE 2ND INFORMATION TECHNOLOGY ,NETWORKING,ELECTRONIC AND AUTOMATION CONTROL CONFERENCE(ITNEC)》 *
洪启飞: "面向深度学习的FPGA硬件加速平台的研究", 《中国优秀博硕士学问论文全文数据库(硕士)信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736986A (en) * 2020-05-29 2020-10-02 浪潮(北京)电子信息产业有限公司 FPGA (field programmable Gate array) accelerated execution method of deep learning model and related device
CN111736986B (en) * 2020-05-29 2023-06-23 浪潮(北京)电子信息产业有限公司 FPGA (field programmable Gate array) acceleration execution method and related device of deep learning model
WO2022001134A1 (en) * 2020-06-28 2022-01-06 浪潮电子信息产业股份有限公司 Load balancing method, apparatus and device for parallel model training task, and storage medium
US11868817B2 (en) 2020-06-28 2024-01-09 Inspur Electronic Information Industry Co., Ltd. Load balancing method, apparatus and device for parallel model training task, and storage medium
CN116303108A (en) * 2022-09-07 2023-06-23 芯砺智能科技(上海)有限公司 Convolutional neural network weight address arrangement method suitable for parallel computing architecture
CN116303108B (en) * 2022-09-07 2024-05-14 芯砺智能科技(上海)有限公司 Weight address arrangement method suitable for parallel computing architecture

Similar Documents

Publication Publication Date Title
CN110503201A (en) A kind of neural network distributed parallel training method and device
CN107578095B (en) Neural computing device and processor comprising the computing device
CN107578098A (en) Neural network processor based on systolic arrays
CN109190756A (en) Arithmetic unit based on Winograd convolution and the neural network processor comprising the device
CN107797962A (en) Computing array based on neutral net
KR20130090147A (en) Neural network computing apparatus and system, and method thereof
CN103324850B (en) Twice polycondensation parallel method of finite element two-stage subregion based on multifile stream
CN106201651A (en) The simulator of neuromorphic chip
JPH08508838A (en) New finite element method and analyzer
CN103617150A (en) GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
CN103345580B (en) Based on the parallel CFD method of lattice Boltzmann method
CN108509270A (en) The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige
CN109472361A (en) Neural network optimization
CN115310566A (en) Distributed training system, method, device, equipment and readable storage medium
CN106843997B (en) A kind of parallel virtual machine polymerization based on Spark with optimization MBBO algorithms
CN110333946A (en) One kind being based on artificial intelligence cpu data processing system and method
CN111125963A (en) Numerical simulation system and method based on Lagrange integral point finite element
CN110415160A (en) A kind of GPU topology partition method and device
CN110580519A (en) Convolution operation structure and method thereof
Rosenthal Monotonicity of the core and value in dynamic cooperative games
CN104536831A (en) Multi-core SoC software mapping method based on multi-objective optimization
Vaughan et al. Enabling tractable exploration of the performance of adaptive mesh refinement
Oliker et al. Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2
Xu et al. Balancing cpu-gpu collaborative high-order cfd simulations on the tianhe-1a supercomputer
CN105531602A (en) System and method of implementing finite difference time domain models with multiple accelerated processing components (APCs)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191126