CN110503201A - A kind of neural network distributed parallel training method and device - Google Patents
A kind of neural network distributed parallel training method and device Download PDFInfo
- Publication number
- CN110503201A CN110503201A CN201910810557.4A CN201910810557A CN110503201A CN 110503201 A CN110503201 A CN 110503201A CN 201910810557 A CN201910810557 A CN 201910810557A CN 110503201 A CN110503201 A CN 110503201A
- Authority
- CN
- China
- Prior art keywords
- layer
- calculating equipment
- convolution
- convolution kernel
- programmable gate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Multi Processors (AREA)
Abstract
The invention discloses a kind of neural network distributed parallel training method and devices, comprising: the convolution kernel that same channel is in each layer of deep learning model is divided into the same calculating equipment in multiple calculating equipment;Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, and newly-generated feature is passed to next layer of continuation convolution;Backpropagation loses error and updates each layer of gradient weight since the last layer.The present invention can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput and performance.
Description
Technical field
The present invention relates to computer fields, more specifically, particularly relating to a kind of neural network distributed parallel training method
With device.
Background technique
Deep learning has been that artificial intelligence field brings huge progress, but training deep learning model needs
It significantly largely to calculate.It is completed on a single machine with a modern times GPU once based on benchmark such as ImageNet
The training of data set may expend up to one week time.Distribution training on more machines can be reduced the training time,
But the prior art lacks corresponding embodiment.
Aiming at the problem that lacking handy distribution training parallel calculating method in the prior art, there has been no effective at present
Solution.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is to propose a kind of neural network distributed parallel training method and dress
It sets, the training time of distributed training parallel calculating method can be reduced, the degree of parallelism of algorithm is improved and improve throughput and property
Energy.
Based on above-mentioned purpose, the first aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training side
Method, comprising:
The convolution kernel that same channel is in each layer of deep learning model is divided into same in multiple calculating equipment
One calculates equipment;
Convolution kernel is based on for each layer of independence respectively in each calculating equipment and executes convolution operation, by newly-generated spy
Sign is passed to next layer of continuation convolution;
Backpropagation loses error and updates each layer of gradient weight since the last layer.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more
Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers
Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment
Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate
Array executes AOCX file.
In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard
Part circuit carrys out hardware-accelerated execution AOCX file.
The second aspect of the embodiment of the present invention provides a kind of neural network distributed parallel training device, comprising:
Distribution module, by each layer by deep learning model be in same channel convolution kernel be divided into it is multiple based on
Calculate the same calculating equipment in equipment;
Execution module executes convolution behaviour for being based on convolution kernel for each layer of independence respectively in each calculating equipment
Make, newly-generated feature is passed to next layer of continuation convolution;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more
Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers
Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment
Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate
Array executes AOCX file;The distributed parallel training algorithm hardware circuit in field programmable gate array is called hardware-accelerated
Execute AOCX file.
The third aspect of the embodiment of the present invention provides a kind of field programmable gate array cluster, comprising:
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and program code executes above-mentioned nerve when being run
Distributed parallel training method.
The present invention has following advantageous effects: neural network distributed parallel provided in an embodiment of the present invention training side
Method and device, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple calculating equipment
Same calculating equipment;Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be new
The feature of generation is passed to next layer of continuation convolution;Backpropagation loses error and updates each layer of gradient since the last layer
The technical solution of weight can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and change
Kind throughput and performance.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of neural network distributed parallel training method provided by the invention;
The convolution operation schematic diagram of the deep learning model training of Fig. 2 prior art;
The convolution operation schematic diagram of Fig. 3 neural network distributed parallel training method provided by the invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
The embodiment of the present invention is further described in attached drawing.
It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention
The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer
It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.
Based on above-mentioned purpose, the first aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time
One embodiment of network distributed parallel training method.Shown in fig. 1 is neural network distributed parallel instruction provided by the invention
Practice the flow diagram of method.
The neural network distributed parallel training method, as shown in Figure 1, comprising:
Step S101: the convolution kernel for being in same channel in each layer of deep learning model is divided into multiple calculating and is set
Same calculating equipment in standby;
Step S103: being based on convolution kernel for each layer of independence respectively in each calculating equipment and execute convolution operation, will
Newly-generated feature is passed to next layer of continuation convolution;
Step S105: backpropagation loses error and updates each layer of gradient weight since the last layer.
It is parallel that the embodiment of the present invention proposes the distribution training based on FPGA (field programmable gate array) cluster platform
Convolution operation is assigned on FPGA device different in cluster by algorithm by deep learning network model by certain method,
Each FPGA device is set to reach the state of load balancing.This method has good extension on large-scale FPGA cluster as the result is shown
Property.If on each FPGA configure 6 transmitters, distribution training performance with FPGA device quantity linear increase.In
In terms of energy consumption, compared with same graphics processor cluster, this method is averagely than 6.36 times of graphics processor cluster.
The embodiment of the present invention designs reasonable deep learning model partition strategy, makes whole network model on FPGA cluster
The state for reaching a load balancing designs the deep learning distribution training parallel algorithm description of reasonable OpenCL description,
Allow to map and generate more efficient FPGA hardware circuit structure, so that traditional single device training algorithm is in the more equipment of FPGA
Parallel pipelining process executes, and then promotes the performance of deep learning model profile formula training.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, Ke Yitong
Computer program is crossed to instruct related hardware and complete, the program can be stored in a computer-readable storage medium,
The program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk,
CD, read-only memory (ROM) or random access memory (RAM) etc..The embodiment of the computer program, Ke Yida
The effect identical or similar to corresponding aforementioned any means embodiment.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more
Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers
Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment
Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate
Array executes AOCX file.
In some embodiments, further includes: call the distributed parallel training algorithm in field programmable gate array hard
Part circuit carrys out hardware-accelerated execution AOCX file.Wherein, host side program, central processing unit are run on general central processor
Pass through high speed serialization computer expansion bus standard connection between field programmable gate array.
The embodiment of the present invention is uniform on FPGA cluster by input channel the convolution operation in deep learning model training
It divides, the description to distributed training algorithm is then completed using OpenCL high-level language, using Altera SDK for
OpenCL (AOC) High Level Synthesis tool is compiled synthesis to Kernel program file, and generation can be running on the FPGA
AOCX file.Finally, running host side program on central processing unit, the distributed training algorithm hardware circuit on FPGA is called
Carry out it is hardware-accelerated, between central processing unit and FPGA using high speed serialization computer expansion bus standard interface connect, carry out
Data communication, using the DDR3 memory on FPGA development board as data buffer storage.
Disclosed method is also implemented as the computer program executed by CPU, the calculating according to embodiments of the present invention
Machine program may be stored in a computer readable storage medium.When the computer program is executed by CPU, executes the present invention and implement
The above-mentioned function of being limited in method disclosed in example.Above method step and system unit also can use controller and be used for
Storage is so that controller realizes that the computer readable storage medium of the computer program of above-mentioned steps or Elementary Function is realized.
Below according to the specific embodiment specific embodiment that the present invention is further explained.
Referring to fig. 2, traditional deep learning model training is main including the following steps:
(1) forward calculation: since first layer, the convolution kernel in input feature vector and different channels carries out convolution operation, generates
New feature is passed to next layer and carries out new convolution operation, until the last layer.
(2) backpropagation: since the last layer, by the loss error back propagation of calculating.
(3) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.
In contrast, deep learning model profile formula training process such as Fig. 3 based on FPGA cluster of the embodiment of the present invention
It is shown, it mainly comprises the steps that
(1) model partition: each layer of model is divided on different FPGA devices according to the different channels of convolution kernel.
Assuming that the input channel of convolution kernel has 28, FPGA device has 4, then the port number in each equipment is 7.
(2) forward calculation: since first layer, according to the convolution kernel on input feature vector and the equipment on each FPGA device
Convolution operation is carried out, then by the Fusion Features generated in each equipment at a new feature, the new feature of generation is passed to
Next layer carries out new convolution operation, until the last layer.
(3) backpropagation: since the last layer, by the loss error back propagation of calculating.
(4) gradient updating: according to each layer of loss memory of gradient updating during backpropagation.
From above-described embodiment as can be seen that neural network distributed parallel training method provided in an embodiment of the present invention, leads to
It crosses the convolution kernel in each layer by deep learning model in same channel and is divided into same calculating equipment;It is opened from first layer
Begin, in each calculatings equipment independently based on convolution kernel execution convolution operation, by newly-generated feature be passed to next layer after
Continuous convolution, to the last one layer;Since the last layer backpropagation lose error and update each layer gradient weight skill
Art scheme can reduce the training time of distributed training parallel calculating method, improve the degree of parallelism of algorithm and improve throughput
And performance.
It is important to note that each step in each embodiment of above-mentioned neural network distributed parallel training method
Suddenly can intersect, replace, increase, deleting, therefore, these reasonable permutation and combination transformation in neural network distribution
Parallel training method should also be as belonging to the scope of protection of the present invention, and protection scope of the present invention should not be confined to the reality
It applies on example.
Based on above-mentioned purpose, the second aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time
One embodiment of network distributed parallel training device.Neural network distributed parallel training device includes:
Distribution module, by each layer by deep learning model be in same channel convolution kernel be divided into it is multiple based on
Calculate the same calculating equipment in equipment;
Execution module executes convolution behaviour for being based on convolution kernel for each layer of independence respectively in each calculating equipment
Make, newly-generated feature is passed to next layer of continuation convolution;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
Various illustrative logical blocks, module, circuit and algorithm steps in conjunction with described in disclosure herein can be implemented
For the combination of electronic hardware, computer software or both.In order to clearly demonstrate this interchangeability of hardware and software,
General description has been carried out to it with regard to the function of various exemplary components, square, module, circuit and step.This function is
Software is implemented as also to be implemented as hardware depending on concrete application and be applied to the design constraint of whole system.This field
Technical staff can realize the function in various ways for every kind of concrete application, but determine should not be by for this realization
It is construed to lead to be detached from range disclosed by the embodiments of the present invention.
In some embodiments, the convolution kernel that same channel is in each layer of deep learning model is divided into more
Same calculating equipment in a calculating equipment includes: under the premise of guaranteeing same channel, to each calculating equipment division numbers
Convolution kernel close as far as possible is with the load between EQUILIBRIUM CALCULATION FOR PROCESS equipment.
In some embodiments, calculating equipment is field programmable gate array.
In some embodiments, the convolution kernel is independently based on for each layer in each calculating equipment
Convolution operation is executed, it includes: to describe distributed parallel using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Training algorithm generates code, carrys out compiled code using High Level Synthesis tool and generates AOCX file, and use site programmable gate
Array executes AOCX file;The distributed parallel training algorithm hardware circuit in field programmable gate array is called hardware-accelerated
Execute AOCX file.
Based on above-mentioned purpose, the third aspect of the embodiment of the present invention proposes a kind of nerve net for reducing the training time
Field programmable gate array cluster one embodiment of network distributed parallel training.Field programmable gate array cluster includes:
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and program code executes above-mentioned nerve when being run
Distributed parallel training method.
From above-described embodiment as can be seen that neural network distributed parallel training device provided in an embodiment of the present invention and existing
Field programmable gate array cluster, the convolution kernel by being in same channel in each layer by deep learning model are divided into multiple
Calculate the same calculating equipment in equipment;Convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution
Operation, is passed to next layer of continuation convolution for newly-generated feature;Backpropagation is lost error and is updated every since the last layer
The technical solution of one layer of gradient weight can reduce the training time of distributed training parallel calculating method, improve algorithm
Degree of parallelism simultaneously improves throughput and performance.
It is important to note that above-mentioned neural network distributed parallel training device and field programmable gate array cluster
Embodiment use the embodiment of the neural network distributed parallel training method and illustrate the worked of each module
Journey, those skilled in the art can be it is readily conceivable that by these module applications to neural network distributed parallel training side
In the other embodiments of method.Certainly, since each step in the neural network distributed parallel training method embodiment is equal
Can intersect, replace, increase, delete, therefore, these reasonable permutation and combination transformation it is distributed in the neural network
Parallel training device and field programmable gate array cluster should also be as belonging to the scope of protection of the present invention, and should not will be of the invention
Protection scope be confined on the embodiment.
It is exemplary embodiment disclosed by the invention above, it should be noted that in the sheet limited without departing substantially from claim
Under the premise of inventive embodiments scope of disclosure, it may be many modifications and modify.According to open embodiment described herein
The function of claim to a method, step and/or movement be not required to the execution of any particular order.In addition, although the present invention is implemented
Element disclosed in example can be described or be required in the form of individual, but be unless explicitly limited odd number, it is understood that be multiple.
It should be understood that it is used in the present context, unless the context clearly supports exceptions, singular " one
It is a " it is intended to also include plural form.It is to be further understood that "and/or" used herein refers to including one or one
Any and all possible combinations of a above project listed in association.The embodiments of the present invention disclose embodiment sequence number only
Only for description, do not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not
It is intended to imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples;In the think of of the embodiment of the present invention
Under road, it can also be combined between the technical characteristic in above embodiments or different embodiments, and exist as described above
Many other variations of the different aspect of the embodiment of the present invention, for simplicity, they are not provided in details.Therefore, all at this
Within the spirit and principle of inventive embodiments, any omission, modification, equivalent replacement, improvement for being made etc. should be included in this hair
Within the protection scope of bright embodiment.
Claims (10)
1. a kind of neural network distributed parallel training method, which comprises the following steps:
The convolution kernel that same channel is in each layer of deep learning model is divided into the same meter in multiple calculating equipment
Calculate equipment;
The convolution kernel, which is based on, for each layer of independence respectively in each calculating equipment executes convolution operation, it will be newly-generated
Feature be passed to next layer of continuation convolution;
Backpropagation loses error and updates each layer of gradient weight since the last layer.
2. the method according to claim 1, wherein same channel will be in each layer of deep learning model
The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each
It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.
3. the method according to claim 1, wherein the calculating equipment is field programmable gate array.
4. according to the method described in claim 3, it is characterized in that, only for each layer of difference in each calculating equipment
The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool
Part, and use site programmable gate array executes the AOCX file.
5. according to the method described in claim 4, it is characterized by further comprising: calling the distribution in field programmable gate array
Formula parallel training hardware algorithm circuit carrys out the hardware-accelerated execution AOCX file.
6. a kind of neural network distributed parallel training device characterized by comprising
Distribution module, the convolution kernel for being in same channel in each layer by deep learning model are divided into multiple calculating and set
Same calculating equipment in standby;
Execution module executes convolution for being based on the convolution kernel for each layer of independence respectively in each calculating equipment
Operation, is passed to next layer of continuation convolution for newly-generated feature;
Update module, for since the last layer backpropagation lose error and update each layer of gradient weight.
7. device according to claim 6, which is characterized in that same channel will be in each layer of deep learning model
The same calculating equipment that are divided into multiple calculating equipment of convolution kernel include: under the premise of guaranteeing same channel, to each
It is described to calculate equipment division numbers convolution kernel close as far as possible to balance the load between the calculating equipment.
8. device according to claim 6, which is characterized in that the calculating equipment is field programmable gate array.
9. device according to claim 8, which is characterized in that only for each layer of difference in each calculating equipment
The convolution kernel that is based on executes convolution operation, and it includes: using Opencl that newly-generated feature, which is passed to next layer of continuation convolution,
Code is generated to describe distributed parallel training algorithm, the code building AOCX text is compiled using High Level Synthesis tool
Part, and use site programmable gate array executes the AOCX file;Call the distributed parallel in field programmable gate array
Training algorithm hardware circuit carrys out the hardware-accelerated execution AOCX file.
10. a kind of field programmable gate array cluster characterized by comprising
Multiple field programmable gate arrays;
Processor;With
Memory, is stored with the program code that processor can be run, and said program code executes such as claim when being run
Neural network distributed parallel training method described in any one of 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910810557.4A CN110503201A (en) | 2019-08-29 | 2019-08-29 | A kind of neural network distributed parallel training method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910810557.4A CN110503201A (en) | 2019-08-29 | 2019-08-29 | A kind of neural network distributed parallel training method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110503201A true CN110503201A (en) | 2019-11-26 |
Family
ID=68590494
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910810557.4A Pending CN110503201A (en) | 2019-08-29 | 2019-08-29 | A kind of neural network distributed parallel training method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110503201A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111736986A (en) * | 2020-05-29 | 2020-10-02 | 浪潮(北京)电子信息产业有限公司 | FPGA (field programmable Gate array) accelerated execution method of deep learning model and related device |
WO2022001134A1 (en) * | 2020-06-28 | 2022-01-06 | 浪潮电子信息产业股份有限公司 | Load balancing method, apparatus and device for parallel model training task, and storage medium |
CN116303108A (en) * | 2022-09-07 | 2023-06-23 | 芯砺智能科技(上海)有限公司 | Convolutional neural network weight address arrangement method suitable for parallel computing architecture |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106600667A (en) * | 2016-12-12 | 2017-04-26 | 南京大学 | Method for driving face animation with video based on convolution neural network |
CN107027036A (en) * | 2017-05-12 | 2017-08-08 | 郑州云海信息技术有限公司 | A kind of FPGA isomeries accelerate decompression method, the apparatus and system of platform |
CN107609646A (en) * | 2017-10-12 | 2018-01-19 | 郑州云海信息技术有限公司 | A kind of residual error network implementation approach, system, equipment and computer-readable storage medium |
CN108764466A (en) * | 2018-03-07 | 2018-11-06 | 东南大学 | Convolutional neural networks hardware based on field programmable gate array and its accelerated method |
KR20180125843A (en) * | 2017-05-16 | 2018-11-26 | 광운대학교 산학협력단 | A hardware classifier applicable to various CNN models |
CN109740731A (en) * | 2018-12-15 | 2019-05-10 | 华南理工大学 | A kind of adaptive convolutional layer hardware accelerator design method |
CN109993299A (en) * | 2017-12-29 | 2019-07-09 | 中兴通讯股份有限公司 | Data training method and device, storage medium, electronic device |
CN110084356A (en) * | 2018-01-26 | 2019-08-02 | 北京深鉴智能科技有限公司 | A kind of deep neural network data processing method and device |
-
2019
- 2019-08-29 CN CN201910810557.4A patent/CN110503201A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106600667A (en) * | 2016-12-12 | 2017-04-26 | 南京大学 | Method for driving face animation with video based on convolution neural network |
CN107027036A (en) * | 2017-05-12 | 2017-08-08 | 郑州云海信息技术有限公司 | A kind of FPGA isomeries accelerate decompression method, the apparatus and system of platform |
KR20180125843A (en) * | 2017-05-16 | 2018-11-26 | 광운대학교 산학협력단 | A hardware classifier applicable to various CNN models |
CN107609646A (en) * | 2017-10-12 | 2018-01-19 | 郑州云海信息技术有限公司 | A kind of residual error network implementation approach, system, equipment and computer-readable storage medium |
CN109993299A (en) * | 2017-12-29 | 2019-07-09 | 中兴通讯股份有限公司 | Data training method and device, storage medium, electronic device |
CN110084356A (en) * | 2018-01-26 | 2019-08-02 | 北京深鉴智能科技有限公司 | A kind of deep neural network data processing method and device |
CN108764466A (en) * | 2018-03-07 | 2018-11-06 | 东南大学 | Convolutional neural networks hardware based on field programmable gate array and its accelerated method |
CN109740731A (en) * | 2018-12-15 | 2019-05-10 | 华南理工大学 | A kind of adaptive convolutional layer hardware accelerator design method |
Non-Patent Citations (2)
Title |
---|
XUELEI LI ET AL.: "FPGA Accelerates Deep Residual Learning for Image Rcognition", 《2017 IEEE 2ND INFORMATION TECHNOLOGY ,NETWORKING,ELECTRONIC AND AUTOMATION CONTROL CONFERENCE(ITNEC)》 * |
洪启飞: "面向深度学习的FPGA硬件加速平台的研究", 《中国优秀博硕士学问论文全文数据库(硕士)信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111736986A (en) * | 2020-05-29 | 2020-10-02 | 浪潮(北京)电子信息产业有限公司 | FPGA (field programmable Gate array) accelerated execution method of deep learning model and related device |
CN111736986B (en) * | 2020-05-29 | 2023-06-23 | 浪潮(北京)电子信息产业有限公司 | FPGA (field programmable Gate array) acceleration execution method and related device of deep learning model |
WO2022001134A1 (en) * | 2020-06-28 | 2022-01-06 | 浪潮电子信息产业股份有限公司 | Load balancing method, apparatus and device for parallel model training task, and storage medium |
US11868817B2 (en) | 2020-06-28 | 2024-01-09 | Inspur Electronic Information Industry Co., Ltd. | Load balancing method, apparatus and device for parallel model training task, and storage medium |
CN116303108A (en) * | 2022-09-07 | 2023-06-23 | 芯砺智能科技(上海)有限公司 | Convolutional neural network weight address arrangement method suitable for parallel computing architecture |
CN116303108B (en) * | 2022-09-07 | 2024-05-14 | 芯砺智能科技(上海)有限公司 | Weight address arrangement method suitable for parallel computing architecture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110503201A (en) | A kind of neural network distributed parallel training method and device | |
CN107578095B (en) | Neural computing device and processor comprising the computing device | |
CN107578098A (en) | Neural network processor based on systolic arrays | |
CN109190756A (en) | Arithmetic unit based on Winograd convolution and the neural network processor comprising the device | |
CN107797962A (en) | Computing array based on neutral net | |
KR20130090147A (en) | Neural network computing apparatus and system, and method thereof | |
CN103324850B (en) | Twice polycondensation parallel method of finite element two-stage subregion based on multifile stream | |
CN106201651A (en) | The simulator of neuromorphic chip | |
JPH08508838A (en) | New finite element method and analyzer | |
CN103617150A (en) | GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system | |
CN103345580B (en) | Based on the parallel CFD method of lattice Boltzmann method | |
CN108509270A (en) | The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige | |
CN109472361A (en) | Neural network optimization | |
CN115310566A (en) | Distributed training system, method, device, equipment and readable storage medium | |
CN106843997B (en) | A kind of parallel virtual machine polymerization based on Spark with optimization MBBO algorithms | |
CN110333946A (en) | One kind being based on artificial intelligence cpu data processing system and method | |
CN111125963A (en) | Numerical simulation system and method based on Lagrange integral point finite element | |
CN110415160A (en) | A kind of GPU topology partition method and device | |
CN110580519A (en) | Convolution operation structure and method thereof | |
Rosenthal | Monotonicity of the core and value in dynamic cooperative games | |
CN104536831A (en) | Multi-core SoC software mapping method based on multi-objective optimization | |
Vaughan et al. | Enabling tractable exploration of the performance of adaptive mesh refinement | |
Oliker et al. | Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2 | |
Xu et al. | Balancing cpu-gpu collaborative high-order cfd simulations on the tianhe-1a supercomputer | |
CN105531602A (en) | System and method of implementing finite difference time domain models with multiple accelerated processing components (APCs) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191126 |