CN111652365A - Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof - Google Patents
Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof Download PDFInfo
- Publication number
- CN111652365A CN111652365A CN202010366873.XA CN202010366873A CN111652365A CN 111652365 A CN111652365 A CN 111652365A CN 202010366873 A CN202010366873 A CN 202010366873A CN 111652365 A CN111652365 A CN 111652365A
- Authority
- CN
- China
- Prior art keywords
- module
- calculation
- vmpu
- matrix multiplication
- vector matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof. The hardware architecture comprises: the general processor module is responsible for interacting with the external environment and realizing the calculation of the reward function and also responsible for maintaining a Deep Q-Network algorithm experience pool; the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm; the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module; the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network; and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network. The method realizes the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence; in particular to a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof.
Background
The Deep reinforcement learning is a novel artificial intelligence technology, is developed by combining a traditional reinforcement learning algorithm and a Deep learning algorithm, is mainly used in the fields of robot control, automatic driving, search recommendation and the like, and is a representative algorithm in the field of Deep reinforcement learning. However, in the calculation process of the Deep Q-Network algorithm, the Deep Q-Network algorithm includes two types of calculations of forward reasoning and backward propagation of the neural Network, and when the neural Network is large in scale, the problems of large storage resource consumption and high calculation complexity exist. At present, a large-scale GPU (graphics processing unit) board card server is usually used for researching and realizing deep reinforcement learning, so that the method is difficult to apply in limited edge computing scenes such as hardware resources and power consumption, and the applicability is not high; in the prior art, the FPGA hardware computing architecture is not optimized, and design space is not explored.
Disclosure of Invention
The invention provides a hardware architecture for accelerating a Deep Q-Network algorithm and a space exploration method thereof, which are used for solving the problems, can explore the design space of the hardware architecture according to the parameters of a Deep Q-Network and the resource parameters of an FPGA chip and give the optimal parallel parameters of the hardware architecture under the resource constraint of the FPGA chip, thereby realizing the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.
The invention is realized by the following technical scheme:
a hardware architecture for accelerating Deep Q-Network algorithm comprises a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm;
the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module;
the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network;
the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network;
the Target Q module and the Current Q module are both formed by vector matrix multiplication processing units VMPU through FIFO cascade connection of first-in first-out queues;
the Loss calculation module is responsible for receiving forward reasoning calculation results of the Target Q module and the Current Q module, calculating error gradient and transmitting the error gradient to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
Furthermore, the vector matrix multiplication unit VMPU is divided into two calculation modes, namely an A calculation mode and a B calculation mode, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.
Further, when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward reasoning adopt column calculation of the vector matrix multiplication processing units VMPU as a main mode, even layers of forward reasoning adopt row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.
Further, when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the even layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, and the vector matrix multiplication processing units VMPU of which rows are calculated as a main and the vector matrix multiplication processing units VMPU of which columns are calculated as a main are alternately cascaded by a first-in first-out queue FIFO.
Further, the vector matrix multiplication processing unit VMPU may implement an internal multiply-accumulate pipeline; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.
A design space exploration method for accelerating a hardware architecture of a Deep Q-Network algorithm specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
The invention has the beneficial effects that:
1. the hardware of the invention can efficiently finish neural network reasoning and back propagation, and realize the parallel computation flow in the neural network layer, the computation flow between the neural network layers and the computation between the Target Q network and the Current Q network.
2. The method can evaluate resources required by deployment of the Deep Q-Network algorithm on the FPGA, and carry out constraint adjustment according to actual conditions.
3. Under the condition of limited computing resources and space, the invention can explore the design space of the hardware architecture according to the hardware resource model and the algorithm structure model, and use the optimal parallel parameters to deploy the algorithm on the FPGA.
Drawings
FIG. 1 is a hardware-software architecture diagram of the present invention for a field programmable gate array.
FIG. 2 is a schematic diagram of the coarse-grained structure of the vector matrix multiplication unit of the present invention.
Fig. 3 is a pseudo code of the vector matrix multiplication unit of the present invention.
FIG. 4 is a diagram of a fine-grained structure of a vector matrix multiplication unit in a column-based computation mode according to the present invention.
Fig. 5 is a schematic diagram of a fine-grained structure of a vector matrix multiplication unit in a row-based calculation mode according to the present invention.
FIG. 6 is a flowchart of a design space exploration process in designing a hardware architecture according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
A hardware architecture for accelerating Deep Q-Network algorithm comprises a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is mainly responsible for storing a Deep Q-Network algorithm experience pool.
The AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module.
The Target Q module is responsible for realizing forward reasoning calculation of the Target Q network, and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network.
The Target Q module and the Current Q module can realize internal multiplication through a vector matrix multiplication processing unit VMPU by first-in first-out; vector matrix multiplication mainly based on row calculation is formed by cascade connection of single queue FIFO;
the Loss calculation module is responsible for receiving forward reasoning calculation results and calculating error gradients of the Target Q module and the Current Q module and transmitting the results and the error gradients to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
The matrix vector multiplication unit VMPU is divided into two calculation modes of A and B, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode. In an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.
When the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward reasoning adopts the column calculation of the vector matrix multiplication processing units VMPU as a main mode, and the even layer of forward reasoning adopts the row calculation of the vector matrix multiplication processing units VMPU as a main mode;
when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, and the even layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns;
the vector matrix multiplication processing unit VMPU can realize internal multiplication accumulation running water; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; parallel calculation is realized between the TargetQ module and the Current Q module;
further, the method for exploring the design space specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
Further, fig. 3 shows a pseudo code implementation of the vector matrix multiplication unit VMPU, which is composed of three loops in total, two loops at the innermost layer are completely expanded in hardware implementation to implement SIMD parallel of single instruction multiple data streams and PE parallel of the processing unit, and a boundary value Fold of the loop at the outermost layer is a total folding factor, which is a product of the SIMD dimension folding factor of the single instruction multiple data streams and the PE dimension folding factor of the processing unit, indicating how many times SIMD parallel computations and PE parallel computations of the processing unit need to be performed to complete the whole vector matrix multiplication.
Further, the fine-grained structures of two calculation modes a and B in fig. 4 and 5 are described below by taking an example in which the input is 8 neurons, the intermediate hidden layer is 4 neurons, and the output layer is 4 neurons.
Further, fig. 4 is a schematic diagram of a fine-grained structure when the VMPU unit is in an a mode, where in the a mode, forward calculation is performed on a network from an input layer to a hidden layer, a SIMD parallel parameter of a single instruction multiple data stream is set to 4, a row dimension of a corresponding matrix is set, and a folding factor SF is set to 2; the parallel parameter of the processing unit PE is 2, the folding factor PF is 2 and the total folding factor is 4 corresponding to the column dimension of the matrix, 2 final results can be calculated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.
Further, fig. 5 is a schematic diagram of a fine-grained structure when the VMPU unit is in a B mode, where a SIMD parallel parameter of a single instruction multiple data stream is set to 2 in the B mode, and a folding factor SF is set to 2 corresponding to a column dimension of a matrix; the parallel parameter of the processing unit PE is 2, the row dimension of the corresponding matrix is 2, the folding factor PF is 2, 4 intermediate results are generated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.
When backward propagation of the network is carried out, transposed weight matrixes are needed for backward propagation calculation of the gradient, when forward reasoning of the neural network is calculated in the FPGA, the weight matrixes are necessarily stored in the BRAM in advance in a row-based or column-based mode, if the forward reasoning is carried out, the weight matrixes are stored in the corresponding BRAM in a row-based mode, and meanwhile, the storage positions of the weight matrixes are related to the parallel parameter processing unit PE and the single instruction multiple data stream SIMD. Therefore, when the data is reversely transmitted, the parallel computation cannot be realized due to the limitation of the on-chip data reading bandwidth. The matrix storage modes of the row-based calculation and the column-based calculation are transposed with each other, if the row-based calculation is used during reverse propagation, and the column-based calculation is used during reverse propagation, the bandwidth problem caused by the fact that matrix data cannot be transposed can be flexibly solved.
Claims (6)
1. A hardware architecture for accelerating Deep Q-Network algorithm is characterized by comprising a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm;
the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module;
the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network;
the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network;
the Target Q module and the Current Q module are both formed by vector matrix multiplication processing units VMPU through FIFO cascade connection of first-in first-out queues;
the Loss calculation module is responsible for receiving forward reasoning calculation results of the Target Q module and the Current Q module, calculating error gradient and transmitting the error gradient to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
2. The hardware architecture for accelerating Deep Q-Network algorithm according to claim 1, wherein the vector matrix multiplication unit VMPU is divided into two calculation modes A and B, and can complete multiplication calculation function and activation function of matrix and vector, the matrix in A mode is column main calculation mode, and the matrix in B mode is row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in row-major B-mode, the column dimensions of a SIMD parallel correspondence matrix are Single Instruction Multiple Data (SIMD)The processing elements PE correspond to the rows of the matrix in parallelDimension (d) of。
3. The hardware architecture of claim 1, wherein when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, an odd layer of forward reasoning adopts column calculation of the vector matrix multiplication processing units VMPU as a main mode, an even layer of forward reasoning adopts row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.
4. The hardware architecture of claim 1, wherein when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, even layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, odd layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, even layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, and vector matrix multiplication processing units VMPU of which rows are mainly computed and vector matrix multiplication processing units VMPU of which columns are mainly computed are alternately cascaded through a first-in first-out queue FIFO.
5. The hardware architecture for accelerating Deep Q-Network algorithm of claim 2, wherein the vector matrix multiplication processing unit VMPU can implement internal multiply-accumulate pipelining; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.
6. The method for exploring the design space of the hardware architecture for accelerating the Deep Q-Network algorithm, according to claim 1, is characterized in that the method for exploring the design space specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010366873.XA CN111652365B (en) | 2020-04-30 | 2020-04-30 | Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010366873.XA CN111652365B (en) | 2020-04-30 | 2020-04-30 | Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111652365A true CN111652365A (en) | 2020-09-11 |
CN111652365B CN111652365B (en) | 2022-05-17 |
Family
ID=72347887
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010366873.XA Active CN111652365B (en) | 2020-04-30 | 2020-04-30 | Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111652365B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190130248A1 (en) * | 2017-10-27 | 2019-05-02 | Salesforce.Com, Inc. | Generating dual sequence inferences using a neural network model |
CN109783412A (en) * | 2019-01-18 | 2019-05-21 | 电子科技大学 | A kind of method that deeply study accelerates training |
CA3032182A1 (en) * | 2018-01-31 | 2019-07-31 | Royal Bank Of Canada | Pre-training neural netwoks with human demonstrations for deep reinforcement learning |
GB201913353D0 (en) * | 2019-09-16 | 2019-10-30 | Samsung Electronics Co Ltd | Method for designing accelerator hardware |
-
2020
- 2020-04-30 CN CN202010366873.XA patent/CN111652365B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190130248A1 (en) * | 2017-10-27 | 2019-05-02 | Salesforce.Com, Inc. | Generating dual sequence inferences using a neural network model |
CA3032182A1 (en) * | 2018-01-31 | 2019-07-31 | Royal Bank Of Canada | Pre-training neural netwoks with human demonstrations for deep reinforcement learning |
CN109783412A (en) * | 2019-01-18 | 2019-05-21 | 电子科技大学 | A kind of method that deeply study accelerates training |
GB201913353D0 (en) * | 2019-09-16 | 2019-10-30 | Samsung Electronics Co Ltd | Method for designing accelerator hardware |
Non-Patent Citations (4)
Title |
---|
JIANG SU等: "Neural Network Based Reinforcement Learning Acceleration on FPGA Platforms", 《COMPUTER ARCHITECTURE NEWS》 * |
JIANG SU等: "Neural Network Based Reinforcement Learning", 《ACM SIGARCH COMPUTER ARCHITECTURE NEWS》 * |
NARU SUGIMOTO等: "Trax solver on Zynq with Deep Q-Network", 《2015 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY(FPT)》 * |
宫磊: "可重构平台上面向卷积神经网络的异构多核加速方法研究", 《中国优秀博硕士学位论文全文数据库(博士)》 * |
Also Published As
Publication number | Publication date |
---|---|
CN111652365B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11574195B2 (en) | Operation method | |
US10691996B2 (en) | Hardware accelerator for compressed LSTM | |
US10698657B2 (en) | Hardware accelerator for compressed RNN on FPGA | |
US10810484B2 (en) | Hardware accelerator for compressed GRU on FPGA | |
CN107797962B (en) | Neural network based computational array | |
CN107704916B (en) | Hardware accelerator and method for realizing RNN neural network based on FPGA | |
CN108090565A (en) | Accelerated method is trained in a kind of convolutional neural networks parallelization | |
US20140344203A1 (en) | Neural network computing apparatus and system, and method therefor | |
US20120166374A1 (en) | Architecture, system and method for artificial neural network implementation | |
CN107229967A (en) | A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA | |
US20240265234A1 (en) | Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment | |
TWI417797B (en) | A Parallel Learning Architecture and Its Method for Transferred Neural Network | |
CN108960414B (en) | Method for realizing single broadcast multiple operations based on deep learning accelerator | |
CN115803754A (en) | Hardware architecture for processing data in a neural network | |
Geng et al. | CQNN: a CGRA-based QNN framework | |
CN115310037A (en) | Matrix multiplication computing unit, acceleration unit, computing system and related method | |
CN110232441B (en) | Stack type self-coding system and method based on unidirectional pulsation array | |
Dias et al. | Deep learning in reconfigurable hardware: A survey | |
CN107368459B (en) | Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication | |
CN111652365B (en) | Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof | |
WO2023114417A2 (en) | One-dimensional computational unit for an integrated circuit | |
US20220036196A1 (en) | Reconfigurable computing architecture for implementing artificial neural networks | |
Mahajan et al. | Review of Artificial Intelligence Applications and Architectures | |
CN112836793B (en) | Floating point separable convolution calculation accelerating device, system and image processing method | |
CN111753974A (en) | Neural network accelerator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |