CN111652365A - Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof - Google Patents

Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof Download PDF

Info

Publication number
CN111652365A
CN111652365A CN202010366873.XA CN202010366873A CN111652365A CN 111652365 A CN111652365 A CN 111652365A CN 202010366873 A CN202010366873 A CN 202010366873A CN 111652365 A CN111652365 A CN 111652365A
Authority
CN
China
Prior art keywords
module
calculation
vmpu
matrix multiplication
vector matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010366873.XA
Other languages
Chinese (zh)
Other versions
CN111652365B (en
Inventor
刘冰
凤雷
付平
李喜鹏
卢学翼
吴瑞东
王嘉晨
童启凡
周彦臻
谢宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010366873.XA priority Critical patent/CN111652365B/en
Publication of CN111652365A publication Critical patent/CN111652365A/en
Application granted granted Critical
Publication of CN111652365B publication Critical patent/CN111652365B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof. The hardware architecture comprises: the general processor module is responsible for interacting with the external environment and realizing the calculation of the reward function and also responsible for maintaining a Deep Q-Network algorithm experience pool; the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm; the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module; the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network; and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network. The method realizes the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.

Description

Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof
Technical Field
The invention belongs to the technical field of artificial intelligence; in particular to a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof.
Background
The Deep reinforcement learning is a novel artificial intelligence technology, is developed by combining a traditional reinforcement learning algorithm and a Deep learning algorithm, is mainly used in the fields of robot control, automatic driving, search recommendation and the like, and is a representative algorithm in the field of Deep reinforcement learning. However, in the calculation process of the Deep Q-Network algorithm, the Deep Q-Network algorithm includes two types of calculations of forward reasoning and backward propagation of the neural Network, and when the neural Network is large in scale, the problems of large storage resource consumption and high calculation complexity exist. At present, a large-scale GPU (graphics processing unit) board card server is usually used for researching and realizing deep reinforcement learning, so that the method is difficult to apply in limited edge computing scenes such as hardware resources and power consumption, and the applicability is not high; in the prior art, the FPGA hardware computing architecture is not optimized, and design space is not explored.
Disclosure of Invention
The invention provides a hardware architecture for accelerating a Deep Q-Network algorithm and a space exploration method thereof, which are used for solving the problems, can explore the design space of the hardware architecture according to the parameters of a Deep Q-Network and the resource parameters of an FPGA chip and give the optimal parallel parameters of the hardware architecture under the resource constraint of the FPGA chip, thereby realizing the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.
The invention is realized by the following technical scheme:
a hardware architecture for accelerating Deep Q-Network algorithm comprises a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm;
the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module;
the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network;
the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network;
the Target Q module and the Current Q module are both formed by vector matrix multiplication processing units VMPU through FIFO cascade connection of first-in first-out queues;
the Loss calculation module is responsible for receiving forward reasoning calculation results of the Target Q module and the Current Q module, calculating error gradient and transmitting the error gradient to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
Furthermore, the vector matrix multiplication unit VMPU is divided into two calculation modes, namely an A calculation mode and a B calculation mode, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.
Further, when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward reasoning adopt column calculation of the vector matrix multiplication processing units VMPU as a main mode, even layers of forward reasoning adopt row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.
Further, when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the even layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, and the vector matrix multiplication processing units VMPU of which rows are calculated as a main and the vector matrix multiplication processing units VMPU of which columns are calculated as a main are alternately cascaded by a first-in first-out queue FIFO.
Further, the vector matrix multiplication processing unit VMPU may implement an internal multiply-accumulate pipeline; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.
A design space exploration method for accelerating a hardware architecture of a Deep Q-Network algorithm specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
The invention has the beneficial effects that:
1. the hardware of the invention can efficiently finish neural network reasoning and back propagation, and realize the parallel computation flow in the neural network layer, the computation flow between the neural network layers and the computation between the Target Q network and the Current Q network.
2. The method can evaluate resources required by deployment of the Deep Q-Network algorithm on the FPGA, and carry out constraint adjustment according to actual conditions.
3. Under the condition of limited computing resources and space, the invention can explore the design space of the hardware architecture according to the hardware resource model and the algorithm structure model, and use the optimal parallel parameters to deploy the algorithm on the FPGA.
Drawings
FIG. 1 is a hardware-software architecture diagram of the present invention for a field programmable gate array.
FIG. 2 is a schematic diagram of the coarse-grained structure of the vector matrix multiplication unit of the present invention.
Fig. 3 is a pseudo code of the vector matrix multiplication unit of the present invention.
FIG. 4 is a diagram of a fine-grained structure of a vector matrix multiplication unit in a column-based computation mode according to the present invention.
Fig. 5 is a schematic diagram of a fine-grained structure of a vector matrix multiplication unit in a row-based calculation mode according to the present invention.
FIG. 6 is a flowchart of a design space exploration process in designing a hardware architecture according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
A hardware architecture for accelerating Deep Q-Network algorithm comprises a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is mainly responsible for storing a Deep Q-Network algorithm experience pool.
The AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module.
The Target Q module is responsible for realizing forward reasoning calculation of the Target Q network, and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network.
The Target Q module and the Current Q module can realize internal multiplication through a vector matrix multiplication processing unit VMPU by first-in first-out; vector matrix multiplication mainly based on row calculation is formed by cascade connection of single queue FIFO;
the Loss calculation module is responsible for receiving forward reasoning calculation results and calculating error gradients of the Target Q module and the Current Q module and transmitting the results and the error gradients to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
The matrix vector multiplication unit VMPU is divided into two calculation modes of A and B, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode. In an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.
When the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward reasoning adopts the column calculation of the vector matrix multiplication processing units VMPU as a main mode, and the even layer of forward reasoning adopts the row calculation of the vector matrix multiplication processing units VMPU as a main mode;
when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, and the even layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns;
the vector matrix multiplication processing unit VMPU can realize internal multiplication accumulation running water; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; parallel calculation is realized between the TargetQ module and the Current Q module;
further, the method for exploring the design space specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
Further, fig. 3 shows a pseudo code implementation of the vector matrix multiplication unit VMPU, which is composed of three loops in total, two loops at the innermost layer are completely expanded in hardware implementation to implement SIMD parallel of single instruction multiple data streams and PE parallel of the processing unit, and a boundary value Fold of the loop at the outermost layer is a total folding factor, which is a product of the SIMD dimension folding factor of the single instruction multiple data streams and the PE dimension folding factor of the processing unit, indicating how many times SIMD parallel computations and PE parallel computations of the processing unit need to be performed to complete the whole vector matrix multiplication.
Further, the fine-grained structures of two calculation modes a and B in fig. 4 and 5 are described below by taking an example in which the input is 8 neurons, the intermediate hidden layer is 4 neurons, and the output layer is 4 neurons.
Further, fig. 4 is a schematic diagram of a fine-grained structure when the VMPU unit is in an a mode, where in the a mode, forward calculation is performed on a network from an input layer to a hidden layer, a SIMD parallel parameter of a single instruction multiple data stream is set to 4, a row dimension of a corresponding matrix is set, and a folding factor SF is set to 2; the parallel parameter of the processing unit PE is 2, the folding factor PF is 2 and the total folding factor is 4 corresponding to the column dimension of the matrix, 2 final results can be calculated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.
Further, fig. 5 is a schematic diagram of a fine-grained structure when the VMPU unit is in a B mode, where a SIMD parallel parameter of a single instruction multiple data stream is set to 2 in the B mode, and a folding factor SF is set to 2 corresponding to a column dimension of a matrix; the parallel parameter of the processing unit PE is 2, the row dimension of the corresponding matrix is 2, the folding factor PF is 2, 4 intermediate results are generated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.
When backward propagation of the network is carried out, transposed weight matrixes are needed for backward propagation calculation of the gradient, when forward reasoning of the neural network is calculated in the FPGA, the weight matrixes are necessarily stored in the BRAM in advance in a row-based or column-based mode, if the forward reasoning is carried out, the weight matrixes are stored in the corresponding BRAM in a row-based mode, and meanwhile, the storage positions of the weight matrixes are related to the parallel parameter processing unit PE and the single instruction multiple data stream SIMD. Therefore, when the data is reversely transmitted, the parallel computation cannot be realized due to the limitation of the on-chip data reading bandwidth. The matrix storage modes of the row-based calculation and the column-based calculation are transposed with each other, if the row-based calculation is used during reverse propagation, and the column-based calculation is used during reverse propagation, the bandwidth problem caused by the fact that matrix data cannot be transposed can be flexibly solved.

Claims (6)

1. A hardware architecture for accelerating Deep Q-Network algorithm is characterized by comprising a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;
the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;
the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm;
the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module;
the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network;
the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network;
the Target Q module and the Current Q module are both formed by vector matrix multiplication processing units VMPU through FIFO cascade connection of first-in first-out queues;
the Loss calculation module is responsible for receiving forward reasoning calculation results of the Target Q module and the Current Q module, calculating error gradient and transmitting the error gradient to the Current Q module for back propagation calculation;
the mode control module performs data path control on three modes of weight initialization, decision and decision learning;
the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;
and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.
2. The hardware architecture for accelerating Deep Q-Network algorithm according to claim 1, wherein the vector matrix multiplication unit VMPU is divided into two calculation modes A and B, and can complete multiplication calculation function and activation function of matrix and vector, the matrix in A mode is column main calculation mode, and the matrix in B mode is row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in row-major B-mode, the column dimensions of a SIMD parallel correspondence matrix are Single Instruction Multiple Data (SIMD)The processing elements PE correspond to the rows of the matrix in parallelDimension (d) of
3. The hardware architecture of claim 1, wherein when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, an odd layer of forward reasoning adopts column calculation of the vector matrix multiplication processing units VMPU as a main mode, an even layer of forward reasoning adopts row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.
4. The hardware architecture of claim 1, wherein when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, even layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, odd layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, even layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, and vector matrix multiplication processing units VMPU of which rows are mainly computed and vector matrix multiplication processing units VMPU of which columns are mainly computed are alternately cascaded through a first-in first-out queue FIFO.
5. The hardware architecture for accelerating Deep Q-Network algorithm of claim 2, wherein the vector matrix multiplication processing unit VMPU can implement internal multiply-accumulate pipelining; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.
6. The method for exploring the design space of the hardware architecture for accelerating the Deep Q-Network algorithm, according to claim 1, is characterized in that the method for exploring the design space specifically comprises the following steps:
step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;
step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;
and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.
CN202010366873.XA 2020-04-30 2020-04-30 Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof Active CN111652365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010366873.XA CN111652365B (en) 2020-04-30 2020-04-30 Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010366873.XA CN111652365B (en) 2020-04-30 2020-04-30 Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Publications (2)

Publication Number Publication Date
CN111652365A true CN111652365A (en) 2020-09-11
CN111652365B CN111652365B (en) 2022-05-17

Family

ID=72347887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010366873.XA Active CN111652365B (en) 2020-04-30 2020-04-30 Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Country Status (1)

Country Link
CN (1) CN111652365B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130248A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. Generating dual sequence inferences using a neural network model
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
CA3032182A1 (en) * 2018-01-31 2019-07-31 Royal Bank Of Canada Pre-training neural netwoks with human demonstrations for deep reinforcement learning
GB201913353D0 (en) * 2019-09-16 2019-10-30 Samsung Electronics Co Ltd Method for designing accelerator hardware

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130248A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. Generating dual sequence inferences using a neural network model
CA3032182A1 (en) * 2018-01-31 2019-07-31 Royal Bank Of Canada Pre-training neural netwoks with human demonstrations for deep reinforcement learning
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
GB201913353D0 (en) * 2019-09-16 2019-10-30 Samsung Electronics Co Ltd Method for designing accelerator hardware

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JIANG SU等: "Neural Network Based Reinforcement Learning Acceleration on FPGA Platforms", 《COMPUTER ARCHITECTURE NEWS》 *
JIANG SU等: "Neural Network Based Reinforcement Learning", 《ACM SIGARCH COMPUTER ARCHITECTURE NEWS》 *
NARU SUGIMOTO等: "Trax solver on Zynq with Deep Q-Network", 《2015 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY(FPT)》 *
宫磊: "可重构平台上面向卷积神经网络的异构多核加速方法研究", 《中国优秀博硕士学位论文全文数据库(博士)》 *

Also Published As

Publication number Publication date
CN111652365B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
US11574195B2 (en) Operation method
US10691996B2 (en) Hardware accelerator for compressed LSTM
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
CN107797962B (en) Neural network based computational array
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
CN108090565A (en) Accelerated method is trained in a kind of convolutional neural networks parallelization
US20140344203A1 (en) Neural network computing apparatus and system, and method therefor
US20120166374A1 (en) Architecture, system and method for artificial neural network implementation
CN107229967A (en) A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA
US20240265234A1 (en) Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment
TWI417797B (en) A Parallel Learning Architecture and Its Method for Transferred Neural Network
CN108960414B (en) Method for realizing single broadcast multiple operations based on deep learning accelerator
CN115803754A (en) Hardware architecture for processing data in a neural network
Geng et al. CQNN: a CGRA-based QNN framework
CN115310037A (en) Matrix multiplication computing unit, acceleration unit, computing system and related method
CN110232441B (en) Stack type self-coding system and method based on unidirectional pulsation array
Dias et al. Deep learning in reconfigurable hardware: A survey
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN111652365B (en) Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof
WO2023114417A2 (en) One-dimensional computational unit for an integrated circuit
US20220036196A1 (en) Reconfigurable computing architecture for implementing artificial neural networks
Mahajan et al. Review of Artificial Intelligence Applications and Architectures
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN111753974A (en) Neural network accelerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant