CN111652365A

CN111652365A - Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Info

Publication number: CN111652365A
Application number: CN202010366873.XA
Authority: CN
Inventors: 刘冰; 凤雷; 付平; 李喜鹏; 卢学翼; 吴瑞东; 王嘉晨; 童启凡; 周彦臻; 谢宇轩
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-11
Anticipated expiration: 2040-04-30
Also published as: CN111652365B

Abstract

The invention discloses a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof. The hardware architecture comprises: the general processor module is responsible for interacting with the external environment and realizing the calculation of the reward function and also responsible for maintaining a Deep Q-Network algorithm experience pool; the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm; the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module; the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network; and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network. The method realizes the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.

Description

Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Technical Field

The invention belongs to the technical field of artificial intelligence; in particular to a hardware architecture for accelerating Deep Q-Network algorithm and a design space exploration method thereof.

Background

The Deep reinforcement learning is a novel artificial intelligence technology, is developed by combining a traditional reinforcement learning algorithm and a Deep learning algorithm, is mainly used in the fields of robot control, automatic driving, search recommendation and the like, and is a representative algorithm in the field of Deep reinforcement learning. However, in the calculation process of the Deep Q-Network algorithm, the Deep Q-Network algorithm includes two types of calculations of forward reasoning and backward propagation of the neural Network, and when the neural Network is large in scale, the problems of large storage resource consumption and high calculation complexity exist. At present, a large-scale GPU (graphics processing unit) board card server is usually used for researching and realizing deep reinforcement learning, so that the method is difficult to apply in limited edge computing scenes such as hardware resources and power consumption, and the applicability is not high; in the prior art, the FPGA hardware computing architecture is not optimized, and design space is not explored.

Disclosure of Invention

The invention provides a hardware architecture for accelerating a Deep Q-Network algorithm and a space exploration method thereof, which are used for solving the problems, can explore the design space of the hardware architecture according to the parameters of a Deep Q-Network and the resource parameters of an FPGA chip and give the optimal parallel parameters of the hardware architecture under the resource constraint of the FPGA chip, thereby realizing the real-time calculation of the Deep Q-Network algorithm under the highly optimized FPGA hardware architecture.

The invention is realized by the following technical scheme:

a hardware architecture for accelerating Deep Q-Network algorithm comprises a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;

the general processor module is responsible for interacting with an external environment and realizing calculation of a reward function and also responsible for maintaining a DeepQ-Network algorithm experience pool;

the external DDR memory is responsible for storing an experience pool of a Deep Q-Network algorithm;

the AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module;

the Target Q module is responsible for realizing forward reasoning calculation of the Target Q network;

the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network;

the Target Q module and the Current Q module are both formed by vector matrix multiplication processing units VMPU through FIFO cascade connection of first-in first-out queues;

the Loss calculation module is responsible for receiving forward reasoning calculation results of the Target Q module and the Current Q module, calculating error gradient and transmitting the error gradient to the Current Q module for back propagation calculation;

the mode control module performs data path control on three modes of weight initialization, decision and decision learning;

the parameter storage unit is used for storing weight parameters and weight gradient parameters of a Deep Q-Network algorithm;

and the weight updating unit is used for updating the weight parameters of the trained Deep Q-Network algorithm.

Furthermore, the vector matrix multiplication unit VMPU is divided into two calculation modes, namely an A calculation mode and a B calculation mode, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.

Further, when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward reasoning adopt column calculation of the vector matrix multiplication processing units VMPU as a main mode, even layers of forward reasoning adopt row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.

Further, when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU rows, the even layer of backward propagation is calculated as a main mode by using vector matrix multiplication processing units VMPU columns, and the vector matrix multiplication processing units VMPU of which rows are calculated as a main and the vector matrix multiplication processing units VMPU of which columns are calculated as a main are alternately cascaded by a first-in first-out queue FIFO.

Further, the vector matrix multiplication processing unit VMPU may implement an internal multiply-accumulate pipeline; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.

A design space exploration method for accelerating a hardware architecture of a Deep Q-Network algorithm specifically comprises the following steps:

step 1: setting FPGA resource model constraint conditions, wherein DSP resources and BRAM resources used by a hardware architecture cannot exceed constraint values;

step 2: setting a Deep Q-Network algorithm structure model constraint condition, wherein the single instruction multiple data stream SIMD parallelism of each level of vector matrix multiplication units in a forward reasoning calculation module in a hardware architecture does not exceed the neuron number of a current calculation full connection layer;

and step 3: under the current resource model constraint and algorithm model constraint, searching the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in Matlab software by using an exhaustion method, and determining the single instruction multiple data stream SIMD parallelism and the processing unit PE parallelism of each level of vector matrix multiplication unit VMPU in a hardware architecture.

The invention has the beneficial effects that:

1. the hardware of the invention can efficiently finish neural network reasoning and back propagation, and realize the parallel computation flow in the neural network layer, the computation flow between the neural network layers and the computation between the Target Q network and the Current Q network.

2. The method can evaluate resources required by deployment of the Deep Q-Network algorithm on the FPGA, and carry out constraint adjustment according to actual conditions.

3. Under the condition of limited computing resources and space, the invention can explore the design space of the hardware architecture according to the hardware resource model and the algorithm structure model, and use the optimal parallel parameters to deploy the algorithm on the FPGA.

Drawings

FIG. 1 is a hardware-software architecture diagram of the present invention for a field programmable gate array.

FIG. 2 is a schematic diagram of the coarse-grained structure of the vector matrix multiplication unit of the present invention.

Fig. 3 is a pseudo code of the vector matrix multiplication unit of the present invention.

FIG. 4 is a diagram of a fine-grained structure of a vector matrix multiplication unit in a column-based computation mode according to the present invention.

Fig. 5 is a schematic diagram of a fine-grained structure of a vector matrix multiplication unit in a row-based calculation mode according to the present invention.

FIG. 6 is a flowchart of a design space exploration process in designing a hardware architecture according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

the external DDR memory is mainly responsible for storing a Deep Q-Network algorithm experience pool.

The AXI bus interface is a general AXI bus interface structure and is responsible for realizing the transmission and feedback of control signals and data signals between the general processor and the FPGA programmable logic module.

The Target Q module is responsible for realizing forward reasoning calculation of the Target Q network, and the Current Q module is responsible for realizing forward reasoning and backward propagation of the Current Q network.

The Target Q module and the Current Q module can realize internal multiplication through a vector matrix multiplication processing unit VMPU by first-in first-out; vector matrix multiplication mainly based on row calculation is formed by cascade connection of single queue FIFO;

the Loss calculation module is responsible for receiving forward reasoning calculation results and calculating error gradients of the Target Q module and the Current Q module and transmitting the results and the error gradients to the Current Q module for back propagation calculation;

The matrix vector multiplication unit VMPU is divided into two calculation modes of A and B, and can complete the multiplication calculation function and the activation function of a matrix and a vector, wherein the matrix in the A mode is a column main calculation mode, and the matrix in the B mode is a row main calculation mode. In an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in the row-dominated B-mode, the SIMD is parallel to the column dimension of the corresponding matrix and the processing element PE is parallel to the row dimension of the corresponding matrix.

When the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward reasoning adopts the column calculation of the vector matrix multiplication processing units VMPU as a main mode, and the even layer of forward reasoning adopts the row calculation of the vector matrix multiplication processing units VMPU as a main mode;

when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, the odd layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns, the even layer of forward inference is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, the odd layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU rows, and the even layer of backward propagation is calculated as a main mode by adopting the vector matrix multiplication processing units VMPU columns;

the vector matrix multiplication processing unit VMPU can realize internal multiplication accumulation running water; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; parallel calculation is realized between the TargetQ module and the Current Q module;

further, the method for exploring the design space specifically comprises the following steps:

Further, fig. 3 shows a pseudo code implementation of the vector matrix multiplication unit VMPU, which is composed of three loops in total, two loops at the innermost layer are completely expanded in hardware implementation to implement SIMD parallel of single instruction multiple data streams and PE parallel of the processing unit, and a boundary value Fold of the loop at the outermost layer is a total folding factor, which is a product of the SIMD dimension folding factor of the single instruction multiple data streams and the PE dimension folding factor of the processing unit, indicating how many times SIMD parallel computations and PE parallel computations of the processing unit need to be performed to complete the whole vector matrix multiplication.

Further, the fine-grained structures of two calculation modes a and B in fig. 4 and 5 are described below by taking an example in which the input is 8 neurons, the intermediate hidden layer is 4 neurons, and the output layer is 4 neurons.

Further, fig. 4 is a schematic diagram of a fine-grained structure when the VMPU unit is in an a mode, where in the a mode, forward calculation is performed on a network from an input layer to a hidden layer, a SIMD parallel parameter of a single instruction multiple data stream is set to 4, a row dimension of a corresponding matrix is set, and a folding factor SF is set to 2; the parallel parameter of the processing unit PE is 2, the folding factor PF is 2 and the total folding factor is 4 corresponding to the column dimension of the matrix, 2 final results can be calculated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.

Further, fig. 5 is a schematic diagram of a fine-grained structure when the VMPU unit is in a B mode, where a SIMD parallel parameter of a single instruction multiple data stream is set to 2 in the B mode, and a folding factor SF is set to 2 corresponding to a column dimension of a matrix; the parallel parameter of the processing unit PE is 2, the row dimension of the corresponding matrix is 2, the folding factor PF is 2, 4 intermediate results are generated in each SF cycle, and the whole vector matrix multiplication can be completed through 4 times of cycle calculation.

When backward propagation of the network is carried out, transposed weight matrixes are needed for backward propagation calculation of the gradient, when forward reasoning of the neural network is calculated in the FPGA, the weight matrixes are necessarily stored in the BRAM in advance in a row-based or column-based mode, if the forward reasoning is carried out, the weight matrixes are stored in the corresponding BRAM in a row-based mode, and meanwhile, the storage positions of the weight matrixes are related to the parallel parameter processing unit PE and the single instruction multiple data stream SIMD. Therefore, when the data is reversely transmitted, the parallel computation cannot be realized due to the limitation of the on-chip data reading bandwidth. The matrix storage modes of the row-based calculation and the column-based calculation are transposed with each other, if the row-based calculation is used during reverse propagation, and the column-based calculation is used during reverse propagation, the bandwidth problem caused by the fact that matrix data cannot be transposed can be flexibly solved.

Claims

1. A hardware architecture for accelerating Deep Q-Network algorithm is characterized by comprising a general processor module, an FPGA programmable logic module and an external DDR memory, wherein the FPGA programmable logic module comprises an AXI bus interface, a Target Q module, a Current Q module, a Loss calculation module, a mode control module, a parameter storage unit and a weight updating unit;

2. The hardware architecture for accelerating Deep Q-Network algorithm according to claim 1, wherein the vector matrix multiplication unit VMPU is divided into two calculation modes A and B, and can complete multiplication calculation function and activation function of matrix and vector, the matrix in A mode is column main calculation mode, and the matrix in B mode is row main calculation mode; in an A mode with columns as the main mode, a single instruction multiple data stream SIMD (single instruction multiple data) is parallel to the row dimension of a corresponding matrix, and a processing unit PE is parallel to the column dimension of the corresponding matrix; in row-major B-mode, the column dimensions of a SIMD parallel correspondence matrix are Single Instruction Multiple Data (SIMD)The processing elements PE correspond to the rows of the matrix in parallelDimension (d) of。

3. The hardware architecture of claim 1, wherein when the Target Q module is formed by cascading vector matrix multiplication processing units VMPU, an odd layer of forward reasoning adopts column calculation of the vector matrix multiplication processing units VMPU as a main mode, an even layer of forward reasoning adopts row calculation of the vector matrix multiplication processing units VMPU as a main mode, and the vector matrix multiplication processing units VMPU whose row calculation is main and the vector matrix multiplication processing units VMPU whose column calculation is main are alternately cascaded through a first-in first-out queue FIFO.

4. The hardware architecture of claim 1, wherein when the Current Q module is formed by cascading vector matrix multiplication processing units VMPU, odd layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, even layers of forward inference are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, odd layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU rows, even layers of backward propagation are computed as a main mode by using vector matrix multiplication processing unit VMPU columns, and vector matrix multiplication processing units VMPU of which rows are mainly computed and vector matrix multiplication processing units VMPU of which columns are mainly computed are alternately cascaded through a first-in first-out queue FIFO.

5. The hardware architecture for accelerating Deep Q-Network algorithm of claim 2, wherein the vector matrix multiplication processing unit VMPU can implement internal multiply-accumulate pipelining; a task-level flow is formed between the vector matrix multiplication processing unit VMPU with the row calculation as the main and the vector matrix multiplication processing unit VMPU with the column calculation as the main; and parallel calculation is realized between the Target Q module and the Current Q module.

6. The method for exploring the design space of the hardware architecture for accelerating the Deep Q-Network algorithm, according to claim 1, is characterized in that the method for exploring the design space specifically comprises the following steps: