CN109146067B

CN109146067B - Policy convolution neural network accelerator based on FPGA

Info

Publication number: CN109146067B
Application number: CN201811373344.1A
Authority: CN
Inventors: 李贞妮; 高宇梁; 王骄
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2021-11-05
Anticipated expiration: 2038-11-19
Also published as: CN109146067A

Abstract

The invention provides a Policy convolution neural network accelerator based on an FPGA (field programmable gate array), and relates to the technical field of digital integrated circuits. The accelerator comprises an input buffer module, a convolution module, a scaling module and a Softmax module; the input buffer module inputs the feature map data to the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponential calculation on the data flow output by the scaling module; the Policy convolution neural network accelerator based on the FPGA provided by the invention realizes the forward propagation process of the AlphaGo strategy network of the deep reinforcement learning algorithm on the FPGA platform, and has great advantages in the aspects of power consumption, processing speed, memory bandwidth requirements and the like.

Description

Policy convolution neural network accelerator based on FPGA

Technical Field

The invention relates to the technical field of digital integrated circuits, in particular to a Policy convolution neural network accelerator based on an FPGA.

Background

In recent years, artificial intelligence has rapidly emerged, leading to a rapid revolution in machine learning. In the continuous progress of each branch of global artificial intelligence, a deep learning algorithm gradually receives attention in a plurality of algorithms of machine learning due to the excellent performance of a large data set, and becomes one of mainstream technologies. The deep learning is derived from an artificial neural network, and is an expansion of the traditional artificial neural network. Artificial intelligence Alphago defeats world go top-level chess hand-lithage, and the deep learning model adopted in the artificial intelligence Alphago defeats world go top-level chess hand-lithage, and comprises a strategy network, a value network and Monte Carlo tree search, wherein the strategy network is also called Policy network and mainly consists of a convolutional neural network. As the problems solved by deep learning become more complex and abstract, large-scale data sets are required to be trained and tested to explain the accuracy, computing power, computing efficiency and storage importance of the model without doubt. At present, the large-scale artificial intelligence operation mainly depends on a central Processing unit (cpu) and a graphics Processing unit (gpu), and is also accelerated by a processor cluster acceleration system. In order to expand the application range of the used device, the implementation of the convolutional neural network algorithm needs to be established on the device with smaller size, lower power consumption and higher operation processing speed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a Policy convolutional neural network accelerator based on an FPGA to accelerate the operation of a convolutional neural network, aiming at the defects of the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a Policy convolution neural network accelerator based on an FPGA comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module carries out corresponding displacement on the value of 361 points which are 19 multiplied by 19 and output by the convolution module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs index calculation on the data streams output by the scaling module one by one, all obtained index calculation results are input into an accumulator, and the sum of all numerical values is calculated and input into a floating point division IP core provided by Xilinx company as a divisor; the accumulator is composed of a floating point number addition IP core provided by Xilinx corporation; the floating point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values computed for each point remains unchanged.

Preferably, the input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, the depth is 4096, and the input buffer module FIFO works in a first-word-fall-through mode; the input buffer module FIFO comprises 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.

Preferably, the convolution module comprises an input convolution module, an intermediate convolution module and an output convolution module; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network;

the input convolution module comprises a weight serial-parallel conversion module and a parallel multiply-add operation module; the feature map data and the weight data passing through the weight serial-parallel conversion module enter a parallel multiply-add operation module together for convolution operation, and after the operation result is superposed with the bias input, feature map data provided for an intermediate convolution module are generated;

the input convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: i _ load _ bias, i _ data _ bias, i _ load _ weight, i _ data _ weight, i _ input _ vld, i _ input _ dat, and i _ output _ rdy; wherein i _ load _ bias is bias data load enable provided by the finite state machine, i _ data _ bias is neural network bias data in 16-bit fixed point format provided by the external input, i _ load _ weight is weight data load enable provided by the finite state machine, i _ data _ weight is neural network weight data in 9-bit fixed point format provided by the external input, input _ vld is a characteristic diagram data valid signal provided by the input buffer module FIFO, i _ input _ dat is 4-bit characteristic diagram data provided by the input buffer module FIFO, and i _ output _ rdy is an output enable signal provided by the intermediate convolution module FIFO; the 3 output signals are i _ output _ vld, i _ output _ dat and i _ input _ rdy respectively; wherein i _ output _ vld is a characteristic diagram valid signal provided for the intermediate convolution module, i _ output _ dat is 10-bit characteristic diagram data provided for the intermediate convolution module, and i _ input _ rdy is an input enable signal of the input buffer module FIFO;

the intermediate convolution module comprises a data memory, a data serial-parallel conversion module, an intermediate weight serial-parallel conversion module, a weight memory, an offset memory, an intermediate parallel multiply-add operation module, a finite state machine module and a bit width adjusting module; the characteristic diagram data output by the input convolution module is sent to a data memory and a data serial-parallel conversion module to carry out block data division and serial-parallel conversion; inputting the weight data into a weight serial-parallel conversion module, converting the weight data into parallel data and storing the parallel data in a weight memory; the converted feature map data and the weight data are sent to an intermediate parallel multiply-add operation module for convolution operation, the bit width adjustment is carried out after the operation result and the bias data are summed, and the result is output to an output convolution module by combining the state of a finite state machine; designing a data serial-parallel conversion module and a weight serial-parallel conversion module aiming at a large amount of programmable logic resources and storage resources in the FPGA, splitting and storing matrix data;

the intermediate convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: m _ load _ bias, m _ data _ bias, m _ load _ weight, m _ data _ weight, m _ input _ vld, m _ input _ dat, and m _ output _ rdy; wherein m _ load _ bias is a bias data load enable provided by the finite state machine, m _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, m _ load _ weight is a weight data load enable provided by the finite state machine, m _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, m _ input _ vld is a signature data valid signal provided by the input convolution module, m _ input _ dat is 10-bit signature data provided by the input convolution module, and m _ output _ rdy is an output enable signal provided by the output convolution module; the 3 output signals are m _ output _ vld, m _ output _ dat and m _ input _ rdy respectively; wherein, m _ output _ vld is a characteristic diagram valid signal provided for the output convolution module, m _ output _ dat is 80-bit characteristic diagram data provided for the output convolution module, and m _ input _ rdy is an input enable signal of the input convolution module;

the output convolution module inputs the characteristic diagram data output by the middle convolution module into an instantiated output data serial-parallel conversion module in the output convolution module, inputs the weight data into an output weight serial-parallel conversion module, and inputs the bias signal into a bias signal memory; the converted characteristic diagram input signal and the weight signal enter an output parallel multiply-add operation module for convolution operation, are superposed with the bias signal and then enter a bit width adjusting module, and output characteristic diagram data provided for a zooming module;

the output convolution module comprises 6 input signals and 3 output signals; the 6 input signals are respectively: o _ load _ bias, o _ data _ bias, o _ load _ weight, o _ data _ weight, o _ input _ vld, and o _ input _ dat; wherein, o _ load _ bias is bias data load enable provided by the finite state machine, o _ data _ bias is neural network bias data in 16-bit fixed point format provided by external input, o _ load _ weight is weight data load enable provided by the finite state machine, o _ data _ weight is neural network weight data in 9-bit fixed point format provided by external input, o _ input _ vld is characteristic diagram data effective signal provided by the previous convolution module, o _ input _ dat is 80-bit characteristic diagram data provided by the previous convolution module; the 3 output signals are o _ output _ vld, o _ output _ dat and o _ input _ rdy respectively; wherein o _ output _ vld is the signature valid signal provided to the scaling module, o _ output _ dat is the 10-bit signature data provided to the scaling module, and o _ input _ rdy is the present module input enable signal provided to the intermediate convolution module.

Preferably, the scaling module comprises 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.

Preferably, the Softmax module comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dar is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the Policy convolutional neural network accelerator based on the FPGA provided by the invention realizes the forward propagation process of the Policy network in the deep reinforcement learning algorithm AlphaGo on the FPGA platform, and has greater advantages in the aspects of power consumption, processing speed, memory bandwidth requirements and the like compared with a CPU and a GPU. A large number of DSP computing units provided by the FPGA are very beneficial to the operation involved in the convolutional neural network, and the complex convolutional operation can be refined and modularized. The Policy network model is mapped into the FPGA platform, parallel acceleration is realized through a pipeline processing mode, and the processing speed of the forward propagation process of the Policy convolution neural network in the AlphaGo algorithm is effectively improved. And the FPGA development period is short, the design is flexible, the configurability is strong, and the development potential is obvious, so that the Policy convolution neural network accelerator is constructed based on the FPGA platform, and has higher research significance and practical value.

Drawings

FIG. 1 shows an FPGA-based policy provided by an embodiment of the present invention_yA system architecture block diagram of a convolutional neural network accelerator;

FIG. 2 is a block diagram of a scaling module according to an embodiment of the present invention;

fig. 3 is a design structure diagram of a Softmax module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hardware structure of a Policy convolutional neural network accelerator based on an FPGA according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating the operation of convolution layers in the input convolution module according to an embodiment of the present invention;

FIG. 6 is a block diagram of an input convolution module according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an input convolution module according to an embodiment of the present invention;

FIG. 8 is a block diagram of a design of an intermediate convolution module according to an embodiment of the present invention;

fig. 9 is a schematic hardware structure diagram of an intermediate convolution module according to an embodiment of the present invention;

FIG. 10 is a block diagram of an output convolution module according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a hardware structure of an output convolution module according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In this embodiment, taking a Policy network as an example, the convolutional neural network in the Policy network is accelerated by using the FPGA-based Policy convolutional neural network accelerator of the present invention.

The convolutional neural network in the Policy network adopted in this embodiment includes thirteen convolutional layers, twelve ReLU functions, one Scale layer, and one Softmax layer, and the depth of the network is obvious. This convolutional neural network structure design is derived from the LeNet-5 network model, but does not contain the pooling layers that are widely used in convolutional neural network models. This is because the Policy network model is used in AlphaGo for predicting the probability of each next position on a 19 × 19 chessboard, and each point in the 19 × 19 feature map represents the situation of the chessboard, which has practical and practical significance. Therefore, in the overall convolutional neural network model, no pooling layer can be adopted, and data cannot be sampled or averaged to reduce the size, which is obviously different from other convolutional neural network models.

An FPGA-based Policy convolutional neural network accelerator is shown in FIG. 1 and comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module performs corresponding displacement on the value of 361 points output by the convolution module, namely 19 × 19 points, as shown in fig. 2; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponent calculation on the data streams output by the scaling module one by one as shown in FIG. 3, all obtained exponent calculation results are input into an accumulator, the sum of all numerical values is calculated and input into a floating point division IP core provided by Xilinx company as a divisor, and the accumulator is composed of the floating point addition IP core provided by the Xilinx company; the floating point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values computed for each point remains unchanged.

The input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, has a depth of 4096 and works in a first-word-fall-through mode; the hardware structure of the input buffer module FIFO is shown in fig. 4, and includes 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.

The convolution layer occupies a great proportion in the Policy convolution neural network model and is a core network layer of the convolution neural network, so that the convolution module is also the core module of the FPGA-based Policy convolution neural network accelerator. In the thirteen convolutional layers in the Policy network model, the sizes of the input feature maps, the number of channels, and the sizes and the number of required convolutional cores required for the first convolutional layer, the second convolutional layer through the twelfth convolutional layer, and the thirteenth convolutional layer are different. The first layer is a type, the size of an input characteristic diagram is 19 multiplied by 19, and 4 channels are provided; the sizes of the input characteristic graphs of the second to twelfth convolutional layers are unchanged, and the number of channels is 192; the third layer inputs the same as the second layer, but the number of output data channels is reduced to 1.

In the convolution module, thirteen convolution layers and twelve ReLU functions are contained. Except that the input of the first layer is known data, the input of each layer is the output obtained by the operation of the previous volume of lamination layer, so the part cannot be accelerated in a parallel mode. However, the operation of each layer is independent, and the operation of each feature map corresponding to each convolution kernel is also independent, so that the parallel processing mode can be adopted in each convolution layer for acceleration. In the convolution module, the complex operation of each convolution layer is divided into independent basic operation units, and a pipeline structure is adopted to accelerate the operation speed as much as possible. In addition, the connection between the convolutional layer and the convolutional layer requires high requirements on the coordination time sequence, so that the operation of the convolutional neural network can be correctly completed in the whole system of the FPGA-based Policy convolutional neural network accelerator. The convolution module is further divided into three sub-modules, namely an input convolution module, an intermediate convolution module and an output convolution module, by analyzing different points of data; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network.

In this embodiment, the convolutional layer operation process input to the convolutional module is taken as an example for analysis, so as to obtain a specific calculation flow of the convolutional layer as shown in fig. 5. First, the enabling of the input buffer module FIFO is initialized by the finite state machine. The 19 × 19 input matrix in the first convolutional layer is edge extended to become a 23 × 23 matrix. Inputting characteristic diagram data and processing in parallel, and inputting weight parameters and a bias parameter matrix. And (4) performing multiplication operation on the corresponding characteristic graph and the weight data, adding the obtained product, and adding an offset value. The core calculation part is convolution operation. The number of convolution kernels input to the first layer of convolution layers is 192, and each convolution kernel is a 5 × 5 matrix. The convolution layer operation process of other convolution modules is similar to the convolution layer operation process of the input convolution module, the number of input convolution kernels is 192, but the size of each convolution kernel is a matrix of 3 × 3. The important point of calculating the ReLU function is to judge whether the input value is positive or negative, and data greater than 0 is directly output, and data less than or equal to 0 is output as 0. Therefore, the ReLU function portion compares the input data with the value 0 using a comparator. In particular implementations, the calculation of the ReLU function may be combined with the calculation of the convolutional layer.

The input convolution module comprises a weight serial-parallel conversion module and a parallel multiply-add operation module as shown in fig. 6; the feature map data and the weight data passing through the weight serial-parallel conversion module enter the parallel multiply-add operation module together for convolution operation, and after the operation result is overlapped with the bias input, the feature map data provided for the middle convolution module is generated.

The hardware structure of the input convolution module is shown in fig. 7, and includes 7 input signals and 3 output signals; the 7 input signals are respectively: i _ load _ bias, i _ data _ bias, i _ load _ weight, i _ data _ weight, i _ input _ vld, i _ input _ dat, and i _ output _ rdy; wherein i _ load _ bias is a bias data load enable provided by the finite state machine, i _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, i _ load _ weight is a weight data load enable provided by the finite state machine, i _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, input _ vld is a characteristic diagram data valid signal provided by the input buffer module FIFO, and is at a high level when the input buffer module FIFO is not empty and the read end is not in a reset state, and the hardware connection is as shown in fig. 4; i _ input _ dat is the 4-bit profile data provided by the input buffer module FIFO, i _ output _ rdy is the output enable signal provided by the intermediate convolution module; the 3 output signals are i _ output _ vld, i _ output _ dat and i _ input _ rdy respectively; wherein i _ output _ vld is the signature valid signal provided to the intermediate convolution module, i _ output _ dat is the 10-bit signature data provided to the intermediate convolution module, and i _ input _ rdy is provided to the present module input enable signal of the input buffer module FIFO.

As shown in fig. 8, the intermediate convolution module includes a data memory, a data serial-to-parallel conversion module, an intermediate weight serial-to-parallel conversion module, a weight memory, an offset memory, an intermediate parallel multiply-add operation module, a finite state machine module, and a bit width adjustment module; the characteristic diagram data output by the input convolution module is sent to a data memory and a data serial-parallel conversion module to carry out block data division and serial-parallel conversion; inputting the weight data into a weight serial-parallel conversion module, converting the weight data into parallel data and storing the parallel data in a weight memory; the converted feature map data and the weight data are sent to an intermediate parallel multiply-add operation module for convolution operation, the bit width adjustment is carried out after the operation result and the bias data are summed, and the result is output to an output convolution module by combining the state of a finite state machine; and designing a data serial-parallel conversion module and a weight serial-parallel conversion module aiming at a large amount of programmable logic resources and storage resources in the FPGA, and splitting and storing the matrix data.

The hardware structure of the intermediate convolution module is shown in fig. 9, and includes 7 input signals and 3 output signals; the 7 input signals are respectively: m _ load _ bias, m _ data _ bias, m _ load _ weight, m _ data _ weight, m _ input _ vld, m _ input _ dat, and m _ output _ rdy; wherein m _ load _ bias is a bias data load enable provided by the finite state machine, m _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, m _ load _ weight is a weight data load enable provided by the finite state machine, m _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, m _ input _ vld is a signature data valid signal provided by the input convolution module, m _ input _ dat is 10-bit signature data provided by the input convolution module, and m _ output _ rdy is an output enable signal provided by the output convolution module; the 3 output signals are m _ output _ vld, m _ output _ dat and m _ input _ rdy respectively; wherein, m _ output _ vld is the characteristic diagram valid signal provided to the output convolution module, m _ output _ dat is the 80-bit characteristic diagram data provided to the output convolution module, and m _ input _ rdy is provided to the input enable signal of the input convolution module.

As shown in fig. 10, the output convolution module inputs the feature map data output by the intermediate convolution module into the output data serial-parallel conversion module instantiated in the output convolution module, inputs the weight data into the output weight serial-parallel conversion module, and inputs the bias signal into the bias signal memory; the converted characteristic diagram input signal and the weight signal enter an output parallel multiply-add operation module for convolution operation, are superposed with the bias signal and then enter a bit width adjusting module, and output characteristic diagram data provided for a zooming module.

The hardware structure of the output convolution module is shown in fig. 11, and includes 6 input signals and 3 output signals; the 6 input signals are respectively: o _ load _ bias, o _ data _ bias, o _ load _ weight, o _ data _ weight, o _ input _ vld, and o _ input _ dat; wherein, o _ load _ bias is bias data load enable provided by the finite state machine, o _ data _ bias is neural network bias data in 16-bit fixed point format provided by external input, o _ load _ weight is weight data load enable provided by the finite state machine, o _ data _ weight is neural network weight data in 9-bit fixed point format provided by external input, o _ input _ vld is characteristic diagram data effective signal provided by the previous convolution module, o _ input _ dat is 80-bit characteristic diagram data provided by the previous convolution module; the 3 output signals are o _ output _ vld, o _ output _ dat and o _ input _ rdy respectively; wherein o _ output _ vld is the signature valid signal provided to the scaling module, o _ output _ dat is the 10-bit signature data provided to the scaling module, and o _ input _ rdy is the present module input enable signal provided to the intermediate convolution module.

The hardware structure of the scaling module is shown in fig. 4, and includes 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.

The hardware structure of the Softmax module is shown in FIG. 4, and comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dat is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. An FPGA-based Policy convolutional neural network accelerator, characterized in that: the device comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module carries out corresponding displacement on the value of 361 points which are 19 multiplied by 19 and output by the convolution module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponent calculation on the data streams output by the scaling module one by one, all obtained exponent calculation results are input into an accumulator, and the sum of all numerical values is calculated and used as a divisor to be input into a floating-point division IP core provided by Xilinx company; the accumulator is composed of a floating point number addition IP core provided by Xilinx corporation; the floating-point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values calculated for each point remains unchanged;

the convolution module comprises an input convolution module, an intermediate convolution module and an output convolution module; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network;

2. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, has a depth of 4096 and works in a first-word-fall-through mode; the input buffer module FIFO comprises 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.

3. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the scaling module comprises 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.

4. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the Softmax module comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dat is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.