CN109146067B - Policy convolution neural network accelerator based on FPGA - Google Patents

Policy convolution neural network accelerator based on FPGA Download PDF

Info

Publication number
CN109146067B
CN109146067B CN201811373344.1A CN201811373344A CN109146067B CN 109146067 B CN109146067 B CN 109146067B CN 201811373344 A CN201811373344 A CN 201811373344A CN 109146067 B CN109146067 B CN 109146067B
Authority
CN
China
Prior art keywords
data
module
input
output
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811373344.1A
Other languages
Chinese (zh)
Other versions
CN109146067A (en
Inventor
李贞妮
高宇梁
王骄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811373344.1A priority Critical patent/CN109146067B/en
Publication of CN109146067A publication Critical patent/CN109146067A/en
Application granted granted Critical
Publication of CN109146067B publication Critical patent/CN109146067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a Policy convolution neural network accelerator based on an FPGA (field programmable gate array), and relates to the technical field of digital integrated circuits. The accelerator comprises an input buffer module, a convolution module, a scaling module and a Softmax module; the input buffer module inputs the feature map data to the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponential calculation on the data flow output by the scaling module; the Policy convolution neural network accelerator based on the FPGA provided by the invention realizes the forward propagation process of the AlphaGo strategy network of the deep reinforcement learning algorithm on the FPGA platform, and has great advantages in the aspects of power consumption, processing speed, memory bandwidth requirements and the like.

Description

Policy convolution neural network accelerator based on FPGA
Technical Field
The invention relates to the technical field of digital integrated circuits, in particular to a Policy convolution neural network accelerator based on an FPGA.
Background
In recent years, artificial intelligence has rapidly emerged, leading to a rapid revolution in machine learning. In the continuous progress of each branch of global artificial intelligence, a deep learning algorithm gradually receives attention in a plurality of algorithms of machine learning due to the excellent performance of a large data set, and becomes one of mainstream technologies. The deep learning is derived from an artificial neural network, and is an expansion of the traditional artificial neural network. Artificial intelligence Alphago defeats world go top-level chess hand-lithage, and the deep learning model adopted in the artificial intelligence Alphago defeats world go top-level chess hand-lithage, and comprises a strategy network, a value network and Monte Carlo tree search, wherein the strategy network is also called Policy network and mainly consists of a convolutional neural network. As the problems solved by deep learning become more complex and abstract, large-scale data sets are required to be trained and tested to explain the accuracy, computing power, computing efficiency and storage importance of the model without doubt. At present, the large-scale artificial intelligence operation mainly depends on a central Processing unit (cpu) and a graphics Processing unit (gpu), and is also accelerated by a processor cluster acceleration system. In order to expand the application range of the used device, the implementation of the convolutional neural network algorithm needs to be established on the device with smaller size, lower power consumption and higher operation processing speed.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a Policy convolutional neural network accelerator based on an FPGA to accelerate the operation of a convolutional neural network, aiming at the defects of the prior art.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a Policy convolution neural network accelerator based on an FPGA comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module carries out corresponding displacement on the value of 361 points which are 19 multiplied by 19 and output by the convolution module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs index calculation on the data streams output by the scaling module one by one, all obtained index calculation results are input into an accumulator, and the sum of all numerical values is calculated and input into a floating point division IP core provided by Xilinx company as a divisor; the accumulator is composed of a floating point number addition IP core provided by Xilinx corporation; the floating point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values computed for each point remains unchanged.
Preferably, the input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, the depth is 4096, and the input buffer module FIFO works in a first-word-fall-through mode; the input buffer module FIFO comprises 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.
Preferably, the convolution module comprises an input convolution module, an intermediate convolution module and an output convolution module; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network;
the input convolution module comprises a weight serial-parallel conversion module and a parallel multiply-add operation module; the feature map data and the weight data passing through the weight serial-parallel conversion module enter a parallel multiply-add operation module together for convolution operation, and after the operation result is superposed with the bias input, feature map data provided for an intermediate convolution module are generated;
the input convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: i _ load _ bias, i _ data _ bias, i _ load _ weight, i _ data _ weight, i _ input _ vld, i _ input _ dat, and i _ output _ rdy; wherein i _ load _ bias is bias data load enable provided by the finite state machine, i _ data _ bias is neural network bias data in 16-bit fixed point format provided by the external input, i _ load _ weight is weight data load enable provided by the finite state machine, i _ data _ weight is neural network weight data in 9-bit fixed point format provided by the external input, input _ vld is a characteristic diagram data valid signal provided by the input buffer module FIFO, i _ input _ dat is 4-bit characteristic diagram data provided by the input buffer module FIFO, and i _ output _ rdy is an output enable signal provided by the intermediate convolution module FIFO; the 3 output signals are i _ output _ vld, i _ output _ dat and i _ input _ rdy respectively; wherein i _ output _ vld is a characteristic diagram valid signal provided for the intermediate convolution module, i _ output _ dat is 10-bit characteristic diagram data provided for the intermediate convolution module, and i _ input _ rdy is an input enable signal of the input buffer module FIFO;
the intermediate convolution module comprises a data memory, a data serial-parallel conversion module, an intermediate weight serial-parallel conversion module, a weight memory, an offset memory, an intermediate parallel multiply-add operation module, a finite state machine module and a bit width adjusting module; the characteristic diagram data output by the input convolution module is sent to a data memory and a data serial-parallel conversion module to carry out block data division and serial-parallel conversion; inputting the weight data into a weight serial-parallel conversion module, converting the weight data into parallel data and storing the parallel data in a weight memory; the converted feature map data and the weight data are sent to an intermediate parallel multiply-add operation module for convolution operation, the bit width adjustment is carried out after the operation result and the bias data are summed, and the result is output to an output convolution module by combining the state of a finite state machine; designing a data serial-parallel conversion module and a weight serial-parallel conversion module aiming at a large amount of programmable logic resources and storage resources in the FPGA, splitting and storing matrix data;
the intermediate convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: m _ load _ bias, m _ data _ bias, m _ load _ weight, m _ data _ weight, m _ input _ vld, m _ input _ dat, and m _ output _ rdy; wherein m _ load _ bias is a bias data load enable provided by the finite state machine, m _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, m _ load _ weight is a weight data load enable provided by the finite state machine, m _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, m _ input _ vld is a signature data valid signal provided by the input convolution module, m _ input _ dat is 10-bit signature data provided by the input convolution module, and m _ output _ rdy is an output enable signal provided by the output convolution module; the 3 output signals are m _ output _ vld, m _ output _ dat and m _ input _ rdy respectively; wherein, m _ output _ vld is a characteristic diagram valid signal provided for the output convolution module, m _ output _ dat is 80-bit characteristic diagram data provided for the output convolution module, and m _ input _ rdy is an input enable signal of the input convolution module;
the output convolution module inputs the characteristic diagram data output by the middle convolution module into an instantiated output data serial-parallel conversion module in the output convolution module, inputs the weight data into an output weight serial-parallel conversion module, and inputs the bias signal into a bias signal memory; the converted characteristic diagram input signal and the weight signal enter an output parallel multiply-add operation module for convolution operation, are superposed with the bias signal and then enter a bit width adjusting module, and output characteristic diagram data provided for a zooming module;
the output convolution module comprises 6 input signals and 3 output signals; the 6 input signals are respectively: o _ load _ bias, o _ data _ bias, o _ load _ weight, o _ data _ weight, o _ input _ vld, and o _ input _ dat; wherein, o _ load _ bias is bias data load enable provided by the finite state machine, o _ data _ bias is neural network bias data in 16-bit fixed point format provided by external input, o _ load _ weight is weight data load enable provided by the finite state machine, o _ data _ weight is neural network weight data in 9-bit fixed point format provided by external input, o _ input _ vld is characteristic diagram data effective signal provided by the previous convolution module, o _ input _ dat is 80-bit characteristic diagram data provided by the previous convolution module; the 3 output signals are o _ output _ vld, o _ output _ dat and o _ input _ rdy respectively; wherein o _ output _ vld is the signature valid signal provided to the scaling module, o _ output _ dat is the 10-bit signature data provided to the scaling module, and o _ input _ rdy is the present module input enable signal provided to the intermediate convolution module.
Preferably, the scaling module comprises 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.
Preferably, the Softmax module comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dar is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the Policy convolutional neural network accelerator based on the FPGA provided by the invention realizes the forward propagation process of the Policy network in the deep reinforcement learning algorithm AlphaGo on the FPGA platform, and has greater advantages in the aspects of power consumption, processing speed, memory bandwidth requirements and the like compared with a CPU and a GPU. A large number of DSP computing units provided by the FPGA are very beneficial to the operation involved in the convolutional neural network, and the complex convolutional operation can be refined and modularized. The Policy network model is mapped into the FPGA platform, parallel acceleration is realized through a pipeline processing mode, and the processing speed of the forward propagation process of the Policy convolution neural network in the AlphaGo algorithm is effectively improved. And the FPGA development period is short, the design is flexible, the configurability is strong, and the development potential is obvious, so that the Policy convolution neural network accelerator is constructed based on the FPGA platform, and has higher research significance and practical value.
Drawings
FIG. 1 shows an FPGA-based policy provided by an embodiment of the present inventionyA system architecture block diagram of a convolutional neural network accelerator;
FIG. 2 is a block diagram of a scaling module according to an embodiment of the present invention;
fig. 3 is a design structure diagram of a Softmax module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a hardware structure of a Policy convolutional neural network accelerator based on an FPGA according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating the operation of convolution layers in the input convolution module according to an embodiment of the present invention;
FIG. 6 is a block diagram of an input convolution module according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of an input convolution module according to an embodiment of the present invention;
FIG. 8 is a block diagram of a design of an intermediate convolution module according to an embodiment of the present invention;
fig. 9 is a schematic hardware structure diagram of an intermediate convolution module according to an embodiment of the present invention;
FIG. 10 is a block diagram of an output convolution module according to an embodiment of the present invention;
fig. 11 is a schematic diagram of a hardware structure of an output convolution module according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
In this embodiment, taking a Policy network as an example, the convolutional neural network in the Policy network is accelerated by using the FPGA-based Policy convolutional neural network accelerator of the present invention.
The convolutional neural network in the Policy network adopted in this embodiment includes thirteen convolutional layers, twelve ReLU functions, one Scale layer, and one Softmax layer, and the depth of the network is obvious. This convolutional neural network structure design is derived from the LeNet-5 network model, but does not contain the pooling layers that are widely used in convolutional neural network models. This is because the Policy network model is used in AlphaGo for predicting the probability of each next position on a 19 × 19 chessboard, and each point in the 19 × 19 feature map represents the situation of the chessboard, which has practical and practical significance. Therefore, in the overall convolutional neural network model, no pooling layer can be adopted, and data cannot be sampled or averaged to reduce the size, which is obviously different from other convolutional neural network models.
An FPGA-based Policy convolutional neural network accelerator is shown in FIG. 1 and comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module performs corresponding displacement on the value of 361 points output by the convolution module, namely 19 × 19 points, as shown in fig. 2; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponent calculation on the data streams output by the scaling module one by one as shown in FIG. 3, all obtained exponent calculation results are input into an accumulator, the sum of all numerical values is calculated and input into a floating point division IP core provided by Xilinx company as a divisor, and the accumulator is composed of the floating point addition IP core provided by the Xilinx company; the floating point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values computed for each point remains unchanged.
The input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, has a depth of 4096 and works in a first-word-fall-through mode; the hardware structure of the input buffer module FIFO is shown in fig. 4, and includes 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.
The convolution layer occupies a great proportion in the Policy convolution neural network model and is a core network layer of the convolution neural network, so that the convolution module is also the core module of the FPGA-based Policy convolution neural network accelerator. In the thirteen convolutional layers in the Policy network model, the sizes of the input feature maps, the number of channels, and the sizes and the number of required convolutional cores required for the first convolutional layer, the second convolutional layer through the twelfth convolutional layer, and the thirteenth convolutional layer are different. The first layer is a type, the size of an input characteristic diagram is 19 multiplied by 19, and 4 channels are provided; the sizes of the input characteristic graphs of the second to twelfth convolutional layers are unchanged, and the number of channels is 192; the third layer inputs the same as the second layer, but the number of output data channels is reduced to 1.
In the convolution module, thirteen convolution layers and twelve ReLU functions are contained. Except that the input of the first layer is known data, the input of each layer is the output obtained by the operation of the previous volume of lamination layer, so the part cannot be accelerated in a parallel mode. However, the operation of each layer is independent, and the operation of each feature map corresponding to each convolution kernel is also independent, so that the parallel processing mode can be adopted in each convolution layer for acceleration. In the convolution module, the complex operation of each convolution layer is divided into independent basic operation units, and a pipeline structure is adopted to accelerate the operation speed as much as possible. In addition, the connection between the convolutional layer and the convolutional layer requires high requirements on the coordination time sequence, so that the operation of the convolutional neural network can be correctly completed in the whole system of the FPGA-based Policy convolutional neural network accelerator. The convolution module is further divided into three sub-modules, namely an input convolution module, an intermediate convolution module and an output convolution module, by analyzing different points of data; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network.
In this embodiment, the convolutional layer operation process input to the convolutional module is taken as an example for analysis, so as to obtain a specific calculation flow of the convolutional layer as shown in fig. 5. First, the enabling of the input buffer module FIFO is initialized by the finite state machine. The 19 × 19 input matrix in the first convolutional layer is edge extended to become a 23 × 23 matrix. Inputting characteristic diagram data and processing in parallel, and inputting weight parameters and a bias parameter matrix. And (4) performing multiplication operation on the corresponding characteristic graph and the weight data, adding the obtained product, and adding an offset value. The core calculation part is convolution operation. The number of convolution kernels input to the first layer of convolution layers is 192, and each convolution kernel is a 5 × 5 matrix. The convolution layer operation process of other convolution modules is similar to the convolution layer operation process of the input convolution module, the number of input convolution kernels is 192, but the size of each convolution kernel is a matrix of 3 × 3. The important point of calculating the ReLU function is to judge whether the input value is positive or negative, and data greater than 0 is directly output, and data less than or equal to 0 is output as 0. Therefore, the ReLU function portion compares the input data with the value 0 using a comparator. In particular implementations, the calculation of the ReLU function may be combined with the calculation of the convolutional layer.
The input convolution module comprises a weight serial-parallel conversion module and a parallel multiply-add operation module as shown in fig. 6; the feature map data and the weight data passing through the weight serial-parallel conversion module enter the parallel multiply-add operation module together for convolution operation, and after the operation result is overlapped with the bias input, the feature map data provided for the middle convolution module is generated.
The hardware structure of the input convolution module is shown in fig. 7, and includes 7 input signals and 3 output signals; the 7 input signals are respectively: i _ load _ bias, i _ data _ bias, i _ load _ weight, i _ data _ weight, i _ input _ vld, i _ input _ dat, and i _ output _ rdy; wherein i _ load _ bias is a bias data load enable provided by the finite state machine, i _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, i _ load _ weight is a weight data load enable provided by the finite state machine, i _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, input _ vld is a characteristic diagram data valid signal provided by the input buffer module FIFO, and is at a high level when the input buffer module FIFO is not empty and the read end is not in a reset state, and the hardware connection is as shown in fig. 4; i _ input _ dat is the 4-bit profile data provided by the input buffer module FIFO, i _ output _ rdy is the output enable signal provided by the intermediate convolution module; the 3 output signals are i _ output _ vld, i _ output _ dat and i _ input _ rdy respectively; wherein i _ output _ vld is the signature valid signal provided to the intermediate convolution module, i _ output _ dat is the 10-bit signature data provided to the intermediate convolution module, and i _ input _ rdy is provided to the present module input enable signal of the input buffer module FIFO.
As shown in fig. 8, the intermediate convolution module includes a data memory, a data serial-to-parallel conversion module, an intermediate weight serial-to-parallel conversion module, a weight memory, an offset memory, an intermediate parallel multiply-add operation module, a finite state machine module, and a bit width adjustment module; the characteristic diagram data output by the input convolution module is sent to a data memory and a data serial-parallel conversion module to carry out block data division and serial-parallel conversion; inputting the weight data into a weight serial-parallel conversion module, converting the weight data into parallel data and storing the parallel data in a weight memory; the converted feature map data and the weight data are sent to an intermediate parallel multiply-add operation module for convolution operation, the bit width adjustment is carried out after the operation result and the bias data are summed, and the result is output to an output convolution module by combining the state of a finite state machine; and designing a data serial-parallel conversion module and a weight serial-parallel conversion module aiming at a large amount of programmable logic resources and storage resources in the FPGA, and splitting and storing the matrix data.
The hardware structure of the intermediate convolution module is shown in fig. 9, and includes 7 input signals and 3 output signals; the 7 input signals are respectively: m _ load _ bias, m _ data _ bias, m _ load _ weight, m _ data _ weight, m _ input _ vld, m _ input _ dat, and m _ output _ rdy; wherein m _ load _ bias is a bias data load enable provided by the finite state machine, m _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, m _ load _ weight is a weight data load enable provided by the finite state machine, m _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, m _ input _ vld is a signature data valid signal provided by the input convolution module, m _ input _ dat is 10-bit signature data provided by the input convolution module, and m _ output _ rdy is an output enable signal provided by the output convolution module; the 3 output signals are m _ output _ vld, m _ output _ dat and m _ input _ rdy respectively; wherein, m _ output _ vld is the characteristic diagram valid signal provided to the output convolution module, m _ output _ dat is the 80-bit characteristic diagram data provided to the output convolution module, and m _ input _ rdy is provided to the input enable signal of the input convolution module.
As shown in fig. 10, the output convolution module inputs the feature map data output by the intermediate convolution module into the output data serial-parallel conversion module instantiated in the output convolution module, inputs the weight data into the output weight serial-parallel conversion module, and inputs the bias signal into the bias signal memory; the converted characteristic diagram input signal and the weight signal enter an output parallel multiply-add operation module for convolution operation, are superposed with the bias signal and then enter a bit width adjusting module, and output characteristic diagram data provided for a zooming module.
The hardware structure of the output convolution module is shown in fig. 11, and includes 6 input signals and 3 output signals; the 6 input signals are respectively: o _ load _ bias, o _ data _ bias, o _ load _ weight, o _ data _ weight, o _ input _ vld, and o _ input _ dat; wherein, o _ load _ bias is bias data load enable provided by the finite state machine, o _ data _ bias is neural network bias data in 16-bit fixed point format provided by external input, o _ load _ weight is weight data load enable provided by the finite state machine, o _ data _ weight is neural network weight data in 9-bit fixed point format provided by external input, o _ input _ vld is characteristic diagram data effective signal provided by the previous convolution module, o _ input _ dat is 80-bit characteristic diagram data provided by the previous convolution module; the 3 output signals are o _ output _ vld, o _ output _ dat and o _ input _ rdy respectively; wherein o _ output _ vld is the signature valid signal provided to the scaling module, o _ output _ dat is the 10-bit signature data provided to the scaling module, and o _ input _ rdy is the present module input enable signal provided to the intermediate convolution module.
The hardware structure of the scaling module is shown in fig. 4, and includes 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.
The hardware structure of the Softmax module is shown in FIG. 4, and comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dat is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (4)

1. An FPGA-based Policy convolutional neural network accelerator, characterized in that: the device comprises an input buffer module FIFO, a convolution module, a scaling module and a Softmax module; the running hardware platform is an FPGA design suite VCU118 of Xilinx corporation; the input buffer module FIFO is used for buffering the inconsistency of the input speed of the external characteristic diagram and the calculation speed of the FPGA internal module and inputting the characteristic diagram data into the convolution module; the convolution module completes the operation of each convolution layer of the Policy convolution neural network and the ReLU activation function and outputs the operation result to the scaling module; the scaling module carries out corresponding displacement on the value of 361 points which are 19 multiplied by 19 and output by the convolution module; the scaling module converts the characteristic diagram data output by the convolution module into floating point data through a fixed point number to floating point number IP core, then the floating point data and the offset data are added to complete a displacement process, the characteristic diagram data are output, and the characteristic diagram data are sent to the Softmax module; the Softmax module performs exponent calculation on the data streams output by the scaling module one by one, all obtained exponent calculation results are input into an accumulator, and the sum of all numerical values is calculated and used as a divisor to be input into a floating-point division IP core provided by Xilinx company; the accumulator is composed of a floating point number addition IP core provided by Xilinx corporation; the floating-point division IP core takes the calculation result data of each numerical index as a dividend and outputs a division operation result; in the division operation, FIFO Generator IP is used for checking data to be stored in sequence, so that the data can be conveniently calculated in sequence; the Softmax module ensures that the order of the probability values calculated for each point remains unchanged;
the convolution module comprises an input convolution module, an intermediate convolution module and an output convolution module; the input convolution module completes the calculation of a first layer convolution layer of a Policy convolution neural network and the calculation of a ReLU activation function; the middle convolution module completes the operation of the second layer to the twelfth layer of the Policy convolution neural network and the operation of the ReLU activation function, and the output convolution module completes the operation of the thirteenth layer of the Policy convolution neural network;
the input convolution module comprises a weight serial-parallel conversion module and a parallel multiply-add operation module; the feature map data and the weight data passing through the weight serial-parallel conversion module enter a parallel multiply-add operation module together for convolution operation, and after the operation result is superposed with the bias input, feature map data provided for an intermediate convolution module are generated;
the input convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: i _ load _ bias, i _ data _ bias, i _ load _ weight, i _ data _ weight, i _ input _ vld, i _ input _ dat, and i _ output _ rdy; wherein i _ load _ bias is bias data load enable provided by the finite state machine, i _ data _ bias is neural network bias data in 16-bit fixed point format provided by the external input, i _ load _ weight is weight data load enable provided by the finite state machine, i _ data _ weight is neural network weight data in 9-bit fixed point format provided by the external input, input _ vld is a characteristic diagram data valid signal provided by the input buffer module FIFO, i _ input _ dat is 4-bit characteristic diagram data provided by the input buffer module FIFO, and i _ output _ rdy is an output enable signal provided by the intermediate convolution module FIFO; the 3 output signals are i _ output _ vld, i _ output _ dat and i _ input _ rdy respectively; wherein i _ output _ vld is a characteristic diagram valid signal provided for the intermediate convolution module, i _ output _ dat is 10-bit characteristic diagram data provided for the intermediate convolution module, and i _ input _ rdy is an input enable signal of the input buffer module FIFO;
the intermediate convolution module comprises a data memory, a data serial-parallel conversion module, an intermediate weight serial-parallel conversion module, a weight memory, an offset memory, an intermediate parallel multiply-add operation module, a finite state machine module and a bit width adjusting module; the characteristic diagram data output by the input convolution module is sent to a data memory and a data serial-parallel conversion module to carry out block data division and serial-parallel conversion; inputting the weight data into a weight serial-parallel conversion module, converting the weight data into parallel data and storing the parallel data in a weight memory; the converted feature map data and the weight data are sent to an intermediate parallel multiply-add operation module for convolution operation, the bit width adjustment is carried out after the operation result and the bias data are summed, and the result is output to an output convolution module by combining the state of a finite state machine; designing a data serial-parallel conversion module and a weight serial-parallel conversion module aiming at a large amount of programmable logic resources and storage resources in the FPGA, splitting and storing matrix data;
the intermediate convolution module comprises 7 input signals and 3 output signals; the 7 input signals are respectively: m _ load _ bias, m _ data _ bias, m _ load _ weight, m _ data _ weight, m _ input _ vld, m _ input _ dat, and m _ output _ rdy; wherein m _ load _ bias is a bias data load enable provided by the finite state machine, m _ data _ bias is a neural network bias data in a 16-bit fixed point format provided by an external input, m _ load _ weight is a weight data load enable provided by the finite state machine, m _ data _ weight is a neural network weight data in a 9-bit fixed point format provided by an external input, m _ input _ vld is a signature data valid signal provided by the input convolution module, m _ input _ dat is 10-bit signature data provided by the input convolution module, and m _ output _ rdy is an output enable signal provided by the output convolution module; the 3 output signals are m _ output _ vld, m _ output _ dat and m _ input _ rdy respectively; wherein, m _ output _ vld is a characteristic diagram valid signal provided for the output convolution module, m _ output _ dat is 80-bit characteristic diagram data provided for the output convolution module, and m _ input _ rdy is an input enable signal of the input convolution module;
the output convolution module inputs the characteristic diagram data output by the middle convolution module into an instantiated output data serial-parallel conversion module in the output convolution module, inputs the weight data into an output weight serial-parallel conversion module, and inputs the bias signal into a bias signal memory; the converted characteristic diagram input signal and the weight signal enter an output parallel multiply-add operation module for convolution operation, are superposed with the bias signal and then enter a bit width adjusting module, and output characteristic diagram data provided for a zooming module;
the output convolution module comprises 6 input signals and 3 output signals; the 6 input signals are respectively: o _ load _ bias, o _ data _ bias, o _ load _ weight, o _ data _ weight, o _ input _ vld, and o _ input _ dat; wherein, o _ load _ bias is bias data load enable provided by the finite state machine, o _ data _ bias is neural network bias data in 16-bit fixed point format provided by external input, o _ load _ weight is weight data load enable provided by the finite state machine, o _ data _ weight is neural network weight data in 9-bit fixed point format provided by external input, o _ input _ vld is characteristic diagram data effective signal provided by the previous convolution module, o _ input _ dat is 80-bit characteristic diagram data provided by the previous convolution module; the 3 output signals are o _ output _ vld, o _ output _ dat and o _ input _ rdy respectively; wherein o _ output _ vld is the signature valid signal provided to the scaling module, o _ output _ dat is the 10-bit signature data provided to the scaling module, and o _ input _ rdy is the present module input enable signal provided to the intermediate convolution module.
2. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the input buffer module FIFO is a synchronous clock FIFO formed by FPGA on-chip block RAM, has a depth of 4096 and works in a first-word-fall-through mode; the input buffer module FIFO comprises 3 input ports and 5 output ports; the 3 input ports are respectively: externally provided 4-bit input data din, write enable wr _ en provided by a finite state machine, and read enable rd _ en provided by a convolution module; the 5 output ports are respectively: the data processing method comprises the following steps of providing 4-bit output data dout for a convolution module, providing a full write indication full for the convolution module, providing a read empty indication empty for the convolution module, providing a write reset state indication wr _ rst _ busy for the convolution module and providing a read reset state indication rd _ rst _ busy for the convolution module.
3. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the scaling module comprises 4 input signals and 2 output signals; the 4 input signals are respectively: sc _ load _ bias, sc _ data _ bias, sc _ input _ vld, and sc _ input _ dat; the sc _ load _ bias is bias data loading enabling provided by a finite state machine, sc _ data _ bias is neural network bias data in a 16-bit fixed point format provided by external input, sc _ input _ vld is a characteristic diagram data effective signal provided by a previous convolution module, and sc _ input _ dat is 10-bit characteristic diagram data provided by the previous convolution module; the 2 output signals are sc _ output _ vld and sc _ output _ dat respectively; sc _ output _ vld is a feature map valid signal provided to the Softmax module, and sc _ output _ dat is feature map data in 32-bit single precision floating point format provided to the Softmax module.
4. The FPGA-based Policy convolutional neural network accelerator of claim 1, wherein: the Softmax module comprises 2 input signals and 2 output signals; the 2 input signals are respectively: so _ input _ vld and so _ input _ dat; the so _ input _ vld is a feature map data valid signal provided by the scaling module, and the so _ input _ dat is feature map data in a 32-bit single-precision floating-point format provided by the scaling module; the 2 output signals are so _ output _ vld and so _ output _ dat respectively; the so _ output _ vld is a feature map valid signal provided to the outside, and the so _ output _ dat is feature map data provided to the outside in a 32-bit single-precision floating-point format.
CN201811373344.1A 2018-11-19 2018-11-19 Policy convolution neural network accelerator based on FPGA Active CN109146067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811373344.1A CN109146067B (en) 2018-11-19 2018-11-19 Policy convolution neural network accelerator based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811373344.1A CN109146067B (en) 2018-11-19 2018-11-19 Policy convolution neural network accelerator based on FPGA

Publications (2)

Publication Number Publication Date
CN109146067A CN109146067A (en) 2019-01-04
CN109146067B true CN109146067B (en) 2021-11-05

Family

ID=64806153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811373344.1A Active CN109146067B (en) 2018-11-19 2018-11-19 Policy convolution neural network accelerator based on FPGA

Country Status (1)

Country Link
CN (1) CN109146067B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188869B (en) * 2019-05-05 2021-08-10 北京中科汇成科技有限公司 Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN110929688A (en) * 2019-12-10 2020-03-27 齐齐哈尔大学 Construction method and acceleration method of rice weed recognition acceleration system
CN111047037A (en) * 2019-12-27 2020-04-21 北京市商汤科技开发有限公司 Data processing method, device, equipment and storage medium
CN111330255B (en) * 2020-01-16 2021-06-08 北京理工大学 Amazon chess-calling generation method based on deep convolutional neural network
CN112232499B (en) * 2020-10-13 2022-12-23 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) Convolutional neural network accelerator
CN112541583A (en) * 2020-12-16 2021-03-23 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) Neural network accelerator
CN112836793B (en) * 2021-01-18 2022-02-08 中国电子科技集团公司第十五研究所 Floating point separable convolution calculation accelerating device, system and image processing method
CN113392973B (en) * 2021-06-25 2023-01-13 广东工业大学 AI chip neural network acceleration method based on FPGA
CN113609548B (en) * 2021-07-05 2023-10-24 中铁工程设计咨询集团有限公司 Bridge span distribution method, device, equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN107239829A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of method of optimized artificial neural network
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN108389183A (en) * 2018-01-24 2018-08-10 上海交通大学 Pulmonary nodule detects neural network accelerator and its control method
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN107239829A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of method of optimized artificial neural network
CN107392309A (en) * 2017-09-11 2017-11-24 东南大学—无锡集成电路技术研究所 A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA
CN108389183A (en) * 2018-01-24 2018-08-10 上海交通大学 Pulmonary nodule detects neural network accelerator and its control method
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
"A Survey of FPGA Based Neural Network Accelerator";KAIYUAN GUO et al.;《arXiv》;20180515;全文 *
"Dynamic scheduling Monte-Carlo framework for multi-accelerator hetergeneous clusters";Anson H.T.Tse et al.;《IEEE》;20110106;全文 *
"FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threased C Software";Jin Hee Kim et al.;《IEEE》;20171231;全文 *
"PipeCNN:An OpenCL-Based FPGA Accelerator for Large-Scale Convolutional Neuron Networks";Dong Wang et al.;《arXiv》;20161108;全文 *
"基于FPGA的大规模浮点矩阵乘加速器研究";沈俊忠;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180415(第04期);第I135-633页 *
"基于FPGA的高速网络流量采集系统设计";汪明;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140515(第05期);全文 *
"深度学习中的卷积神经网络系统设计及硬件实现";王昆 等;《万方数据知识服务平台》;20180611;第44卷(第5期);第56-69页 *
"深度学习算法可重构加速器关键技术研究";刘志强;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170315(第3期);全文 *

Also Published As

Publication number Publication date
CN109146067A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109146067B (en) Policy convolution neural network accelerator based on FPGA
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN109063825B (en) Convolutional neural network accelerator
CN102629189B (en) Water floating point multiply-accumulate method based on FPGA
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN113361695B (en) Convolutional neural network accelerator
CN103984560A (en) Embedded reconfigurable system based on large-scale coarseness and processing method thereof
Chen et al. A compact and configurable long short-term memory neural network hardware architecture
Yue et al. A 28nm 16.9-300TOPS/W computing-in-memory processor supporting floating-point NN inference/training with intensive-CIM sparse-digital architecture
CN113283587A (en) Winograd convolution operation acceleration method and acceleration module
CN115018062A (en) Convolutional neural network accelerator based on FPGA
WO2023070997A1 (en) Deep learning convolution acceleration method using bit-level sparsity, and processor
Shu et al. High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination
CN103279323A (en) Adder
CN102129419B (en) Based on the processor of fast fourier transform
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
Tsai et al. An on-chip fully connected neural network training hardware accelerator based on brain float point and sparsity awareness
CN111882050A (en) FPGA-based design method for improving BCPNN speed
Wong et al. Low bitwidth CNN accelerator on FPGA using Winograd and block floating point arithmetic
CN113191494B (en) Efficient LSTM accelerator based on FPGA
CN115167815A (en) Multiplier-adder circuit, chip and electronic equipment
He et al. An LSTM acceleration engine for FPGAs based on caffe framework
Xia et al. Reconfigurable spatial-parallel stochastic computing for accelerating sparse convolutional neural networks
Nagarajan et al. Fixed point multi-bit approximate adder based convolutional neural network accelerator for digit classification inference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant