CN109784489B

CN109784489B - Convolutional neural network IP core based on FPGA

Info

Publication number: CN109784489B
Application number: CN201910038533.1A
Authority: CN
Inventors: 常瀛修; 廖立伟; 曹健
Original assignee: Yu Dunshan; SCHOOL OF SOFTWARE AND MICROELECTRONICS PEKING UNIVERSITY
Current assignee: Yu Dunshan; SCHOOL OF SOFTWARE AND MICROELECTRONICS PEKING UNIVERSITY
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2021-07-30
Anticipated expiration: 2039-01-16
Also published as: CN109784489A

Abstract

The invention discloses a convolution neural network IP core based on FPGA, aiming at realizing the operation acceleration of the convolution neural network on a field programmable logic array (FPGA). The invention specifically comprises a convolution operation IP core, a pooling operation IP core, a full-connection operation IP core, a bubbling convolution layer, a bubbling pooling layer, a full-connection layer, a characteristic diagram storage module and a parameter storage module according to a basic model of a convolution neural network. The invention supports the construction of convolution neural networks with different scales by various IP cores, and instantiates IP cores with different types and quantities according to the required network model. Different neural network layers are constructed by instantiating IP cores, and the parallelism of the FPGA is fully utilized to realize the acceleration of the operation of the convolutional neural network. Different FPGA migration was realized by designing the IP core in Verilog HDL language. The invention greatly improves the operation speed and efficiency of the convolutional neural network and reduces the processing power consumption of the convolutional neural network.

Description

Convolutional neural network IP core based on FPGA

Technical Field

The invention relates to the field of hardware acceleration of a convolutional neural network, in particular to design of an IP core of the convolutional neural network based on an FPGA.

Background

With the recent rise and improvement of machine learning, deep learning, and Artificial intelligence, an Artificial Neural Network (ANN) is continuously developed, and is being concerned by academic and commercial industries as an Artificial intelligence field of cross-infiltration of bioscience and computer science. Early artificial neural networks were similar to the structure of the biomimetic nervous system, and computational structures that mimic the structure of human cerebral neurons were proposed in the mid-20 th century. The method is characterized in that dendritic branches of a human neuron structure are simulated into a plurality of input data, axons are simulated into single output data, and nerve signal output of the axons is achieved through certain data transformation, namely linear weighting.

The threshold value and the weight of various linear weights are manually set, the best result is not necessarily achieved, researchers in the 20 th century and the 70 th century think that a perceptron model cannot solve the linear indifference problem, the calculation capability is low at that time, a multilayer neural network model cannot be achieved, and the research of a neural network enters the low tide.

In order to solve the complex calculation problem and the linear irreparable problem of the neural network in the end of the 80 s of the 20 th century, researchers propose a back propagation algorithm, so that the calculation time of the neural network is greatly reduced. Up to now, the back propagation algorithm is still the mainstream algorithm for neural network training. Training of deep neural networks is still difficult because the computational resources at the time are still insufficient.

In 1989, researchers proposed the first to have a true Convolutional Neural Network (CNN) LeNet-5, and the Convolutional Neural Network gradually became the most widely used one of various deep Neural networks. With the development of neural network algorithms, convolutional neural networks are widely applied to the fields of image and pattern recognition, target detection, semantic segmentation and the like.

Because the convolutional neural network is widely applied, the convolutional neural network is gradually paid attention from the academic and business fields, and the convolutional neural network has clear advantages in the aspects of image processing, particularly in the aspects of scaling of image size, feature map extraction and the like. In order to fit the application of the industrial field, the learning capability and the classification capability of the convolutional neural network are continuously improved, so that the structure of the convolutional neural network is gradually complicated, and a large-scale and deep-level network needs to train parameters of the neural network by using a large number of samples, so that the calculation amount of the training process is huge. Massive training parameters of large-scale and deep-level convolutional neural networks need higher storage resources, high-throughput data processing and high parallelism, so that the characteristics of the convolutional neural networks cannot be exerted based on the control theory of the traditional computer structure. The development of cuda (computer Unified Device architecture) of NVIDIA and the tensrflow framework of Google supports the high-performance numerical calculation of GPU, and relieves the calculation pressure of the general-purpose architecture CPU to a certain extent, but the research and manufacturing cost, the energy efficiency ratio and the like of the framework cannot meet the requirements of low power consumption and high performance, and the framework is limited by volume and portability and is difficult to support the application scene of the convolutional neural network on the terminal.

Field Programmable Gate Arrays (FPGAs) are emerging as semi-custom Integrated circuits in the Field of Application Specific Integrated Circuits (ASICs). The FPGA combines the high performance and the high integration of the ASIC and the flexibility of a user programmable device, and is characterized by reconfigurability, higher performance and integration and large hardware upgrading space. Due to the reconfigurability of the FPGA, the FPGA is matched with the high-performance numerical calculation of the convolutional neural network under the condition that a general convolutional neural network special chip framework does not exist at present, and the low energy efficiency ratio of single-thread processing of a CPU and a GPU of the general framework is avoided. And the FPGA product has high marketing speed, and can be quickly put into the market at the present day when the neural network structure is changed day by day, so that the poor flexibility that the ASIC chip can only be designed according to a specific algorithm is avoided.

Existing FPGA acceleration schemes are roughly as follows: 1. by adopting a low-power-consumption high-performance accelerator design, the memory access bandwidth is improved by stacking a small number of Processing units (PEs), but the data throughput is low due to fewer pipelines, and even a redundant pipeline design in the convolution process exists. 2. The convolution kernels with different sizes are built by adopting smaller Processing units (PE), so that the problem of computation bottleneck is avoided, but the delay time of convolution computation is longer, and the peak computation performance is limited to some extent. 3. The accelerator of frequency domain processing is designed, and the OaA convolution kernel with variable size is used for reducing the convolution times and improving the universality of the convolution kernels of different levels, but OaA of the accelerator is composed of FFT with fixed size, so that the convolution processing process needs 0 filling FFT edge, and the convolution kernels are tiled, so that the convolution delay time is longer. 4. The hybrid neural network processor is adopted to realize neural network acceleration, different neural networks are processed according to requirements and configuration through 16 multiplied by 16 reconfigurable heterogeneous PE, and self-adaptive bit width configuration is adopted to reduce power consumption and improve efficiency. The main reason for the reconfigurable computing platform is that the current market has few general neural network processors, and an accelerator is basically designed for a neural network or a neural network model. The reconfigurable computing platform can adapt to most of the existing neural networks or neural network models, including CNN, FCN, RNN and the like, and the energy efficiency ratio of the computing platform is improved. The neural network solidified by the reconfigurable computing platform comprises CNN, FCN and RNN, occupies larger logic resources, so that the power consumption of the computing process of the platform cannot be effectively reduced, the expandability is low, the requirements on FPGA resources and performance are high, and the research and development cost is increased.

In summary, the convolutional neural network and the terminal application market thereof are very wide, and although the convolutional neural network converts the network model thereof for different application scenarios, the basic components of the convolutional neural network, such as convolution, pooling, full connection, activation functions and the like, do not change much, so the FPGA can cope with the convolutional neural network of different scenarios. On one hand, an IP core designed aiming at the convolutional neural network comprises basic components of the convolutional neural network, and the IP core is designed through a Verilog HDL language, is easy to be deployed in different FPGAs and embedded systems, and has strong portability. On the other hand, hardware design, modification, and debugging are a threshold for software engineers and algorithm engineers unfamiliar with hardware, which may increase costs and man-hours for enterprises. In order to solve the problem of door opening and reduce the working time cost of enterprises and researchers, an IP core designed for the convolutional neural network comprises a callable interface, so that a user can conveniently construct different convolutional neural network models on different FPGAs, and the important significance is achieved in supporting hardware acceleration of the convolutional neural network.

Disclosure of Invention

The invention provides a convolution neural network IP core based on FPGA, aiming at quickly and conveniently constructing a hardware structure of the convolution neural network in the FPGA, realizing acceleration of feedforward operation of the convolution neural network, reducing hardware design thresholds of software and algorithm engineers and facilitating development and verification of algorithms and terminal products.

The anticipated application scenario of the present invention requires the FPGA as the accelerator hardware platform, a reconfigurable convolutional neural network model, a data stream containing the convolution, pooling, full connectivity, activation functions, and convolutional neural network signature. The method has the characteristics of low power consumption while improving the operational performance and efficiency. The FPGA has standard interface configuration and supports the extension of a convolutional neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

the convolutional neural network IP core based on the FPGA is characterized in that the specific IP core and the composition module comprise a convolutional operation IP core, a pooling operation IP core, a full-connection operation IP core, a bubbling method convolutional layer, a bubbling method pooling layer, a full-connection layer, a feature map storage module and a parameter storage module.

The neural network layer formed by the IP cores is internally interconnected with the parameter storage module and the feature map storage module, and a hardware structure formed by the neural network layer, the parameter storage module and the feature map storage module is consistent with a required convolutional neural network algorithm structure.

The convolution neural network IP core based on the FPGA is characterized in that the convolution operation IP core comprises an input feature map buffer, a weight parameter buffer, a multiplier, an adder and an activation function module;

1) and the convolution operation IP core reads the characteristic points in the characteristic diagram one by one according to rows in each clock cycle to form a data stream. The input characteristic diagram buffer is a register group with configurable depth, is used for shifting and buffering the data stream of the input characteristic diagram, modifies the depth of the register group according to the number of rows and columns of the input characteristic diagram, supports convolutional neural networks with different scales, and the input characteristic diagram data of a fixed address interval is connected with a multiplier.

2) The weight parameter buffer is a register group with configurable depth and is used for shifting and buffering weight parameters, and the weight parameter buffer is fixed after the register group is filled with the weight parameters. The hardware structure of the weight parameter buffer is the same as that of the input characteristic diagram buffer, namely, the weight parameter data of the same fixed address interval is connected with the multiplier.

3) The multiplier and the adder connected with the input feature diagram buffer and the weight parameter buffer fixed address interval form a multiplication-addition pair, and the feature points in the feature diagram and the corresponding weight parameters jointly form complete convolution operation.

4) The convolution operation IP core adopts a common ReLU activation function, and the activation function module is a MUX multiplexer which is equivalent to the formula f (x) max (0, x).

The convolution neural network IP core based on the FPGA is characterized in that the pooling operation IP core comprises an input feature map buffer and a comparator;

1) the input characteristic diagram buffer is a register group with configurable depth, is used for shifting and buffering the input characteristic diagram data stream, can modify the depth of the register group according to the number of rows and columns of the input characteristic diagram to support convolutional neural networks with different scales, and the input characteristic diagram data of a fixed address interval is connected with the comparator.

2) The input port of the comparator is connected with 4 fixed address intervals of the input feature map buffer, and the output port is the maximum value of the 4 data.

The convolution neural network IP core based on the FPGA is characterized in that the full-connection operation IP core comprises a counter, an accumulator, a multiplier and an adder;

1) the multiplier inputs the input node data and the weight parameters which correspond to each other one by one respectively to complete the multiplication operation of the input node and the weight parameters.

2) The accumulator is composed of a register and an adder, and the operation result of the multiplier is input into the register of the accumulator and is accumulated with the multiplication result of the next clock cycle to form multiply-accumulate operation.

3) The counter controls the iteration cycle of multiply-accumulate operation, and one iteration cycle is that all input nodes and the corresponding weight parameters complete multiply-accumulate operation to obtain data of one output node. And after one iteration period is finished, the control result of the counter is multiplied and accumulated and is added with the offset parameter corresponding to the output node.

The convolution neural network IP core based on the FPGA is characterized in that the bubbling convolution layer comprises a bubbling controller and a convolution operation IP core;

1) the three clock sequences of the bubble method controller are respectively a read input characteristic diagram sequence, a read weight parameter sequence and a convolution process sequence. And reading the waiting interval of the input characteristic diagram sequence and the read weight parameter sequence, which is the time when the input characteristic diagram data and the weight parameter data respectively enter the convolution operation IP core input characteristic diagram buffer and the weight parameter buffer, wherein the convolution result of the interval is invalid. Reading the holding interval of the weight parameter sequence to fill the weight parameter buffer with the weight parameter data and holding for a fixed time, wherein the input feature map data flow in the interval passes through the input feature map buffer, the process is actually a convolution processing process, and the convolution result of the interval is partially effective. The waiting interval of the convolution process sequence is the same as the input characteristic diagram sequence and the read weight parameter sequence. The effective interval in the convolution operation process represents that the data of the output characteristic diagram is effective, namely an effective result generated by a convolution kernel without crossing rows; the invalid interval represents the data invalidity of the output characteristic diagram, namely an invalid result generated by the cross-row convolution kernel. If the convolution process is liquid and the invalid section is gas, the valid result of the convolution process is equivalent to discharging excessive gas from the liquid, so the method is called bubbling.

2) And stacking convolution operation IP cores with the same number as the depth of the input feature map to form a single convolution core of the convolutional layer, instantiating the convolution core required by the convolutional layer, and forming the bubbling convolutional layer together with the bubbling controller.

The convolution neural network IP core based on the FPGA is characterized in that the bubbling method pooling layer comprises a bubbling method controller and a pooling operation IP core;

1) the four clock sequences of the bubble method controller are respectively a read input characteristic diagram sequence, a column effective sequence, a row effective sequence and a pooling process sequence. The waiting intervals of the four clock sequences are the time when the input feature map data enter the input feature map buffer of the pooling operation IP core, and the interval pooling result is invalid. Each effective pooling result of the column effective sequence is separated by an ineffective pooling result, which indicates that the step length of the pooling filter column is 2; the results for each active line of the line active sequence are separated by one inactive line, indicating a step size of 2 for the pooled filtered line. Assuming that the pooling process is liquid and the invalid pooling results of the column active sequences and the row active sequences are gas, the pooling process is equivalent to draining excess gas from the liquid, similar to a bubble convolution layer controller, and is referred to as bubbling.

2) And stacking the pooling operation IP cores with the same number as the depth of the input feature map to form a single pooling core of the pooling layer, and forming the bubbling method pooling layer together with the bubbling method controller.

The convolution neural network IP core based on the FPGA is characterized in that the full-connection layer comprises a full-connection layer controller and a full-connection operation IP core;

1) the full-connection layer controller comprises a weight parameter address reader, a bias parameter address reader, a counter and an input characteristic diagram address reader. The input characteristic graph address reading device and the weight parameter address reading device output an address signal in each clock cycle, and respectively read input nodes and weight parameter data which are in one-to-one correspondence to enter the full-connection operation IP core. The offset parameter reading addresser is controlled by a counter to complete the multiplication and accumulation of the first output node and then read the offset parameter corresponding to the first output node.

2) And the control port of the full-connection layer controller is in point-to-point connection with the corresponding ports of the characteristic map storage module and the parameter storage module and forms a full-connection layer together with the full-connection operation IP core.

The convolutional neural network IP core based on the FPGA is characterized in that the feature map storage module comprises global feature map caches corresponding to all dimensions, and the internal structure of the feature map caches is composed of shift caches. The buffer depth is the number of all the characteristic points of the characteristic diagram line multiplied by the number of all the characteristic points of the characteristic diagram column, and the number of the buffers is the depth of the three-dimensional characteristic diagram. The input port of the characteristic graph storage module is connected with the address port corresponding to each layer of controller in a point-to-point manner, and the output port is connected with the input port of the IP core of each neural network layer.

The convolutional neural network IP core based on the FPGA is characterized in that the parameter storage module comprises a global weight parameter and a bias parameter of the convolutional neural network model and is interconnected with each neural network layer.

Drawings

FIG. 1 is a system block diagram of the IP core of the FPGA-based convolutional neural network of the present invention;

FIG. 2 is a block diagram of the hardware structure and internal and external components of the convolution operation IP core of the present invention;

FIG. 3 is a schematic diagram of the timing control logic of the bubble convolutional layer of the present invention;

FIG. 4 is a block diagram of the bubble convolution layer hardware structure and internal and external components of the present invention;

FIG. 5 is a block diagram of the hardware structure and internal and external components of the pooling computing IP core of the present invention;

FIG. 6 is a schematic diagram of the timing control logic of the bubbling pooling layer of the present invention;

FIG. 7 is a block diagram of the bubbling pond level hardware architecture and internal and external components of the present invention;

FIG. 8 is a schematic diagram of a fully-connected neural network according to the present invention;

FIG. 9 is a diagram of the hardware architecture of a fully-connected arithmetic IP core of the present invention;

FIG. 10 is a block diagram of the hardware architecture and internal and external components of the fully-connected layer controller according to the present invention;

FIG. 11 is a block diagram of the fully-connected hardware architecture and internal and external components of the present invention;

fig. 12 is a schematic diagram of a power measurement result during FPGA operation.

Detailed Description

The invention designs the convolution neural network IP core based on the FPGA to carry out operation acceleration on the feedforward propagation of the convolution neural network from 3 aspects of convolution, pooling and full connection which are intensive in calculation of the convolution neural network by analyzing the basic characteristics of the convolution neural network, researching the current research situation at home and abroad at present and analyzing the advantages and disadvantages of the convolution neural network and combining the high parallelism, the high energy efficiency ratio and the reconfigurability of the FPGA. An IP core designed using the Verilog HDL language can efficiently build the required hardware structures using minimal logic resources and is easily ported to different types of FPGAs.

The following basic unit definitions are first given for the following detailed description and the description of the mathematical expressions:

table 1 CNN basic unit definitions of the invention

Referring to fig. 1, a neural network accelerator composed of convolutional neural network IP cores based on FPGA includes a convolutional operation IP core, a pooling operation IP core, a full-connection operation IP core, a bubble convolution layer, a bubble pooling layer, a full-connection layer, a feature map storage module, and a parameter storage module.

Fig. 2 specifically describes an internal hardware structure of the convolution operation IP core and a connection manner of external component modules.

The convolution operation IP core comprises an input characteristic diagram buffer, a weight parameter buffer, a multiplier, an adder and an activation function module. Before the convolution operation works, all parameters of the trained CNN are stored in a parameter storage module of the BRAM, so that the access and storage efficiency of the parameters is improved. Parameters (parameters) IC and CR within the convolution IP core are manually modified to conform to the size requirements of the current convolutional layer. When the convolution operation is started, the characteristic point pixel values in the characteristic map storage module pass through an input characteristic map buffer in the convolution operation IP core line by line in a data stream mode. If the first layer of convolution layer is adopted, the data flow is a data flow of real-time image collected by a camera which is subjected to binarization and scaling. CC corresponds to the light shaded portion of the input signature graph buffer, the number of the light shaded portions is CR in the buffer, and two adjacent light shaded portions are separated by IC-CC unit registers. The depth of the configurable input feature map buffer is (CR-1) multiplied by IC + CC, and the IC is modified to adapt to different input feature map sizes so as to support the construction of convolutional neural networks of different scales.

Similarly, the weight parameter of the parameter storage module also enters a weight parameter buffer in the IP core in the form of data stream, and the structure of the buffer is consistent with that of the input characteristic diagram buffer. The weight parameters are stored by filling (IC-CC) 0 effective weight parameters every CC, and are read in sequence in the convolution process.

The weight parameter buffer keeps unchanged after the data is stored, the input feature map buffer shifts to continue buffering all input feature map data streams, and the two shift buffers fix address intervals: { IC × 0+0, IC × 0+ 1., IC × 0+ (CR-1) }, { IC × 1+0, IC × 1+ 1., IC × 1+ (CR-1) }, { IC × (CR-1) +0, IC × (CR-1) +1, IC × (CR-1) + (CR-1) }. The number of addresses in the braces is the number of columns of the convolution kernel, and the number of the braces is the number of rows. The two shift buffer fixed address intervals are connected with the multiply-add unit to form an abstract convolution operation IP core.

The convolution process is that the first feature point data of the input feature graph enters the input feature graph buffer to the last feature point data enters the buffer, namely, the convolution kernel shifts from the upper left corner to the lower right corner of the feature graph to scan the whole input feature graph.

The bubble method convolutional layer needs to be matched with a bubble method controller, and details of the bubble method timing control logic are described with reference to fig. 3:

referring to fig. 3(a), three clock sequences of the bubble controller are a read input feature map sequence, a read weight parameter sequence and a convolution process sequence, respectively. The waiting intervals of the three clock sequences are the time when the input feature map data and the weight parameter data respectively enter the input feature map buffer and the weight parameter buffer in the convolution operation IP core, the convolution result in the intervals is invalid, and the process delays the clock period: clk_wait(CR-1) × IC + CC-1. Reading the holding interval of the weight parameter sequence to fill the weight parameter buffer with the weight parameter data and holding for a fixed time, wherein the input feature map data flow in the interval passes through the input feature map buffer, the process is actually a convolution processing process, and the convolution result of the interval is partially effective.

The valid interval of the convolution process sequence represents that the data of the output characteristic diagram is valid, corresponds to the blank valid output characteristic diagram data of the figure 3(b) and is stored immediately, and the process delays the clock period: clk_ValIC-CC + 1. The invalid interval of the convolution process sequence indicates that the data of the output feature map is invalid, namely, the invalid data generated across the rows of the convolution kernel in FIG. 3(b) corresponds to the grey invalid output feature in FIG. 3(b)Graph data is characterized and filtered out, the process delays clock cycles: clk_UnvalCC-1. If the convolution process is liquid and the invalid section is gas, the valid result of the convolution process is equivalent to discharging excessive gas from the liquid, so the method is called bubbling. Total convolution delay clock period:

clk_{conv Tolal} clk_wait+Convolutional process＝clk_wait+{[(IC-CC+1)+(CC-1)]×(IR-CR)+(IC-CC+1)}＝IC×IR。

referring to fig. 4, the feature map storage module has registered an input feature map of the current convolutional layer, whose depth is ID. A hardware structure of a single convolution kernel with a dashed line frame is constructed by instantiating ID general convolution operation IP kernels, and then instantiating n convolution kernels of the convolution layer to increase the number of channels, so that convolution kernel parallel convolution operation is achieved. The instantiated convolution operation IP core, convolution core and bubbling controller construct CNN convolution layers of different sizes. The feature map after convolution operation is still stored in the feature map storage module.

Referring to fig. 5, the hardware structure of the pooling operation IP core is similar to that of the convolution operation IP core, and the output feature map data of the previous layer stored in the feature map storage module is first read into the input feature map buffer of the pooling operation IP core in a data stream manner, where the buffer is a register group with configurable depth, and the depth is IC × (CR-1) + CC. Parameters IC and CR of the pooling operation IP core are manually modified to adapt to different input characteristic diagram sizes, so that the convolutional neural network building with different scales is supported.

When the input feature map data stream continuously flows into the input feature map buffer, the sliding translation of the pooling operation IP core is realized. The invention defines the size of the pooling filter to be 2 x 2, the step size is 2, so the fixed address interval of the shift register is: { IC × 0+0, IC × 0+1}, { IC × 1+0, IC × 1+1} are connected to the input ports of the comparators. The comparator takes the 4 maximum data values as the characteristic points of the output characteristic diagram and stores the characteristic points in the characteristic diagram storage module.

The bubbling method pooling layer needs to be matched with a bubbling method controller, and details of the bubbling method sequential control logic of the pooling layer are described with reference to fig. 6:

referring to FIG. 6, bubbling timing control logic for a pooling layerThere are 4 clock sequences, respectively a read input profile sequence, a column valid sequence, a row valid sequence, and a pooling process sequence. The pooling layer needs to read the data of the input feature map row by row from the feature map storage module and enter the pooling operation IP core. The 4 waiting intervals in the clock sequence indicate that the data stream of the input characteristic diagram is filling the buffer in the pooling operation IP core, and the clock period is delayed: clk_wait＝(CR-1)×IC+CC-1。

Referring to fig. 6, reading the input signature sequence shaded region indicates that the input signature data has filled the buffer of the IP core of the pooling operation, and the pooling process begins. The bubble method controller makes the input characteristic diagram input a data column by column and row by row into the buffer memory of the pooling operation IP core according to a clock cycle, and this process actually causes the covered part of the filter to move by step size 1, which is not in accordance with the present invention. Therefore, the column valid sequence selects or rejects the pooled output data, starting from the buffer of the IP core for which the input feature map data fills up the pooled operation, the first result is valid data of the output feature map, the result obtained after the next clock cycle enters the buffer is invalid data, and so on. The column valid sequence shaded region represents column valid data of the output feature map.

The step size of the pooling filter is 2, when the buffer shifts to store data according to the original clock period, the step size after crossing rows of the covered part of the filter is actually 1, so the output characteristic diagram data of the next row after crossing rows is invalid data, refer to the blank area in the valid sequence of row 6.

The sequence of pooling processes is similar to the sequence of convolution processes, and referring to fig. 6, the pooling process is regarded as liquid, the invalid data in the column valid sequence and the invalid data in the row valid sequence, i.e., the blank space, are regarded as gas, and the pooling process is equivalent to a process of exhausting air in the liquid, and is therefore called bubbling. The shaded regions of the pooling process sequence represent valid data, which is stored in the feature map storage module.

Hardware architecture of bubble pooling referring to fig. 7, the bubble pooling layer includes a pooling core and a bubble controller. And instantiating ID pooling operation IP cores to construct a pooling filter, and instantiating a bubbling method controller to control the pooling filter to process the input feature map. The number of the pooling filters is only 1 in CNN, but the number of the pooling operation IP cores for constructing the pooling filters is not fixed, because the pooling layers generally follow the convolutional layers, and the depth of the output feature map generated after the feature map passes through the convolutional layers is not fixed, so the number of the pooling operation IP cores in the pooling filters needs to be manually modified.

The instantiated pooling operation IP core, pooling filter and bubbling controller construct CNN pooling layers of different sizes. The characteristic graph after the pooling operation is still stored in the characteristic graph storage module.

The last layer of the CNN model is usually a fully connected layer, and the algorithm model of the fully connected layer refers to fig. 8. FIG. 8(a) shows a single-layer fully-connected neural network with n input nodes and m output nodes. x represents the value of the input node and y represents the value of the output node. FIG. 8(b) is a graph showing y₁Output nodes are taken as examples, and the numerical result of any one output node is calculated according to the following formula:

in the CNN feedforward process, the fully-connected layer functions as a "classifier". Convolutional layers, pooling layers, and the like operate to map raw data to a feature space, and fully-connected layers map the learned "distributed feature representation" to a sample label space. And (4) carrying out probability distribution calculation on the result of the full connection layer through a softmax function to obtain effective image classification, wherein the classification can realize final image identification.

Fig. 9 specifically illustrates an internal hardware structure of the fully-connected arithmetic IP core.

The full-connection operation IP core is responsible for operation of the weight parameter and the input node data, and outputs an operation result through an Outnode port, wherein the operation result is a numerical value of an output node. The full-connection operation IP core comprises a counter, an accumulator, a multiplier and an adder. Each input node and any output node have one-to-one corresponding weight parameter, and each output node has one-to-one corresponding bias parameter. With reference to fig. 8(b), the value of the first output node is calculated by performing one-to-one corresponding multiply-add calculation on the n input nodes and the first n parameters of the n × m weight parameters and added to the offset, and similarly, the value of the second input node is still calculated by the n input nodes and the second n parameters of the n × m weight parameters and added to the offset.

Referring to fig. 9, n data of the input nodes need to be read circularly m times, and the weight parameter data are read sequentially, so that the data of each input node and the weight parameter can be ensured to be in one-to-one correspondence. The first clock period full-connection operation IP core inputs a first pair of input node data x through f data and w _ data ports respectively₁And weight parameter w₁，x₁And w₁Firstly, performing multiplication calculation and inputting a result into an accumulator for temporary storage; second clock cycle input x₂And w₂Performing multiplication calculation and adding x in accumulator₁×w₁And accumulating the result, and temporarily storing the accumulated result in an accumulator. By analogy, after n clock cycles, the accumulator completes n times of multiplication accumulation operation, the counter controls the gate switch to open and output the accumulation result, and then the accumulation result is added with the Bias parameter (Bias), and the final result is output through the egress port. The result is calculated from the following formula:

y1＝x1×w1+x2×w2+...+xn×wn+bias1

n clock cycles may result in the first of the output nodes. Similarly, the second output node needs to read in the input node data and the weight parameter again, and the result of the second output node is calculated through n clock cycles. Since the number of output nodes is m, it takes n × m clock cycles to obtain the results of all output nodes.

And the full-connection operation IP core is responsible for inputting node data and corresponding weight parameters and calculating, and the final m output node data only need to judge the size of the data, so that the full-connection layer feedforward process is finally completed.

Referring to fig. 10, a hardware structure of the fully-connected layer controller is specifically described, and details of a sequential control logic of the fully-connected layer controller are described in conjunction with fig. 8:

the full-connection layer controller is responsible for the operations of data, weight parameters, bias parameters, memory access and the like of the input nodes. The hardware structure of the fully-connected layer controller is shown in fig. 10. The input characteristic diagram address reading device and the weight parameter address reading device output an address signal in each clock cycle, respectively read one data from the characteristic diagram storage module and the parameter storage module and simultaneously enter the full-connection operation IP core. The offset parameter reading addresser needs to be controlled by a counter, and after all input node parameters and weight parameters of the first output node are read, the offset parameter corresponding to the first output node is read.

With reference to fig. 8, it is assumed that the number of input nodes of the full connection layer is n, the number of output nodes is m, the cycle period of the input feature map address reading device is n clock cycles, the cycle period of the weight parameter address reading device is n × m clock cycles, and the offset parameter address reading device reads one data for every n clock cycles. The time required for the full link layer to complete the computation is n × m clock cycles.

Referring to fig. 11, instantiated fully-connected operational IP cores, fully-connected layer controllers build CNN pooling layers of different sizes. Fig. 11(a) is a schematic diagram of a single-layer fully-connected neural network, and fig. 11(b) is a hardware configuration diagram of the layer fully-connected neural network. Wherein x₁，x₂，...，x_nIndicating a fully connected layer with n input nodes, y₁Representing one of a plurality of output nodes, w₁，w₂，...，w_nRepresenting the calculation of weight parameters of the output node of y 1.

The upper layer of the fully-connected layer in the CNN network structure is usually a pooling layer or a convolutional layer, and thus a two-dimensional or three-dimensional output feature map needs to be converted into a one-dimensional data stream for storage. And the full-connection layer controls that n data of the input characteristic diagrams need to be read circularly from the characteristic diagram storage module and enter the full-connection operation IP core through the f _ data port. And simultaneously, n multiplied by m weight parameters and m Bias parameters of the full connection layer are modified into a one-dimensional matrix form of (n multiplied by m, 1) and (m, 1), and the controller of the full connection layer sequentially reads the weight parameters and the Bias parameter data from the parameter memory and respectively enters the full connection operation IP core through w _ data and the Bias ports.

The fully-connected neural network needs the softmax layer to perform probability distribution calculation in the training process, but in the CNN feedforward process, different pictures can be classified only by distinguishing the size of the result, so that the softmax layer is not needed in the hardware structure.

Through the steps, a convolutional neural network hardware structure of any scale can be constructed by using the IP core of the convolutional neural network based on the FPGA to accelerate the feedforward process of the convolutional neural network, but a peripheral module is still required to support the convolutional neural network accelerator. The peripheral module comprises a feature map storage module and a parameter storage module.

The characteristic diagram storage module caches global characteristic diagrams of all dimensions, and an internal structure of the characteristic diagram storage module is composed of a shift buffer composed of register groups with configurable depth. A single shift buffer caches a two-dimensional feature map of a unit depth in the three-dimensional feature map, the buffer depth is the number of all feature points of a feature map row x a column, and the number of the buffers is the depth of the three-dimensional feature map. The input port of the characteristic graph storage module is connected with the address port corresponding to each layer of controller in a point-to-point manner, and the output port is connected with the input port of the IP core of each neural network layer.

The parameter storage module is composed of a read-only memory, weight parameters and bias parameters of each neural network layer trained by a PC end are stored in the parameter storage module as initial files of the read-only memory, an input port of the parameter storage module is connected with an address port corresponding to each layer of controller to access data, and an output port of the parameter storage module is connected with a corresponding data input port of each layer of IP core to read data.

The built convolution neural network hardware accelerator needs an external device Micro Control Unit (MCU) to regulate and control the convolution process. The MCU control system is described in detail with reference to fig. 1:

TABLE 2 MCU control State

Status of state	Description of the invention
		IDLE	Idle State, also the initial State of the State machine in the MCU
Start	Starting state, controlling the data of register in each convolution layer and the characteristic diagram storage module to zero
		Conv1	The first layer convolution layer enable signal is pulled high to carry out convolution operation
Pool1	The first layer convolution layer enable signal is pulled down and follows the pooling layer enable signal of the first layer convolution layer to be pulled up to carry out pooling operation
		FCN	The enable signals of the convolution layer and the pooling layer in the neural network are pulled down, the enable signal of the full-connection layer is pulled up, and full-connection operation is carried out

Table 3 shows MCU state jump conditions and assembly instruction protocol

Present state	Secondary tyre	Jump condition
			IDLE	Start	The external Switcher of the FPGA is pulled high,accelerator for representing the beginning of an image into a convolutional neural network
Start	Conv1	The on-chip storage is used for completing the binarization and storage of the image, and the assembly instruction is [00.. 0001 ]]Bit width is the number of neural network layers
			Conv1	Pool1	The convolution operation is completed by IC multiplied by IR clock cycles, and the assembly instruction is [00.. 0010 ]]
Pool1	Conv2	The convolution operation is completed by IC multiplied by IR clock cycles, and the assembly instruction is [00.. 0100 ]]
			Conv2	Pool2	The convolution operation is completed by IC multiplied by IR clock cycles, and the assembly instruction is [00.. 1000.. 1000]
...	...	...
			Poolx	FCN	After the operation of the last pooling layer is finished, the operation of the full connection layer is carried out, and the assembly instruction is [10.. 0000 ]]

The present invention takes the MNIST handwritten digit set as the verification data set, and the CNN model structure is represented by table 4. Are respectively at

The delay time, performance and efficiency of the E3-1230V 2 CPU, NVIDIA Quadro 4000GPU and De2i-150FPGA are subjected to prototype verification.

TABLE 4 CNN model

Neural network layer	1	2	3	4	5
						ID	1	3	3	6	-
n	3	1	6	1	-
						IC＝IR	28	24	12	8	-
CC＝CR	5	2	5	2	-
						Step size	1	2	1	2	-
Status of state	Convolution with a bit line	Maximum pooling	Convolution with a bit line	Maximum pooling	Full connection

TABLE 5 convolution neural network feed-forward process delay time comparison for each hardware platform

Neural network layer	The design	GPU	CPU
				First floor (occupancy rate)	7.84×10^-6s(31.0％)	8.52×10^-3s(31.4％)	1.13×10^-2s(30.4％)
Second floor (occupancy rate)	5.76×10^-6s(22.8％)	6.63×10^-3s(24.4％)	1.25×10^-2s(33.7％)
				Third layer (occupancy rate)	1.44×10^-6s(5.7％)	9.00×10^-3s(33.2％)	0.74×10^-2s(19.9％)
Fourth layer (occupancy rate)	0.64×10^-6s(2.5％)	2.58×10^-3s(9.5％)	0.56×10^-2s(15.1％)
				Fifth layer (occupancy rate)	96×10^-6s(38.0％)	0.40×10^-3s(1.5％)	0.03×10^-2s(0.9％)
Time consuming	25.28×10^-6s(100％)	27.13×10^-3s(100％)	3.71×10^-2s(100％)
				Ratio of	1.00x	1073.18x	1468.35x

TABLE 6 Accelerator Performance consisting of FPGA-based convolutional neural network IP cores

TABLE 7 Accelerator Power consumption and efficiency formed by convolutional neural network IP cores based on FPGA

Referring to fig. 12, a hardware accelerator built based on the FPGA convolutional neural network IP core measures power on the FPGA. The power was measured using a model TM9 miniature power meter from TECMAN. The power of the FPGA when image recognition is performed on a single picture is 1.69W, corresponding to the data in table 7.

Claims

1. The convolutional neural network IP core based on the FPGA is characterized in that the specific IP core and the composition module comprise a convolutional operation IP core, a pooling operation IP core, a full-connection operation IP core, a bubbling method convolutional layer, a bubbling method pooling layer, a full-connection layer, a feature map storage module and a parameter storage module; the neural network layer formed by each IP core is internally interconnected with the parameter storage module and the characteristic map storage module, and a hardware structure formed by the neural network layer, the parameter storage module and the characteristic map storage module is consistent with a required convolutional neural network algorithm structure;

the bubbling method pooling layer comprises a bubbling method controller and a pooling operation IP core:

1) the four clock sequences of the bubble method controller are respectively a read input characteristic diagram sequence, a column effective sequence, a row effective sequence and a pooling process sequence; the waiting intervals of the four clock sequences are the time when the input feature map data enter the input feature map buffer of the pooling operation IP core, and the interval pooling result is invalid; each effective pooling result of the column effective sequence is separated by an ineffective pooling result, which indicates that the step length of the pooling filter column is 2; the result of each effective line of the line effective sequence is separated by the result of one ineffective line, and the step length of the pooling filtering line is 2; assuming that the pooling process is liquid and the ineffective pooling results of the column active sequences and the row active sequences are gas, the pooling process is equivalent to discharging excessive gas from the liquid, so the method is called bubbling;

2) stacking pooling operation IP cores with the same number as the depth of the input feature map to form a single pooling core of a pooling layer, and forming a bubbling method pooling layer together with a bubbling method controller;

the bubbling method rolling layer can be obtained by the same method as the bubbling method pooling layer.

2. The FPGA-based convolutional neural network IP core of claim 1, wherein the convolutional operation IP core comprises an input feature map buffer, a weight parameter buffer, a multiplier, an adder, and an activation function module:

1) the convolution operation IP core reads the feature points in the feature map one by one according to rows in each clock cycle to form a data stream; the input characteristic diagram buffer is a register group with configurable depth, is used for shifting and buffering the data stream of the input characteristic diagram, modifies the depth of the register group according to the number of rows and columns of the input characteristic diagram, supports convolutional neural networks with different scales, and the input characteristic diagram data of a fixed address interval is connected with a multiplier;

2) the weight parameter buffer is a register group with configurable depth and is used for shifting and buffering weight parameters, and the weight parameter buffer is fixed after the register group is filled with the weight parameters; the hardware structure of the weight parameter buffer is the same as that of the input characteristic diagram buffer, namely, the weight parameter data of the same fixed address interval is connected with the multiplier;

3) the multiplier and the adder connected with the input feature map buffer and the weight parameter buffer fixed address interval form a multiplication-addition pair, and the feature points in the feature map and the corresponding weight parameters form complete convolution operation;

3. The FPGA-based convolutional neural network IP core of claim 1, wherein the pooling operation IP core comprises an input profile buffer and a comparator:

1) the input characteristic diagram buffer is a register group with configurable depth, is used for shifting and buffering the input characteristic diagram data stream, can modify the depth of the register group according to the number of rows and columns of the input characteristic diagram to support convolutional neural networks with different scales, and the input characteristic diagram data of a fixed address interval is connected with the comparator;

4. The FPGA-based convolutional neural network IP core of claim 1, wherein the fully-connected arithmetic IP core comprises a counter, an accumulator, a multiplier and an adder:

1) the multiplier inputs the input node data and the weight parameters which correspond one to one respectively to complete the multiplication operation of the input node and the weight parameters;

2) the accumulator is composed of a register and an adder, and the operation result of the multiplier is input into the register of the accumulator and is accumulated with the multiplication result of the next clock cycle to form multiply-accumulate operation;

3) the counter controls the iteration period of the multiply-accumulate operation, and after the iteration period is completed, the counter controls the addition of the multiply-accumulate result and the offset parameter corresponding to the output node.

5. The FPGA-based convolutional neural network IP core of claim 1, wherein the bubble convolution layer comprises a bubble controller and a convolutional operational IP core:

1) three clock sequences of the bubble method controller are respectively a read input characteristic diagram sequence, a read weight parameter sequence and a convolution process sequence; reading a waiting interval of the input characteristic diagram sequence and the read weight parameter sequence, wherein the waiting interval is the time when the input characteristic diagram data and the weight parameter data respectively enter a convolution operation IP core input characteristic diagram buffer and a weight parameter buffer, and the interval convolution result is invalid; reading a holding interval of the weight parameter sequence to fill the weight parameter buffer with the weight parameter data and hold the weight parameter buffer for a fixed time, wherein the input feature map data stream in the interval passes through the input feature map buffer, the process is actually a convolution processing process, and the convolution result of the interval is partially effective; the waiting interval of the convolution process sequence is the same as the input characteristic diagram sequence and the read weight parameter sequence; the effective interval in the convolution operation process represents that the data of the output characteristic diagram is effective, namely an effective result generated by a convolution kernel without crossing rows; the invalid interval represents that the data of the output characteristic diagram is invalid, namely invalid results generated by the cross-row convolution kernel; if the convolution process is liquid and the invalid interval is gas, the valid result of the convolution process is equivalent to discharge redundant gas from the liquid, so the method is called bubbling;

6. The FPGA-based convolutional neural network IP core of claim 1, wherein the fully-connected layer comprises a fully-connected layer controller and a fully-connected operation IP core:

1) the full-connection layer controller comprises a weight parameter address reading device, a bias parameter address reading device, a counter and an input characteristic diagram address reading device; inputting a characteristic diagram reading addressor and a weight parameter reading addressor to output an address signal in each clock cycle, respectively reading input nodes and weight parameter data which correspond one to one and entering a full-connection operation IP core; the offset parameter reading addresser is controlled by a counter to read the offset parameter corresponding to the first output node after the multiplication and accumulation of the first output node are completed;

7. The IP core of the convolutional neural network based on FPGA of claim 1, wherein the feature map storage module comprises a global feature map buffer corresponding to each dimension, and the internal structure is composed of a shift buffer; the buffer depth is the number of all feature points of the feature map row x column, the buffer number is the depth of the three-dimensional feature map, the input port of the feature map storage module is connected with the address port corresponding to each layer of controller point to point, and the output port is connected with the input port of the IP core of each neural network layer.