CN114819120A

CN114819120A - PYNQ platform-based neural network universal acceleration processing method

Info

Publication number: CN114819120A
Application number: CN202210180230.5A
Authority: CN
Inventors: 王树龙; 孙承坤; 薛慧敏; 赵银锋; 刘钰; 马兰; 刘红侠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-07-29

Abstract

The invention relates to the technical field of artificial intelligence and FPGA design, in particular to a general neural network accelerated processing method based on a PYNQ platform. The invention adds general acceleration processing of the neural network on the basis of the original PYNQ platform, adopts a parallel method of multiple input and multiple output channels, optimizes the network structure aiming at VGG-16 and tiny-YOLOv3, improves the data processing speed, performance and universality of the neural network on the basis of realizing smaller resource consumption and lower power consumption, and improves the acceleration performance and the acceleration efficiency.

Description

PYNQ platform-based neural network universal acceleration processing method

Technical Field

The invention relates to the technical field of artificial intelligence and FPGA design, in particular to a general neural network accelerated processing method based on a PYNQ platform.

Background

With the rapid development of artificial intelligence technology, neural network algorithms become the subject of research in advance for researchers. The CNN (Convolutional Neural Networks) network algorithm has important application value and remarkable research significance in the fields of image recognition classification, voice analysis retrieval, target detection monitoring and the like. The calculation of the neural network is mostly in a hierarchical structure progressive calculation mode, the calculation of the neurons in the same layer in the network can be executed in parallel, the calculation depends on the output of the neurons in the previous layer, data can be multiplexed, and the important factor that the CNN can realize calculation acceleration is achieved.

However, in the evolution of convolutional neural networks, hundreds of millions of parameters and calculations place higher demands on hardware performance. On one hand, with the progress of the process level, the computing performance of the current acceleration platform is gradually improved, and how to accelerate the high-performance convolutional neural network algorithm on the novel acceleration platform needs to be further explored; on the other hand, in part of the research on the lightweight convolutional neural network, on the premise of ensuring the performance as much as possible, how to reduce the calculation amount and the data transmission amount is also an important research direction.

The PYNQ platform of Xilinx corporation adopts a mode of a processing system PS (processing System) + programmable logic PL (processing logic). The PS end of the processing system is compatible with various peripheral interfaces such as GIGe, USB, CAN and the like, and the official party has already printed the drive of devices such as a camera, a network card and the like of the USB interface in the Image file, so that the processing system CAN be conveniently used in the later period. The PS end of the processing system supports various external storage devices, such as Flash, DRAM and SRAM. The programmable logic PL terminal is mainly responsible for some high-speed peripheral interfaces, such as PCIE, HDMI, Pmod interface, audio input and the like. The Linux mirror image is burnt in the SD card, the Linux operating system is used for realizing the joint development of the programmable logic PL end FPGA and the computer, and finally the parallel operation, the high-speed image processing, the hardware acceleration algorithm, the real-time signal processing, the high-speed communication and the low-delay control are realized.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a PYNQ platform-based neural network universal accelerated processing method.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A PYNQ platform-based neural network universal acceleration processing method comprises the following steps:

step 1, based on a PYNQ platform, an ARM processor at a PS end acquires feature map data and weight data from an upper computer and stores the feature map data and the weight data into a DDR memory;

step 2, the ARM processor performs assignment configuration on a CNN register of the PL end and provides a lookup table for a nonlinear processing module of the PL end; a DMA copy-memory module at the PL end loads feature map data and weight data or intermediate calculation results in the DDR memory to an on-chip cache FIFO; a convolution module or a pooling module or a nonlinear processing module at the PL end calculates data in the on-chip cache FIFO to obtain an intermediate calculation result under the control of the CNN register, and sends the intermediate calculation result back to the on-chip cache FIFO; the DMA copy-memory module stores the calculation result in the on-chip cache FIFO into the DDR memory;

step 3, repeating the step 2 according to the calculation requirement to obtain a final calculation result;

and 4, taking out the final calculation result of the PL end from the DDR memory by the ARM processor, completing probability operation, and transmitting the probability operation result to the upper computer.

Furthermore, the ARM processor accesses the DDR memory through the ARM memory access interface, obtains an intermediate calculation result, performs auxiliary operation, and sends the auxiliary operation result back to the DDR memory as the intermediate calculation result.

Compared with the prior art, the invention has the beneficial effects that: the general acceleration processing of the neural network is added on the basis of the original PYNQ platform, a parallel method of multiple input and multiple output channels is adopted, the VGG-16 and tiny-YOLOv3 network structures are optimized, the data processing speed, the performance and the universality of the neural network are improved on the basis of realizing smaller resource consumption and lower power consumption, and the acceleration performance and the acceleration efficiency are improved.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

FIG. 1 is a diagram illustrating an architecture of a PYNQ platform according to the present invention;

FIG. 2 is a schematic diagram of the convolution module;

FIG. 3 is a schematic diagram of the operation flow of the pooling module;

FIG. 4 is a diagram illustrating the architecture and operation flow of a non-linear processing module.

Fig. 5 is a target image predicted by the VGG16 in the embodiment.

FIG. 6 is a target graph of the prediction after the calculation result obtained by tiny-Yolov3 in the example.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

specifically, referring to fig. 1, the PYNQ platform includes a processing system PS terminal, a programmable logic PL terminal, and a DDR memory, where the DDR memory is connected to the processing system PS terminal and the programmable logic PL terminal through an AXI4 bus, respectively; the DDR memory is also connected with the ARM processor through an ARM memory access interface;

the PS end of the processing system comprises an ARM processor; the ARM processor comprises a Softmax module and an auxiliary operation module;

the ARM processor is used for acquiring feature diagram data and weight data from the upper computer and storing the feature diagram data and the weight data into the DDR memory through an AXI4 bus; the ARM processor is also used for assigning values to the CNN register through the AXI4 bus; the ARM processor is also used for providing a lookup table for the nonlinear processing module through the AXI4 bus; the ARM processor is also used for acquiring an intermediate calculation result in the DDR memory through the ARM memory access interface and performing auxiliary operation on the intermediate calculation result by using an auxiliary operation module; the ARM processor is also used for carrying out probability operation on the final calculation result by using a Softmax module and transmitting the probability operation result to the upper computer;

the DDR memory is used for storing the characteristic diagram data and the weight data, and temporarily storing an intermediate calculation result of the PL end of the programmable logic and an auxiliary operation result of the ARM processor;

characteristic diagram data storage mode: the feature map is divided into subblocks of Tc × 1, and the subblocks are stored in every 32 channels in the order of C, H, W (since the input feature is a three-dimensional tensor, data of channel 1, width 1, and height 1 are stored, 32-dimensional channel data is stored first, the high dimension is incremented, and then the 32-dimensional channel data is stored until the high dimension reaches the high value of the input feature map, and the wide dimension is incremented, and the above process is repeated until each element of the feature map is stored in the DDR memory). Where Tc is the number of input channels, 32 channel input, and W, H, C are the width and height of the feature map and the channel, respectively. The calculated feature map (i.e. the intermediate calculation result at the PL side and the auxiliary calculation result at the ARM processor) is divided into subblocks of Tk × 1, and feature data is output in the same order of C, H, W. Wherein, Tk is the number of output channels, and is 32-channel output.

Weight graph data storage mode: grouping the convolution kernels, wherein each group is Tk, and the number of the last group can be less than Tk; dividing each convolution kernel into subblocks with the size of Tc 1, if the total number of input channels of the convolution kernels cannot be divided by Tk, namely 32, complementing 0 element to enable the total number of the input channels to be a multiple of 32; and storing the weight elements of the convolution kernel into the DDR memory according to the sequence of the input channel Cin, the output channel Cout, the height H and the width W. Such a storage manner can implement matrix operation of the input feature map and the weights on the channel dimension.

The programmable logic PL end comprises a DMA copy module, a convolution module, a pooling module, a nonlinear processing module and a CNN register; the DMA copy module comprises a reading module, a writing module and an on-chip cache FIFO; the read module comprises an rd _ cmd _ fifo buffer unit and a 1-out-of-5 arbiter; the write module comprises a wr _ cmd _ fifo buffer unit and a 1-from-2 arbiter; referring to fig. 2, the convolution module includes 4 RAMs and multiply-add operation units;

the DMA copy memory module is used for loading feature map data, weight data, intermediate calculation results and auxiliary operation results in the DDR memory to an on-chip cache FIFO (first in first out) through an AXI4 bus by using a reading module, namely reading operation; and (3) reading: the rd _ cmd _ FIFO buffer unit stores the read command of each internal bus, and when certain conditions (during data transmission, handshake signals need to be designed on the PL end and the AXI port, namely, data transmission, a read signal is transmitted through the AXI end, the PL end receives the read signal, a receiving control signal is fed back to the AXI end, then the rd _ cmd _ FIFO buffer unit at the PL end reads data of the DDR memory through the AXI4 in order to prevent a read data channel of the AXI4 bus from being blocked due to the full read data FIFO in the module) are met, a request from the rd _ cmd _ FIFO buffer unit is received through a five-selected arbiter, and if the five-selected arbiter permits the request, the request is sent to a read address channel of the AXI4 bus. The ID flag bit of the AXI4 bus read request is changed to a, and the requested data is returned through the read data channel, followed by a flag bit ID with a value of a. When a read request of the nth (N is 0 to 4) internal bus is granted, the DMA access module will send the request to the read address channel of the AXI4 bus, and at the same time, set the flag ID of the request on the read address channel to N. Therefore, it can be known through the ID signal on the AXI4 bus read data channel which is the response to the data requested by which read port, and the DMA access module can send the data to the data return channel of the corresponding port. The DMA copy memory module is also used for storing the intermediate calculation result output to the on-chip cache FIFO by the convolution module or the pooling module or the non-linear processing module into the DDR memory through the AXI4 bus by using the writing module, namely writing operation; and (3) writing: the wr _ cmd _ FIFO buffer stores the write command of the internal bus, when a certain condition is met (a PL end and an AXI port are required to design handshake signals in the data transmission process, namely, when data transmission is carried out, a read signal is transmitted through the AXI port, a receiving control signal is fed back to the AXI end when the PL end receives the read signal, then a wr _ cmd _ FIFO buffer unit of the PL end reads data of a DDR4 through an AXI4 bus, and the aim is to prevent a write data channel of an AXI4 bus from being blocked due to the fact that a write data FIFO in a module is full), the request from the wr _ cmd _ FIFO buffer unit can be received through a designed alternative arbiter, and if the alternative arbiter permits the request, the request can be sent to the write address channel of the AXI4 bus. The write data of the internal bus is buffered by the wr _ dat _ fifo buffer unit, and the buffered data is written into the DDR memory through the write data channel of the AXI4 bus.

The convolution module is used for performing convolution operation and full connection operation on the data in the on-chip cache FIFO; referring to fig. 2, 4 RAMs of the convolution module are used to implement preloading of the weights, and the 4 RAMs hide the delay required to load the weights; the multiplication and addition unit comprises an adder of Tc multiplied by Tk 16-bit multipliers and (Tc-1) multiplied by Tk (32+ log2Tc) (the addition of log2Tc bits is to prevent data overflow in addition operation), and the maximum operational capability of 2Tc multiplied by Tk dominant frequency can be provided. Where Tc is the number of input channels and Tk is the number of output channels. Since all the weights in the weight storage area are needed for one array operation, the RAM can be realized only by using a register file, that is, 2 × Tc × Tk registers with 16 bits are needed. Meanwhile, the multiplier-adder array will cause very long logic delay, and in order to shorten the critical path, the multiplier-adder array is in a pipeline structure. With 5 stages of pipelined registers, registers that roughly require 5 × Tc × Tk/2 (32+ log2Tc) bits (compression of the adder tree results in less total bit width of the data, with the adder tree input being Tc × Tk (32+ log2Tc) bits, so 5 stages of pipelined registers roughly require

5*Tc*Tk/2*(32+log2Tc)bits)。

The pooling module is used for pooling data in the on-chip cache FIFO; referring to fig. 3, pooling in the width direction is first performed, and the output is a temporary matrix with height H and width Wout, denoted Ftp. The data of the ith row and the jth column of the Ftp is the result of the operation of 'maximum value solving', 'minimum value solving' or 'average value solving' of the data (the row vector with the length of KW) from the ith row and the jth column to the SW x j + KW-1 column of the input characteristic layer. And then, performing pooling operation in the height direction, wherein the input of the operation in the step is the operation result of the previous step, namely the temporary matrix Ftp. After pooling operation in the height direction is performed on the Ftp, a matrix with height Hout and width Wout is obtained, and the matrix is the output characteristic map layer Fout. The data in the ith row and the jth column of Fout is the result of the "maximum value", "minimum value" or "average value" operation performed on the data (column vector with the length Ky) in the jth row to the Sy + i + Ky-1 row of the temporary matrix Ftp and the like.

The nonlinear processing module is used for performing activation function operation on the data in the on-chip cache FIFO according to a lookup table provided by the ARM processor;

the lookup table comprises intermediate results x of all input corresponding data and the value range of the input x, and the value range is 16 bits, so that the data volume of the lookup table is compressed from 16M to 1M in order to reduce the storage space of the lookup table at the PL end, the truncated 5-bit data is set as 0, and the 6 th bit data is set as 1 to finish the processing of the input intermediate results x, so that certain calculation precision can be ensured while flexibility is considered, and nonlinear operations such as various activation functions, quantization and the like are realized. The generation of the lookup table is completed at the PS terminal, the C code is firstly written, a 1024-scale array is established to store an output result, all input 1024 data are traversed from 0 through circulation, wherein the input data are added with 0.5 because the 6 th position of the 16bit data is one. Calibrating input data x into floating point type data before the activation function is input through shifting operation according to a scaling factor before the activation function is input, obtaining an activation function calculation result through the operation of the activation function, obtaining 16-bit fixed point data after the activation function result is quantized through shifting, and inputting the 16-bit fixed point data to the next layer. The method comprises the steps that Python codes are written in a Pythroch frame in a PC, the weight, the offset and the input characteristics of 32 floating point type data are quantized into fixed point type 16bit data through 16bit symmetric quantization, the obtained quantization weight offset data and image data are convoluted to obtain corresponding weight scaling factors q _ w, offset scaling factors q _ b and characteristic scaling factors q _ i, input result scaling factors q _ o and quantization coefficients are all integer multiples of 2. The fixed-point numbers may be dequantized to floating-point numbers by dividing the fixed-point data by a scaling factor power of 2.

Referring to fig. 4, the operation flow of the nonlinear processing module is as follows: the 16-bit multiplication operation is carried out on the input features and the weight of the input 16-bit fixed point number, the output result is stored in a 44-bit register to prevent overflow, the result obtained by each multiplication operation in one convolution operation is subjected to accumulation operation, and the result of the convolution multiplication operation is retained in the 44-bit register. And then the convolution result is shifted to the left by the biased scaling factor q _ b bit to complete the calibration operation, the biased scaling factor q _ i bit of the left shift input characteristic is shifted to the left by the weighted scaling factor q _ w to complete the calibration operation, and then the calibration operation is added with the accumulated result obtained by the convolution multiplication operation to obtain the result of the convolution operation. After the accumulated offset result is obtained, the sum of the scaling factors (q _ i, q _ w and q _ b) which are shifted to the right by 3 and the difference bit of the output scaling factor (q _ o) are subjected to inverse quantization operation to obtain a complete convolution operation calculation result before being input into the activation function table, and finally the convolution operation calculation result is output through the activation function table to obtain the operation result of the activation function. And repeating the operation when the operation result of the activating function enters the next convolution operation.

And the CNN register is used for controlling the DMA copy module to read and write the DDR memory according to the assignment of the ARM processor, and controlling the convolution module or the pooling module or the nonlinear processing module to calculate. The CNN register is positioned at the PL end, a specific number of registers are designed at the PL end, data is written into the designated registers at the PS end through an AXI4 bus, and a specific signal is generated to control the read-write operation of the DMA copy memory module at the PL end and control the operation of each calculation module at the PL end.

Furthermore, the ARM processor accesses the DDR memory through the ARM memory access interface, obtains an intermediate calculation result, performs auxiliary operation, and sends the auxiliary operation result back to the DDR memory as the intermediate calculation result. Given that the feature storage sequence is WHC, every 16 input channels are a group, the ARM processor accesses the DDR memory through the ARM memory access interface according to the following formula:

address＝mem_map_base+(CH/TK)×feature_surface_stride+row ×feature_line_stride+col×TK×2+(CH％TK)×2

wherein, address is the position of the access element; the mem _ map _ base is a characteristic storage initial address in the DDR; CH is the total number of input characteristic channels; TK is a parallel 16 channel; feature _ surface _ stride is the total address of the feature map of a single channel, i.e. the product of length and width is multiplied by 2; feature _ line _ stride is the total number of feature row addresses, i.e., width multiplied by 2; row, col are the width and height of the feature, respectively.

Simulation experiment process:

and accessing the PYNQ platform through the PC, writing a C code through an ARM chip at the PS end to control a CNN register at the PL end, realizing the function of an accelerator, and compiling and operating VGG16 and tiny-Y0L0v 3.

The operation time of each layer of VGG16, which is the detection result of VGG16, is shown in Table 1 under the frequency of 150M; referring to fig. 5, the predicted target image for VGG16, it can be seen that VGG16 can obtain correct results after being accelerated by the method of the present invention.

TABLE 1

The operation time for obtaining each layer of tiny-YOLOv3 at a frequency of 150M is shown in Table 2; referring to fig. 6, in order to obtain the predicted target map after calculating the result, it can be seen that tiny-YOLOv3 can obtain the correct result after being accelerated by the method of the present invention.

TABLE 2

Referring to table 1, table 2, fig. 5 and fig. 6, it can be seen that the acceleration method of the present invention can effectively accelerate the VGG16 network and the tiny-YOLOv3 network while ensuring accuracy.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A PYNQ platform-based neural network universal acceleration processing method is characterized by comprising the following steps:

2. The PYNQ platform-based neural network universal acceleration processing method is characterized in that the PYNQ platform comprises a processing system PS terminal, a programmable logic PL terminal and a DDR memory, wherein the DDR memory is respectively connected with the processing system PS terminal and the programmable logic PL terminal through an AXI4 bus; the DDR memory is also connected with the ARM processor through an ARM memory access interface;

the DMA copy memory module is used for loading feature map data, weight data, intermediate calculation results and auxiliary operation results in the DDR memory to an on-chip cache FIFO (first in first out) through an AXI4 bus by using a reading module, namely reading operation; the DMA copy memory module is also used for storing the intermediate calculation result output to the on-chip cache FIFO by the convolution module or the pooling module or the non-linear processing module into the DDR memory through the AXI4 bus by using the writing module, namely writing operation;

the convolution module is used for performing convolution operation and full connection operation on the data in the on-chip cache FIFO;

the pooling module is used for pooling data in the on-chip cache FIFO;

and the CNN register is used for controlling the DMA copy module to read and write the DDR memory according to the assignment of the ARM processor, and controlling the convolution module or the pooling module or the nonlinear processing module to calculate.

3. The PYNQ platform-based neural network general accelerated processing method as recited in claim 1, wherein the ARM processor further accesses the DDR memory through the ARM memory access interface, obtains an intermediate calculation result to perform auxiliary operation, and sends the auxiliary operation result back to the DDR memory as the intermediate calculation result.