CN112862079B

CN112862079B - Design method of running water type convolution computing architecture and residual error network acceleration system

Info

Publication number: CN112862079B
Application number: CN202110262425.XA
Authority: CN
Inventors: 黄以华; 黄俊源; 陈志炜
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2023-04-28
Anticipated expiration: 2041-03-10
Also published as: CN112862079A

Abstract

The invention provides a design method of a running water type convolution computing architecture and a residual error network acceleration system, wherein the method divides the hardware acceleration architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module; the main route of the hardware acceleration architecture is composed of three serially arranged convolution processing arrays, and two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer pipelining of three layers of convolution of the main route; setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution; and setting a point-by-point addition module to add the pixels of the corresponding output characteristics by elements of the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection.

Description

Design method of running water type convolution computing architecture and residual error network acceleration system

Technical Field

The invention relates to the field of computer vision scene processing methods, in particular to a running water type convolution computing architecture design method and a residual error network acceleration system.

Background

Convolutional Neural Networks (CNNs) are widely used in a variety of computer vision scenarios and exhibit superior performance. However, due to the complex and intensive computational requirements and the huge storage requirements, it is a challenge to deploy and accelerate convolutional neural networks on mobile devices and embedded platforms with high power consumption sensitivity and real-time requirements.

In the convolutional neural network, the calculation time of the convolutional layer occupies more than 90% of the total calculation time of the network, so that the acceleration of the convolutional layer operation is the most important component of the acceleration of the convolutional neural network. The design of the accelerator of the convolutional neural network should make full use of the parallelism of each convolutional kernel in the layer and between layers of the convolutional neural network, and simultaneously customize the convolutional operation module according to the characteristics of the network model.

The Field Programmable Gate Array (FPGA) is a semi-custom circuit in an application-specific integrated circuit, is a programmable logic device, and along with the continuous upgrading and development of semiconductor technology, the current mainstream FPGA contains rich logic calculation, storage and wiring resources, and has the advantage of low power consumption, so that researchers have enough design space to customize a special convolutional neural network acceleration hardware structure so as to fully utilize the parallel characteristic of the convolutional neural network calculation to accelerate the operation process.

The residual network is a convolutional neural network model which is paid attention to in the field of computer vision in recent years, and is different from the layer-by-layer simple stacking of the traditional convolutional neural network in that the residual network adopts branch shortcut connection to construct a residual building block module, so that the problems of training and testing precision degradation along with the deepening of the network layer number are effectively solved, and the performance of the network can be improved more easily through the stacking layer number. But the number of related studies to deploy residual networks on FPGAs is currently small. Because the number of layers of the residual network is deeper, the sizes of all layers are different, and the branch shortcut connection is used between every two or three adjacent layers to accumulate the characteristic images, the structure height of the network is irregular, and compared with the traditional CNN, the difficulty of deploying the residual network on the FPGA is higher. Many studies are currently performed by designing a single convolution processing array module to process the convolution operations of the residual network, calculating one layer of convolution in the network at a time by the convolution processing array, and repeatedly calling the convolution processing array by the central processor to calculate all the convolution layers of the residual network layer by layer.

The structure of the residual network is mainly composed of a stack of residual building blocks with branching shortcuts (fig. 1), the main route is usually composed of three convolutions with kernel sizes 1×1, 3×3 and 1×1 (hence also called bottleneck building blocks), there are two cases of branching shortcuts: 1) Calculating input features by using a convolution layer with kernel size of 1 multiplied by 1, and adding the obtained result with pixels corresponding to the output features of the main path point by point; 2) And directly adding the input characteristic data with the pixels corresponding to the output characteristics of the main path point by point without any processing.

Fig. 2 is a conventional execution flow of calculating a layer of a residual network using a single convolution processing array, and each time the execution of the flow, one layer of calculation of the residual network can be completed. The existing solution for accelerating the convolutional neural network by using a single convolutional processing array module is suitable for processing the convolutional neural network model with a traditional simple layer-by-layer stacking structure, and has certain universality. However, since the external memory needs to be accessed once before and after the calculation of each layer of convolution calculation is started, and the residual network usually has a deeper network layer number, a great amount of energy consumption and memory access delay are brought; because of the specificity of the residual network structure, only a single convolution processing array can be used for serially executing the convolution layers of the main path and the branch quick connection of the residual building block, and then point-by-point addition is carried out, so that the parallelism of the structure cannot be fully utilized; meanwhile, the residual network convolution layers have various sizes, and the use of a single convolution processing array to process convolutions with different sizes cannot generally achieve higher hardware resource utilization.

Disclosure of Invention

The invention provides a design method of a running water type convolution computing architecture with higher hardware utilization rate.

It is yet another object of the present invention to design a residual network accelerator system using the pipelined convolutional computing architecture design method.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a design method of a running water type convolution computing architecture comprises the following steps:

s1: dividing the hardware acceleration architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module;

s2: the main route of the hardware acceleration architecture is composed of three serially arranged convolution processing arrays, two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer pipelining of three layers of convolution of the main route, and the assembly line buffer areas are arranged in an on-chip buffer area;

s3: setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution;

s4: and setting a point-by-point addition module to add the pixels of the corresponding output characteristics by elements of the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection.

Further, the buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; applying pipeline buffer areas at the output ends of a first convolution processing array and a second convolution processing array for calculating main convolution of the residual building block module; the pipeline buffer is used for buffering the output characteristics of the first convolution processing array, namely buffering the input characteristics of the second convolution processing array; the method comprises the steps that a first output buffer area is arranged at the output end of a third convolution processing array of a residual block main path, a second output buffer area is arranged at the output end of a fourth convolution processing array of a branch quick connection part and used for storing convolution output characteristic results, and data in the output buffer area can be sent to a point-by-point addition module, a pooling operation unit or written back to an external memory through a direct memory access module according to different subsequent operation processes; the weight buffer area is used for buffering weight data slices corresponding to all layers of convolution, and because the pipeline technology is used for processing three layers of convolution of a main way, in order to enable next-stage pipeline to start faster and simultaneously minimize the size of the pipeline buffer area, the circulation sequence of convolution calculation is designed to calculate all output channels corresponding to a certain output characteristic firstly, then the output characteristic is replaced, all the output characteristics are calculated according to the sequence, the repeated loading of input characteristic slices of the buffer area is avoided, but the repeated loading of the weight buffer area is caused for replacing the weight slices, and each convolution processing array is designed with two weight buffer areas: the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay; the point-by-point addition module is used for performing element-by-element addition of corresponding output characteristic pixels on the output characteristics of the residual block main path and the output characteristics of the branch quick connection;

firstly, corresponding output characteristics are respectively read from a first output buffer area and a second output buffer area of a main circuit to carry out addition operation, then activation operation is carried out, then an operation result is sent back to the first output buffer area of the main circuit, and according to different follow-up operations, data of the first output buffer area can be sent to a pooling operation unit to carry out pooling operation, or the data is written back to an external memory through a direct memory access module.

Further, the register configuration module in the first to fourth convolution processing arrays is configured to receive and register various parameters of the convolution processing arrays, including a size of a convolution layer and a working mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.

A residual network acceleration system, comprising: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;

the direct memory access module sends a read data command to the off-chip memory, so that data in the off-chip memory is transmitted to an input buffer area on the chip; transmitting a data writing command to an off-chip memory, and writing the final output characteristics calculated by the current residual block into the external memory from the data of the output buffer;

the pooling operation unit is used for carrying out average pooling operation or maximum pooling operation; when the pooling operation is required to be executed, the pooling operation unit reads the characteristic data from the output buffer zone of the running water convolution computing architecture module to execute corresponding pooling operation, and then writes the executed result back to the output buffer zone;

the global control logic unit is used for controlling the starting, execution sequence and data flow of each module of the whole system; tracking the number of layers executed by the current network; transferring parameters required by the direct memory access module; the data in the off-chip memory comprises characteristic data for identification of a convolutional neural network and corresponding weight data; the global control logic unit is also used for configuring the working mode of the running water type convolution computing architecture and loading the kernel size and characteristic size parameters of the current computing layer into the register configuration modules of the convolution processing arrays.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the special central computing unit is designed according to the characteristics of the residual building block module, so that the central computing unit can complete the computation of a plurality of convolution layers through one-time configuration; by combining the design idea of a pipeline type design neural network accelerator, the pipeline buffer is inserted between three layers of convolution layers of a main path in a central computing unit, so that the computing parallelism is enhanced, and the computing delay is reduced; multiple accesses to an external memory are avoided, memory access delay is reduced, and power consumption is reduced; by using the design idea of hardware parallelism, a special convolution processing array is arranged for the branch convolution of the residual block, so that the branch convolution and the main convolution can be operated in parallel, and the calculation delay is reduced; the branched convolution processing array 4 can be used for calculating the head convolution and the full connection layer through configuration of the working mode; ping-pong buffers are designed for weights. The circulation sequence of the convolution calculation of the accelerator is designed to finish the calculation of all output channels corresponding to a certain output feature, and then finish the calculation of all output features, which can cause frequent weight slice replacement, so that a ping-pong buffer area is designed for weights to overlap memory access delay and calculation delay.

Drawings

FIG. 1 is a schematic diagram of a residual block of a residual network of the prior art;

FIG. 2 is a flow chart of a layer of a residual calculation network in the prior art;

FIG. 3 is a flow chart of the design method of the present invention;

FIG. 4 is a block diagram of a convolution processing array;

FIG. 5 is a block diagram of a residual network accelerator system;

FIG. 6 is a flow chart of the execution of the accelerator system calculation residual network.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 3, a design method of a running water type convolution computing architecture includes the following steps:

The buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; applying pipeline buffer areas at the output ends of a first convolution processing array and a second convolution processing array for calculating main convolution of the residual building block module; the pipeline buffer is used for buffering the output characteristics of the first convolution processing array, namely buffering the input characteristics of the second convolution processing array; the method comprises the steps that a first output buffer area is arranged at the output end of a third convolution processing array of a residual block main path, a second output buffer area is arranged at the output end of a fourth convolution processing array of a branch quick connection part and used for storing convolution output characteristic results, and data in the output buffer area can be sent to a point-by-point addition module, a pooling operation unit or written back to an external memory through a direct memory access module according to different subsequent operation processes; the weight buffer area is used for buffering weight data slices corresponding to all layers of convolution, and because the pipeline technology is used for processing three layers of convolution of a main way, in order to enable next-stage pipeline to start faster and simultaneously minimize the size of the pipeline buffer area, the circulation sequence of convolution calculation is designed to calculate all output channels corresponding to a certain output characteristic firstly, then the output characteristic is replaced, all the output characteristics are calculated according to the sequence, the repeated loading of input characteristic slices of the buffer area is avoided, but the repeated loading of the weight buffer area is caused for replacing the weight slices, and each convolution processing array is designed with two weight buffer areas: the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay; the point-by-point addition module is used for performing element-by-element addition of corresponding output characteristic pixels on the output characteristics of the residual block main path and the output characteristics of the branch quick connection;

As shown in fig. 4, the register configuration modules in the first to fourth convolution processing arrays are configured to receive and register various parameters of the convolution processing arrays, including the size of the convolution layer and the operation mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.

Example 2

As shown in fig. 5, a residual network accelerator system is designed by using the method for designing a running water convolution computing architecture, which comprises: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;

The residual network accelerator system execution flow is shown in fig. 6.

The same or similar reference numerals correspond to the same or similar components;

the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The design method of the running water type convolution computing architecture is characterized by comprising the following steps of:

s1: dividing a running water type convolution computing architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module;

s2: the main route of the running water type convolution computing architecture is composed of three serially arranged convolution processing arrays, two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer running water of three layers of convolution of the main route, and the assembly line buffer areas are arranged in the on-chip buffer areas;

s4: setting a point-by-point addition module to add the corresponding output characteristic pixels element by the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection;

the buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; pipeline buffers are applied at the outputs of the first convolution processing array and the second convolution processing array for calculating the residual building block main convolution.

2. The method of claim 1, wherein the pipeline buffer is configured to buffer output characteristics of the first convolution processing array, i.e., buffer input characteristics of the second convolution processing array.

3. The method according to claim 2, wherein a first output buffer is provided at an output end of a third convolution processing array of the residual block main path, and a second output buffer is provided at an output end of a fourth convolution processing array of the branch quick connection portion, for storing a convolution output characteristic result, and data in the output buffers may be sent to a point-by-point addition module, a pooled operation unit or written back to an external memory through a direct memory access module according to a difference of a subsequent operation process.

4. The method for designing a pipelined convolutional computing architecture according to claim 3, wherein the weight buffer is used for buffering weight data slices corresponding to the convolutions of each layer, and since the pipeline technology is used for processing the three-layer convolutions of the main way, in order to enable the next stage of pipeline to start faster while minimizing the size of the pipeline buffer, the cyclic sequence of the convolutional computation is designed to compute all output channels corresponding to a certain output feature first, then replace the output feature, and compute all output features according to the sequence, thereby avoiding the repeated loading of input feature slices of the buffer, but causing the repeated loading of the weight buffer to replace the weight slices, and each convolutional processing array is designed with two weight buffers for this purpose: and the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay.

5. The method of claim 4, wherein the point-by-point addition module is configured to perform element-by-element addition of corresponding output feature pixels for the output features of the residual block main path and the output features of the branch quick connection;

6. The method of designing a pipelined convolutional computing architecture according to any one of claims 1-4, wherein the register configuration module in the first through fourth convolutional processing arrays is configured to receive and register parameters of the convolutional processing arrays, including the size of the convolutional layer and the operating mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.

7. A residual network acceleration system designed using the design method of claim 6, comprising: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;

the global control logic unit is used for controlling the starting, execution sequence and data flow of each module of the whole system; tracking the number of layers executed by the current network; the parameters required by the direct memory access module are passed.

8. The residual network acceleration system of claim 7, wherein the data in the off-chip memory includes characteristic data and corresponding weight data for the convolutional neural network for identification.

9. The residual network acceleration system of claim 8, wherein the global control logic is further configured to configure an operating mode of a pipelined convolutional computing architecture and to load kernel size, feature size parameters of a current compute layer into register configuration modules of each convolutional processing array.