CN108647779B

CN108647779B - Reconfigurable computing unit of low-bit-width convolutional neural network

Info

Publication number: CN108647779B
Application number: CN201810318783.6A
Authority: CN
Inventors: 曹伟; 王伶俐; 罗成; 谢亮; 范锡添; 周学功
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-04-11
Filing date: 2018-04-11
Publication date: 2021-06-04
Anticipated expiration: 2038-04-11
Also published as: CN108647779A

Abstract

The invention discloses a reconfigurable computing unit of a low bit width convolution neural network. The unit includes: the reconfigurable shift accumulation module comprises a plurality of reconfigurable shift accumulation modules, a multi-channel gate and a quantization processing module; the reconfigurable shifting accumulation module comprises a controller, a first register, a second register, a third register and a shifting accumulator; the method comprises the steps that a controller, a first register, a second register, a third register and a shift accumulator are constructed by utilizing network discreteness, whether fixed point data and index weight of a current period are zero or not is judged through the controller, and once the fixed point data and the index weight of the current period are detected to be zero, the third register is controlled to output shift accumulated data of the current period according to a first trigger signal sent by the first register and a second trigger signal sent by the second register; the invention can realize the flexible fixed-point multiply-accumulate operation of 4 bits and 8 bits, improve the shift accumulate operation speed and reduce the memory and power consumption occupied by the operation.

Description

Reconfigurable computing unit of low-bit-width convolutional neural network

Technical Field

The invention relates to the technical field of reconfigurable computing, in particular to a reconfigurable computing unit of a low-bit-width convolutional neural network.

Background

With the development of artificial intelligence, deep learning has achieved great success in the fields of speech recognition, computer vision, automatic driving and the like, and further development of the fields is promoted. The core technology for promoting the development of deep learning research is a convolutional neural network. The target recognition technology using Convolutional Neural Network (Convolutional Neural Network) defeats the traditional image recognition method in the large-scale image recognition competition ILSVRC2012 held in 2012, announcing the arrival of the deep academic era. With the continuous development of the deep learning technology, the structure of the convolutional neural network is also continuously optimized, and the recognition performance is also continuously improved. Whereas on the large-scale image recognition contest ILSVRC2015 held in 2015, the convolutional neural network surpassed the image recognition capability of human for the first time. This milestone event marks a great success of deep learning techniques.

With the continuous improvement of the performance of the convolutional neural network, the network structure becomes more and more complex, corresponding to more computing requirements and storage requirements. In order to support the calculation of the convolutional neural network, a network processing flow is generally operated on a server and a data center, and when data interaction is performed with the data center, a large amount of data needs to be transmitted, so that great delay is brought, and the application of the convolutional neural network in embedded devices such as smart phones and smart cars is hindered. To address this problem, academia and industry began to study how to deploy convolutional neural networks onto accelerators of embedded hardware systems, and therefore many effective convolutional neural network accelerators have been designed with specialized computational units (PEs), typically using fixed computational units for different convolutional neural network models. Due to the diversity of convolutional neural networks, fixed computational units may not be suitable when the network model changes, which increases data movement and compromises power efficiency. Moreover, their convolution mapping methods are not very scalable to various convolution parameters, and mismatches between network shapes and computational resources can occur, thereby reducing resource utilization and performance. Therefore, how to design reconfigurable computing units for different networks becomes a matter of intense research in the art.

The existing reconfigurable computing unit basically adopts a special DSP (Digital Signal Processing) for computing, while the DSP computing unit is designed for floating-point type operation, in the design of common floating-point convolutional neural network hardware, the DSP unit is usually adopted for Multiply-and-accumulate operation (MAC), and one DSP can be used for completing one Multiply-and-accumulate operation in one clock cycle. However, the DSP computing unit is not suitable for multiply-accumulate operation with low bit width, and this disadvantage makes it unable to exert its full capability in low bit width hardware design.

In order to solve the problem, Xilinx corporation introduced a special DSP mapping technique, and aiming at the FPGA chip design introduced by Xilinx, the DSP computing unit on each FPGA chip can realize parallel two times of eight-bit multiply-accumulate operations. The technology gives full play to the computing power of the DSP on the FPGA chip and improves the area and power consumption performance of the FPGA. However, the application range of the technique is too narrow, and the technique can only be used for fixed-point multiply-accumulate operation with eight bit widths and cannot be applied to special operation requirements of the exponential convolution neural network. Based on the above problems, how to overcome the above problems is a problem to be solved in the art.

Disclosure of Invention

The invention aims to provide a reconfigurable computing unit of a low-bit-width convolutional neural network, which is used for meeting the operation requirement of an exponential convolutional neural network, not only realizing flexible fixed-point multiply-accumulate operation of 4 bits and 8 bits, but also improving the shift-accumulate operation rate and reducing the memory and power consumption occupied by operation.

In order to achieve the above object, the present invention provides a low bit width convolutional neural network reconfigurable computing unit, which is applied to a shift accumulation operation of an exponential convolutional neural network, and includes: the reconfigurable shift accumulation module comprises a plurality of reconfigurable shift accumulation modules, a multi-channel gate and a quantization processing module;

the multi-path gate is respectively connected with each reconfigurable shift accumulation module and is used for selecting the shift accumulation data of the current period output by the reconfigurable shift accumulation module; the quantization processing module is connected with the multi-channel gate and is used for performing quantization processing according to the shift accumulated data of the current period to obtain quantization processing data; wherein:

the reconfigurable shifting accumulation module comprises a controller, a first register, a second register, a third register and a shifting accumulator;

the controller is used for judging whether the exponential weight data of the current period is negative; if the exponential weight data of the current period is negative, data shifting accumulation operation is not needed, and the judgment of the exponential weight data of the next period is waited; if the exponential weight data of the current period is not a negative number, judging whether the exponential weight data of the current period is 0; if the exponential weight data of the current period is not 0, controlling a first register to store the exponential weight data of the current period; when the exponential weight data of the current period is 0, the first register sends out a first trigger signal under the control;

the controller is also used for judging whether the fixed point number data of the current period is negative; if the fixed point number data of the current period is negative, data shifting accumulation operation is not needed, and the fixed point number data of the next period is waited to be judged; if the fixed point number data of the current period is not negative, judging whether the fixed point number data of the current period is 0 or not; if the fixed point number data of the current period is not 0, controlling a second register to store the fixed point number data of the current period; if the fixed point number data of the current period is 0, controlling a second register to send out a second trigger signal;

the third register is respectively connected with the first register and the second register, and is used for controlling the third register to output shift accumulated data of the current period according to a first trigger signal sent by the first register or a second trigger signal sent by the second register; the third register is also used for storing the shift accumulated data of the previous period;

the shift accumulator is respectively connected with the first register, the second register and the third register, and is used for determining shift accumulation data of the current period according to the exponential weight data of the last period stored by the first register, the fixed point data of the last period stored by the second register and the first shift accumulation data of the last period stored by the third register, and storing the shift accumulation data of the current period in the third register.

Preferably, the shift accumulator includes:

the shifter is respectively connected with the first register and the second register and used for determining shift data according to the exponential weight data stored in the first register and the fixed point data stored in the second register;

and the accumulator is respectively connected with the shifter and the third register and used for determining the shift accumulated data of the current period according to the shift data determined by the shifter and the first shift accumulated data of the previous period stored by the third register.

Preferably, the low bit width reconfigurable shift accumulation module further includes:

and the output register is connected with the third register and is used for storing the shift accumulated data of the current period output by the third register.

Preferably, the exponential weight data is 4 bits.

Preferably, the fixed point number data is 8 bits.

Preferably, the shifted accumulated data is 18 bits.

Preferably, the quantization processing data is 8-bit data.

Compared with the prior art, the invention has the following technical effects:

the method comprises the steps that a controller, a first register, a second register, a third register and a shift accumulator are constructed by utilizing network discreteness, whether fixed point data and index weight of a current period are zero or not is judged through the controller, and once the fixed point data and the index weight of the current period are detected to be zero, the third register is controlled to output shift accumulated data of the current period according to a first trigger signal sent by the first register and a second trigger signal sent by the second register; the method can realize flexible fixed-point multiply-accumulate operation of 4 bits and 8 bits, improve the shift accumulation operation rate and reduce the memory and power consumption occupied by operation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a block diagram of a reconfigurable computing unit of a low bit width convolutional neural network according to an embodiment of the present invention;

fig. 2 is a structural diagram of a low bit width reconfigurable shift accumulation module according to an embodiment of the present invention.

10, a reconfigurable shift accumulation module 11, a controller 12, a first register 13, a second register 14, a shifter 15, an accumulator 16, a third register 17, an output register 20, a multi-way gate 30 and a quantization processing module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

FIG. 1 is a block diagram of a reconfigurable computing unit of a low bit width convolutional neural network according to an embodiment of the present invention; fig. 2 is a block diagram of a low bit width reconfigurable shift accumulation module according to an embodiment of the present invention, and as shown in fig. 1 to fig. 2, the present invention provides a low bit width reconfigurable calculation unit of a convolutional neural network, where the low bit width reconfigurable calculation unit of the convolutional neural network is applied to a shift accumulation operation of an exponential convolutional neural network, and the low bit width reconfigurable calculation unit of the convolutional neural network includes: a plurality of reconfigurable shift accumulation modules 10, a multiplexer 20 and a quantization processing module 30;

the reconfigurable shift accumulation module 10 comprises a controller 11, a first register 12, a second register 13, a third register 16 and a shift accumulator 15.

The controller 11 is configured to determine whether the exponential weight data of the current period is a negative number; if the exponential weight data of the current period is negative, data shifting accumulation operation is not needed, and the judgment of the exponential weight data of the next period is waited; if the exponential weight data of the current period is not a negative number, judging whether the exponential weight data of the current period is 0; if the exponential weight data of the current period is not 0, controlling the first register 12 to store the exponential weight data of the current period; the exponential weight data of the current period is 0, and the first register 12 sends out a first trigger signal under control.

The controller 11 is further configured to determine whether fixed-point data in the current period is negative; if the fixed point number data of the current period is negative, data shifting accumulation operation is not needed, and the fixed point number data of the next period is waited to be judged; if the fixed point number data of the current period is not negative, judging whether the fixed point number data of the current period is 0 or not; if the fixed point number data of the current period is not 0, controlling the second register 13 to store the fixed point number data of the current period; and if the fixed point number data of the current period is 0, controlling the second register 13 to send out a second trigger signal.

The third register 16 is connected to the first register 12 and the second register 13, respectively, and the third register 16 is configured to control the third register 16 to output shift accumulation data of a current period according to a first trigger signal sent by the first register 12 or a second trigger signal sent by the second register 13; the third register 16 is also used for storing the shift accumulated data of the last period.

The shift accumulator 15 is connected to the first register 12, the second register 13, and the third register 16, respectively, and the shift accumulator 15 is configured to determine shift accumulation data of a current period according to the exponential weight data of a previous period stored in the first register 12, fixed point data of a previous period stored in the second register 13, and first shift accumulation data of a previous period stored in the third register 16, and store the shift accumulation data of the current period in the third register 16.

And the multiplexer 20 is respectively connected to each of the low-bit-width reconfigurable shift accumulation modules 10, and is configured to select shift accumulation data of the current period output by the low-bit-width reconfigurable shift accumulation module 10.

And the quantization processing module 30 is connected to the multiplexer 20 and configured to perform quantization processing on the shift accumulated data of the current period to obtain quantized data. The quantization processing data is 8-bit data.

The shift accumulator 15 of the present invention includes:

and the shifter 14 is respectively connected with the first register 12 and the second register 13, and is used for determining shift data according to the exponential weight data stored in the first register 12 and the fixed point data stored in the second register 13.

And the accumulator 15 is respectively connected with the shifter 14 and the third register 16, and is configured to determine shift accumulated data of a current period according to the shift data determined by the shifter 14 and the first shift accumulated data of a previous period stored in the third register 16.

The reconfigurable shift accumulation module 10 of the low bit width convolutional neural network of the present invention further comprises: and the output register 17 is connected with the third register 16 and is used for storing the shift accumulated data of the current period output by the third register 16.

The exponential weight data is 4 bits.

The fixed point number data is 8 bits.

The shift accumulation data of the present invention is 18 bits.

Because the convolutional neural network contains a large part of discreteness, the power performance of hardware design can be greatly improved by fully utilizing the discreteness performance of the network, so that in order to further improve the performance of the reconfigurable computing unit, the discreteness of the convolutional neural network is expanded and the power consumption performance of the network is improved by utilizing the discreteness. Research shows that about 40% -60% of input data in the convolutional neural network are zero values, a large part of small data in weight data can be trimmed, the precision of the network is not influenced, therefore, multiplication and addition containing the zero values are meaningless, and the output result is not influenced, so once fixed point data and exponential weight of a previous period are detected to be zero, the third register 16 is controlled to output shift accumulated data of the current period according to a first trigger signal sent by the first register 12 and a second trigger signal sent by the second register 13.

The quantization processing module 30 of the present invention quantizes the 18-bit shift accumulated data to obtain 8-bit quantization processed data.

In the shift-accumulation calculation process, the widths of the shift accumulation data of the previous period and the output shift accumulation data of the current period are obviously larger than the widths of the fixed point number data and the exponential weight data, because a larger calculation range is needed in the shift-accumulation calculation to avoid calculation overflow. The invention sets the width of the shift accumulation data of the current period of 18-bits to be capable of completely accommodating the shift accumulation data of the current period obtained by all shift-accumulation operations.

The experiment board card of the xc7z020clg400-2 model adopted by the invention is used for testing, and has the following advantages: (1) the reconfigurable computing unit designed by the invention improves the shift accumulation operation rate. Through tests, the common neural network accelerator structure adopting the common reconfigurable multiply-accumulate unit occupies 95 LUTs, the calculation power consumption is 1.658, the common neural network accelerator structure adopting the reconfigurable calculation unit designed by the invention only occupies 46 LUTs, the calculation power consumption is only 1W, and obviously, the reconfigurable calculation unit designed by the invention is close to twice of the operation frequency of the common multiply-accumulate unit. (2) The reconfigurable computing unit designed by the invention fully utilizes reconfigurable performance, supports a network structure with multi-bit wide and multi-configuration and realizes flexible bit width configuration of 4-8 bits. (3) The reconfigurable computing unit designed by the invention fully utilizes the discreteness of the network and further improves the hardware performance. (4) The invention enables the exponential convolution neural network to be effectively mapped on an embedded system, and further reduces the area and power expenditure.

TABLE 1 comparison table of reconfigurable computing unit and reconfigurable multiply-accumulate unit

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A low bit width convolution neural network reconfigurable computing unit is characterized in that the low bit width convolution neural network reconfigurable computing unit is applied to displacement accumulation operation of an exponential convolution neural network, and comprises the following steps: the reconfigurable shift accumulation module comprises a plurality of reconfigurable shift accumulation modules, a multi-channel gate and a quantization processing module; the multi-path gate is respectively connected with each reconfigurable shift accumulation module and is used for selecting the shift accumulation data of the current period output by the reconfigurable shift accumulation module; the quantization processing module is connected with the multi-channel gate and is used for performing quantization processing according to the shift accumulated data of the current period to obtain quantization processing data; wherein:

2. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the shift accumulator comprises:

3. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the reconfigurable shift accumulation module further comprises:

4. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the exponential weight data is 4 bits.

5. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the fixed point number data is 8 bits.

6. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the shift accumulation data is 18 bits.

7. The low bit width convolutional neural network reconfigurable computing unit of claim 1, wherein the quantized processed data is 8-bit data.