CN113568597A

CN113568597A - Convolution neural network-oriented DSP packed word multiplication method and system

Info

Publication number: CN113568597A
Application number: CN202110802058.8A
Authority: CN
Inventors: 莫志文; 杜培栋; 郭梦原; 王琴; 景乃锋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-10-29
Anticipated expiration: 2041-07-15
Also published as: CN113568597B

Abstract

The invention provides a convolution neural network-oriented DSP packed word multiplication method and system, and designs a packed word multiplication calculation mode realized based on DSP resources on an FPGA. The packed word multiplication utilizes the low bit advantage of data quantization, realizes a plurality of four-bit multiplications in one DSP, and improves the utilization efficiency of resources. In addition, because the FPGA specially optimizes the cascade connection among the DSP units, the invention also utilizes the cascade connection of the DSP units to realize the packed word multiplication accumulation, namely, after finishing multiple packed word multiplications and accumulation, the operation result is extracted from the packed word multiplication. The invention fully utilizes the characteristics of the DSP, improves the utilization efficiency of the DSP and is beneficial to the optimization of the energy efficiency ratio of the system.

Description

Convolution neural network-oriented DSP packed word multiplication method and system

Technical Field

The invention relates to the technical field of convolutional neural networks, in particular to a convolution neural network-oriented DSP packed word multiplication method and system.

Background

Neural network technology is an important branch of artificial intelligence. A large number of neurons are interconnected to form a hierarchical structure similar to a human brain, and the hierarchical structure is a neural network and generally consists of an input layer, an output layer and a plurality of hidden layers.

The neural network has high precision and strong learning ability, and has wide and important application in the fields of image and voice recognition, pattern recognition and the like. The types of neural networks are many, and there are BP neural networks, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the like. Among them, the convolutional neural network plays an important role in the field of image recognition due to its characteristics such as weight sharing and extraction of regional features. In the large-scale visual challenge race (ILSVRC), the best performance of image recognition is created by the convolutional neural network correlation algorithm.

However, convolutional neural networks are computationally and parameter intensive models, which place high demands on the computational power and storage capacity of the hardware. Considering the real-time and safety requirements of the application, the forward inference of the model is often deployed at the edge end near the data source. The edge end is an energy-and resource-constrained system, which presents challenges to the efficient execution of the convolutional neural network at the edge end. On the premise of ensuring model accuracy, how to improve throughput, reduce power consumption and resource use becomes the most concerned topic in the industry.

In order to break through the bottleneck of deployment of convolutional neural networks at the edge, the current research focus mainly on two aspects of algorithm and hardware design: on the algorithm level, on the premise of ensuring the precision or only having little precision loss, compressing an original model, for example, carrying out low bit quantization on the weight and the activation value by model quantization; on a hardware level, the high-efficiency special acceleration design conforming to the operation mode of the convolutional neural network is realized, and the requirement of edge end deployment is met. The FPGA supports fine-grained design, has good reconfigurability and is convenient for rapid deployment of various convolutional neural network models.

The core operations of the convolutional neural network (i.e., multiply-accumulate operations) are often mapped into DSP units on the FPGA. However, the DSP on the FPGA platform supports multiplication of 27bits × 18bits, and if both the weight and the activation value are quantized to four bits, only multiplication of 4bits × 4bits needs to be performed in the convolution calculation. In this case, if a special hardware design is not adopted, the EDA tool usually maps a multiplication operation in the hardware description language to a DSP, which causes a great waste of DSP resources, not only affects the energy efficiency ratio of the accelerator, but also makes the DSP resources a constraint condition for the deployment of the network at the edge.

In the publication of: CN101976044A discloses a wind power system modeling and DSP implementation method based on a neural network. The method comprises the steps of determining input and output signals of a wind power generation system and a neural network by analyzing the working mechanism of the wind power generation system and the neural network; the input signals comprise wind speed and pitch angle, the output signals have power, wind wheel rotating speed and wind wheel torque, a BP neural network is combined with a wind power generation system, the number of hidden layers is set to be large enough to achieve random training precision to determine the weight of each layer by establishing a BP neural network model, and the performance of a modeling object can be well fitted; and at the same time to achieve the possibility of its application.

Therefore, a calculation mode of the packed word multiplication is provided, a plurality of low-bit multiplications are mapped into one DSP unit, and the utilization rate of hardware resources and the energy efficiency ratio of model deployment are improved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a convolution neural network-oriented DSP packed word multiplication method and system.

The invention provides a convolution neural network-oriented DSP packed word multiplication method, which comprises the following steps:

step S1: respectively packaging four inputs, two weights and two input activation values of the multiply-accumulate unit through a shift-addition module;

step S2: taking packed word form as the operand of DSP;

step S3: using DSP to complete multiplication operation at the same time;

step S4: extracting the calculation result of the multiplication operation from the output result of the DSP, and completing the convolution multiplication and accumulation four partial sums; and performing further accumulation operation on the partial sums to complete the complete convolution operation.

Preferably, the weight in step S1 is two 4bits, and the input activation value is two 4 bits.

Preferably, the number of operands in the step S2 is two.

Preferably, in step S3, four multiplication operations are performed by using one DSP.

Preferably, the method is used for realizing efficient mapping of multiply-accumulate operation on FPGA in the convolutional neural network; the same input activation value needs to be multiplied by two different weights, the input activation value is regarded as an output channel to be parallel, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, and is regarded as the parallel of the convolution windows, and the parallelism is also 2.

Preferably, the calculation result of each multiplication occupies 11bits in the packed word product, and the extraction of the calculation result is performed after the completion of multiple multiply-accumulate operations.

The invention also provides a convolution neural network-oriented DSP packed word multiplication system, which comprises the following modules:

module M1: packing the weight and the input activation value respectively;

module M2: taking packed word form as the operand of DSP;

module M3: using DSP to complete multiplication operation at the same time;

module M4: the calculation result of the multiplication operation is extracted from the output result of the DSP.

Preferably, the weight in the module M1 is two 4bits, and the input activation value is two 4 bits;

the number of the operands in the module M2 is two;

four multiplication operations are performed in the module M3 using one DSP.

Preferably, the system is used for realizing efficient mapping of multiply-accumulate operation on the FPGA in the convolutional neural network; the same input activation value needs to be multiplied by two different weights, the input activation value is regarded as an output channel to be parallel, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, and is regarded as the parallel of the convolution windows, and the parallelism is also 2.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a compact word multiplication calculation mode based on a digital signal processing unit (DSP) of a Field Programmable Gate Array (FPGA) by utilizing the characteristic of low bit width after model quantization, so as to improve the energy efficiency ratio of the convolutional neural network deployed at the edge end;

2. the invention fully utilizes the characteristics of the DSP, improves the utilization efficiency of the DSP and is beneficial to the optimization of the energy efficiency ratio of the system;

3. the convolution operation link provided by the invention can fully utilize an optimization circuit in the FPGA, is convenient for layout and wiring, and is beneficial to improving the performance and power consumption.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of a packed word multiplication operation according to the present invention;

FIG. 3 is a diagram of a one-dimensional convolution operation link structure based on DSP according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides a convolution neural network-oriented DSP packed word multiplication method, wherein a packed word multiplication calculation mode is to respectively pack two 4-bit weights and two 4-bit input activation values, and the packed word is used as two operands of a DSP in a packed word form, so that one DSP is used for simultaneously completing four multiplication operations. The calculation results of the four multiplication operations can be extracted from the output results of the DSP.

Referring to fig. 1 and 2, a 4bits input activation value a₀Shifted left by 22 bits and then compared with another 4btis activation value a₁Adding as an operand of the DSP; will be a 4bits weight w₀Shifted 11bits to the left and then weighted by another 4btis weight w₁Added as another operand of the DSP. The multiplication operation performed by the DSP, corresponding to fig. 1, is shown in equation (1), where P denotes the result of the multiplication operation.

P＝(a₀＜＜22+a₁)(w₀＜＜11+w₁)

＝(w₁a₁)+(w₀a₁＜＜11)+(w₁a₀＜＜22)+(w₀a₀＜＜33) (1)

Since the product of the signed 4-bit weight and the unsigned 4-bit input activation value is 8bits, and considering the effect of the complement on the above operation flow, the bit selection from the packed word product P can be used to extract the calculation results of four multiplication operations, i.e., P [10:0], P [21:11] + P [10], P [32:22] + P [21], and P [43:33] + P [32], as shown in equation (2).

Based on the above calculation mode, the same input activation value needs to be multiplied by two different weights, which can be regarded as parallel output channels, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, which can be regarded as the convolution window parallelism, and the parallelism degree is also 2. Therefore, the computation mode of packed word multiplication realizes the combination of two parallel schemes of convolution window parallel and output channel parallel, and the total parallelism is 4.

Calculation of each multiplicationThe result occupies 11bits in the packed word product, and considering that the product of 4-bit multiplication is only 8bits, so that the extraction of the calculation result can be performed after completing multiply-accumulate operations, the formula (1) and the formula (2) can be rewritten as the formula (3) and the formula (4), wherein

Respectively representing a first input activation value, a second input activation value, a first weight and a second weight in the ith group of inputs.

Where N is the number of products accumulated. Under the current computing mode, the value range of N is derived as follows:

the signed number of an m bit is in the range of-2^m-1,2^m-1-1]An unsigned number of m bits in the range 0,2^m-1]Then its product range is [ -2 [)^m-1(2^m-1),(2^m-1-1)(2^m-1)]. The range of values of N such products after accumulation is [ -N2^m-1(2^m-1),N(2^m-1-1)(2^m-1)]As shown in equation (4), the actual value range of the packed word product is [ -N2 ] due to the complementary code effect^m-1(2^m-1)-1,N(2^m-1-1)(2^m-1)-1]. If it is to be expressed as a q-bit signed number, equation (5) is satisfied.

Based on the above analysis, the present invention proposes a one-dimensional convolution operation link structure based on DSP. Referring to fig. 3, the packing of two weight data is completed by an independent adder, and the packing of two input activation values is realized by a 27bits pre-adder inside the DSP; after the data is packed, a multiplier inside the DSP is used for completing packed word multiplication operation; the packed word product is then accumulated with the partial sum output by the previous stage DSP using an accumulator internal to the DSP to obtain a new partial sum as input to the next stage DSP. After the accumulation of the whole link is completed, the splitting of the packed word product is completed according to

formula

4, and 4 partial sums are obtained as the output of the module.

The structure can fully utilize the resources in the DSP, and map most operations of packed word multiplication into the DSP, thereby reducing the use of logic resources on the FPGA. In addition, the DSPs in the FPGA are arranged in an array mode, and the FPGA specially optimizes the cascade connection among the DSP units, so that the convolution operation link provided by the invention can fully utilize an optimization circuit in the FPGA, is convenient for layout and wiring, and is beneficial to improving the performance and power consumption.

In addition, in order to further improve the throughput rate, the invention introduces a pipeline structure in the operational chain, and the number of pipeline stages is equal to the number N of DSP units. Because the operation amount of convolution operation is huge, the clock period of pipeline starting and cooling can be ignored, so that the method can be approximately equivalent to the parallel of N DSP units in practical analysis, and a parallel scheme of multiplication parallel in convolution kernels or parallel input channels is realized. From the analysis of the perspective of three-dimensional convolution, the commonality of multiplication parallel in convolution kernels and input channel parallel is more, and the design of subsequent data flow is considered, and the parallel input channel is realized by adopting a one-dimensional convolution operation link. In combination with a parallel scheme realized by packed word multiplication, a one-dimensional convolution operation link can simultaneously realize convolution window parallel with the parallelism of 2, output channel parallel with the parallelism of 2 and input channel parallel with the parallelism of N.

Aiming at the multiply-accumulate operation in the convolutional neural network, the invention designs the calculation mode of the packed word multiplication by utilizing the low bit advantage of data quantization, and realizes a plurality of four-bit multiplications in one DSP. The invention also realizes the compact word multiplication accumulation by utilizing the special structure aiming at the DSP link in the FPGA. The design makes full use of the characteristics of the DSP, improves the utilization efficiency of the DSP and is beneficial to the optimization of the parallelism and the energy efficiency ratio of the system.

module M1: packing the weight and the input activation value respectively; the weight is two 4bits and the input activation value is two 4 bits.

Module M2: taking packed word form as the operand of DSP; the number of operands is two.

Module M3: using DSP to complete multiplication operation at the same time; four multiplication operations are performed using one DSP.

The system is used for realizing the high-efficiency mapping of multiply-accumulate operation in the convolutional neural network on the FPGA; the same input activation value needs to be multiplied by two different weights, the input activation value is regarded as an output channel to be parallel, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, and is regarded as the parallel of the convolution windows, and the parallelism is also 2. The calculation result of each multiplication occupies 11bits in the packed word product, and the extraction of the calculation result is performed after the multiplication and accumulation operations are completed for multiple times.

The invention designs a compact word multiplication calculation mode based on a digital signal processing unit (DSP) of a Field Programmable Gate Array (FPGA) by utilizing the characteristic of low bit width after model quantization, so as to improve the energy efficiency ratio of the convolutional neural network deployed at the edge end; the characteristics of the DSP are fully utilized, the utilization efficiency of the DSP is improved, and the optimization of the energy efficiency ratio of the system is facilitated; the provided convolution operation link can make full use of an optimization circuit inside the FPGA, facilitates layout and wiring, and is beneficial to improving performance and power consumption.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A convolutional neural network-oriented DSP packed word multiplication method, comprising the steps of:

step S2: taking packed word form as the operand of DSP;

step S3: using DSP to complete multiplication operation at the same time;

2. The convolutional neural network-oriented DSP packed word multiplication method as claimed in claim 1, wherein the weight in step S1 is two 4bits, and the input activation value is two 4 bits.

3. The convolutional neural network-oriented DSP packed word multiplication method of claim 1, wherein the number of operands in the step S2 is two.

4. The convolutional neural network-oriented DSP packed word multiplication method of claim 1, wherein four multiplication operations are performed in step S3 using one DSP.

5. The convolutional neural network-oriented DSP packed word multiplication method as claimed in claim 1, wherein the method is used for realizing efficient mapping of multiply-accumulate operations on FPGA in convolutional neural network; the same input activation value needs to be multiplied by two different weights, the input activation value is regarded as an output channel to be parallel, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, and is regarded as the parallel of the convolution windows, and the parallelism is also 2.

6. The convolutional neural network-oriented DSP packed word multiplication method of claim 1, wherein the calculation result of each multiplication occupies 11bits in the packed word product, and the extraction of the calculation result is performed after the multiplication and accumulation operations are completed for multiple times.

7. A convolutional neural network-oriented DSP packed word multiplication system, comprising:

module M1: packing the weight and the input activation value respectively;

module M2: taking packed word form as the operand of DSP;

module M3: using DSP to complete multiplication operation at the same time;

8. The convolutional neural network-oriented DSP packed word multiplication system of claim 7, wherein the weight in the module M1 is two 4bits, and the input activation value is two 4 bits;

the number of the operands in the module M2 is two;

four multiplication operations are performed in the module M3 using one DSP.

9. The convolutional neural network oriented DSP packed word multiplication system of claim 7, wherein the system is configured to implement efficient mapping of multiply-accumulate operations on FPGAs in convolutional neural networks; the same input activation value needs to be multiplied by two different weights, the input activation value is regarded as an output channel to be parallel, and the parallelism is 2; the same weight needs to be multiplied by two different activation values, and is regarded as the parallel of the convolution windows, and the parallelism is also 2.

10. The convolutional neural network-oriented DSP packed word multiplication system of claim 1, wherein the calculation result of each multiplication occupies 11bits in the packed word product, and the extraction of the calculation result is performed after the multiplication and accumulation operations are completed for multiple times.