CN116596034A

CN116596034A - Three-dimensional convolutional neural network accelerator and method on complex domain

Info

Publication number: CN116596034A
Application number: CN202310440957.7A
Authority: CN
Inventors: 宫磊; 王超; 周学海; 李曦; 陈香兰; 朱宗卫
Original assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Current assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-08-15

Abstract

The invention discloses a three-dimensional convolutional neural network accelerator and a method thereof on a complex domain, wherein the accelerator comprises: the buffer memory unit is used for storing input characteristics, output characteristics and weight data in a complex domain; an AXI DMA unit, which is used for carrying out data transmission between the accelerator and the off-chip memory; the computing unit is used for accelerating the computation of the convolution layer and the full connection layer; the post-processing unit is used for calculating the integrated quantization layer, the integrated pooling layer, the batch normalization layer and the integrated activation layer; and the control unit is used for controlling and scheduling the working states of the buffer memory unit, the AXI DMA unit, the computing unit and the post-processing unit. The method can remarkably improve the performance and energy efficiency of the 3D CNN in deployment.

Description

Three-dimensional convolutional neural network accelerator and method on complex domain

Technical Field

The invention belongs to the technical field of convolutional neural networks, and particularly relates to a three-dimensional convolutional neural network accelerator and a method on a complex domain.

Background

In recent years, deep convolutional neural networks have achieved great success in the field of image processing. However, when processing higher-dimensional data such as video, the conventional two-dimensional convolutional neural network cannot effectively capture time information therein, and thus cannot achieve a satisfactory effect. The three-dimensional convolution neural network solves the problem, can simultaneously capture the space-time information in the video through three-dimensional convolution, and plays a great role in video classification and medical image analysis. However, compared with a two-dimensional convolutional neural network, the three-dimensional convolutional neural network has huge storage and calculation overhead, and brings serious challenges for deployment in an embedded type equal-edge scene.

To address this problem, the industry began to attempt to accelerate the 3D CNN algorithm with specialized hardware. In the cloud, the GPU becomes a mainstream hardware acceleration platform due to the characteristics of high computational parallelism and high memory bandwidth. At the edge, due to the limitation of resources, power consumption and other factors, hardware acceleration technology based on ASIC and FPGA is generally adopted, and the deployment efficiency of the 3D CNN is improved by providing higher parallelism on the calculation level and increasing data multiplexing as much as possible on the access level. Wherein an ASIC is an integrated circuit chip designed and developed for a particular application, which has the highest performance, lowest power consumption, and smallest area relative to other platforms. The FPGA has the reconfigurable characteristic, the development period is shorter than that of the ASIC, the development difficulty is lower, and the FPGA has the characteristics of high performance and low power consumption, so that the accelerator based on the FPGA can obtain higher performance and energy efficiency, and can be more suitable for the rapid iteration of a deep learning algorithm.

Currently, the mainstream hardware accelerator mainly adopts structures such as vector inner product units, systolic arrays, line buffers and the like. The vector inner product unit mainly spreads on the convolved input channel and output channel to realize the parallelism of the dimensions of the input channel and the output channel, and typical examples are DianNao in the department of Chinese academy of computing. The systolic array includes three data streams, i.e., input fixed, output fixed, and weight fixed, and multiplexes data by data transmission between neighboring PEs, thereby improving performance and energy efficiency of the accelerator, typically as in google's tensor processor. The line buffer architecture realizes the parallel computation of convolution kernel dimension by caching the input features of the K x K window, and under the hardware architecture, the flow processing of the feature map can be easily realized, so that a higher throughput rate can be achieved.

Research shows that the calculation of the neural network in the complex domain has a plurality of advantages. However, there is a situation in which the three-dimensional convolutional neural network accelerator-related research on the complex domain is deficient.

Disclosure of Invention

In order to solve the problem of lack of related research of the existing three-dimensional convolutional neural network accelerator in the complex domain, the invention provides the three-dimensional convolutional neural network accelerator in the complex domain and the method thereof, which can remarkably improve the performance and energy efficiency of the 3D CNN in deployment.

The aim of the invention is achieved by the following technical scheme:

the first aspect of the present invention provides a three-dimensional convolutional neural network accelerator on a complex domain, the three-dimensional convolutional neural network comprising a convolutional layer, a full-connection layer, a pooling layer, an activation layer and a batch normalization layer, wherein the accelerator comprises:

the buffer memory unit is used for storing input characteristics, output characteristics and weight data in a complex domain;

an AXI DMA unit, which is used for carrying out data transmission between the accelerator and the off-chip memory;

the computing unit is used for accelerating the computation of the convolution layer and the full connection layer;

the post-processing unit is used for calculating the integrated quantization layer, the integrated pooling layer, the batch normalization layer and the integrated activation layer;

and the control unit is used for controlling and scheduling the working states of the buffer memory unit, the AXI DMA unit, the computing unit and the post-processing unit.

The cache unit includes:

the input feature caching unit is used for storing input features in a complex domain;

the output characteristic buffer unit is used for storing output characteristics in a complex domain;

and the weight caching unit is used for caching weights in the complex domain.

The calculation unit includes:

an arithmetic unit matrix comprising a plurality of arithmetic units PE, the plurality of arithmetic units PE being T _m A two-dimensional matrix arrangement of size/B x B, each arithmetic unit PE comprising T _n B parallel complex multipliers and one for T _n Complex addition tree of summation of B parallel complex multiplier outputs, where T _m For the block size of the output channel, B is the block size of the two-dimensional matrix, T _n The block size for the input channel;

an address generator for generating address data of the input feature, the output feature, and the weight data;

and the PE controller is used for controlling the working states of the operation unit PE and the address generator.

The AXI DMA unit includes:

the data packaging unit is used for packaging the output data of the buffer unit to increase the bandwidth of the output data;

the data disassembling unit is used for disassembling the data of the off-chip memory to obtain the data required by the accelerator;

and the AXI DMA controller is used for controlling the working states of the data packaging unit and the data disassembling unit.

The invention discloses a three-dimensional convolutional neural network acceleration method on a complex domain, which comprises the following steps of:

quantizing the three-dimensional convolutional neural network;

deploying a three-dimensional convolutional neural network;

the three-dimensional convolutional neural network is accelerated using a three-dimensional convolutional neural network accelerator on one of the complex domains described in the first aspect and any one of its possible designs.

The quantifying the three-dimensional convolutional neural network comprises:

calculating the scaling factor of the real and imaginary parts of the weight valuesThe real and imaginary part of the activation value are stretch factors +.>

Scaling factor based on real and imaginary parts of weight valuesThe real and imaginary part of the activation value are stretch factors +.>Calculating a pseudo quantization operator comprising a quantization operator CQquant and an inverse quantization operator CDequant,

inserting a pseudo quantization operator into a calculation map of the three-dimensional convolutional neural network;

wherein ,

wherein l=1, 2 represent the scaling factor of the real part and the scaling factor of the imaginary part of the weight respectively,

a ^l where l=1, 2 represent the real and imaginary parts of the activation value, respectively, β e [0, 1]]；

CQuant(z)＝Quant(z _r )+jQuant(z _i )，

Dequant(z)＝Dequant(z _r )+jDequant(z _i )，

Wherein z is the complex number to be quantized, z _r 、z _i Respectively representing the real part and the imaginary part of z, j being the square root of-1;

quant is a symmetric quantization operator on the real number domain, quant (x) =clamp ([ x/s ], -127, 127), clamp (x, a, b) is used to constrain the value of x between a, b, returning a if x is less than a, b if x is greater than b, otherwise, returning x;

dequat is an antisymmetric quantization operator over the real number domain, dequat (x) =x×s.

The method for deploying the three-dimensional convolutional neural network further comprises the following steps:

acquiring a complex sequence obtained by FFT conversion of a weight value and an activation value;

according to the complex conjugate symmetry after real FFT, the complex sequence is stored and calculated to be compressed, when N is even, only X in the complex sequence is needed to be stored ₀ 、X ₁ 、…、X _N/2 N/2+1 complex numbers in total, and can further convert real number X ₀ and X_N/2 Packing into a complex number X ₀ +jX _N/2 . In terms of computation, only the product of the previous N/2+1 complex numbers needs to be computed

The method further comprises the step of optimizing the multiplication of complex numbers before the deployment of the three-dimensional convolutional neural network:

obtaining a first complex number z to be multiplied ₁ And a second complex number z ₂ First complex number z ₁ =a+bj, second complex number z ₂ =c+dj, a, b, c, d are real numbers, j represents-1Square root of (2);

the first complex number z ₁ And a second complex number z ₂ The multiplication of (a) is converted into (a-B) + (B-C) j, where a= (a+b) C, b= (c+d) B, c= (B-a) d.

obtaining a third complex number x and a fourth complex number w to be multiplied ₁ And a fifth complex number w ₂ Wherein the third complex number x is respectively equal to the fourth complex number w ₁ And a fifth complex number w ₂ Product operation is carried out, and x=a+bj, w ₁ ＝x ₁ +jy ₁ ，w ₂ ＝x ₂ +jy ₂ Wherein a, b, x ₁ ,y ₁ ,x ₂ ,y ₂ Are real numbers, j represents the square root of-1;

the third complex number x and the fourth complex number w ₁ Is converted into (A) ₁ -B ₁ )+(B ₁ -C ₁ ) j, wherein A ₁ ＝(a+b)×x ₁ ，B ₁ ＝(x ₁ +y ₁ )×b，C ₁ ＝(b-a)×y ₁ ，

The third complex number x and the fifth complex number w ₂ Is converted into (A) ₂ -B ₂ )+(B ₂ -C ₂ ) j, wherein A ₂ ＝(a+b)×x ₂ ，B ₂ ＝(x ₂ +y ₂ )×b，C ₂ ＝(b-a)×y ₂ . Further comprises:

the first multiplier r ₁ Shift left 18 bits and shift the second multiplier r ₂ The sign is expanded into 27 bits and then the sign and the sign are summed to obtain r ₁ <<18+r ₂, wherein ,r₁ 、r ₂ Are real numbers;

calculating r ₁ <<18+r ₂ And a third multiplier r ₃ Is multiplied by (c) to give o=r ₃ ×(r ₁ <<18+r ₂ )；

Separation of r from o ₃ ×r ₁ and r₃ ×r ₂ As a result of (a).

。

Compared with the prior art, the invention has at least the following advantages and beneficial effects:

1. the accelerator of the scheme adopts optimization methods such as cyclic blocking, cyclic unfolding, double-buffer operation and the like, and effectively improves the performance and energy efficiency of the three-dimensional convolutional neural network during deployment.

2. The method of the scheme adopts a series of optimization measures aiming at the characteristics of complex operation, effectively reduces the storage space and the calculated amount, reduces the calculation complexity by utilizing a rapid algorithm of complex multiplication, reduces the consumption of hardware resources and improves the calculation efficiency of the accelerator.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic block diagram of an accelerator of the present invention;

FIG. 2 is a diagram of complex conjugate symmetry optimization according to the present invention;

FIG. 3 is a schematic diagram of the INT9 multiplicative DSP optimization of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without collision.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be noted that, directions or positional relationships indicated by terms such as "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on those shown in the drawings, or are directions or positional relationships conventionally put in use of the inventive product, or are directions or positional relationships conventionally understood by those skilled in the art, are merely for convenience of describing the present invention and for simplifying the description, and are not to indicate or imply that the apparatus or element to be referred to must have a specific direction, be constructed and operated in a specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Research shows that the calculation of the neural network in the complex domain has a plurality of advantages.

For example, a conventional 2D/3D FFT algorithm can significantly reduce the two-dimensional/three-dimensional volumeThe computational complexity of the product operation. The algorithm firstly fills the feature diagram and the weight zero to the same size, then transforms the feature diagram and the weight zero to a frequency domain through a 2D/3D FFT, performs element level multiplication in the frequency domain, accumulates along a channel, and finally transforms the result back to the time domain through the 2D/3DIFFT, and the algorithm is used for accelerating convolution operation, thereby being capable of reducing the approach K ² Or K ³ The amount of calculation of the times. However, the above method requires a large number of zero-padding operations when the feature map and convolution kernel sizes differ significantly, thereby increasing the memory overhead of the computation process. For this reason, researchers have proposed to alleviate this problem by using an overlay-and-Add (OaA), oaA first performs a block-wise operation on the feature map, then fills each block of feature map and the corresponding convolution kernel to the same size, transforms both to the frequency domain by FFT, then performs element-level multiplication and accumulation operations in the frequency domain, and finally transforms back to the time domain. Since OaA performs the blocking operation on the feature map, the size difference between the feature map and the convolution kernel is not great any more, and the extra memory overhead in the calculation process is relieved.

Another neural network over the complex domain is based on a cyclic matrix compression algorithm: after the cyclic matrix is used for compressing the neural network, the calculation can be further converted into a frequency domain through 1D FFT according to the cyclic convolution theorem, so that the acceleration effect is achieved. Unlike the traditional 2D/3D FFT algorithm, the 1D FFT based on cyclic matrix compression does not have zero padding operation in the calculation process, so that the storage and calculation cost of a model can be remarkably reduced. In addition, the method can also balance between the compression ratio and the precision by adjusting the size of the cyclic matrix, and the higher the size of the cyclic matrix is, the higher the compression ratio can be provided, and the larger the precision loss is, and vice versa.

In addition to complex domain neural networks that are converted to the frequency domain via FFT, another class of complex domain neural networks is constructed directly on the complex domain. Related studies have shown that complex numbers are easier to optimize, have better generalized characteristics, faster learning ability, and allow for a noisy memory mechanism when neural networks are trained. For example, in recurrent neural networks, the use of complex numbers may allow the network to have a richer characterization capability. As a result, complex domain neural networks have begun to receive increasing attention. In such a network, the input feature map, the weight, and the output feature map are each composed of complex numbers, and correspond to the neural network in the time domain, and such a network further has a complex convolution layer, a complex full connection layer, a complex pooling layer, a complex activation function layer (activation layer), and a complex batch normalization layer (input layer).

Because the storage resources on the FPGA chip are limited, the feature map and the weight cannot be accommodated at one time, and therefore, the invention performs the cyclic block optimization when designing the accelerator. For convenience of explanation, convention T _n 、T _m The sizes of the blocks of the input channel and the output channel of the accelerator are respectively T _d 、T _r 、T _c And B is the block size of the cyclic matrix for outputting the block sizes of the feature map in time, height and width dimensions. In accordance with a first aspect of the present invention there is provided a three-dimensional convolutional neural network accelerator over a complex domain, the three-dimensional convolutional neural network comprising a convolutional layer, a fully-connected layer, a pooling layer, an activation layer, and a batch normalization layer. The accelerator is shown in fig. 1, and comprises a buffer unit, an AXI DMA unit, a calculation unit, a post-processing unit and a control unit, wherein the buffer unit is used for storing input characteristics, output characteristics and weight data in a complex domain; the AXI DMA unit is used for carrying out data transmission between the accelerator and the off-chip memory; the calculation unit PEs is used for accelerating the calculation of the convolution layer and the full connection layer; the post-processing unit is used for calculating a quantized layer, a pooling layer, a batch normalization layer and an activation layer after fusion, and mainly comprises operations of linear transformation, truncation, rounding and the like; the control unit is used for controlling and scheduling working states of the buffer memory unit, the AXI DMA unit, the calculation unit PEs and the post-processing unit.

In order to provide the matched on-chip access bandwidth for the computing unit, the caching unit comprises an input feature caching unit, an output feature caching unit and a weight caching unit, wherein the input feature caching unit is used for storing input features in a complex domain; the output characteristic caching unit is used for storing output characteristics in a complex domain; the weight caching unit is used for caching weights in a complex domain. The input feature buffer, the output feature buffer and the weight buffer are all composed of a plurality of banks, and data of different channels are stored in different banks, so that parallel reading and writing of the data can be realized.

In order to mask the transmission time of the data, the accelerator can also adopt double buffer settings, namely, an input characteristic buffer unit, an output characteristic buffer unit and a weight buffer unit are respectively arranged in two, two buffers are arranged for each type of data, and when a producer writes the data into the buffer 1 or 2, a consumer reads the data from the buffer 2 or 1 for subsequent use, so that coarse-granularity pipelining is realized, and the throughput rate of the accelerator is further increased. The working process of the accelerator is divided into three stages of data loading, data calculating and data writing back, two caches are arranged on an input characteristic diagram, a weight and an output characteristic diagram, and the transmission time of the data is covered through ping-pong operation, so that coarse-grained pipelining among the three stages is realized.

Illustratively, the computing units PEs include a matrix of computing units, an address generator, and a PE controller. The operation unit matrix comprises a plurality of operation units PE which are T-shaped _m A two-dimensional matrix arrangement of size/B x B, each arithmetic unit PE comprising T _n B parallel complex multipliers and one for T _n Complex addition tree of B parallel complex multiplier outputs for summation, such structure having a total of T _m /B×B＝T _m And thus the total parallelism is T _m ×T _n /B, i.e. T can be completed per cycle _m ×T _n B complex multiplication operations. The address generator is used for generating address data of input characteristics, output characteristics and weight data; the PE controller is used for controlling the working states of the operation unit PE and the address generator.

The accelerator adopts a cyclic block optimization method to carry out block operation on 5 dimensions of time, height, width and weight of an output feature diagram, namely an input channel and an output channel, wherein the block sizes are respectively T _d 、T _r 、T _c 、T _n 、T _m Meanwhile, the accelerator performs large dimension on the input channel and the output channelSmall as T _n 、T _m Is a cyclic expansion of (a).

The AXI DMA unit is used for carrying data between the off-chip memory and the accelerator and comprises a data packaging unit, a data disassembling unit and an AXI DMA controller, wherein the data packaging unit is used for packaging the output data of the cache unit so as to increase the bandwidth of the output data; the data disassembling unit is used for disassembling the data of the off-chip memory to obtain the data required by the accelerator; and the AXI DMA controller is used for controlling the working states of the data packaging unit and the data disassembling unit.

Based on the above accelerator structure, the second aspect of the present invention discloses a method for accelerating a three-dimensional convolutional neural network in a complex domain, which comprises steps S01 to S04.

And S01, quantifying the three-dimensional convolutional neural network.

INT8 quantized perception training is carried out on the three-dimensional convolutional neural network on the complex domain, so that the storage and calculation cost is further reduced. The INT8 quantized perceptual training on the complex domain carries out symmetrical quantization based on the maximum absolute value on the real part and the imaginary part of the complex, and the two operations have independent expansion factors, so that the model precision can be well maintained while the model calculation and storage cost is reduced. The quantization step is adopted to reduce the storage and calculation cost of the model. Specifically, the quantization step includes steps S011 to S013.

Step S011, calculating the scale factors of the real part and the imaginary part of the weight valueThe real and imaginary part of the activation value are stretch factors +.>

wherein ,

wherein l=1, 2 are respectively the scaling factor of the real part and the scaling factor of the imaginary part of the weight value, a ^l Where l=1, 2 represent the real and imaginary parts of the activation value, respectively, β e [0, 1]]. The smaller the value of β plays a regulatory role, the greater the impact of the history value, whereas the current value dominates.

The scaling factors of the real and imaginary parts of the weights in this stepAre all based on weight w ^l Is dynamically calculated in the training process. Whereas in order to smooth the activated scale factor, preventing it from shaking strongly during training, the updating +.>

Furthermore, during back propagation, since the round function round is not conductive, a Straight-through estimator (Straight-through-Through Estimator, STE) may be employed to estimate its derivative, as shown in equation (1).

Step S012, scaling factors according to real and imaginary parts of weight valuesThe real and imaginary part of the activation value are stretch factors +.>Calculating a pseudo quantization operator comprising a quantization operator CQuant and an inverse quantization operatorSub CDequal.

Wherein CQuant (z) =quant (z _r )+jQuant(z _i )，

Dequant(z)＝Dequant(z _r )+jDequant(z _i )，

z is the complex number to be quantized, z _r 、z _i Respectively representing the real part and the imaginary part of z, j being the square root of-1;

In the quantization process, in order to reduce the influence of quantization on model precision, the weight, the activation, the real part and the imaginary part all adopt independent telescopic factors.

And step S013, inserting the pseudo quantization operator into a calculation map of the three-dimensional convolutional neural network, and thus realizing the quantized perception training.

And step S02, optimizing the quantized three-dimensional convolutional neural network. The optimization of the three-dimensional convolutional neural network is optimized based on complex operation characteristics. Specifically, the method comprises the steps of optimizing conjugate symmetry after real number sequence FFT, optimizing a fast algorithm based on complex multiplication and optimizing low-bit-width complex multiplication based on DSP.

Step S021 is mainly to optimize the conjugate symmetry after the FFT of the real number sequence. Complex domain three-dimensional convolutional neural networks, including traditional 3D FFT acceleration algorithm based, cyclic matrix compression and 1D FFT acceleration based, and pure complex three-dimensional convolutional neural networks. In the complex-domain neural network described above, both of which are derived from the real-number domain by FFT transformation, the activation and weight thereof on the complex-number domain satisfy conjugate symmetry, and the present invention further optimizes the storage and computation of accelerators by taking advantage of this property.

Taking 1D FFT as an example, a real sequence x= [ x ] with a length of N without loss of generality ₀ ,x ₁ ,…,x _N-1 ] ^T After FFT conversion, the complex sequence X= [ X ] is obtained ₀ ,X ₁ ,…,X _N-1 ] ^T The conjugate symmetry is satisfied.

X ₀ 、X _N/2 Is real and(N％2＝0,i＝1,2,…,N/2-1)

X ₀ is real and(N％2＝1,i＝1,2,…,(N-1)/2)，

wherein, represents the conjugate of complex numbers, i.e., (a+bj) ^* =a-bj. Therefore, when storing the weights of the frequency domain (which may be converted to the frequency domain in advance) and the activation, only a part of the values thereof may be stored. For even N, only X needs to be stored ₀ 、X ₁ 、…、X _N/2 N/2+1 complex numbers in total, taking X into account ₀ and X_N/2 Are all real numbers, so X can be taken as ₀ and X_N/2 Further packed into a complex number X ₀ +jX _N/2 Thereby halving the storage space as shown in fig. 2. In addition, since the product of the complex conjugate is equal to the conjugate of the complex product, wherein ,z₁ and z₂ Is an arbitrary complex number, which represents the conjugate of the complex number. Therefore, when the element level multiplication of the complex vector is performed in the frequency domain, only the product of the first N/2+1 terms can be calculated, and the result of the product of the last N/2-1 terms can be obtained by the complex conjugate of the former, thereby reducing the calculation amount by nearly 50%. In this regard, the main method of this step is: acquiring a complex sequence obtained by FFT conversion of a weight value and an activation value; the storage and calculation are compressed according to the conjugate symmetry after real FFT, when N is even number, only X in complex sequence is needed to be stored ₀ 、X ₁ 、…、X _N/2 N/2+1 complex numbers in total, and real numbers X0 and XN/2 can be further packed into one complex number X ₀ +jX _N/2 . Computationally, aspects of the methodOnly the product of the first N/2+1 terms needs to be calculated.

Step S022, this step is mainly based on the optimization of the fast algorithm of complex multiplication. Complex multiplication is the most core operation of a complex domain three-dimensional convolutional neural network. However, unlike real multiplication, complex multiplication has a more complex form, with complex z ₁ ＝a+bj,z ₂ =c+dj, a, b, c, d are real numbers, j represents the square root of-1, then z ₁ and z₂ The product of (2) is as follows:

(a+bj)(c+dj)＝(ac-bd)+(ad+bc)j

the above equation shows that 1 complex multiplication consists of 4 real multiplications and 3 real additions.

In order to reduce the computational complexity of complex multiplication, the steps of the scheme are as follows: obtaining a first complex number z to be multiplied ₁ And a second complex number z ₂ First complex number z ₁ =a+bj, second complex number z ₂ =c+dj, a, b, c, d are real numbers, j representing the square root of-1; the first complex number z ₁ And a second complex number z ₂ The multiplication of (a) is converted into (a-B) + (B-C) j, where a= (a+b) C, b= (c+d) B, c= (B-a) d.

After the optimization, the calculated amount of complex multiplication operations of original 4 times of real multiplication and 3 times of real addition can be reduced to 3 times of real multiplication and 5 times of real addition. Since the computational complexity of addition is much lower than that of multiplication, overall, the optimization can effectively reduce the computational complexity of complex multiplication, thereby reducing about 25% of the hardware resource (e.g., DSP Slice) consumption.

Step S023, which is mainly based on the optimization of low-order wide complex multiplication by DSP. The DSP is an important computing resource in the FPGA, taking the Xilinx FPGA as an example, in the FPGA of the UltraScale/UltraScale+ series, the DSP48E2 Slice comprises a 27X 18bit multiplier and a 48bit adder, and the DSP supports a larger bit width, so that the DSP can be further optimized under the condition that the data bit width participating in computation is smaller and data sharing exists, and the utilization rate of the DSP is improved. Meanwhile, in 3D-CNN, there are rich data multiplexing including input feature map multiplexing (Input feature map reuse), convolution kernel multiplexing (Filter reuse), and convolution multiplexing (Convolutional reuse), which make data sharing possible. Furthermore, DNN model quantization has become a necessary technology for high performance DNN reasoning, and related research has also demonstrated: INT8, even INT4, etc. low-bit-width quantization can still maintain good model accuracy on many neural networks.

Based on the three-point observation, the invention takes INT8 quantization as an example, and describes an optimization method based on a DSP packaging technology on a complex domain.

Consider now three complex numbers x=a+bj, w ₁ ＝x ₁ +jy ₁ ，w ₂ ＝x ₂ +jy ₂ Wherein a, b, x ₁ ,y ₁ ,x ₂ ,y ₂ Are real numbers, j represents the square root of-1, and o needs to be calculated ₁ ＝x×w ₁ and o₂ ＝x×w ₂ I.e. one multiplier is shared, which can be derived from multiplexing of the input feature maps in the convolution calculation.

To reduce the usage of DSP resources, the steps include:

The third complex number x and the fifth complex number w ₂ Is converted into (A) ₂ -B ₂ )+(B ₂ -C ₂ ) j, wherein A ₂ ＝(a+b)×x ₂ ，B ₂ ＝(x ₂ +y ₂ )×b，C ₂ ＝(b-a)×y ₂ 。

A ₁ and A₂ Share a multiplier a+b, B ₁ and B₂ Share multipliers b, C ₁ and C₂ Share a multiplier b-a, and a+b, b-a, x in the case of 8-bit signed integers in both real and imaginary parts ₁ +y ₁ X ₂ +y ₂ All need to be represented by 9-bit signed integers, so the problem can be translated into: two signed INT9 multiplications with a multiplier in between are shared, how optimized can reduce the amount of DSP resources used.

The problems of A1 and A2, B1 and B2 and C1 and C2 are abstracted into a multiplier to be multiplied by two other numbers, so that the method is further optimized and the consumption of DSP resources is reduced. r is (r) ₁ 、r ₂ 、r ₃ Are of INT9 type, and are to be calculated r ₃ ×r ₁ and r₃ ×r ₂ How to optimize the consumption of DSP resources has been reduced. This step is achieved by combining r ₁ 、r ₂ Packing into a larger number of bits wide, and r ₃ Multiplying and then adding r from the result ₃ ×r ₁ and r₃ ×r ₂ Split off, the key here is r ₃ ×r ₁ and r₃ ×r ₂ Is separable. Due to r ₃ ×r ₁ or r₃ ×r ₂ At most, 18 bits are needed for storage, and the bit width of a core multiplier of DSP48E2 Slice in the Xilinx ZCU102 FPGA is 18 multiplied by 27, so that r is firstly calculated ₁ Left shift 18 bits while shifting r ₂ The sign is expanded into 27 bits, and the two are summed to obtain r ₁ <<18+r ₂ Then sum r ₃ Multiplication, i.e. o=r ₃ ×(r ₁ <<18+r ₂ ) Finally, r is separated from the product o ₃ ×r ₁ and r₃ ×r ₂ As a result of (a): r is (r) ₃ ×r ₁ ＝o _35:18 +o ₁₇ ,r ₃ ×r ₂ ＝o _17:0 ，o _35:17 Representing the fetching of bits 35 to 18 of o as an 18bit signed integer; o (o) _17:0 The same applies to fetching bits 17 to 0 of o as an 18-bit signed integer, as shown in FIG. 3. Through the optimization, 1 DSP48E2 Slice meter can be realizedThe purpose of calculating two INT9 multiplications (one multiplier is shared) is achieved, the computing capacity of the DSP48E2 Slice is fully mined, the consumption of DSP resources is obviously reduced, the resource consumption of an accelerator is reduced, and the overall computing efficiency is improved.

And S03, deploying the three-dimensional convolutional neural network. The deployment step is performed by adopting the existing deployment mode, and details are not repeated in the scheme.

Step S04, accelerating the three-dimensional convolutional neural network using a three-dimensional convolutional neural network accelerator as described in the first aspect and any one of its possible designs.

The accelerator and the accelerating method of the invention realize the acceleration of the complex domain three-dimensional convolutional neural network, and effectively relieve the problem of overlarge storage and calculation overhead when the three-dimensional convolutional neural network is deployed. The 3D CNN is subjected to quantitative perception training in a complex domain, and an efficient hardware architecture is designed. For the characteristics of complex operation, a series of optimization measures are adopted to further reduce the storage and calculation cost and improve the performance of the accelerator, and the method comprises the following steps: the conjugate symmetry after the real number sequence FFT is utilized, so that the storage and calculation cost is reduced; the calculated amount is further reduced by utilizing a rapid algorithm of complex multiplication operation; and by utilizing a DSP packaging technology in the FPGA and combining rich data multiplexing in the 3D CNN, the consumption of resources such as DSP and the like is reduced, and the calculation efficiency of the accelerator is improved.

To assess the technical advantages of the present solution we selected two 3D CNN models, C3D and 3D ResNet-18, for experiments. During training, INT8 quantization is realized by inserting a pseudo quantization node. In the experiment, the size of the cyclic matrix was set to 8, with corresponding losses of accuracy of 1.25% and 1.71%, respectively, within an acceptable range. The hardware platform adopted in the experiment is Xilinx ZCU102 FPGA, the development tool is Vivado HLS 2019.2, and corresponding hardware accelerators are designed for C3D and 3D ResNet-18 respectively. For C3D, the accelerator is configured as (T _n ,T _m )＝(16,64)，(T _d ,T _r ,T _c ) = (2, 7). For 3D res net-18, we have designed two hardware acceleration cores, for accelerating 1 x 1 and 3 x 3 rolls, respectivelyProducts with parallelism of (T) _n ,T _m ) = (8, 32) and (32, 32), the block size (T _d ,T _r ,T _c ) Are (2, 7). The off-chip memory transfers data over the AXI bus and accelerator, the bit width of the AXI bus interface is set to 128, and 8 complex numbers (each complex number having 8 bits for real and imaginary parts) can be read in parallel each time, 16 bits total. After the accelerator design is completed, the Xilinx Vivado2019.2 is used for synthesis, layout and wiring, and bit stream generation, and the final clock frequency of both accelerators is 200MHz. Table 1 shows the final resource consumption and power consumption overhead of the two accelerators and table 2 shows the comparison of the two accelerators and other related operations.

TABLE 1 resources and Power consumption Condition of accelerators

	LUT	FF	BRAM	DSP	Power(W)
						C3D	69636	59264	448.5	452	5.461
3D ResNet-18	93150	89372	502.5	611	6.477

Table 2 comparison with other works

Comparison of results:

as can be seen from Table 2, the power consumption of the C3D and 3D ResNet-18 accelerators is 5.461W and 6.477W, respectively, which is lower than all the other operations listed in the table, mainly because the efficient compression scheme of the present invention greatly reduces the power consumption generated by memory accesses, computations. In addition, only 452 and 611 DSP slices are consumed by the two accelerators, much less than other tasks, but the throughput rates reach 1476.81GOP/s and 876.27GOP/s, respectively. Among these, the calculation performance of the C3D accelerator 1476.81GOP/s is optimal in these works, and a 1.1 to 9.15-fold speed ratio is obtained. The performance finally achieved by the 3D ResNet-18 accelerator is reduced compared with that of the C3D accelerator, is 876.27GOP/s and is higher than that of the work [2] and the work [3], but is lower than that of the work [1], and mainly because the convolution of 1 multiplied by 1 exists in the 3DResNet-18 network and the convolution with the step length of 2, the calculated memory is lower than that of the C3D accelerator, and the interlayer difference is larger, so that a larger gap exists between the actual performance and the theoretical value. In terms of calculation efficiency, the C3D accelerator reaches 3.267GOP/s/DSPs, which is 2.47 times of the work [1] and 11.19-15.78 times of the rest of the work in the table. Although the performance of the 3D ResNet-18 accelerator is not optimal, the highest calculation efficiency (1.08-6.93 times) relative to the rest of the work in the table is obtained, and the advantages of the complex domain three-dimensional convolution neural network accelerator designed by the invention in the aspects of performance and calculation efficiency are fully shown.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. A three-dimensional convolutional neural network accelerator over a complex domain, the three-dimensional convolutional neural network comprising a convolutional layer, a fully-connected layer, a pooling layer, an activation layer, and a batch normalization layer, the accelerator comprising: the buffer memory unit is used for storing input characteristics, output characteristics and weight data in a complex domain; an AXIDMA unit, which is used for data transmission between the accelerator and the off-chip memory;

and the control unit is used for controlling and scheduling the working states of the cache unit, the AXIDMA unit, the computing unit and the post-processing unit.

2. The three-dimensional convolutional neural network accelerator over complex domain of claim 1, wherein the buffer unit comprises:

and the weight caching unit is used for caching weights in the complex domain.

3. The three-dimensional convolutional neural network accelerator over complex domain of claim 1, wherein the computing unit comprises:

an arithmetic unit matrix comprising a plurality of arithmetic units PE, the plurality of arithmetic units PE being T _m A two-dimensional matrix arrangement of size/B x B, each arithmetic unit PE comprising T _n B parallel complex multipliers and one for T _n Complex addition tree of summation of B parallel complex multiplier outputs, where T _m For the block size of the output channel, B is the block size of the two-dimensional matrix, T _n The block size for the input channel; an address generator for generating address data of the input feature, the output feature, and the weight data;

4. The three-dimensional convolutional neural network accelerator over complex domain of claim 1, wherein the AXIDMA unit comprises:

and the AXIDMA controller is used for controlling the working states of the data packaging unit and the data disassembling unit.

5. The three-dimensional convolutional neural network acceleration method on the complex domain is characterized by comprising the following steps of: quantizing the three-dimensional convolutional neural network;

deploying a three-dimensional convolutional neural network;

accelerating the three-dimensional convolutional neural network using a three-dimensional convolutional neural network accelerator on a complex domain as defined in any one of claims 1-4.

6. The method for accelerating a three-dimensional convolutional neural network over a complex domain according to claim 5, wherein said quantizing the three-dimensional convolutional neural network comprises:

calculating the scaling factor of the real and imaginary parts of the weight valuesThe real and imaginary part of the activation value are stretch factors +.>A scaling factor according to the real and imaginary parts of the weight values +.>The real and imaginary part of the activation value are stretch factors +.>Calculating a pseudo quantization operator, wherein the pseudo quantization operator comprises a quantization operator CQuant and an inverse quantization operator CDequant, and inserting the pseudo quantization operator into a calculation graph of the three-dimensional convolutional neural network;

wherein ,

a ^l where l=1, 2 represent the real and imaginary parts, β, respectively, of the activation value∈[0,1]；

CQuant(z)＝Quant(z _r )+jQuant(z _i )，

Dequant(z)＝Deθuant(z _r )+jDequant(z _i )，

7. The method for accelerating a three-dimensional convolutional neural network on a complex domain according to claim 5, wherein the method comprises the following steps: the method for deploying the three-dimensional convolutional neural network further comprises the following steps:

when N is even, only X in the complex sequence is stored ₀ 、X ₁ 、…、X _N/2 N/2+1 complex numbers and will be a real number X ₀ and X_N/2 Packing into a complex number X ₀ +jX _N/2 。

8. The method for accelerating a three-dimensional convolutional neural network on a complex domain according to claim 5, wherein the method comprises the following steps: the method further comprises the step of optimizing the multiplication of complex numbers before the deployment of the three-dimensional convolutional neural network:

obtaining a first complex number z to be multiplied ₁ And a second complex number z ₂ First complex number z ₁ =a+bj, second complex number z ₂ =c+dj, a, b, c, d are real numbers, j representing the square root of-1;

9. The method for accelerating a three-dimensional convolutional neural network on a complex domain according to claim 5, wherein the method comprises the following steps: the method further comprises the step of optimizing the multiplication of complex numbers before the deployment of the three-dimensional convolutional neural network:

10. The method for accelerating a three-dimensional convolutional neural network on a complex domain according to claim 9, wherein: further comprises:

Separation of r from o ₃ ×r ₁ and r₃ ×r ₂ Is a knot of (2)And (5) fruits.