CN113554163A

CN113554163A - Convolutional neural network accelerator

Info

Publication number: CN113554163A
Application number: CN202110849929.1A
Authority: CN
Inventors: 郑海生; 余蓓; 沈小勇; 吕江波; 贾佳亚
Original assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Current assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-26
Anticipated expiration: 2041-07-27
Also published as: CN113554163B

Abstract

The application relates to the technical field of artificial intelligence, and provides a convolutional neural network accelerator, which comprises: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree; the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel; the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel; the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantized convolution result, an operation framework is provided for the quantized convolution neural network, parallel operation of the multi-channel characteristic diagram is achieved, the calculation bandwidth of unit time is improved, and calculation acceleration is achieved.

Description

Convolutional neural network accelerator

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a convolutional neural network accelerator.

Background

With the development of convolutional neural networks, convolutional neural networks are widely applied to the fields of computer vision, speech recognition, natural language processing, automatic driving and the like.

The convolutional neural network has too many parameters and too large network volume, and occupies more memory; before the convolutional neural network is deployed on the computing chip, quantization processing is generally performed on the convolutional neural network so as to reduce the network volume and reduce the memory. Therefore, it is necessary to design an efficient accelerator for the quantized convolutional neural network.

Disclosure of Invention

In view of the above, it is necessary to provide a convolutional neural network accelerator in order to solve the above technical problems.

A convolutional neural network accelerator, comprising: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree;

the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel;

the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel;

the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantization convolution result.

In one embodiment, the multiply-add tree includes a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port;

the first multiplier is configured to obtain quantization weight data through the first input port, obtain two pieces of quantization feature data through the second input port, and perform multiplication on the two pieces of quantization feature data by using the quantization weight data; the feature maps to which the two quantized feature data belong respectively correspond to different pictures.

In one embodiment, the bit width of the input data of the second input port is consistent with the number of input sub-ports included in the second input port;

the first multiplier is configured to obtain one of the two quantization feature data through a preset number of input sub-ports counted from a first input sub-port included in the second input port, and obtain the other of the two quantization feature data through the preset number of input sub-ports backward counted from a last input sub-port included in the second input port; and the preset number is consistent with the bit width of the quantized feature data.

In one embodiment, the scaling operator comprises a second multiplier and a shifter;

the second multiplier is configured to obtain an integer scaling value obtained by amplifying a floating-point scaling value, perform multiplication on the result of the multiply-add operation and the integer scaling value, and send the obtained result of the multiplication to the shifter;

the shifter is used for obtaining the amplification ratio of the floating point type zoom value to the integer type zoom value and performing division operation on the multiplication result and the amplification ratio.

In one of the embodiments, the first and second electrodes are,

the shaping scaling value is formed in the following way: if the floating point type scaling value is in a first numerical range, performing twice cyclic amplification on the floating point type scaling value until the floating point type scaling value subjected to twice cyclic amplification is in a second numerical range; amplifying the floating point type zoom value subjected to twice cyclic amplification according to the calculation bit width of the second multiplier, and taking a result obtained by rounding an amplification result as the integer type zoom value;

wherein, when a value falling within the first range of values is rounded, the rounding results in a value less than the value; values falling within the second range of values are rounded to a value greater than the value.

In one embodiment, the integer scaling value is further formed by: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.

In one embodiment, the magnification ratio is 2 to the power of n; n is the sum of the calculated bit width of the second multiplier and the number of times of the twice cyclic amplification.

In one of the embodiments, the first and second electrodes are,

the shifter is further configured to perform a shift operation on the binary system of the multiplication result based on the binary system of the multiplication result and the binary system of the amplification scale to obtain a shift operation result;

the shifter is further configured to perform an exclusive-or operation on the binary system of the multiplication operation result and the binary system of the amplification scale, and determine a scaling operation result based on the exclusive-or operation result and the shift operation result.

In one of the embodiments, the first and second electrodes are,

the shifter is further configured to, if the xor operation result is greater than half of the amplification ratio, take a result obtained by adding 1 to the shift operation result as a scaling operation result;

the shifter is further configured to use the shift operation result as a scaling operation result if the xor operation result is less than or equal to half of the amplification ratio.

In one embodiment, the convolutional neural network accelerator includes: and the on-chip cache is combined with the calculation path and used for storing the intermediate characteristic graph formed by the calculation path in the inference process.

The convolutional neural network accelerator comprises a plurality of multiply-add trees for performing parallel operation on a multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series after each multiply-add tree; the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel; the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel; the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantization convolution result. In the application, each multiplication-addition tree of the convolutional neural network accelerator is connected with a scaling operator and an adder in series, so as to provide an operation framework for the quantized convolutional neural network; and the convolution neural network comprises a plurality of multiplication and addition trees, so that the parallel operation of the multi-channel characteristic diagram is realized, the calculation bandwidth of unit time is improved, and the calculation acceleration is realized.

Drawings

FIG. 1 is a schematic diagram of the computation path of a convolution operation in a convolutional neural network in one embodiment;

FIG. 2 is a diagram illustrating split use of DSP multipliers sharing weights in one embodiment;

FIG. 3 is a diagram illustrating split use of DSP multipliers sharing weights in one embodiment;

FIG. 4 is an overall block diagram of a convolutional neural network accelerator in one embodiment;

FIG. 5 is a diagram of an NN computer architecture in one embodiment;

FIG. 6 is an architecture diagram of a PE in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The present application provides a convolutional neural network accelerator (INT8) for a quantized convolutional neural network, which can be implemented on an FPGA (Programmable Gate Array). Specifically, as shown in fig. 1, the convolutional neural network accelerator mainly includes: a plurality of multiply-add trees 101 for performing parallel operations on the multi-channel feature map, and a scaling operator 102 and an adder 103 connected in series after each multiply-add tree.

The multiplication and addition tree 101 is configured to perform multiplication and addition operation on the quantization weight data and the quantization feature data of the corresponding feature map of one of the channels to obtain a multiplication and addition operation result of the feature map of the corresponding channel.

Each picture corresponds to a multi-channel feature map (also called a multi-dimensional feature map); and the characteristic data of the characteristic diagram of each channel is operated with the weight data of the convolution kernel. If the convolutional neural network is quantized, both the feature data and the weight data are quantized, the quantized feature data may be referred to as quantized feature data, and the quantized weight data may be referred to as quantized weight data.

Taking a convolution kernel of 3 × 3 as an example, the multiply-add tree 101 shown in fig. 1 includes 9 multipliers, each multiplier is used for performing multiplication on the quantization weight data and the quantization feature data of the feature map, and an adder included in the multiply-add tree 101 is used for adding the multiplication results output by the multipliers to obtain the multiplication-add result of the feature map.

The scaling operator 102 is configured to perform scaling operation on the result of the multiply-add operation of the feature map of the corresponding channel to obtain a scaling operation result of the feature map of the corresponding channel;

the adder 103 is configured to perform zero point adjustment on the scaling operation result of the feature map of the corresponding channel; the result output by the adder is used as a quantization convolution result.

Each multiplication-addition tree in the convolutional neural network accelerator is connected with a scaling operator and an adder in series, and an operation framework is provided for the quantized convolutional neural network; and the convolution neural network comprises a plurality of multiplication and addition trees, so that the parallel operation of the multi-channel characteristic diagram is realized, the calculation bandwidth of unit time is improved, and the calculation acceleration is realized.

In one embodiment, the scaling operator 102 includes a multiplier and a shifter; to distinguish the multipliers comprised by the multiply-add tree 101 from the multipliers comprised by the scale operator 102, the multipliers comprised by the multiply-add tree 101 are denoted by a first multiplier and the multipliers comprised by the scale operator are denoted by a second multiplier.

The second multiplier is configured to obtain an integer scaling value obtained by amplifying a floating-point scaling value, perform multiplication operation on the result of the multiply-add operation and the integer scaling value, and send the obtained result of the multiplication operation to the shifter; the shifter is used for obtaining the amplification ratio of the floating point type zoom value to the integer type zoom value and performing division operation on the multiplication result and the amplification ratio.

The following describes the way of amplifying the floating-point scaled value to the integer scaled value and the way of performing division operation by the shifter.

Firstly, the calculation path design of the convolution operation in the convolutional neural network accelerator specifically includes the following contents:

the INT8 convolution calculation formula is:

wherein

M may be referred to as a scaling value, and in general, M is a value in the range of [0,1 ]]The multiplication operation of the floating-point numerical value not only consumes computing resources, but also has lower computing efficiency compared with the multiplication efficiency of the integer, so that the application converts M into an integer numerical value for computing.

As shown in the algorithm pseudocode (r), the floating point type scaling value (M) is converted into the integer type scaling value (M) by using the idea of approximate calculation₁). The method specifically comprises the following steps: if the floating point type scaling value is in a first numerical range, performing twice cyclic amplification on the floating point type scaling value until the floating point type scaling value subjected to twice cyclic amplification is in a second numerical range; amplifying the floating point type zoom value subjected to twice cyclic amplification according to the calculation bit width (for example, the calculation bit width is 32) of a second multiplier, and taking a result obtained by rounding an amplification result as the integer type zoom value; wherein, when a value falling within the first range of values is rounded, the rounding results in a value less than the value; values falling within the second range of values are rounded to a value greater than the value.

Further, the processing manner of amplifying the floating point type scaled value after twice circular amplification according to the calculation bit width of the second multiplier may specifically be: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.

Illustratively, if M is [0,0.5 ]]Within this range, M is amplified twice in cycles until M is [0.5,1 ] in twice in cycles]Between this range of values, and the number of times the cyclic amplification is doubled (which can be denoted as Shift); then will beM amplification of two times cyclic amplification 2³²The amplified value is rounded (i.e., rounded) to obtain a result M₁I.e. the integer scaling value.

In the above manner, the integer scaling value M₁At least 2 with respect to the floating-point type scaling value M³⁰The precision is kept, and the network inference accuracy is guaranteed.

Converting a floating-point scaled value M to an integer scaled value M₁After that, the above convolution calculation formula becomes:

it can be seen that the scaling up from the floating point scaled value M to the integer scaled value M₁At an enlargement ratio of 2^Shift+32That is, the amplification ratio is n times of 2, where n is the sum of the calculated bit width of the second multiplier and the number of times of the double cyclic amplification.

Without considering the decimal place, for binary calculations, dividing by the power of 2 is equivalent to moving the binary of the dividend to the right by the power of bits, the shift operation is a very efficient operation for the computational logic.

For the convolutional neural network inference of INT8, the rounding operation only considers integer bits, and only needs to perform 1 shift operation to achieve the same calculation result as the above expression, as shown by the algorithm pseudocode.

In one embodiment, the shifter is further configured to perform a shift operation on the binary system of the multiplication result based on the binary system of the multiplication result and the binary system of the amplification scale to obtain a shift operation result; the shifter is further configured to perform an exclusive-or operation on the binary system of the multiplication operation result and the binary system of the amplification scale, and determine a scaling operation result based on the exclusive-or operation result and the shift operation result.

To be provided with

For example, the following steps are carried out: binary of 15 is 1111, 2³Is 1000, thus

It can be seen that the binary system of 15 is right-shifted by 3 bits, and the result of the shift operation is 1. But do not

The actual result of (2), direct shifting would produce a calculation error.

In this case, the binary system of the multiplication result and the binary system of the enlargement ratio may be subjected to an exclusive or operation, and the scaling result may be determined based on the exclusive or operation result and the shift operation result. E.g. binary 1111 and 2 for 15³The binary 1000 of (1) is subjected to an exclusive-or operation to obtain an exclusive-or operation result of 111, and a scaling operation result is determined based on the exclusive-or operation result and the shift operation result.

Further, the shifter is further configured to, if the xor operation result is greater than half of the amplification ratio, take a result obtained by adding 1 to the shift operation result as a scaling operation result; the shifter is further configured to use the shift operation result as a scaling operation result if the xor operation result is less than or equal to half of the amplification ratio.

The XOR operation result is still given as 111 as an example: the XOR result 111 is greater than 2³And therefore, the shift operation result 1 is added by 1, and the resulting value is the scaling operation result. Similarly, if the XOR result is less than or equal to 2³Half of that, the result of the shift operation is directly taken as the result of the scaling operation.

The method can be realized through time sequence optimization, and can be completed only by one clock cycle, so that the calculation efficiency is improved.

In one embodiment, the multiply-add tree 101 includes a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port; the first multiplier is configured to obtain quantization weight data through the first input port, obtain two pieces of quantization feature data through the second input port, and perform multiplication on the two pieces of quantization feature data by using the quantization weight data; the feature maps to which the two quantized feature data belong respectively correspond to different pictures.

Illustratively, in an FPGA, the multiplier may be implemented by a DSP resource (Digital Signal Processing, a chip capable of implementing Digital Signal Processing technology) on an FPGA chip, where the DSP includes two input ports, such as an input port a and an input port B shown in fig. 2, and the input data bit widths of the input port a and the input port B are 25 bits and 18 bits, respectively.

The DSP acquires two quantized Feature data (Feature1 and Feature2) through the input port a, where Feature maps to which the two quantized Feature data, Feature1 and Feature2, respectively belong correspond to different pictures. The DSP acquires quantized Weight data (Weight) through the input port B, and multiplies Feature1 and Feature2 by the quantized Weight data, respectively, that is, the quantized Weight data is shared by two pictures.

In the above manner, one of the input ports of the multiplier is subjected to one-way multiplexing, and a single multiplier is split for use, so as to form the schematic diagram shown in fig. 3, and at the same time, the quantized feature data of two pictures is inferred, and correspondingly, the inferred Batch Size parameter is set to 2, so that the logic code to be changed is less, and only the data bit width and the number of adders need to be increased in the change of the calculation path.

Further, the input data bit width of the second input port is consistent with the number of input sub-ports included in the second input port; the first multiplier is configured to obtain one of the two quantization feature data through a preset number of input sub-ports counted from a first input sub-port included in the second input port, and obtain the other of the two quantization feature data through the preset number of input sub-ports backward counted from a last input sub-port included in the second input port; and the preset number is consistent with the bit width of the quantized feature data.

The following contents are described by taking the first input port as the input port B and the second input port as the input port a: the input data bit width of the input port a is 25 bits, that is, the input port a includes 25 input sub-ports, which can be denoted as a [24:0 ]; the input data bit width of input port B is 18 bits, i.e. input port B includes 18 input sub-ports, which can be denoted as B [17:0 ]. If the bit width of the quantized Feature data is 8 bits, the DSP may obtain Feature2 through the input sub-port A [7:0], and obtain Feature1 through the input sub-port A [24:17 ].

In the above-described mode, the reason why the input port a to which the feature data is input is subjected to the one-way multiplexing and the input port B to which the weight data is input is not subjected to the one-way multiplexing is that: if the input port a and the input port B are multiplexed in a single path, bit width of quantized data is too small, and accuracy of the network is reduced.

Exemplarily, the bit width of the quantized feature data and the quantized weight data is W bits, and if the input port A, B is split into two paths for use, the split inputs are a₁、a₂、b₁And b₂And the zero value bit width in the middle of A and B is 25-2W, 18-2W, respectively, then A and B can be expressed as:

A＝a₂×2^25-W+a₁，B＝b₂×2^18-W+b₁；

the product P can be expressed as:

P＝A×B＝a₂b₂×2^43-2W+a₂b₁×2^25-W+a₁b₂×2^18-W+a₁b₁

when the DSP is split for use, the required operation result is a₂b₂And a₁b₁A and a₂b₁And a₁b₂May be paired with₂b₂And a₁b₁A result of the operation of (a) is contaminated₂b₂And a₁b₁The maximum bit width of the operation result is 2W, so in order to ensure that the operation result is not polluted, W needs to be limited by the following three equations:

25-W＞2W、18-W＞2W、max(25+W,18+W)+1＜43-2W；

solving the three inequalities to obtain the maximum value of W which is 5; when the feature data and the weight data of the convolutional neural network are quantized to 4 bits, the precision is reduced, so that the input port a of the DSP can be multiplexed in a single path, and the above three equations can be rewritten as follows:

P＝A×B＝a₂b×2^25-W+a₁b、25-W＞2W；

the maximum integer value of W obtained by the solution is 8, so that only the input port a may be subjected to single-path multiplexing in order to ensure the detection accuracy and accuracy of the network.

For the on-chip cache design of the convolutional neural network accelerator, because the convolution calculation needs to repeatedly perform the read-write operation of the weight parameters and the characteristic diagram, if the operations are directly interacted with the DDR, the logic code writing is simpler, but the problems of power consumption increase, reading bandwidth reduction and the like are brought, so the problem can be solved through the on-chip cache.

In the single-layer calculation process, a calculation path of the layer outputs the characteristic diagram Output of the layer while reading the characteristic diagram Input for calculation, and the Output of the characteristic diagram of the layer is the Input of the calculation of the next layer, so that the concept of ping-pong operation (FIFO) is introduced to carry out the design of on-chip cache, the characteristic diagram can be read from the cache block by the calculation path of the next layer while writing into the on-chip cache block, thereby avoiding the communication with a DDR (Double Data Rate SDRAM, Double Rate synchronous dynamic random access memory), reducing the power consumption and simultaneously improving the bandwidth of Data reading and writing.

According to the design of the computing path and the storage structure, the whole structure design of the convolutional neural network accelerator is given (as shown in fig. 4 and fig. 5). The convolutional neural network accelerator comprises an on-chip cache combined with a computation path, is used for storing an intermediate characteristic diagram formed by the computation path in an inference process, can be multiplexed for multiple times, and all data scheduling and computation path configuration are completed by a controller in a Processing Element (PE) and a main controller in a coordinated mode. The architecture diagram of PE is shown in fig. 6.

The convolutional neural network accelerator provided by the application can comprise an input instruction module, an operation module, a data handling module and the like, and can obtain operation efficiency which is dozens of times higher than that of a Graphic Processing Unit (GPU) by designing an efficient accelerator (INT8) aiming at a quantized neural network model through an FPGA.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A convolutional neural network accelerator, comprising: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree;

2. The convolutional neural network accelerator of claim 1, wherein the multiply-add tree comprises a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port;

3. The convolutional neural network accelerator of claim 2, wherein the input data bit width of the second input port is consistent with the number of input sub-ports comprised by the second input port;

4. The convolutional neural network accelerator of claim 1, wherein the scaling operator comprises a second multiplier and a shifter;

5. The convolutional neural network accelerator of claim 4,

6. The convolutional neural network accelerator of claim 5, wherein the integer scaling value is further formed by: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.

7. The convolutional neural network accelerator of claim 6, wherein the magnification ratio is an n-th power of 2; n is the sum of the calculated bit width of the second multiplier and the number of times of the twice cyclic amplification.

8. The convolutional neural network accelerator of claim 4,

9. The convolutional neural network accelerator of claim 8,

10. The convolutional neural network accelerator of claim 1, wherein the convolutional neural network accelerator comprises: and the on-chip cache is combined with the calculation path and used for storing the intermediate characteristic graph formed by the calculation path in the inference process.