CN113554163A - Convolutional neural network accelerator - Google Patents

Convolutional neural network accelerator Download PDF

Info

Publication number
CN113554163A
CN113554163A CN202110849929.1A CN202110849929A CN113554163A CN 113554163 A CN113554163 A CN 113554163A CN 202110849929 A CN202110849929 A CN 202110849929A CN 113554163 A CN113554163 A CN 113554163A
Authority
CN
China
Prior art keywords
scaling
multiplication
result
value
operation result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110849929.1A
Other languages
Chinese (zh)
Other versions
CN113554163B (en
Inventor
郑海生
余蓓
沈小勇
吕江波
贾佳亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Smartmore Technology Co Ltd
Shanghai Smartmore Technology Co Ltd
Original Assignee
Shenzhen Smartmore Technology Co Ltd
Shanghai Smartmore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Smartmore Technology Co Ltd, Shanghai Smartmore Technology Co Ltd filed Critical Shenzhen Smartmore Technology Co Ltd
Priority to CN202110849929.1A priority Critical patent/CN113554163B/en
Publication of CN113554163A publication Critical patent/CN113554163A/en
Application granted granted Critical
Publication of CN113554163B publication Critical patent/CN113554163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of artificial intelligence, and provides a convolutional neural network accelerator, which comprises: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree; the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel; the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel; the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantized convolution result, an operation framework is provided for the quantized convolution neural network, parallel operation of the multi-channel characteristic diagram is achieved, the calculation bandwidth of unit time is improved, and calculation acceleration is achieved.

Description

Convolutional neural network accelerator
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a convolutional neural network accelerator.
Background
With the development of convolutional neural networks, convolutional neural networks are widely applied to the fields of computer vision, speech recognition, natural language processing, automatic driving and the like.
The convolutional neural network has too many parameters and too large network volume, and occupies more memory; before the convolutional neural network is deployed on the computing chip, quantization processing is generally performed on the convolutional neural network so as to reduce the network volume and reduce the memory. Therefore, it is necessary to design an efficient accelerator for the quantized convolutional neural network.
Disclosure of Invention
In view of the above, it is necessary to provide a convolutional neural network accelerator in order to solve the above technical problems.
A convolutional neural network accelerator, comprising: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree;
the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel;
the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel;
the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantization convolution result.
In one embodiment, the multiply-add tree includes a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port;
the first multiplier is configured to obtain quantization weight data through the first input port, obtain two pieces of quantization feature data through the second input port, and perform multiplication on the two pieces of quantization feature data by using the quantization weight data; the feature maps to which the two quantized feature data belong respectively correspond to different pictures.
In one embodiment, the bit width of the input data of the second input port is consistent with the number of input sub-ports included in the second input port;
the first multiplier is configured to obtain one of the two quantization feature data through a preset number of input sub-ports counted from a first input sub-port included in the second input port, and obtain the other of the two quantization feature data through the preset number of input sub-ports backward counted from a last input sub-port included in the second input port; and the preset number is consistent with the bit width of the quantized feature data.
In one embodiment, the scaling operator comprises a second multiplier and a shifter;
the second multiplier is configured to obtain an integer scaling value obtained by amplifying a floating-point scaling value, perform multiplication on the result of the multiply-add operation and the integer scaling value, and send the obtained result of the multiplication to the shifter;
the shifter is used for obtaining the amplification ratio of the floating point type zoom value to the integer type zoom value and performing division operation on the multiplication result and the amplification ratio.
In one of the embodiments, the first and second electrodes are,
the shaping scaling value is formed in the following way: if the floating point type scaling value is in a first numerical range, performing twice cyclic amplification on the floating point type scaling value until the floating point type scaling value subjected to twice cyclic amplification is in a second numerical range; amplifying the floating point type zoom value subjected to twice cyclic amplification according to the calculation bit width of the second multiplier, and taking a result obtained by rounding an amplification result as the integer type zoom value;
wherein, when a value falling within the first range of values is rounded, the rounding results in a value less than the value; values falling within the second range of values are rounded to a value greater than the value.
In one embodiment, the integer scaling value is further formed by: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.
In one embodiment, the magnification ratio is 2 to the power of n; n is the sum of the calculated bit width of the second multiplier and the number of times of the twice cyclic amplification.
In one of the embodiments, the first and second electrodes are,
the shifter is further configured to perform a shift operation on the binary system of the multiplication result based on the binary system of the multiplication result and the binary system of the amplification scale to obtain a shift operation result;
the shifter is further configured to perform an exclusive-or operation on the binary system of the multiplication operation result and the binary system of the amplification scale, and determine a scaling operation result based on the exclusive-or operation result and the shift operation result.
In one of the embodiments, the first and second electrodes are,
the shifter is further configured to, if the xor operation result is greater than half of the amplification ratio, take a result obtained by adding 1 to the shift operation result as a scaling operation result;
the shifter is further configured to use the shift operation result as a scaling operation result if the xor operation result is less than or equal to half of the amplification ratio.
In one embodiment, the convolutional neural network accelerator includes: and the on-chip cache is combined with the calculation path and used for storing the intermediate characteristic graph formed by the calculation path in the inference process.
The convolutional neural network accelerator comprises a plurality of multiply-add trees for performing parallel operation on a multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series after each multiply-add tree; the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel; the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel; the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantization convolution result. In the application, each multiplication-addition tree of the convolutional neural network accelerator is connected with a scaling operator and an adder in series, so as to provide an operation framework for the quantized convolutional neural network; and the convolution neural network comprises a plurality of multiplication and addition trees, so that the parallel operation of the multi-channel characteristic diagram is realized, the calculation bandwidth of unit time is improved, and the calculation acceleration is realized.
Drawings
FIG. 1 is a schematic diagram of the computation path of a convolution operation in a convolutional neural network in one embodiment;
FIG. 2 is a diagram illustrating split use of DSP multipliers sharing weights in one embodiment;
FIG. 3 is a diagram illustrating split use of DSP multipliers sharing weights in one embodiment;
FIG. 4 is an overall block diagram of a convolutional neural network accelerator in one embodiment;
FIG. 5 is a diagram of an NN computer architecture in one embodiment;
FIG. 6 is an architecture diagram of a PE in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The present application provides a convolutional neural network accelerator (INT8) for a quantized convolutional neural network, which can be implemented on an FPGA (Programmable Gate Array). Specifically, as shown in fig. 1, the convolutional neural network accelerator mainly includes: a plurality of multiply-add trees 101 for performing parallel operations on the multi-channel feature map, and a scaling operator 102 and an adder 103 connected in series after each multiply-add tree.
The multiplication and addition tree 101 is configured to perform multiplication and addition operation on the quantization weight data and the quantization feature data of the corresponding feature map of one of the channels to obtain a multiplication and addition operation result of the feature map of the corresponding channel.
Each picture corresponds to a multi-channel feature map (also called a multi-dimensional feature map); and the characteristic data of the characteristic diagram of each channel is operated with the weight data of the convolution kernel. If the convolutional neural network is quantized, both the feature data and the weight data are quantized, the quantized feature data may be referred to as quantized feature data, and the quantized weight data may be referred to as quantized weight data.
Taking a convolution kernel of 3 × 3 as an example, the multiply-add tree 101 shown in fig. 1 includes 9 multipliers, each multiplier is used for performing multiplication on the quantization weight data and the quantization feature data of the feature map, and an adder included in the multiply-add tree 101 is used for adding the multiplication results output by the multipliers to obtain the multiplication-add result of the feature map.
The scaling operator 102 is configured to perform scaling operation on the result of the multiply-add operation of the feature map of the corresponding channel to obtain a scaling operation result of the feature map of the corresponding channel;
the adder 103 is configured to perform zero point adjustment on the scaling operation result of the feature map of the corresponding channel; the result output by the adder is used as a quantization convolution result.
Each multiplication-addition tree in the convolutional neural network accelerator is connected with a scaling operator and an adder in series, and an operation framework is provided for the quantized convolutional neural network; and the convolution neural network comprises a plurality of multiplication and addition trees, so that the parallel operation of the multi-channel characteristic diagram is realized, the calculation bandwidth of unit time is improved, and the calculation acceleration is realized.
In one embodiment, the scaling operator 102 includes a multiplier and a shifter; to distinguish the multipliers comprised by the multiply-add tree 101 from the multipliers comprised by the scale operator 102, the multipliers comprised by the multiply-add tree 101 are denoted by a first multiplier and the multipliers comprised by the scale operator are denoted by a second multiplier.
The second multiplier is configured to obtain an integer scaling value obtained by amplifying a floating-point scaling value, perform multiplication operation on the result of the multiply-add operation and the integer scaling value, and send the obtained result of the multiplication operation to the shifter; the shifter is used for obtaining the amplification ratio of the floating point type zoom value to the integer type zoom value and performing division operation on the multiplication result and the amplification ratio.
The following describes the way of amplifying the floating-point scaled value to the integer scaled value and the way of performing division operation by the shifter.
Firstly, the calculation path design of the convolution operation in the convolutional neural network accelerator specifically includes the following contents:
the INT8 convolution calculation formula is:
Figure BDA0003182031060000051
wherein
Figure BDA0003182031060000052
M may be referred to as a scaling value, and in general, M is a value in the range of [0,1 ]]The multiplication operation of the floating-point numerical value not only consumes computing resources, but also has lower computing efficiency compared with the multiplication efficiency of the integer, so that the application converts M into an integer numerical value for computing.
Figure BDA0003182031060000053
Figure BDA0003182031060000061
As shown in the algorithm pseudocode (r), the floating point type scaling value (M) is converted into the integer type scaling value (M) by using the idea of approximate calculation1). The method specifically comprises the following steps: if the floating point type scaling value is in a first numerical range, performing twice cyclic amplification on the floating point type scaling value until the floating point type scaling value subjected to twice cyclic amplification is in a second numerical range; amplifying the floating point type zoom value subjected to twice cyclic amplification according to the calculation bit width (for example, the calculation bit width is 32) of a second multiplier, and taking a result obtained by rounding an amplification result as the integer type zoom value; wherein, when a value falling within the first range of values is rounded, the rounding results in a value less than the value; values falling within the second range of values are rounded to a value greater than the value.
Further, the processing manner of amplifying the floating point type scaled value after twice circular amplification according to the calculation bit width of the second multiplier may specifically be: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.
Illustratively, if M is [0,0.5 ]]Within this range, M is amplified twice in cycles until M is [0.5,1 ] in twice in cycles]Between this range of values, and the number of times the cyclic amplification is doubled (which can be denoted as Shift); then will beM amplification of two times cyclic amplification 232The amplified value is rounded (i.e., rounded) to obtain a result M1I.e. the integer scaling value.
In the above manner, the integer scaling value M1At least 2 with respect to the floating-point type scaling value M30The precision is kept, and the network inference accuracy is guaranteed.
Converting a floating-point scaled value M to an integer scaled value M1After that, the above convolution calculation formula becomes:
Figure BDA0003182031060000071
it can be seen that the scaling up from the floating point scaled value M to the integer scaled value M1At an enlargement ratio of 2Shift+32That is, the amplification ratio is n times of 2, where n is the sum of the calculated bit width of the second multiplier and the number of times of the double cyclic amplification.
Without considering the decimal place, for binary calculations, dividing by the power of 2 is equivalent to moving the binary of the dividend to the right by the power of bits, the shift operation is a very efficient operation for the computational logic.
Figure BDA0003182031060000072
For the convolutional neural network inference of INT8, the rounding operation only considers integer bits, and only needs to perform 1 shift operation to achieve the same calculation result as the above expression, as shown by the algorithm pseudocode.
In one embodiment, the shifter is further configured to perform a shift operation on the binary system of the multiplication result based on the binary system of the multiplication result and the binary system of the amplification scale to obtain a shift operation result; the shifter is further configured to perform an exclusive-or operation on the binary system of the multiplication operation result and the binary system of the amplification scale, and determine a scaling operation result based on the exclusive-or operation result and the shift operation result.
To be provided with
Figure BDA0003182031060000081
For example, the following steps are carried out: binary of 15 is 1111, 23Is 1000, thus
Figure BDA0003182031060000082
It can be seen that the binary system of 15 is right-shifted by 3 bits, and the result of the shift operation is 1. But do not
Figure BDA0003182031060000083
The actual result of (2), direct shifting would produce a calculation error.
In this case, the binary system of the multiplication result and the binary system of the enlargement ratio may be subjected to an exclusive or operation, and the scaling result may be determined based on the exclusive or operation result and the shift operation result. E.g. binary 1111 and 2 for 153The binary 1000 of (1) is subjected to an exclusive-or operation to obtain an exclusive-or operation result of 111, and a scaling operation result is determined based on the exclusive-or operation result and the shift operation result.
Further, the shifter is further configured to, if the xor operation result is greater than half of the amplification ratio, take a result obtained by adding 1 to the shift operation result as a scaling operation result; the shifter is further configured to use the shift operation result as a scaling operation result if the xor operation result is less than or equal to half of the amplification ratio.
The XOR operation result is still given as 111 as an example: the XOR result 111 is greater than 23And therefore, the shift operation result 1 is added by 1, and the resulting value is the scaling operation result. Similarly, if the XOR result is less than or equal to 23Half of that, the result of the shift operation is directly taken as the result of the scaling operation.
The method can be realized through time sequence optimization, and can be completed only by one clock cycle, so that the calculation efficiency is improved.
In one embodiment, the multiply-add tree 101 includes a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port; the first multiplier is configured to obtain quantization weight data through the first input port, obtain two pieces of quantization feature data through the second input port, and perform multiplication on the two pieces of quantization feature data by using the quantization weight data; the feature maps to which the two quantized feature data belong respectively correspond to different pictures.
Illustratively, in an FPGA, the multiplier may be implemented by a DSP resource (Digital Signal Processing, a chip capable of implementing Digital Signal Processing technology) on an FPGA chip, where the DSP includes two input ports, such as an input port a and an input port B shown in fig. 2, and the input data bit widths of the input port a and the input port B are 25 bits and 18 bits, respectively.
The DSP acquires two quantized Feature data (Feature1 and Feature2) through the input port a, where Feature maps to which the two quantized Feature data, Feature1 and Feature2, respectively belong correspond to different pictures. The DSP acquires quantized Weight data (Weight) through the input port B, and multiplies Feature1 and Feature2 by the quantized Weight data, respectively, that is, the quantized Weight data is shared by two pictures.
In the above manner, one of the input ports of the multiplier is subjected to one-way multiplexing, and a single multiplier is split for use, so as to form the schematic diagram shown in fig. 3, and at the same time, the quantized feature data of two pictures is inferred, and correspondingly, the inferred Batch Size parameter is set to 2, so that the logic code to be changed is less, and only the data bit width and the number of adders need to be increased in the change of the calculation path.
Further, the input data bit width of the second input port is consistent with the number of input sub-ports included in the second input port; the first multiplier is configured to obtain one of the two quantization feature data through a preset number of input sub-ports counted from a first input sub-port included in the second input port, and obtain the other of the two quantization feature data through the preset number of input sub-ports backward counted from a last input sub-port included in the second input port; and the preset number is consistent with the bit width of the quantized feature data.
The following contents are described by taking the first input port as the input port B and the second input port as the input port a: the input data bit width of the input port a is 25 bits, that is, the input port a includes 25 input sub-ports, which can be denoted as a [24:0 ]; the input data bit width of input port B is 18 bits, i.e. input port B includes 18 input sub-ports, which can be denoted as B [17:0 ]. If the bit width of the quantized Feature data is 8 bits, the DSP may obtain Feature2 through the input sub-port A [7:0], and obtain Feature1 through the input sub-port A [24:17 ].
In the above-described mode, the reason why the input port a to which the feature data is input is subjected to the one-way multiplexing and the input port B to which the weight data is input is not subjected to the one-way multiplexing is that: if the input port a and the input port B are multiplexed in a single path, bit width of quantized data is too small, and accuracy of the network is reduced.
Exemplarily, the bit width of the quantized feature data and the quantized weight data is W bits, and if the input port A, B is split into two paths for use, the split inputs are a1、a2、b1And b2And the zero value bit width in the middle of A and B is 25-2W, 18-2W, respectively, then A and B can be expressed as:
A=a2×225-W+a1,B=b2×218-W+b1
the product P can be expressed as:
P=A×B=a2b2×243-2W+a2b1×225-W+a1b2×218-W+a1b1
when the DSP is split for use, the required operation result is a2b2And a1b1A and a2b1And a1b2May be paired with2b2And a1b1A result of the operation of (a) is contaminated2b2And a1b1The maximum bit width of the operation result is 2W, so in order to ensure that the operation result is not polluted, W needs to be limited by the following three equations:
25-W>2W、18-W>2W、max(25+W,18+W)+1<43-2W;
solving the three inequalities to obtain the maximum value of W which is 5; when the feature data and the weight data of the convolutional neural network are quantized to 4 bits, the precision is reduced, so that the input port a of the DSP can be multiplexed in a single path, and the above three equations can be rewritten as follows:
P=A×B=a2b×225-W+a1b、25-W>2W;
the maximum integer value of W obtained by the solution is 8, so that only the input port a may be subjected to single-path multiplexing in order to ensure the detection accuracy and accuracy of the network.
For the on-chip cache design of the convolutional neural network accelerator, because the convolution calculation needs to repeatedly perform the read-write operation of the weight parameters and the characteristic diagram, if the operations are directly interacted with the DDR, the logic code writing is simpler, but the problems of power consumption increase, reading bandwidth reduction and the like are brought, so the problem can be solved through the on-chip cache.
In the single-layer calculation process, a calculation path of the layer outputs the characteristic diagram Output of the layer while reading the characteristic diagram Input for calculation, and the Output of the characteristic diagram of the layer is the Input of the calculation of the next layer, so that the concept of ping-pong operation (FIFO) is introduced to carry out the design of on-chip cache, the characteristic diagram can be read from the cache block by the calculation path of the next layer while writing into the on-chip cache block, thereby avoiding the communication with a DDR (Double Data Rate SDRAM, Double Rate synchronous dynamic random access memory), reducing the power consumption and simultaneously improving the bandwidth of Data reading and writing.
According to the design of the computing path and the storage structure, the whole structure design of the convolutional neural network accelerator is given (as shown in fig. 4 and fig. 5). The convolutional neural network accelerator comprises an on-chip cache combined with a computation path, is used for storing an intermediate characteristic diagram formed by the computation path in an inference process, can be multiplexed for multiple times, and all data scheduling and computation path configuration are completed by a controller in a Processing Element (PE) and a main controller in a coordinated mode. The architecture diagram of PE is shown in fig. 6.
The convolutional neural network accelerator provided by the application can comprise an input instruction module, an operation module, a data handling module and the like, and can obtain operation efficiency which is dozens of times higher than that of a Graphic Processing Unit (GPU) by designing an efficient accelerator (INT8) aiming at a quantized neural network model through an FPGA.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A convolutional neural network accelerator, comprising: a plurality of multiplication-addition trees for performing parallel operation on the multi-channel characteristic diagram, and a scaling operator and an adder which are connected in series behind each multiplication-addition tree;
the multiplication and addition tree is used for carrying out multiplication and addition operation on the quantization weight data and the quantization characteristic data of the corresponding characteristic diagram of one channel to obtain a multiplication and addition operation result of the characteristic diagram of the corresponding channel;
the scaling operator is used for scaling the multiplication and addition operation result of the characteristic diagram of the corresponding channel to obtain the scaling operation result of the characteristic diagram of the corresponding channel;
the adder is used for carrying out zero point adjustment on the scaling operation result of the characteristic diagram of the corresponding channel; the result output by the adder is used as a quantization convolution result.
2. The convolutional neural network accelerator of claim 1, wherein the multiply-add tree comprises a first multiplier having a first input port and a second input port; wherein the input data bit width of the first input port is less than the input data bit width of the second input port;
the first multiplier is configured to obtain quantization weight data through the first input port, obtain two pieces of quantization feature data through the second input port, and perform multiplication on the two pieces of quantization feature data by using the quantization weight data; the feature maps to which the two quantized feature data belong respectively correspond to different pictures.
3. The convolutional neural network accelerator of claim 2, wherein the input data bit width of the second input port is consistent with the number of input sub-ports comprised by the second input port;
the first multiplier is configured to obtain one of the two quantization feature data through a preset number of input sub-ports counted from a first input sub-port included in the second input port, and obtain the other of the two quantization feature data through the preset number of input sub-ports backward counted from a last input sub-port included in the second input port; and the preset number is consistent with the bit width of the quantized feature data.
4. The convolutional neural network accelerator of claim 1, wherein the scaling operator comprises a second multiplier and a shifter;
the second multiplier is configured to obtain an integer scaling value obtained by amplifying a floating-point scaling value, perform multiplication on the result of the multiply-add operation and the integer scaling value, and send the obtained result of the multiplication to the shifter;
the shifter is used for obtaining the amplification ratio of the floating point type zoom value to the integer type zoom value and performing division operation on the multiplication result and the amplification ratio.
5. The convolutional neural network accelerator of claim 4,
the shaping scaling value is formed in the following way: if the floating point type scaling value is in a first numerical range, performing twice cyclic amplification on the floating point type scaling value until the floating point type scaling value subjected to twice cyclic amplification is in a second numerical range; amplifying the floating point type zoom value subjected to twice cyclic amplification according to the calculation bit width of the second multiplier, and taking a result obtained by rounding an amplification result as the integer type zoom value;
wherein, when a value falling within the first range of values is rounded, the rounding results in a value less than the value; values falling within the second range of values are rounded to a value greater than the value.
6. The convolutional neural network accelerator of claim 5, wherein the integer scaling value is further formed by: and performing power operation by taking the calculation bit width of the second multiplier as a power value and 2 as a base number, and amplifying the floating point type scaling value subjected to twice circular amplification by using a power operation result.
7. The convolutional neural network accelerator of claim 6, wherein the magnification ratio is an n-th power of 2; n is the sum of the calculated bit width of the second multiplier and the number of times of the twice cyclic amplification.
8. The convolutional neural network accelerator of claim 4,
the shifter is further configured to perform a shift operation on the binary system of the multiplication result based on the binary system of the multiplication result and the binary system of the amplification scale to obtain a shift operation result;
the shifter is further configured to perform an exclusive-or operation on the binary system of the multiplication operation result and the binary system of the amplification scale, and determine a scaling operation result based on the exclusive-or operation result and the shift operation result.
9. The convolutional neural network accelerator of claim 8,
the shifter is further configured to, if the xor operation result is greater than half of the amplification ratio, take a result obtained by adding 1 to the shift operation result as a scaling operation result;
the shifter is further configured to use the shift operation result as a scaling operation result if the xor operation result is less than or equal to half of the amplification ratio.
10. The convolutional neural network accelerator of claim 1, wherein the convolutional neural network accelerator comprises: and the on-chip cache is combined with the calculation path and used for storing the intermediate characteristic graph formed by the calculation path in the inference process.
CN202110849929.1A 2021-07-27 2021-07-27 Convolutional neural network accelerator Active CN113554163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110849929.1A CN113554163B (en) 2021-07-27 2021-07-27 Convolutional neural network accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110849929.1A CN113554163B (en) 2021-07-27 2021-07-27 Convolutional neural network accelerator

Publications (2)

Publication Number Publication Date
CN113554163A true CN113554163A (en) 2021-10-26
CN113554163B CN113554163B (en) 2024-03-29

Family

ID=78132958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110849929.1A Active CN113554163B (en) 2021-07-27 2021-07-27 Convolutional neural network accelerator

Country Status (1)

Country Link
CN (1) CN113554163B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5473730A (en) * 1993-11-09 1995-12-05 At&T Ipm Corp. High efficiency learning network
CN109063825A (en) * 2018-08-01 2018-12-21 清华大学 Convolutional neural networks accelerator
CN110059733A (en) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 The optimization and fast target detection method, device of convolutional neural networks
WO2019183202A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd
US20190303103A1 (en) * 2018-03-30 2019-10-03 Intel Corporation Common factor mass multiplication circuitry
CN111709522A (en) * 2020-05-21 2020-09-25 哈尔滨工业大学 Deep learning target detection system based on server-embedded cooperation
CN111832719A (en) * 2020-07-28 2020-10-27 电子科技大学 Fixed point quantization convolution neural network accelerator calculation circuit
US20210056397A1 (en) * 2019-08-23 2021-02-25 Nvidia Corporation Neural network accelerator using logarithmic-based arithmetic

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5473730A (en) * 1993-11-09 1995-12-05 At&T Ipm Corp. High efficiency learning network
WO2019183202A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
US20190303103A1 (en) * 2018-03-30 2019-10-03 Intel Corporation Common factor mass multiplication circuitry
CN109063825A (en) * 2018-08-01 2018-12-21 清华大学 Convolutional neural networks accelerator
CN110059733A (en) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 The optimization and fast target detection method, device of convolutional neural networks
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd
US20210056397A1 (en) * 2019-08-23 2021-02-25 Nvidia Corporation Neural network accelerator using logarithmic-based arithmetic
CN111709522A (en) * 2020-05-21 2020-09-25 哈尔滨工业大学 Deep learning target detection system based on server-embedded cooperation
CN111832719A (en) * 2020-07-28 2020-10-27 电子科技大学 Fixed point quantization convolution neural network accelerator calculation circuit

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BENOIT JACOB等: "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", 《ARXIV:1712.05877 [CS.LG]》, pages 1 - 14 *
ZIHAOZHAO: "深度学习加速器入门(二)数据复用", pages 1 - 2, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/366450277> *
高晗,等: "深度学习模型压缩与加速综述", 《软件学报》, pages 68 - 92 *

Also Published As

Publication number Publication date
CN113554163B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN109063825B (en) Convolutional neural network accelerator
CN107145939B (en) Computer vision processing method and device of low-computing-capacity processing equipment
CN108647773B (en) Hardware interconnection system capable of reconstructing convolutional neural network
CN106445471A (en) Processor and method for executing matrix multiplication on processor
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN112668708B (en) Convolution operation device for improving data utilization rate
WO2022134465A1 (en) Sparse data processing method for accelerating operation of re-configurable processor, and device
CN110109646A (en) Data processing method, device and adder and multiplier and storage medium
US20230221924A1 (en) Apparatus and Method for Processing Floating-Point Numbers
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN111008691B (en) Convolutional neural network accelerator architecture with weight and activation value both binarized
CN110110852B (en) Method for transplanting deep learning network to FPAG platform
CN112799634B (en) Based on base 2 2 MDC NTT structured high performance loop polynomial multiplier
US20210044303A1 (en) Neural network acceleration device and method
TW202013261A (en) Arithmetic framework system and method for operating floating-to-fixed arithmetic framework
CN110766136B (en) Compression method of sparse matrix and vector
CN113554163B (en) Convolutional neural network accelerator
US20230259743A1 (en) Neural network accelerator with configurable pooling processing unit
CN113870090B (en) Method, graphics processing apparatus, system, and medium for implementing functions
US20220156043A1 (en) Apparatus and Method for Processing Floating-Point Numbers
Lu et al. A reconfigurable DNN training accelerator on FPGA
CN113138748B (en) Configurable CNN multiplication accumulator supporting 8bit and 16bit data based on FPGA
CN115167815A (en) Multiplier-adder circuit, chip and electronic equipment
CN114168106A (en) Data processing method, device and equipment based on convolutional neural network
US10761847B2 (en) Linear feedback shift register for a reconfigurable logic unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Haisheng

Inventor after: Shen Xiaoyong

Inventor after: Lv Jiangbo

Inventor before: Zheng Haisheng

Inventor before: Yu Bei

Inventor before: Shen Xiaoyong

Inventor before: Lv Jiangbo

Inventor before: Jia Jiaya

GR01 Patent grant
GR01 Patent grant