WO2020057162A1 - 一种卷积神经网络加速器 - Google Patents
一种卷积神经网络加速器 Download PDFInfo
- Publication number
- WO2020057162A1 WO2020057162A1 PCT/CN2019/087771 CN2019087771W WO2020057162A1 WO 2020057162 A1 WO2020057162 A1 WO 2020057162A1 CN 2019087771 W CN2019087771 W CN 2019087771W WO 2020057162 A1 WO2020057162 A1 WO 2020057162A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- weight
- matrix
- kneading
- bit
- activation value
- Prior art date
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 25
- 230000004913 activation Effects 0.000 claims abstract description 80
- 238000004898 kneading Methods 0.000 claims abstract description 61
- 239000011159 matrix material Substances 0.000 claims abstract description 56
- 238000004364 calculation method Methods 0.000 claims abstract description 28
- 238000009825 accumulation Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 7
- 230000000295 complement effect Effects 0.000 claims description 3
- 238000001994 activation Methods 0.000 description 69
- 241001442055 Vipera berus Species 0.000 description 28
- 238000000034 method Methods 0.000 description 20
- KRQUFUKTQHISJB-YYADALCUSA-N 2-[(E)-N-[2-(4-chlorophenoxy)propoxy]-C-propylcarbonimidoyl]-3-hydroxy-5-(thian-3-yl)cyclohex-2-en-1-one Chemical compound CCC\C(=N/OCC(C)OC1=CC=C(Cl)C=C1)C1=C(O)CC(CC1=O)C1CCCSC1 KRQUFUKTQHISJB-YYADALCUSA-N 0.000 description 13
- 238000007792 addition Methods 0.000 description 13
- 230000001133 acceleration Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 210000000225 synapse Anatomy 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4031—Fixed length to variable length coding
- H03M7/4037—Prefix coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/01—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/50—Adding; Subtracting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3059—Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/38—Indexing scheme relating to groups G06F7/38 - G06F7/575
- G06F2207/3804—Details
- G06F2207/386—Special constructional features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the invention relates to the field of neural network computing, and in particular to a convolutional neural network accelerator.
- Deep convolutional neural networks have made significant progress in machine learning applications, such as real-time image recognition, detection, and natural language processing.
- the advanced deep convolutional neural network (DCNN) architecture has complex connections and a large number of neurons and synapses to meet the needs of high-precision and complex tasks.
- the weights are multiplied by the corresponding activation values, and the products are added up and summed up. That is, the weight and activation value form a pair.
- DCNN consists of multiple layers, from tens to hundreds of layers, even thousands of layers. Nearly 98% of the calculations in the entire DCNN come from convolution operations. Convolution is the most important factor affecting power and performance. Improving the computational efficiency of convolutions without compromising the robustness of the learning model has become an effective method for accelerating DCNNs, especially on lightweight devices (such as smartphones and robots) with limited resources and low power consumption .
- some existing methods take advantage of the fact that fixed-point multiplication can be decomposed into a series of single-bit multiplications and shift-add, and proposes using bit-level serialization when performing MAC.
- the basic bit (or 1) may appear anywhere in the fixed-point number, so this scheme must take into account the worst-case position of the bit "1". If it is a 16-bit fixed-point number (fp16), 16-bit The register holds the position information of the basic bit ("1"). Different weights or activation values may cause different waiting times during the acceleration process, and therefore produce unpredictable cycles.
- the hardware design will inevitably cover the worst case.
- the worst case cycle can only be used as the accelerator cycle, which will not only increase the processing cycle, reduce the frequency of the accelerator, but also increase the design complexity.
- the classic DCNN accelerator performs multiply-add operations by deploying multipliers and adders on each activation value and weight channel.
- multiplication can be a floating-point 32-bit operand, or a 16-bit fixed-point number and an 8-bit integer.
- the multiplier determines the delay of the convolution operation.
- the 8-bit fixed-point 2 operand multiplier takes 3.4 times as long as the adder.
- different DCNN models require different accuracy, and even different levels in the same model have different accuracy requirements, so the multiplier designed for the convolutional neural network accelerator must be able to cover the worst case.
- the main component of a classic DCNN accelerator is a multiplier and adder, and the main problem of multiplier and adder is that it will perform invalid operations.
- Invalid calculations can be expressed in two aspects: first, the operand is a value with many zero values or including most zero bits. Compared to zero, the value of zero occupies a small part of the weight. These small parts of zero can be easily eliminated by advanced microarchitecture design or memory-level compression techniques, and avoid using multipliers as input.
- Figure 1 shows that the average ratio of the zero position is as high as 68.9% compared to the basic position (or 1).
- the traditional CNN puts each weight / activation value pair into a processing unit (PE) and completes the multiply-add operation within one cycle, however, it is unavoidable to calculate zero bits. If we can reduce the time of invalid calculations, then the throughput of PE will be improved.
- a bit containing a “0” bit in a PE is referred to as a “relaxed bit”.
- the distribution of basic bits has two characteristics: (1) A probability of about 50% to 60% is valid at each position, which also means that a probability of 40% to 50% is relaxed. (2)
- the bits of certain weights are slack. For example, the third to fifth digits are less than 1% of the basic digits. These bits are almost all composed of slack bits, but the multiplier does not distinguish between slack and basic bits when performing multiply-add operations. If you want to improve the efficiency of the convolution operation, you must use the slack bit.
- an object of the present invention is to provide an efficient convolution operation method, which includes a weight pinch technique for a convolutional neural network accelerator.
- Weight pinching improves the zero relaxation that is common in modern DCNN models. Unlike data compression or pruning, it reduces the number of weights without loss of accuracy.
- a convolutional neural network accelerator which includes:
- a weighting module is used to obtain multiple sets of activation values to be operated and their corresponding original weights, arrange the original weights in the calculation order and align them bit by bit to obtain a weight matrix, remove the slack bits in the weight matrix, and obtain gaps To reduce the matrix, and make the basic bits in each column of the reduced matrix complement the vacancies in the calculation order to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix , Each row of the kneading matrix is used as the kneading weight;
- the split accumulation module is used to obtain the position information of the activation value of each bit in the kneading weight according to the corresponding relationship between the activation value and the basic bits in the original weight, and send the kneading weight to the split accumulator, and the split accumulation
- the processor divides the kneaded weight into a plurality of weight segments according to the position information. Based on the position information, the weight segment is summed with the corresponding activation value, and the processing result is sent to the addition tree. The shift is performed on the processing result. Add up to get the output feature map.
- the convolution operation optimization system of the convolutional neural network accelerator wherein the split accumulator includes a separator and a segment adder, the separator is used to perform bitwise division of the kneading weight, and the segment adder is used for the weight segment and The corresponding activation values are summed.
- the convolution operation optimization system of the convolutional neural network accelerator wherein the split accumulation module includes: using Huffman coding to save the position information of each bit of the kneading weight corresponding to the activation value.
- the convolution operation optimization system of the convolutional neural network accelerator wherein the activation value is a pixel value of an image.
- the technical progress of the present invention includes:
- the method of kneading weights and activation values of the present invention, and the method of analyzing and calculating the kneaded values of the invention, can reduce the storage and accelerate the calculation speed;
- Tetris accelerator architecture including the structure of SAC and the structure of the separator, can further accelerate the convolution calculation.
- VivadoHLS is used to simulate the two methods to obtain the execution time.
- Figure 2 shows the inference acceleration of the two modes of DaDN and the Tetris accelerator proposed by the present invention.
- the Tetris method can achieve 1.30 times the acceleration on fp16, and the int8 mode can achieve an average 1.50 times acceleration. Compared to PRA, it can achieve acceleration of nearly 1.15 times.
- Figure 1 shows the proportion of zero values and zero bits in some modern deep convolutional neural network models
- FIG. 3 is a schematic view of weighting
- FIG. 4 is a schematic diagram of a split accumulator SAC
- Figure 5 is a Tetris accelerator architecture diagram
- FIG. 6 is a structural diagram of a separator
- FIG. 7 is a schematic diagram of a result stored in a register of an arithmetic unit.
- the present invention reconstructs the reasoning calculation mode of the DCNN model.
- the present invention uses a split accumulator (SAC) instead of the classic calculation mode-MAC.
- SAC split accumulator
- the present invention can make full use of the basic bits in the weight, and is composed of an adder, a shifter, and the like, without a multiplier.
- each weight / activation value pair is subjected to a shift and sum operation, where “weight / activation value” means “weight and activation value”.
- the present invention accumulates multiple weight / activation value pairs multiple times and performs only one shift addition and summation, thereby achieving a significant acceleration.
- the present invention proposes a Tetris accelerator to tap the maximum potential of the weight-kneading technique and the split-accumulator SAC.
- the Tetris accelerator consists of a series of split-accumulator SAC units, and uses pinched weights and activation values to achieve high-throughput and low-power inference calculations. Experiments with advanced synthesis tools have proven that Tetris accelerator achieves the best results compared to the prior art.
- the activation value of the first layer is the input
- the activation value of the second time is the output of the first layer, and so on. If the input is an image, the activation value of the first layer is the pixel value of the image.
- the present invention includes:
- Step 1 Obtain multiple sets of activation values to be calculated and their corresponding original weights, arrange the original weights in the calculation order and align them bitwise to obtain a weight matrix, remove the slack bits in the weight matrix, and delete the slack bits and change them. Is a vacancy, and a reduced matrix with vacancies is obtained, and the basic digits in each column of the reduced matrix are supplemented with the vacancies in the calculation order to obtain an intermediate matrix, and the empty rows in the intermediate matrix are eliminated, and the empty rows are in the intermediate matrix.
- the entire row is a vacant row, and the empty position of the intermediate matrix after the empty row is proposed is 0 to obtain a kneading matrix, and each row of the kneading matrix is used as a kneading weight;
- Step 2 According to the correspondence between the activation value and the basic bits in the original weight, obtain the position information of the activation value corresponding to each bit (element) in the kneading weight;
- Step 3 The kneading weight is sent to a split accumulator, and the split accumulator divides the kneading weight into a plurality of weight segments bit by bit. Based on the position information, the weight segment is summed with the corresponding activation value. , And send the processing result to the addition tree, and perform shift addition on the processing result to obtain an output feature map.
- the splitter and accumulator includes a separator and a segment adder.
- the separator is used to perform bitwise division on the kneading weight
- the segment adder is used to sum the weighted segment and the corresponding activation value.
- Figure 3 shows the method of kneading weights. Assuming 6 sets of weights as a batch, it usually takes 6 cycles to complete 6 pairs of weights / activate multiply-add operations. Relaxation occurs in two orthogonal dimensions: (1) the size within the weights, that is, W 1 , W 2 , W 3 , W 4 , and W 6 , the relaxation represents arbitrary allocation; (2) the relaxation bit also appears in the weight dimension, That is, W 5 is an all-zero bit (zero value) weight, so it does not appear in Figure 3 (b). By kneading to optimize the basic bit and the relaxed bit, the calculation cycle of 6 MACs will be reduced to 3 cycles as shown in Figure 3 (c).
- W ′ 1 , W ′ 2 and W ′ 3 are obtained by weighting, but each bit is a basic bit combination of W 1 to W 6 . If more weights are allowed, these slack bits will be filled with basic bits, which is referred to in the present invention as a "weight pinch" operation, that is, replacing the slack bits in the previous original weights with the basic bits of subsequent original weights.
- a weight pinch operation that is, replacing the slack bits in the previous original weights with the basic bits of subsequent original weights.
- the zero position is also replaced by the basic position, which avoids the effect of two-dimensional relaxation.
- the kneading weight indicates that the current kneading weight needs to be calculated with multiple activation values, rather than with individual activation values.
- the activation value a has 4 bits
- the weight w also has 4 bits, which are w 1 , w 2 , w 3 , w 4 , and w 1 are high bits
- a * w a * w 1 * 2 3 + a * w 2 * 2 2 + a * w 3 * 2 1 + a * w 4 * 2 0 .
- the present invention reduces the number of shift additions.
- the processing mode of the present invention is different from the traditional MAC and canonical bit serializers.
- SAC SAC stands for "splitting and accumulation".
- the SAC operation first divides the weights of the pinch, references the activation values, and finally accumulates each activation value into a specific segment register.
- Figure 4 shows the SAC process for each weight / activation value pair. This figure is only used as an example of SAC to show its concept. In actual use, the weighting SAC is used, that is, the data is first weighted and then sent to SAC for calculation. Multiple SACs form an array to form the accelerator of the invention.
- the split accumulator (SAC) shown in FIG. 4 is different from the MAC in that it aims to calculate the pairwise multiplication accurately; it first divides the weights and sums the activation values in the corresponding segment adders according to the segmentation results. After a certain number of weight / activation value pairs have completed processing, a final shift-and-add operation is performed to obtain the final sum. "A certain amount” is obtained by experiments and is related to the weight distribution of the convolutional neural network model. You can set a better value according to the specific model, or you can choose a compromise value that is suitable for most models. In detail, SAC is a split accumulator.
- the splitter (spliter) is part of the SAC. After the weight enters the SAC, it first passes through the splitter, parses the weights (split the weights), and assigns the subsequent segment registers and segment adders. Then go through the adder tree to get the final result.
- the separator is responsible for dispersing each segment into its corresponding adder. For example, if the second bit of the weight is the basic bit, the activation is passed to S1 in the graph. The same operation applies to other bits.
- the SAC requires 3 cycles. Although SAC has an operation to read the position information of the activation value, it is all in a register, which takes a very short time and is far less than a multiplication operation. The most time-consuming convolution operation is a multiplication operation, and the present invention does not have a multiplication operation. And shift operation is less than MAC. Although the invention sacrifices very little additional storage for this purpose, such a sacrifice is worth it.
- Tetris accelerator architecture Each SAC unit consists of 16 splitters organized in parallel (ie, Figure 6), which accepts a series of kneading weights and activation values, which are represented by the kneading stride (KS) parameters, the same number of segmented adders, and the following adders
- KS kneading stride
- the tree is responsible for calculating the final partial sum.
- the Tetris refers to the kneading process used in the present invention similar to the Tetris stacking process.
- the Tetris accelerator proposed by the present invention uses the pinch weights, and the SAC only assigns activation values to the segmented adders for accumulation.
- Figure 5 shows the architecture of Tetris.
- Each SAC unit accepts weight / activation values.
- the accelerator consists of a symmetric SAC cell matrix, which is connected to a throttle buffer and accepts pinch weights from on-chip eDRAM. If fixed-point 16-bit weights are used, each cell contains 16 separators that make up a separator array.
- the separator is based on the parameter KS, the activation value and the weight value.
- KS that is, KS weights are mixed.
- the separator micro-architecture is shown in Figure 6.
- the weight-kneading technique requires that the separator can accept a set of activation values for the kneading weights, and each base bit must be identifiable to indicate the relevant activation value in the KS range.
- the Tetris method uses ⁇ W i , p> in the separator to represent activations related to a particular basic bit.
- KS stands for the range of p-bits, that is, 4-bit p refers to the kneading stride of 16 weights. It assigns a comparator to determine if the bits of the kneading weight are zero, because even after the kneading, some positions may be invalid, namely W ' 3 in Fig. 3 (c), depending on KS. If it is slack, the multiplexer after the comparator outputs zero to the following structure. Otherwise, it will decode p and output the corresponding activation value.
- the separator obtains the target activation value in the throttle buffer only when necessary, and does not need to store the activation value multiple times.
- the newly introduced position information p will occupy a small part of the space, but the use of p only decodes activation and consists of only a few bits, that is, 16 activations require 4 bits, so it will not introduce a large amount of on-chip resources and Power overhead.
- each segment adder For each segment adder, it receives the activation value from all 16 splitters and adds the value from the local segment register.
- the middle section sum is stored in the S0 ⁇ S15 registers, and once all tasks that can be added are completed, the multiplexer is notified by a "control signal" to pass each section value to the subsequent adder tree.
- the final stage of the adder tree generates the final sum and passes it to the output non-linear activation function, which is determined by the network model, such as RELU, sigmoid.
- the end of the active / weight pair that can be added is marked by a tag and sent to the detector in each SAC unit. If all tokens reach the end, the adder tree outputs the final sum.
- KS a parameter to control the weight of the pinch
- different channels will have different numbers of pinch weights, so in most cases the marker can reside anywhere in the throttle buffer. If a new activation value / weight pair is filled into the buffer, the marker moves backwards, so it does not affect the calculation of the partial sum of each segment.
- the weights are first stored in the on-chip eDRAM after kneading, and the kneading weights are obtained from the eDRAM.
- the weights are parsed by the SAC separator and assigned to subsequent segment registers and adders. Then go through the addition tree to get the final result.
- the final result is the feature map, which is the input of the next layer.
- Each bit of W ' may correspond to a different A (activation value). If this bit is' 1', additional overhead is required to store which A corresponds to this bit. If this bit is' 0 ', it is not necessary to store it.
- the additional stored position information (which A corresponds to it)
- the present invention does not limit the encoding mode, and more commonly used encoding modes such as Huffman encoding are used.
- the W 'and the additionally stored position information are sent to the SAC, and the SAC sends these weights and activation values to the corresponding computing unit.
- the result in the register of the arithmetic unit is shown in FIG. 7, and then the result in the register is shifted and added to obtain the final result.
- the invention also discloses a convolutional neural network accelerator, which includes:
- a weighting module is used to obtain multiple sets of activation values to be operated and their corresponding original weights, arrange the original weights in the calculation order and align them bit by bit to obtain a weight matrix, remove the slack bits in the weight matrix, and obtain gaps To reduce the matrix, and make the basic bits in each column of the reduced matrix complement the vacancies in the calculation order to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix , Each row of the kneading matrix is used as the kneading weight;
- the split accumulation module is used to obtain the position information of the activation value of each bit in the kneading weight according to the corresponding relationship between the activation value and the basic bits in the original weight, and send the kneading weight to the split accumulator, and the split accumulation
- the processor divides the kneaded weight into a plurality of weight segments according to the position information. Based on the position information, the weight segment is summed with the corresponding activation value, and the processing result is sent to the addition tree. The shift is performed on the processing result. Add up to get the output feature map;
- the split accumulator includes a segment register for storing the weighted segment.
- the convolutional neural network accelerator wherein the split accumulator includes a separator and a segment adder, the separator is used to perform bitwise division of the kneading weight, and the segment adder is used to calculate the weight segment and a corresponding activation value. And processing.
- the split-and-accumulate module includes: using Huffman coding to save the position information of each bit of the kneading weight corresponding to the activation value.
- the invention relates to a convolutional neural network accelerator, which includes: arranging original weights in a calculation order and bit-aligning to obtain a weight matrix, removing slack bits in the weight matrix, obtaining a reduced matrix with vacancies, and making each of the reduced matrix
- the basic bits in a column are replenished with vacancies in the order of calculation to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix.
- Each row of the kneading matrix is used as the kneading weight; according to the activation value and the original Correspondence between the basic bits in the weights to get the position information of each bit in the kneading weight corresponding to the activation value; the kneading weight is sent to the split accumulator, and the split accumulator divides the kneading weight into multiple weight segments according to the position. Information, summing the weight segment and the corresponding activation value, and sending the processing result to the addition tree, and performing shift addition on the processing result to obtain an output feature map.
- the present invention can reduce the operation speed of storing and accelerating the convolutional neural network.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Neurology (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims (5)
- 一种卷积神经网络加速器,其特征在于,包括:权重捏合模块,用于获取多组待运算的激活值及其对应的原始权重,将该原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除该权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得该精简矩阵的每一列中的基本位按该计算顺序递补该空位,得到中间矩阵,剔除该中间矩阵中的空行,并将该中间矩阵的空位置0,得到捏合矩阵,该捏合矩阵的每一行作为捏合权重;拆分累加模块,用于根据激活值与原始权重中基本位的对应关系,得到该捏合权重中每一位对应激活值的位置信息,将该捏合权重送入拆分累加器,该拆分累加器将该捏合权重按位分割为多个权重段,根据该位置信息,将该权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对该处理结果执行移位相加,得到输出特征图;其中该拆分累加器包括段寄存器,用于存储该权重段。
- 如权利要求1所述的卷积神经网络加速器的卷积运算优化系统,其特征在于,该拆分累加器包括分离器和段加法器,该分离器用于对该捏合权重进行按位分割,该段加法器用于对该权重段与对应的激活值进行求和处理。
- 如权利要求1或2所述的卷积神经网络加速器的卷积运算优化系统,其特征在于,该拆分累加模块包括:利用哈夫曼编码保存该捏合权重中每一位对应激活值的该位置信息。
- 如权利要求1或2所述的卷积神经网络加速器的卷积运算优化系统,其特征在于,该激活值为图像的像素值。
- 如权利要求3所述的卷积神经网络加速器的卷积运算优化系统,其特征在于,该激活值为图像的像素值。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/250,889 US20210350204A1 (en) | 2018-09-20 | 2019-05-21 | Convolutional neural network accelerator |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811100309 | 2018-09-20 | ||
CN201811100309.2 | 2018-09-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020057162A1 true WO2020057162A1 (zh) | 2020-03-26 |
Family
ID=65843951
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/087769 WO2020057161A1 (zh) | 2018-09-20 | 2019-05-21 | 一种用于卷积神经网络加速器的拆分累加器 |
PCT/CN2019/087767 WO2020057160A1 (zh) | 2018-09-20 | 2019-05-21 | 一种基于权重捏合的卷积神经网络计算方法和系统 |
PCT/CN2019/087771 WO2020057162A1 (zh) | 2018-09-20 | 2019-05-21 | 一种卷积神经网络加速器 |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/087769 WO2020057161A1 (zh) | 2018-09-20 | 2019-05-21 | 一种用于卷积神经网络加速器的拆分累加器 |
PCT/CN2019/087767 WO2020057160A1 (zh) | 2018-09-20 | 2019-05-21 | 一种基于权重捏合的卷积神经网络计算方法和系统 |
Country Status (3)
Country | Link |
---|---|
US (3) | US20210350214A1 (zh) |
CN (3) | CN109543140B (zh) |
WO (3) | WO2020057161A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11500811B2 (en) * | 2020-06-12 | 2022-11-15 | Alibaba Group Holding Limited | Apparatuses and methods for map reduce |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543140B (zh) * | 2018-09-20 | 2020-07-10 | 中国科学院计算技术研究所 | 一种卷积神经网络加速器 |
CN110059733A (zh) * | 2019-04-01 | 2019-07-26 | 苏州科达科技股份有限公司 | 卷积神经网络的优化及快速目标检测方法、装置 |
CN110245324B (zh) * | 2019-05-19 | 2023-01-17 | 南京惟心光电系统有限公司 | 一种基于光电计算阵列的反卷积运算加速器及其方法 |
CN110245756B (zh) * | 2019-06-14 | 2021-10-26 | 第四范式(北京)技术有限公司 | 用于处理数据组的可编程器件及处理数据组的方法 |
CN110633153A (zh) * | 2019-09-24 | 2019-12-31 | 上海寒武纪信息科技有限公司 | 一种用多核处理器实现神经网络模型拆分方法及相关产品 |
WO2021081854A1 (zh) * | 2019-10-30 | 2021-05-06 | 华为技术有限公司 | 一种卷积运算电路和卷积运算方法 |
CN110807522B (zh) * | 2019-10-31 | 2022-05-06 | 合肥工业大学 | 一种神经网络加速器的通用计算电路 |
CN113627600B (zh) * | 2020-05-07 | 2023-12-29 | 合肥君正科技有限公司 | 一种基于卷积神经网络的处理方法及其系统 |
CN113919477A (zh) * | 2020-07-08 | 2022-01-11 | 嘉楠明芯(北京)科技有限公司 | 一种卷积神经网络的加速方法及装置 |
CN112580787B (zh) * | 2020-12-25 | 2023-11-17 | 北京百度网讯科技有限公司 | 神经网络加速器的数据处理方法、装置、设备及存储介质 |
CN116888575A (zh) * | 2021-03-26 | 2023-10-13 | 上海科技大学 | 精简近似的基于共享的单输入多权重乘法器 |
CN114021710A (zh) * | 2021-10-27 | 2022-02-08 | 中国科学院计算技术研究所 | 利用比特级稀疏性的深度学习卷积加速方法及处理器 |
CN114168991B (zh) * | 2022-02-10 | 2022-05-20 | 北京鹰瞳科技发展股份有限公司 | 对加密数据进行处理的方法、电路及相关产品 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239824A (zh) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | 用于实现稀疏卷积神经网络加速器的装置和方法 |
CN107392308A (zh) * | 2017-06-20 | 2017-11-24 | 中国科学院计算技术研究所 | 一种基于可编程器件的卷积神经网络加速方法与系统 |
US20170344882A1 (en) * | 2016-05-31 | 2017-11-30 | Canon Kabushiki Kaisha | Layer-based operations scheduling to optimise memory for CNN applications |
US20180046916A1 (en) * | 2016-08-11 | 2018-02-15 | Nvidia Corporation | Sparse convolutional neural network accelerator |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679185B (zh) * | 2012-08-31 | 2017-06-16 | 富士通株式会社 | 卷积神经网络分类器系统、其训练方法、分类方法和用途 |
US9513870B2 (en) * | 2014-04-22 | 2016-12-06 | Dialog Semiconductor (Uk) Limited | Modulo9 and modulo7 operation on unsigned binary numbers |
US10049322B2 (en) * | 2015-05-21 | 2018-08-14 | Google Llc | Prefetching weights for use in a neural network processor |
US20160379109A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Convolutional neural networks on hardware accelerators |
US11074492B2 (en) * | 2015-10-07 | 2021-07-27 | Altera Corporation | Method and apparatus for performing different types of convolution operations with the same processing elements |
US11170294B2 (en) * | 2016-01-07 | 2021-11-09 | Intel Corporation | Hardware accelerated machine learning |
US20170344876A1 (en) * | 2016-05-31 | 2017-11-30 | Samsung Electronics Co., Ltd. | Efficient sparse parallel winograd-based convolution scheme |
WO2018058426A1 (zh) * | 2016-09-29 | 2018-04-05 | 清华大学 | 硬件神经网络转换方法、计算装置、编译方法和神经网络软硬件协作系统 |
CN106529670B (zh) * | 2016-10-27 | 2019-01-25 | 中国科学院计算技术研究所 | 一种基于权重压缩的神经网络处理器、设计方法、芯片 |
US10871964B2 (en) * | 2016-12-29 | 2020-12-22 | Qualcomm Incorporated | Architecture for sparse neural network acceleration |
CN107086910B (zh) * | 2017-03-24 | 2018-08-10 | 中国科学院计算技术研究所 | 一种针对神经网络处理的权重加解密方法和系统 |
CN107341544B (zh) * | 2017-06-30 | 2020-04-10 | 清华大学 | 一种基于可分割阵列的可重构加速器及其实现方法 |
CN107844826B (zh) * | 2017-10-30 | 2020-07-31 | 中国科学院计算技术研究所 | 神经网络处理单元及包含该处理单元的处理系统 |
CN107918794A (zh) * | 2017-11-15 | 2018-04-17 | 中国科学院计算技术研究所 | 基于计算阵列的神经网络处理器 |
CN108182471B (zh) * | 2018-01-24 | 2022-02-15 | 上海岳芯电子科技有限公司 | 一种卷积神经网络推理加速器及方法 |
SG11202007532TA (en) * | 2018-02-16 | 2020-09-29 | Governing Council Univ Toronto | Neural network accelerator |
CN108510066B (zh) * | 2018-04-08 | 2020-05-12 | 湃方科技(天津)有限责任公司 | 一种应用于卷积神经网络的处理器 |
CN109543140B (zh) * | 2018-09-20 | 2020-07-10 | 中国科学院计算技术研究所 | 一种卷积神经网络加速器 |
-
2018
- 2018-10-18 CN CN201811214310.8A patent/CN109543140B/zh active Active
- 2018-10-18 CN CN201811214639.4A patent/CN109543830B/zh active Active
- 2018-10-18 CN CN201811214323.5A patent/CN109543816B/zh active Active
-
2019
- 2019-05-21 US US17/250,892 patent/US20210350214A1/en active Pending
- 2019-05-21 WO PCT/CN2019/087769 patent/WO2020057161A1/zh active Application Filing
- 2019-05-21 US US17/250,890 patent/US20210357735A1/en active Pending
- 2019-05-21 US US17/250,889 patent/US20210350204A1/en not_active Abandoned
- 2019-05-21 WO PCT/CN2019/087767 patent/WO2020057160A1/zh active Application Filing
- 2019-05-21 WO PCT/CN2019/087771 patent/WO2020057162A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170344882A1 (en) * | 2016-05-31 | 2017-11-30 | Canon Kabushiki Kaisha | Layer-based operations scheduling to optimise memory for CNN applications |
US20180046916A1 (en) * | 2016-08-11 | 2018-02-15 | Nvidia Corporation | Sparse convolutional neural network accelerator |
CN107239824A (zh) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | 用于实现稀疏卷积神经网络加速器的装置和方法 |
CN107392308A (zh) * | 2017-06-20 | 2017-11-24 | 中国科学院计算技术研究所 | 一种基于可编程器件的卷积神经网络加速方法与系统 |
Non-Patent Citations (1)
Title |
---|
WANG, YING ET AL.: "Re-architecting the On-chip memory Sub-system of Machine-Learning Accelerator for Embedded Devices", 2016 /EEE/4CM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD, 10 January 2017 (2017-01-10), pages 1 - 6, XP033048844, ISSN: 1558-2434 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11500811B2 (en) * | 2020-06-12 | 2022-11-15 | Alibaba Group Holding Limited | Apparatuses and methods for map reduce |
Also Published As
Publication number | Publication date |
---|---|
CN109543830A (zh) | 2019-03-29 |
CN109543830B (zh) | 2023-02-03 |
US20210357735A1 (en) | 2021-11-18 |
CN109543140A (zh) | 2019-03-29 |
CN109543816B (zh) | 2022-12-06 |
CN109543140B (zh) | 2020-07-10 |
WO2020057161A1 (zh) | 2020-03-26 |
CN109543816A (zh) | 2019-03-29 |
US20210350214A1 (en) | 2021-11-11 |
WO2020057160A1 (zh) | 2020-03-26 |
US20210350204A1 (en) | 2021-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020057162A1 (zh) | 一种卷积神经网络加速器 | |
US20220012593A1 (en) | Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization | |
CN107862374B (zh) | 基于流水线的神经网络处理系统和处理方法 | |
Samimi et al. | Res-DNN: A residue number system-based DNN accelerator unit | |
KR20230113408A (ko) | 가속화된 수학 엔진 | |
US20220164663A1 (en) | Activation Compression Method for Deep Learning Acceleration | |
CN110555516B (zh) | 基于FPGA的YOLOv2-tiny神经网络低延时硬件加速器实现方法 | |
US20200104669A1 (en) | Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations | |
CN113010213B (zh) | 基于阻变忆阻器的精简指令集存算一体神经网络协处理器 | |
US12079592B2 (en) | Deep neural network accelerator including lookup table based bit-serial processing elements | |
JP7292297B2 (ja) | 確率的丸めロジック | |
Khaleghi et al. | Shear er: highly-efficient hyperdimensional computing by software-hardware enabled multifold approximation | |
Frickenstein et al. | DSC: Dense-sparse convolution for vectorized inference of convolutional neural networks | |
CN110851779A (zh) | 用于稀疏矩阵运算的脉动阵列架构 | |
Shu et al. | High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination | |
Wu et al. | A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation | |
CN110716751B (zh) | 高并行度计算平台、系统及计算实现方法 | |
Guo et al. | A high-efficiency fpga-based accelerator for binarized neural network | |
He et al. | FTW-GAT: An FPGA-based accelerator for graph attention networks with ternary weights | |
CN116888591A (zh) | 一种矩阵乘法器、矩阵计算方法及相关设备 | |
CN110659014B (zh) | 乘法器及神经网络计算平台 | |
CN116842304A (zh) | 一种不规则稀疏矩阵的计算方法及系统 | |
Lu et al. | SparseNN: A performance-efficient accelerator for large-scale sparse neural networks | |
WO2022174733A1 (zh) | 一种神经元加速处理方法、装置、设备及可读存储介质 | |
CN115222028A (zh) | 基于fpga的一维cnn-lstm加速平台及实现方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19861835 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19861835 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19861835 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19861835 Country of ref document: EP Kind code of ref document: A1 |