WO2020057161A1 - 一种用于卷积神经网络加速器的拆分累加器 - Google Patents

一种用于卷积神经网络加速器的拆分累加器 Download PDF

Info

Publication number
WO2020057161A1
WO2020057161A1 PCT/CN2019/087769 CN2019087769W WO2020057161A1 WO 2020057161 A1 WO2020057161 A1 WO 2020057161A1 CN 2019087769 W CN2019087769 W CN 2019087769W WO 2020057161 A1 WO2020057161 A1 WO 2020057161A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
matrix
kneading
bit
activation value
Prior art date
Application number
PCT/CN2019/087769
Other languages
English (en)
French (fr)
Inventor
李晓维
魏鑫
路航
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Priority to US17/250,890 priority Critical patent/US20210357735A1/en
Publication of WO2020057161A1 publication Critical patent/WO2020057161A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4031Fixed length to variable length coding
    • H03M7/4037Prefix coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the field of neural network computing, and in particular to a split accumulator for a convolutional neural network accelerator.
  • Deep convolutional neural networks have made significant progress in machine learning applications, such as real-time image recognition, detection, and natural language processing.
  • the advanced deep convolutional neural network (DCNN) architecture has complex connections and a large number of neurons and synapses to meet the needs of high-precision and complex tasks.
  • the weights are multiplied by the corresponding activation values, and the products are added up and summed up. That is, the weight and activation value form a pair.
  • DCNN consists of multiple layers, from tens to hundreds of layers, even thousands of layers. Nearly 98% of the calculations in the entire DCNN come from convolution operations. Convolution is the most important factor affecting power and performance. Improving the computational efficiency of convolutions without compromising the robustness of the learning model has become an effective method for accelerating DCNNs, especially on lightweight devices (such as smartphones and robots) with limited resources and low power consumption .
  • some existing methods take advantage of the fact that fixed-point multiplication can be decomposed into a series of single-bit multiplications and shift-add, and proposes using bit-level serialization when performing MAC.
  • the basic bit (or 1) may appear anywhere in the fixed-point number, so this scheme must take into account the worst-case position of the bit "1". If it is a 16-bit fixed-point number (fp16), 16-bit The register holds the position information of the basic bit ("1"). Different weights or activation values may cause different waiting times during the acceleration process, and therefore produce unpredictable cycles.
  • the hardware design will inevitably cover the worst case.
  • the worst case cycle can only be used as the accelerator cycle, which will not only increase the processing cycle, reduce the frequency of the accelerator, but also increase the design complexity.
  • the classic DCNN accelerator performs multiply-add operations by deploying multipliers and adders on each activation value and weight channel.
  • multiplication can be a floating-point 32-bit operand, or a 16-bit fixed-point number and an 8-bit integer.
  • the multiplier determines the delay of the convolution operation.
  • the 8-bit fixed-point 2 operand multiplier takes 3.4 times as long as the adder.
  • different DCNN models require different accuracy, and even different levels in the same model have different accuracy requirements, so the multiplier designed for the convolutional neural network accelerator must be able to cover the worst case.
  • the main component of a classic DCNN accelerator is a multiplier and adder, and the main problem of multiplier and adder is that it will perform invalid operations.
  • Invalid calculations can be expressed in two aspects: first, the operand is a value with many zero values or including most zero bits. Compared to zero, the value of zero occupies a small part of the weight. These small parts of zero can be easily eliminated by advanced microarchitecture design or memory-level compression techniques, and avoid using multipliers as input.
  • Figure 1 shows that the average ratio of the zero position is as high as 68.9% compared to the basic position (or 1).
  • the traditional CNN puts each weight / activation value pair into a processing unit (PE) and completes the multiply-add operation within one cycle, however, it is unavoidable to calculate zero bits. If we can reduce the time of invalid calculations, then the throughput of PE will be improved.
  • a bit containing a “0” bit in a PE is referred to as a “relaxed bit”.
  • the distribution of basic bits has two characteristics: (1) A probability of about 50% to 60% is valid at each position, which also means that a probability of 40% to 50% is relaxed. (2)
  • the bits of certain weights are slack. For example, the third to fifth digits are less than 1% of the basic digits. These bits are almost all composed of slack bits, but the multiplier does not distinguish between slack and basic bits when performing multiply-add operations. If you want to improve the efficiency of the convolution operation, you must use the slack bit.
  • an object of the present invention is to provide a split accumulator for a convolutional neural network accelerator, including a weight kneading technology for a convolutional neural network accelerator.
  • Weight pinching improves the zero relaxation that is common in modern DCNN models. Unlike data compression or pruning, it reduces the number of weights without loss of accuracy.
  • the present invention discloses a split accumulator for a convolutional neural network accelerator, which includes:
  • a weighting module is used to obtain multiple sets of activation values to be operated and their corresponding original weights, arrange the original weights in the calculation order and align them bit by bit to obtain a weight matrix, remove the slack bits in the weight matrix, and obtain gaps To reduce the matrix, and make the basic bits in each column of the reduced matrix complement the vacancies in the calculation order to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix , Each row of the kneading matrix is used as the kneading weight;
  • a splitting and accumulating module is used to obtain the position information of the corresponding activation value of each bit in the kneading weight according to the correspondence between the activation value and the basic bit in the original weight, and divide the kneading weight into multiple weight segments according to the bit. Position information, summing the weight segment and the corresponding activation value, and sending the processing result to the addition tree, and performing shift addition on the processing result to obtain an output feature map.
  • the split accumulator for a convolutional neural network accelerator further includes a separator and a segment adder.
  • the separator is used to perform bitwise segmentation of the kneading weight, and the segment adder is used for the weight segment and the corresponding The activation values are summed.
  • the split accumulation module includes: using Huffman coding to save the position information of each bit of the kneading weight corresponding to the activation value.
  • the activation value is a pixel value of an image.
  • the technical progress of the present invention includes:
  • the method of kneading weights and activation values of the present invention, and the method of analyzing and calculating the kneaded values of the invention, can reduce the storage and accelerate the calculation speed;
  • Tetris accelerator architecture including the structure of SAC and the structure of the separator, can further accelerate the convolution calculation.
  • VivadoHLS is used to simulate the two methods to obtain the execution time.
  • Figure 2 shows the inference acceleration of the two modes of DaDN and the Tetris accelerator proposed by the present invention.
  • the Tetris method can achieve 1.30 times acceleration on fp16, and the int8 mode can achieve an average 1.50 times acceleration. Compared to PRA, it can achieve acceleration of nearly 1.15 times.
  • Figure 1 shows the proportion of zero values and zero bits in some modern deep convolutional neural network models
  • FIG. 3 is a schematic view of weighting
  • FIG. 4 is a schematic diagram of a split accumulator SAC
  • Figure 5 is a Tetris accelerator architecture diagram
  • FIG. 6 is a structural diagram of a separator
  • FIG. 7 is a schematic diagram of a result stored in a register of an arithmetic unit.
  • the present invention reconstructs the reasoning calculation mode of the DCNN model.
  • the present invention uses a split accumulator (SAC) instead of the classic calculation mode-MAC.
  • SAC split accumulator
  • the present invention can make full use of the basic bits in the weight, and is composed of an adder, a shifter, and the like, without a multiplier.
  • each weight / activation value pair is subjected to a shift and sum operation, where “weight / activation value” means “weight and activation value”.
  • the present invention accumulates multiple weight / activation value pairs multiple times and performs only one shift addition and summation, thereby achieving a significant acceleration.
  • the present invention proposes a Tetris accelerator to tap the maximum potential of the weight-kneading technique and the split-accumulator SAC.
  • the Tetris accelerator consists of a series of split-accumulator SAC units, and uses pinched weights and activation values to achieve high-throughput and low-power inference calculations. Experiments with advanced synthesis tools have proven that Tetris accelerator achieves the best results compared to the prior art.
  • the activation value of the first layer is the input
  • the activation value of the second time is the output of the first layer, and so on. If the input is an image, the activation value of the first layer is the pixel value of the image.
  • the present invention includes:
  • Step 1 Obtain multiple sets of activation values to be calculated and their corresponding original weights, arrange the original weights in the calculation order and align them bitwise to obtain a weight matrix, remove the slack bits in the weight matrix, and delete the slack bits and change them. Is a vacancy, and a reduced matrix with vacancies is obtained, and the basic digits in each column of the reduced matrix are supplemented with the vacancies in the calculation order to obtain an intermediate matrix, and the empty rows in the intermediate matrix are eliminated, and the empty rows are in the intermediate matrix.
  • the entire row is a vacant row, and the empty position of the intermediate matrix after the empty row is proposed is 0 to obtain a kneading matrix, and each row of the kneading matrix is used as a kneading weight;
  • Step 2 According to the correspondence between the activation value and the basic bits in the original weight, obtain the position information of the activation value corresponding to each bit (element) in the kneading weight;
  • Step 3 The kneading weight is sent to a split accumulator, and the split accumulator divides the kneading weight into a plurality of weight segments bit by bit. Based on the position information, the weight segment is summed with the corresponding activation value. , And send the processing result to the addition tree, and perform shift addition on the processing result to obtain an output feature map.
  • the splitter and accumulator includes a separator and a segment adder.
  • the separator is used to perform bitwise division on the kneading weight
  • the segment adder is used to sum the weighted segment and the corresponding activation value.
  • Figure 3 shows the method of kneading weights. Assuming 6 sets of weights as a batch, it usually takes 6 cycles to complete 6 pairs of weights / activate multiply-add operations. Relaxation occurs in two orthogonal dimensions: (1) the size within the weights, that is, W 1 , W 2 , W 3 , W 4 , and W 6 , the relaxation represents arbitrary allocation; (2) the relaxation bit also appears in the weight dimension, That is, W 5 is an all-zero bit (zero value) weight, so it does not appear in Figure 3 (b). By kneading to optimize the basic bit and the relaxed bit, the calculation cycle of 6 MACs will be reduced to 3 cycles as shown in Figure 3 (c).
  • W ′ 1 , W ′ 2 and W ′ 3 are obtained by weighting, but each bit is a basic bit combination of W 1 to W 6 . If more weights are allowed, these slack bits will be filled with basic bits, which is referred to in the present invention as a "weight pinch" operation, that is, replacing the slack bits in the previous original weights with the basic bits of subsequent original weights.
  • a weight pinch operation that is, replacing the slack bits in the previous original weights with the basic bits of subsequent original weights.
  • the zero position is also replaced by the basic position, which avoids the effect of two-dimensional relaxation.
  • the kneading weight indicates that the current kneading weight needs to be calculated with multiple activation values, rather than with individual activation values.
  • the activation value a has 4 bits
  • the weight w also has 4 bits, which are w 1 , w 2 , w 3 , w 4 , and w 1 are high bits
  • a * w a * w 1 * 2 3 + a * w 2 * 2 2 + a * w 3 * 2 1 + a * w 4 * 2 0 .
  • the present invention reduces the number of shift additions.
  • the processing mode of the present invention is different from the traditional MAC and canonical bit serializers.
  • SAC SAC stands for "splitting and accumulation".
  • the SAC operation first divides the weights of the pinch, references the activation values, and finally accumulates each activation value into a specific segment register.
  • Figure 4 shows the SAC process for each weight / activation value pair. This figure is only used as an example of SAC to show its concept. In actual use, the weighting SAC is used, that is, the data is first weighted and then sent to SAC for calculation. Multiple SACs form an array to form the accelerator of the invention.
  • the split accumulator (SAC) shown in FIG. 4 is different from the MAC in that it aims to calculate the pairwise multiplication accurately; it first divides the weights and sums the activation values in the corresponding segment adders according to the segmentation results. After a certain number of weight / activation value pairs have completed processing, a final shift-and-add operation is performed to obtain the final sum. "A certain amount” is obtained by experiments and is related to the weight distribution of the convolutional neural network model. You can set a better value according to the specific model, or you can choose a compromise value that is suitable for most models. In detail, SAC is a split accumulator.
  • the splitter (spliter) is part of the SAC. After the weight enters the SAC, it first passes through the splitter, parses the weights (split the weights), and assigns the subsequent segment registers and segment adders. Then go through the adder tree to get the final result.
  • the separator is responsible for dispersing each segment into its corresponding adder. For example, if the second bit of the weight is the basic bit, the activation is passed to S1 in the graph. The same operation applies to other bits.
  • the SAC requires 3 cycles. Although SAC has an operation to read the position information of the activation value, it is all in a register, which takes a very short time and is far less than a multiplication operation. The most time-consuming convolution operation is a multiplication operation, and the present invention does not have a multiplication operation. And shift operation is less than MAC. Although the invention sacrifices very little additional storage for this purpose, such a sacrifice is worth it.
  • Tetris accelerator architecture Each SAC unit consists of 16 splitters organized in parallel (ie, Figure 6), which accepts a series of kneading weights and activation values, which are represented by the kneading stride (KS) parameters, the same number of segmented adders, and the following adders
  • KS kneading stride
  • the tree is responsible for calculating the final partial sum.
  • the Tetris refers to the kneading process used in the present invention similar to the Tetris stacking process.
  • the Tetris accelerator proposed by the present invention uses the pinch weights, and the SAC only assigns activation values to the segmented adders for accumulation.
  • Figure 5 shows the architecture of Tetris.
  • Each SAC unit accepts weight / activation values.
  • the accelerator consists of a symmetric SAC cell matrix, which is connected to a throttle buffer and accepts pinch weights from on-chip eDRAM. If fixed-point 16-bit weights are used, each cell contains 16 separators that make up a separator array.
  • the separator is based on the parameter KS, the activation value and the weight value.
  • KS that is, KS weights are mixed.
  • the separator micro-architecture is shown in Figure 6.
  • the weight-kneading technique requires that the separator can accept a set of activation values for the kneading weights, and each base bit must be identifiable to indicate the relevant activation value in the KS range.
  • the Tetris method uses ⁇ W i , p> in the separator to represent activations related to a particular basic bit.
  • KS stands for the range of p-bits, that is, 4-bit p refers to the kneading stride of 16 weights. It assigns a comparator to determine if the bits of the kneading weight are zero, because even after the kneading, some positions may be invalid, namely W ' 3 in Fig. 3 (c), depending on KS. If it is slack, the multiplexer after the comparator outputs zero to the following structure. Otherwise, it will decode p and output the corresponding activation value.
  • the separator obtains the target activation value in the throttle buffer only when necessary, and does not need to store the activation value multiple times.
  • the newly introduced position information p will occupy a small part of the space, but the use of p only decodes activation and consists of only a few bits, that is, 16 activations require 4 bits, so it will not introduce a large amount of on-chip resources and Power overhead.
  • each segment adder For each segment adder, it receives the activation value from all 16 splitters and adds the value from the local segment register.
  • the middle section sum is stored in the S0 ⁇ S15 registers, and once all tasks that can be added are completed, the multiplexer is notified by a "control signal" to pass each section value to the subsequent adder tree.
  • the final stage of the adder tree generates the final sum and passes it to the output non-linear activation function, which is determined by the network model, such as RELU, sigmoid.
  • the end of the active / weight pair that can be added is marked by a tag and sent to the detector in each SAC unit. If all tokens reach the end, the adder tree outputs the final sum.
  • KS a parameter to control the weight of the pinch
  • different channels will have different numbers of pinch weights, so in most cases the marker can reside anywhere in the throttle buffer. If a new activation value / weight pair is filled into the buffer, the marker moves backwards, so it does not affect the calculation of the partial sum of each segment.
  • the weights are first stored in the on-chip eDRAM after kneading, and the kneading weights are obtained from the eDRAM.
  • the weights are parsed by the SAC separator and assigned to subsequent segment registers and adders. Then go through the addition tree to get the final result.
  • the final result is the feature map, which is the input of the next layer.
  • Each bit of W ' may correspond to a different A (activation value). If this bit is' 1', additional overhead is required to store which A corresponds to this bit. If this bit is' 0 ', it is not necessary to store it.
  • the additional stored position information (which A corresponds to it)
  • the present invention does not limit the encoding mode, and more commonly used encoding modes such as Huffman encoding are used.
  • the W 'and the additionally stored position information are sent to the SAC, and the SAC sends these weights and activation values to the corresponding computing unit.
  • the result in the register of the arithmetic unit is shown in FIG. 7, and then the result in the register is shifted and added to obtain the final result.
  • the invention also discloses a split accumulator for a convolutional neural network accelerator, which includes:
  • a weighting module is used to obtain multiple sets of activation values to be operated and their corresponding original weights, arrange the original weights in the calculation order and align them bit by bit to obtain a weight matrix, remove the slack bits in the weight matrix, and obtain gaps To reduce the matrix, and make the basic bits in each column of the reduced matrix complement the vacancies in the calculation order to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix , Each row of the kneading matrix is used as the kneading weight;
  • a splitting and accumulating module is used to obtain the position information of the corresponding activation value of each bit in the kneading weight according to the correspondence between the activation value and the basic bit in the original weight, and divide the kneading weight into multiple weight segments according to the bit. Position information, summing the weight segment and the corresponding activation value, and sending the processing result to the addition tree, and performing shift addition on the processing result to obtain an output feature map.
  • the split accumulator for a convolutional neural network accelerator further includes a separator and a segment adder.
  • the separator is used to perform bitwise segmentation of the kneading weight, and the segment adder is used for the weight segment and the corresponding The activation values are summed.
  • the split accumulation module includes: using Huffman coding to save the position information of each bit of the kneading weight corresponding to the activation value.
  • the activation value is a pixel value of an image.
  • the invention relates to a split accumulator for a convolutional neural network accelerator, which is used to transform the original weight into a weight matrix, remove the slack bits in the weight matrix, obtain a reduced matrix with vacancies, and make each column of the reduced matrix
  • the basic digits in the space are replenished with the vacancies in the calculation order to obtain the intermediate matrix, remove the empty rows in the intermediate matrix, and set the empty position of the intermediate matrix to 0 to obtain the kneading matrix.
  • Each row of the kneading matrix is used as the kneading weight; Correspondence between the basic bits in the kneading weight to obtain the position information of each bit in the kneading weight corresponding to the activation value; the kneading weight is sent to the split accumulator, and the split accumulator divides the kneading weight into multiple weight segments bit by bit, according to the position information , Summing the weight segment and the corresponding activation value, and sending the processing result to the addition tree, and performing shift addition on the processing result to obtain an output feature map.
  • the present invention can reduce the operation speed of storing and accelerating the convolutional neural network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明涉及一种用于卷积神经网络加速器的拆分累加器,用于将原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得精简矩阵的每一列中的基本位按计算顺序递补空位,得到中间矩阵,剔除中间矩阵中的空行,并将中间矩阵的空位置0,得到捏合矩阵,捏合矩阵的每一行作为捏合权重;根据激活值与原始权重中基本位的对应关系,得到捏合权重中每一位对应激活值的位置信息;将捏合权重送入拆分累加器,拆分累加器将捏合权重按位分割为多个权重段,根据位置信息,将权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对处理结果执行移位相加,得到输出特征图。

Description

一种用于卷积神经网络加速器的拆分累加器 技术领域
本发明涉及神经网络计算领域,并特别涉及一种用于卷积神经网络加速器的拆分累加器。
背景技术
深度卷积神经网络已经在机器学习应用中取得了重大进展,例如实时图像识别、检测和自然语言处理等。为了提高准确性,先进的深度卷积神经网络(DCNN)架构拥有复杂的连接和大量的神经元和突触,以满足高精度和复杂的任务的需求。卷积运算时,权重与对应的激活值相乘,最后把积加起来求和。即权重和激活值构成一对。
鉴于传统通用处理器架构的局限性,许多研究人员提出了针对现代DCNN的特定计算模式的专用加速器。DCNN由多个层组成,从几十层到几百层,甚至上千层。在整个DCNN中近98%的计算来自卷积操作。卷积作为影响功率和性能的最主要因素。在不损害学习模型的稳健性的情况下提高卷积的计算效率,成为加速DCNN的有效方法,尤其是在资源有限和有低功耗需求的轻量级设备(如智能手机和自动机器人)上。
为了解决这一挑战,一些现有方法利用定点乘法可以分解为一系列单比特乘法并移位相加的特点,提出在执行MAC时使用比特位级串行。然而,基本位(或1)可能出现在定点数的任何位置,因此这种方案必须将位“1”的最坏情况的位置考虑在内,如果是16位定点数(fp16)需要使用16位寄存器保存基本位(“1”)的位置信息。不同的权重或激活值可能在加速过程中产生不同的等待时间,因此产生不可预测的周期。硬件设计必然会覆盖最坏情况,只能将最坏情况的周期作为加速器的周期,不仅会增大处理周期,减小加速器的频率,还会增加设计复杂性。
为了加速这种操作,经典的DCNN加速器通过在每个激活值和权重通道上部署乘法器和加法器,进行乘加操作。为了在加速和准确度之间达到平衡,乘法可以是浮点32位操作数,或者16位定点数和8位整型数。与定点数加法 器相比,乘法器决定了卷积操作的延迟,8位定点的2操作数乘法器所需时间是加法器的3.4倍。并且不同的DCNN模型所需精度不同,甚至在同一个模型的不同层都有不同的精度需求,因此为卷积神经网络加速器设计的乘法器必须能够覆盖最坏情况。
经典DCNN加速器主要部件是乘加器,而乘加器的主要问题是会进行无效运算,无效计算可以表现在两个方面:首先操作数是有很多零值或包括大部分零比特的值。与零位相比,零值在权重中占据了一小部分。这些小部分0可以通过先进的微架构设计或内存级压缩技术轻松消除,并作为输入避免使用乘数。图1显示,与基本位(或1)相比,零位的平均比例高达68.9%。其次乘加运算的中间结果是无用的,例如,y=ab+cd+de,我们只需要y的值,而不需要ab的值。这意味着优化零位和乘加运算将有助于计算效率、降低功耗。
有很多人利用量化技术加速计算。例如,将权重转换为二值,或更精确的三值。因此乘法运算可以转换为纯移位或加法操作。然而,这样做必然会牺牲结果的精度,尤其是在大型数据集中这些方案精度损失非常严重。因此,发明高精度的加速器是非常有必要的。
传统的CNN将每个权重/激活值对都放入处理单元(PE)中,并在一个周期内完成乘加操作,然而无法避免计算零比特。如果我们可以减少无效计算的时间,那么PE的吞吐量将得到提升。本发明将PE中包含“0”比特的位称为“松弛位”。
基本位(或1)的分布有两个特性:(1)在每个位置约为50%~60%的概率是有效的,这也意味着40%~50%的概率是松弛的。(2)某些权重大部分的比特位是松弛。例如,第3~5位仅少于1%的基本位。这些位几乎全部由松弛位组成,但乘法器在执行乘加操作时不会将松弛位和基本位区分开来。如果想要改善卷积操作执行效率,必须利用松弛位。
如果先前权重中的松弛位可以被后续权重的基本位替代,就可以减少无效计算,并在一个周期中处理多个权重/激活值对。挤压这些权重,总权重可压缩到其初始体积的近一半。换句话说,可以减少50%的推理时间。然而,实现这一目标将是很困难的,因为需要修改现有的MAC计算模式并重新构建硬件架构以支持新的计算模式。
发明公开
为了解决上述技术问题,本发明目的在于提供一种用于卷积神经网络加速器的拆分累加器,包括权重捏合技术,用于卷积神经网络加速器。权重捏合改善了现代DCNN模型中普遍存在的零位松弛。与数据压缩或修剪不同,它减少了权重的数量而没有精度损失。
具体来说,本发明公开了一种用于卷积神经网络加速器的拆分累加器,其中包括:
权重捏合模块,用于获取多组待运算的激活值及其对应的原始权重,将该原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除该权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得该精简矩阵的每一列中的基本位按该计算顺序递补该空位,得到中间矩阵,剔除该中间矩阵中的空行,并将该中间矩阵的空位置0,得到捏合矩阵,该捏合矩阵的每一行作为捏合权重;
拆分累加模块,用于根据激活值与原始权重中基本位的对应关系,得到该捏合权重中每一位对应激活值的位置信息,将该捏合权重按位分割为多个权重段,根据该位置信息,将该权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对该处理结果执行移位相加,得到输出特征图。
所述的用于卷积神经网络加速器的拆分累加器,还包括分离器和段加法器,该分离器用于对该捏合权重进行按位分割,该段加法器用于对该权重段与对应的激活值进行求和处理。
所述的用于卷积神经网络加速器的拆分累加器,其中该拆分累加模块包括:利用哈夫曼编码保存该捏合权重中每一位对应激活值的该位置信息。
所述的用于卷积神经网络加速器的拆分累加器,其中该激活值为图像的像素值。
本发明的技术进步包括:
1、通过本发明的捏合权重和激活值的方法,以及解析运算捏合值的方式,可实现减少存储并加速运算速度;
2、俄罗斯方块加速器架构,包括SAC的结构和分离器的结构,可进一步加速卷积计算。
用VivadoHLS仿真两种方法得到执行时间,图2显示了DaDN以及本发明提出的俄罗斯方块加速器(Tetris)的两种模式的推理加速。在这里,我们 观察到通过捏合权重,俄罗斯方块方法可以在fp16上实现1.30倍的加速,int8模式可以实现平均1.50倍加速。相比PRA可以实现近1.15倍的加速。
附图简要说明
图1为一些现代深度卷积神经网络模型零值和零比特位占的比例;
图2技术效果对比图;
图3为权重捏合示意图;
图4为拆分累加器SAC示意图;
图5为俄罗斯方块加速器架构图;
图6为分离器的结构图;
图7为运算器的寄存器中所存结果示意图。
实现本发明的最佳方式
本发明重新构建了DCNN模型的推理计算模式。本发明使用分裂累加器(SAC)来代替经典的计算模式--MAC。没有经典的乘法运算,而是由一系列低运算成本的加法器代替。本发明可以充分利用权重中的基本位,由加法器、移位器等构成,没有乘法器。传统乘法器中每个权重/激活值对都进行一次移位求和运算,其中“权重/激活值”意为“权重和激活值”。而本发明对多个权重/激活值对进行多次累加,只进行一次移位相加求和,从而获得大幅加速。
最后,本发明提出了俄罗斯方块加速器,以挖掘权重捏合技术和分裂累加器SAC的最大潜力。俄罗斯方块加速器由一系列分裂累加器SAC单元构成,并使用捏合的权重和激活值,以实现高吞吐量和低功耗的推理计算。通过高级综合工具的实验证明了与现有技术相比,俄罗斯方块加速器达到了最好的效果。第一层的激活值就是输入,第二次的激活值是第一层的输出,以此类推。如果输入是图像,第一层的激活值就是图像的像素值。
具体来说,本发明包括:
步骤1、获取多组待运算的激活值及其对应的原始权重,将该原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除该权重矩阵中的松弛位,即将松弛位删除后变为空位,得到具有空位的精简矩阵,并使得该精简矩阵的每一列中的基本位按该计算顺序递补该空位,得到中间矩阵,剔除该中间矩阵中的 空行,空行即该中间矩阵中整行均为空位的行,并将提出空行后的该中间矩阵的空位置0,得到捏合矩阵,该捏合矩阵的每一行作为捏合权重;
步骤2、根据激活值与原始权重中基本位的对应关系,得到该捏合权重中每一位(元素)对应激活值的位置信息;
步骤3、将该捏合权重送入拆分累加器,该拆分累加器将该捏合权重按位分割为多个权重段,根据该位置信息,将该权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对该处理结果执行移位相加,得到输出特征图。
其中,该拆分累加器包括分离器和段加法器,该分离器用于对该捏合权重进行按位分割,该段加法器用于对该权重段与对应的激活值进行求和处理。
为让本发明的上述特征和效果能阐述的更明确易懂,下文特举实施例,并配合说明书附图作详细说明如下。
图3显示了捏合权重的方法。假设将6组权重作为一批,通常需要6个周期完成6对权重/激活乘加操作。松弛出现在两个正交维度:(1)在权重内尺寸,即W 1、W 2、W 3、W 4、和W 6,松弛表示任意分配;(2)松弛位也出现在权重维度,即W 5,是一个全零比特(零值)权重,因此它不会出现在图3(b)中。通过捏合优化基本位和松弛位,6个MAC的计算周期将减少到3个循环如图3(c)所示。本发明通过权重捏合获得W’ 1、W’ 2和W’ 3,但每一位都是W 1~W 6的基本位组合。如果允许更多权重,这些松弛位将会被基本位填充,本发明称之为“权重捏合”操作,即使用后续原始权重的基本位替代先前原始权重中的松弛位。显然,权重捏合的好处一方面是它可以自动消除零值的影响并不需要引入额外的操作。另一方面,零位也被基本位替换捏合,它避免了二维松弛的影响。然而,捏合的权重表明了当前捏合权重需要和多个激活值进行运算,而不是与单独激活值进行运算。
为了支持这种高效的计算,探索加速器架构的必要性是非常重要的,这种架构与经典MAC的传统架构不同。我们使用等效移位加shift-and-add获得一个部分和,部分和不完全是一系列权重/激活值乘法的总和。因此,在计算一个捏合权重w'之后不必立即移位b位。移b位的最终和可以在捏合权重计算之后进行。b的大小是根据移位相加的原理来的。例如,激活值a有4位,权重w也有4位,分别为w 1,w 2,w 3,w 4,w 1为高位,则a*w=a*w 1*2 3+a*w 2 *2 2+a*w 3*2 1+a*w 4*2 0。乘以2 3即左移3位,乘以2 2即左移2位,乘以2 1即左移1位。因此传统乘法器每算完一个w*a都有移位相加的操作。而本发明减少了移位相加的次数,本发明的处理模式不同于传统的MAC和规范位序列化方,在本发明中将其称为“SAC”,SAC表示“拆分和累加”。SAC操作首先分割捏合的权重,引用激活值并最终将每个激活值累积到特定的段寄存器。图4显示了每个权重/激活值对的SAC过程。该图仅作为SAC的示例展示其概念,而在实际使用中,采用捏合权重SAC,即数据先经过权重捏合,然后送入SAC进行计算,多个SAC组成阵列构成本发明的加速器。
图4所示的拆分累加器(SAC),与MAC不同,它旨在精确计算的成对乘法;它首先分割权重,根据分割结果将对于激活值在对应的段加法器中求和。在一定数量的权重/激活值对完成处理之后,执行最终的移位加(shift-and-add)操作以获得最终和。“一定数量”是由实验得出来的,与卷积神经网络模型的权重分布相关,可以根据特定模型设置比较好的值,也可选择一个折中的适合大多数模型的值。详细来说,SAC是拆分累加器。数据进入之后,先经过权重捏合,权重捏合的输出作为SAC的输入。分离器(spliter)是SAC的一部分,权重进入SAC后,先经过分离器,将权重解析(分割权重),并分配给后续的段寄存器和段加法器。然后经过加法树(adder tree)得到最终结果。
SAC根据所需比特长度实例化“分离器”:如果对每个权重值使用16位定点数,那么就需要16个段寄存器(图中p=16)和16个加法器,根据精度与速度要求决定定点数的位数。精度越高速度越慢,这个选择根据不同需求而定。而且不同模型对精度敏感程度也不同,有的模型8位就很好,有的模型要16位。通常,分离器负责用于将每个段分散到其对应的加法器中,例如,如果权重的第二位是基本位,激活在图中传递给S1。相同操作适用于其他位。在“分裂”这个权重后,随后的权重以相同的方式处理,因此每个权重段寄存器累加了新的激活。后续的加法器树执行一次移位相加获得最终的部分和。与MAC不同,SAC不会尝试获得中间部分和。原因在于真正的CNN模型,输出特征图仅需要“最终”和,即卷积核的所有通道的总和及其相应的输入特征图。尤其是当使用捏合权重时,其好处与MAC相比更明显。如图3和图7所示。用MAC计算6对权重和激活值,需要先计算a 1*w 1,然后计算a 2*w 2以此类推,需要6次乘法操作。而SAC需要3个周期,虽然SAC有读取激活 值位置信息的操作,但这都是寄存器内的,耗时非常短,远远小于一次乘法操作。卷积运算最耗时的是乘法操作,本发明并没有乘法操作。而且移位操作也比MAC少。虽然本发明为此牺牲了非常少的额外存储,但这种牺牲是值得的。
图5.俄罗斯方块加速器架构。每个SAC单元由16个并行组织的分离器(即图6)组成,接受一系列捏合权重和激活值,由捏合步幅(KS)参数表示,相同数量的分段加法器以及后面的加法器树负责计算最终的部分和。其中俄罗斯方块指的是本发明采用的捏合过程类似俄罗斯方块的堆叠过程。
在传统的加速器设计中,成对的累加器不能提高推理效率,因为它不区分无效计算,因此即使权重的当前位为零,它仍然会计算对应权重/激活值对。因此,本发明提出的俄罗斯方块加速器,利用捏合权重,SAC仅将激活值分配给分段加法器以进行累积。图5显示了俄罗斯方块的架构。每个SAC单元都接受权重/激活值。具体而言,加速器由对称的SAC单元矩阵组成,与节流缓冲器连接,接受来自片上eDRAM的捏合权重。如果使用定点16位权重,每个单元包含16个分离器,组成分离器阵列。分离器根据参数KS,激活值和权重值。KS,即KS个权重捏合。有16个输出数据路径用于激活以到达每个分离器的目标段寄存器,因此它在分离器阵列和分段加法器之间形成完全连接的结构。
分离器微架构如图6所示。权重捏合技术要求分离器可以接受一组捏合权重的激活值,并且每个基本位必须是可识别的,以指示KS范围内的相关激活值。俄罗斯方块方法在分离器中使用<W i,p>来表示与特定基本位相关的激活。KS代表p位数表示的范围,即4比特的p指的是16个权重的揉捏步幅。它分配一个比较器以确定捏合权重的位是否为零,因为即使在捏合之后,某些位置也可能是无效的,即图3(c)中的W’ 3,这取决于KS。如果是松弛,则比较器之后的多路复用器输出零到后面的结构。否则,它将解码p并输出相应的激活值。
分离器只在必要时在节流缓冲区中获取目标激活值,并且不需要多次存储激活值。新引入的位置信息p将占用一小部分空间,但p的使用仅解码激活,并且仅由几个比特组成,即16个激活需要4比特,因此它不会在加速器中引入大量的片上资源和功率开销。
对于每个段加法器,它接收来自所有16个分离器的激活值和自本地段寄 存器的值相加。中间段和存储在S0~S15寄存器中,并且一旦完成所有可相加的任务,通过“控制信号”通知多路复用器将每个段值传递给后面的加法器树。加法器树的最后一级生成最终的和,并将其传递给输出非线性激活函数,该非线性激活函数是由网络模型决定的,例如RELU,sigmoid。在节流缓冲单元中,通过标记表示可加的激活/权重对的末端,将其发送给每个SAC单元中检测器。如果所有标记到达结尾,则加法器树输出最终和。由于我们使用KS作为参数来控制捏合的权重,不同的通道将具有不同数量的捏合权重,因此在大多数情况下,标记可以驻留在节流缓冲器的任何位置。如果新的激活值/权重对填充到缓冲区中,则标记向后移动,因此它不会影响每个段部分和的计算。
简而言之,权重先经过捏合存至自片上eDRAM,从eDRAM获得捏合权重,经过SAC的分离器将权重解析,并分配给后续的段寄存器和加法器。然后经过加法树得到最终结果。改最终结果就是特征图,即下一层的输入。
W’的每一位有可能对应不同的A(激活值),如果这位是‘1’需要额外开销存储这一位是对应的哪个A,如果这一位是‘0’就不用存储了。至于额外存储的位置信息(对应的是哪个A),本发明不限制编码方式,比较常用编码方式例如利用哈夫曼编码。将W’和额外存储的位置信息送入SAC,SAC将这些权重和激活值送入相应的运算器。
根据图3捏合权重,运算器的寄存器中结果如图7所示,然后将寄存器中结果移位相加得到最后结果。
以下为与上述方法实施例对应的系统实施例,本实施方式可与上述实施方式互相配合实施。上述实施方式中提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述实施方式中。
本发明还公开了一种用于卷积神经网络加速器的拆分累加器,其中包括:
权重捏合模块,用于获取多组待运算的激活值及其对应的原始权重,将该原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除该权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得该精简矩阵的每一列中的基本位按该计算顺序递补该空位,得到中间矩阵,剔除该中间矩阵中的空行,并将该中间矩阵的空位置0,得到捏合矩阵,该捏合矩阵的每一行作为捏合权重;
拆分累加模块,用于根据激活值与原始权重中基本位的对应关系,得到该 捏合权重中每一位对应激活值的位置信息,将该捏合权重按位分割为多个权重段,根据该位置信息,将该权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对该处理结果执行移位相加,得到输出特征图。
所述的用于卷积神经网络加速器的拆分累加器,还包括分离器和段加法器,该分离器用于对该捏合权重进行按位分割,该段加法器用于对该权重段与对应的激活值进行求和处理。
所述的用于卷积神经网络加速器的拆分累加器,其中该拆分累加模块包括:利用哈夫曼编码保存该捏合权重中每一位对应激活值的该位置信息。
所述的用于卷积神经网络加速器的拆分累加器,其中该激活值为图像的像素值。
工业应用性
本发明涉及一种用于卷积神经网络加速器的拆分累加器,用于将原始权重变换为权重矩阵,剔除权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得精简矩阵的每一列中的基本位按计算顺序递补空位,得到中间矩阵,剔除中间矩阵中的空行,并将中间矩阵的空位置0,得到捏合矩阵,捏合矩阵的每一行作为捏合权重;根据激活值与原始权重中基本位的对应关系,得到捏合权重中每一位对应激活值的位置信息;将捏合权重送入拆分累加器,拆分累加器将捏合权重按位分割为多个权重段,根据位置信息,将权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对处理结果执行移位相加,得到输出特征图。本发明通过捏合权重和激活值,能减少了存储并加速卷积神经网络的运算速度。

Claims (4)

  1. 一种用于卷积神经网络加速器的拆分累加器,其特征在于,包括:
    权重捏合模块,用于获取多组待运算的激活值及其对应的原始权重,将该原始权重按计算顺序排列并按位对齐,得到权重矩阵,剔除该权重矩阵中的松弛位,得到具有空位的精简矩阵,并使得该精简矩阵的每一列中的基本位按该计算顺序递补该空位,得到中间矩阵,剔除该中间矩阵中的空行,并将该中间矩阵的空位置0,得到捏合矩阵,该捏合矩阵的每一行作为捏合权重;
    拆分累加模块,用于根据激活值与原始权重中基本位的对应关系,得到该捏合权重中每一位对应激活值的位置信息,将该捏合权重按位分割为多个权重段,根据该位置信息,将该权重段与对应的激活值进行求和处理,并将处理结果发送至加法树,通过对该处理结果执行移位相加,得到输出特征图。
  2. 如权利要求1所述的用于卷积神经网络加速器的拆分累加器,其特征在于,还包括分离器和段加法器,该分离器用于对该捏合权重进行按位分割,该段加法器用于对该权重段与对应的激活值进行求和处理。
  3. 如权利要求1或2所述的用于卷积神经网络加速器的拆分累加器,其特征在于,该拆分累加模块包括:利用哈夫曼编码保存该捏合权重中每一位对应激活值的该位置信息。
  4. 如权利要求1或2所述的用于卷积神经网络加速器的拆分累加器,其特征在于,该激活值为图像的像素值。
PCT/CN2019/087769 2018-09-20 2019-05-21 一种用于卷积神经网络加速器的拆分累加器 WO2020057161A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/250,890 US20210357735A1 (en) 2018-09-20 2019-05-21 Split accumulator for convolutional neural network accelerator

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811100309 2018-09-20
CN201811100309.2 2018-09-20

Publications (1)

Publication Number Publication Date
WO2020057161A1 true WO2020057161A1 (zh) 2020-03-26

Family

ID=65843951

Family Applications (3)

Application Number Title Priority Date Filing Date
PCT/CN2019/087771 WO2020057162A1 (zh) 2018-09-20 2019-05-21 一种卷积神经网络加速器
PCT/CN2019/087767 WO2020057160A1 (zh) 2018-09-20 2019-05-21 一种基于权重捏合的卷积神经网络计算方法和系统
PCT/CN2019/087769 WO2020057161A1 (zh) 2018-09-20 2019-05-21 一种用于卷积神经网络加速器的拆分累加器

Family Applications Before (2)

Application Number Title Priority Date Filing Date
PCT/CN2019/087771 WO2020057162A1 (zh) 2018-09-20 2019-05-21 一种卷积神经网络加速器
PCT/CN2019/087767 WO2020057160A1 (zh) 2018-09-20 2019-05-21 一种基于权重捏合的卷积神经网络计算方法和系统

Country Status (3)

Country Link
US (3) US20210357735A1 (zh)
CN (3) CN109543140B (zh)
WO (3) WO2020057162A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543140B (zh) * 2018-09-20 2020-07-10 中国科学院计算技术研究所 一种卷积神经网络加速器
CN110059733A (zh) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 卷积神经网络的优化及快速目标检测方法、装置
CN110245324B (zh) * 2019-05-19 2023-01-17 南京惟心光电系统有限公司 一种基于光电计算阵列的反卷积运算加速器及其方法
CN110245756B (zh) * 2019-06-14 2021-10-26 第四范式(北京)技术有限公司 用于处理数据组的可编程器件及处理数据组的方法
CN110633153A (zh) * 2019-09-24 2019-12-31 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
WO2021081854A1 (zh) * 2019-10-30 2021-05-06 华为技术有限公司 一种卷积运算电路和卷积运算方法
CN110807522B (zh) * 2019-10-31 2022-05-06 合肥工业大学 一种神经网络加速器的通用计算电路
CN113627600B (zh) * 2020-05-07 2023-12-29 合肥君正科技有限公司 一种基于卷积神经网络的处理方法及其系统
US11500811B2 (en) * 2020-06-12 2022-11-15 Alibaba Group Holding Limited Apparatuses and methods for map reduce
CN113919477A (zh) * 2020-07-08 2022-01-11 嘉楠明芯(北京)科技有限公司 一种卷积神经网络的加速方法及装置
CN112580787B (zh) * 2020-12-25 2023-11-17 北京百度网讯科技有限公司 神经网络加速器的数据处理方法、装置、设备及存储介质
CN116888575A (zh) * 2021-03-26 2023-10-13 上海科技大学 精简近似的基于共享的单输入多权重乘法器
CN114021710A (zh) * 2021-10-27 2022-02-08 中国科学院计算技术研究所 利用比特级稀疏性的深度学习卷积加速方法及处理器
CN114168991B (zh) * 2022-02-10 2022-05-20 北京鹰瞳科技发展股份有限公司 对加密数据进行处理的方法、电路及相关产品

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344882A1 (en) * 2016-05-31 2017-11-30 Canon Kabushiki Kaisha Layer-based operations scheduling to optimise memory for CNN applications
CN108182471A (zh) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 一种卷积神经网络推理加速器及方法
CN109543830A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种用于卷积神经网络加速器的拆分累加器

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679185B (zh) * 2012-08-31 2017-06-16 富士通株式会社 卷积神经网络分类器系统、其训练方法、分类方法和用途
US9513870B2 (en) * 2014-04-22 2016-12-06 Dialog Semiconductor (Uk) Limited Modulo9 and modulo7 operation on unsigned binary numbers
US10049322B2 (en) * 2015-05-21 2018-08-14 Google Llc Prefetching weights for use in a neural network processor
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
US11074492B2 (en) * 2015-10-07 2021-07-27 Altera Corporation Method and apparatus for performing different types of convolution operations with the same processing elements
US11170294B2 (en) * 2016-01-07 2021-11-09 Intel Corporation Hardware accelerated machine learning
US20170344876A1 (en) * 2016-05-31 2017-11-30 Samsung Electronics Co., Ltd. Efficient sparse parallel winograd-based convolution scheme
US10528864B2 (en) * 2016-08-11 2020-01-07 Nvidia Corporation Sparse convolutional neural network accelerator
US11544539B2 (en) * 2016-09-29 2023-01-03 Tsinghua University Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN106529670B (zh) * 2016-10-27 2019-01-25 中国科学院计算技术研究所 一种基于权重压缩的神经网络处理器、设计方法、芯片
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
US10871964B2 (en) * 2016-12-29 2020-12-22 Qualcomm Incorporated Architecture for sparse neural network acceleration
CN107086910B (zh) * 2017-03-24 2018-08-10 中国科学院计算技术研究所 一种针对神经网络处理的权重加解密方法和系统
CN107392308B (zh) * 2017-06-20 2020-04-03 中国科学院计算技术研究所 一种基于可编程器件的卷积神经网络加速方法与系统
CN107341544B (zh) * 2017-06-30 2020-04-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN107844826B (zh) * 2017-10-30 2020-07-31 中国科学院计算技术研究所 神经网络处理单元及包含该处理单元的处理系统
CN107918794A (zh) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 基于计算阵列的神经网络处理器
WO2019157599A1 (en) * 2018-02-16 2019-08-22 The Governing Council Of The University Of Toronto Neural network accelerator
CN108510066B (zh) * 2018-04-08 2020-05-12 湃方科技(天津)有限责任公司 一种应用于卷积神经网络的处理器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344882A1 (en) * 2016-05-31 2017-11-30 Canon Kabushiki Kaisha Layer-based operations scheduling to optimise memory for CNN applications
CN108182471A (zh) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 一种卷积神经网络推理加速器及方法
CN109543830A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种用于卷积神经网络加速器的拆分累加器
CN109543140A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种卷积神经网络加速器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANG, LIN ET AL.: "Design and Implementation of Convolutional Neural Network Based on FPGA", MICROELECTRONICS & COMPUTER, vol. 35, no. 8, 31 August 2018 (2018-08-31) *

Also Published As

Publication number Publication date
CN109543140B (zh) 2020-07-10
WO2020057162A1 (zh) 2020-03-26
CN109543830A (zh) 2019-03-29
CN109543140A (zh) 2019-03-29
US20210357735A1 (en) 2021-11-18
WO2020057160A1 (zh) 2020-03-26
US20210350204A1 (en) 2021-11-11
US20210350214A1 (en) 2021-11-11
CN109543816B (zh) 2022-12-06
CN109543830B (zh) 2023-02-03
CN109543816A (zh) 2019-03-29

Similar Documents

Publication Publication Date Title
WO2020057161A1 (zh) 一种用于卷积神经网络加速器的拆分累加器
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
KR102557589B1 (ko) 가속화된 수학 엔진
Jang et al. Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC
CN107862374B (zh) 基于流水线的神经网络处理系统和处理方法
CN107704916A (zh) 一种基于fpga实现rnn神经网络的硬件加速器及方法
CN110555516B (zh) 基于FPGA的YOLOv2-tiny神经网络低延时硬件加速器实现方法
KR20180083030A (ko) 이진 파라미터를 갖는 컨볼루션 신경망 시스템 및 그것의 동작 방법
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
JP7292297B2 (ja) 確率的丸めロジック
CN113010213B (zh) 基于阻变忆阻器的精简指令集存算一体神经网络协处理器
KR20190089685A (ko) 데이터를 처리하는 방법 및 장치
Khaleghi et al. Shear er: highly-efficient hyperdimensional computing by software-hardware enabled multifold approximation
Kung et al. Term revealing: Furthering quantization at run time on quantized dnns
Bao et al. LSFQ: A low precision full integer quantization for high-performance FPGA-based CNN acceleration
Shu et al. High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination
Wu et al. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
CN110659014B (zh) 乘法器及神经网络计算平台
CN110716751B (zh) 高并行度计算平台、系统及计算实现方法
He et al. FTW-GAT: An FPGA-based accelerator for graph attention networks with ternary weights
CN116842304A (zh) 一种不规则稀疏矩阵的计算方法及系统
CN112906863B (zh) 一种神经元加速处理方法、装置、设备及可读存储介质
CN115222028A (zh) 基于fpga的一维cnn-lstm加速平台及实现方法
CN109901993B (zh) 一种单路径线性约束的循环程序终止性判断方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19862815

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19862815

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19862815

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19862815

Country of ref document: EP

Kind code of ref document: A1