US20210357735A1 - Split accumulator for convolutional neural network accelerator - Google Patents

Split accumulator for convolutional neural network accelerator Download PDF

Info

Publication number
US20210357735A1
US20210357735A1 US17/250,890 US201917250890A US2021357735A1 US 20210357735 A1 US20210357735 A1 US 20210357735A1 US 201917250890 A US201917250890 A US 201917250890A US 2021357735 A1 US2021357735 A1 US 2021357735A1
Authority
US
United States
Prior art keywords
weight
kneading
matrix
bit
activations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/250,890
Inventor
Xiaowei Li
Xin Wei
Hang Lu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Assigned to INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES reassignment INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LI, XIAOWEI, LU, HANG, WEI, XIN
Publication of US20210357735A1 publication Critical patent/US20210357735A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4031Fixed length to variable length coding
    • H03M7/4037Prefix coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the field of neural network computation, and particularly to a split accumulator for a convolutional neural network accelerator.
  • a deep convolutional neural network has achieved a significant progress in application of machine learning, for example, real-time image recognition, detection and natural language processing.
  • architecture of the advanced deep convolutional neural network owns a complex connection and massive neurons and synapses to satisfy requirement for high accurate and complex tasks.
  • weights are multiplied by corresponding activations, and finally, products are added up to perform a summation. That is, the weight and the activation form a pair.
  • the DCNN is formed of multiple layers, from dozens of layers to hundreds of layers, even thousands of layers. In the entire DCNN, nearly 98% of computation comes from the convolution operation. Convolution is the major factor that affects power and performance. In the case of not damaging robustness of a learning model, improving computing efficiency of convolution becomes an efficient method for accelerating the DCNN, in particular, on lightweight devices (e.g., smartphones and automatic robots) with limited resources and low consumption requirements.
  • an essential bit may be at any position of a fixed-point number, so the solution must consider position of the bit “1” in the worst case, and if it is a 16-bit fixed-point number (fp16), a 16-bit register shall be used to save positional information of the essential bit (“1”).
  • Different weights or activations may produce different latency time in the process of acceleration, so unexpectable cycles are produced.
  • Design of hardware necessarily covers the worse case, and only if the cycle of the worse case functions as a cycle of the accelerator, a processing cycle is increased, and a frequency of the accelerator is reduced while also adding complexity of the design.
  • the typical DCNN accelerator performs a multiplication and addition operation by deploying a multiplier and an adder on each activation and weight lane.
  • multiplication can be floating-point 32-bit operands, or 16-bit fixed-point numbers and 8-bit integers.
  • the multiplier determines delay of the convolution operation, and time desired by an 8-bit fixed-point two operand multiplier is 3.4 times of that of the adder.
  • accuracies desired by different DCNN models are different, and even different layers of the same model have different requirements for accuracy, so the multiplier designed for the convolutional neural network accelerator must cover the worse case.
  • a main component of the typical DCNN accelerator is a multiply-adder, and the main problem of the multiply-adder is to perform invalid operation.
  • the invalid operation can be expressed in two aspects, and firstly, the operand is a value having many zeros or including most of zero bits. As compared to zero bits, zeros occupy a small part of the weight. These small part of zeros can be easily eliminated through an advanced microarchitecture design or a memory-level compression technique, and function as inputs to avoid use of multiplicators.
  • FIG. 1 shows that as compared to the essential bit (or 1), an average ratio of the zero bits reaches up to 68.9%.
  • the weight is converted into a two-value, or a more accurate three-value. Therefore, a multiplication operation can be converted into a pure shift or add operation. However, such will necessarily sacrifice accuracy of the result, and in particular, in large data sets, accuracies of these solutions are quite seriously damaged. Therefore, it is quite necessary to invent a high accurate accelerator.
  • the traditional CNN puts each weight/activation pair in a processing element (PE), and completes the multiplication and addition operation within one cycle. However, it is impossible to avoid computing the zero bits. If we can reduce time of invalid computation, throughput of the PE will be improved.
  • the invention refers the bit including the bit “0” in the PE as a “slack bit”.
  • Distribution of the essential bits has two characteristics: (1) each position has a probability of about 50% to 60% to be valid, and it also means that each position has a probability of about 40% to 50% to be slack; (2) most of bits of some weights are slack. For example, the third to fifth positions only have less than 1% of the essential bits. These bits are almost all formed of slack bits, but the multiplier does not distinguish the slack bits from the essential bits when executing the multiplication and addition operation. If efficiency of executing the convolution operation shall be improved, the slack bits must be utilized.
  • an object of the invention is to provide a split accumulator for a convolutional neural network accelerator including weight kneading technique for a convolutional neural network accelerator.
  • Weight kneading improves zero slacks commonly existing in the modern DCNN model, and differs from data compression or trimming in reducing the number of weights without loss of accuracy.
  • the invention discloses a split accumulator for a convolutional neural network accelerator, comprising:
  • a weight kneading module for acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; and
  • a split accumulation module for obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight, dividing the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • Reducing storage and accelerating an operational speed can be realized through the method of kneading weights and activations, and the way of analyzing and operating kneading values;
  • Architecture of a Tetris accelerator comprises structure of the SAC and structure of the splitters, and may further accelerate convolution computation.
  • FIG. 2 shows inference speedup of two models of DaDN and the Tetris accelerator proposed in the invention.
  • the Tetris method can accelerate 1.3 times on the fp16, and can accelerate 1.5 times on average in an int8 model through the kneading weights. As compared to PRA, it can accelerate nearly 1.15 times.
  • FIG. 1 is ratios of zeros and zero BITs in some modern deep convolutional neural network models.
  • FIG. 2 is a comparison diagram of technical effects.
  • FIG. 3 is a schematic diagram of kneading weights.
  • FIG. 4 is a schematic diagram of a split accumulator SAC.
  • FIG. 5 is an architecture diagram of a Tetris accelerator.
  • FIG. 6 is a structural diagram of a splitter.
  • FIG. 7 is a schematic diagram of results saved in registers of an arithmetic unit.
  • the invention reconstructs an inference and computing mode of the DCNN model.
  • the invention replaces the typical computing mode MAC with a split accumulator (SAC).
  • SAC split accumulator
  • a series of adders with a low operation cost are replaced without typical multiplication operation.
  • the invention can make full use of essential bits in the weight, and the split accumulator SAC is formed of adders and shifters without multipliers.
  • Each weight/activation pair in the traditional multiplier performs one shift summation operation, where “weight/activation” means “weight and activation”.
  • the invention performs several accumulations on the multiple weights/activations, but one shift-and-add summation only, thereby acquiring large acceleration.
  • the invention proposes a Tetris accelerator to tap the maximum potential of the kneading weight technique and the split accumulator SAC.
  • the Tetris accelerator is formed of a series of split accumulator SAC units, and uses the kneading weights and activations to realize a high throughput and low power consumption inference computation. It is proved by tests of advanced synthesis tools that as compared to the prior art, the Tetris accelerator reaches the best effect.
  • Activation on the first layer is an input
  • the second activation is an output of the first layer, and so on. If the input is an image, the activation on the first layer is pixel values of the image.
  • the invention comprises:
  • step 1 acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix, i.e., deleting the slack bits to be vacancies, to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, and removing null rows, which are rows where an entire row of the intermediate matrix is vacancies, in the intermediate matrix and placing zeros at vacancies of the intermediate matrix after removing the null rows to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight;
  • step 2 obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit (element) of the kneading weight;
  • step 3 sending the kneading weight to a split accumulator, which divides the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • the split accumulator comprises splitters for dividing the kneading weight by bit, and segment adders for processing summation of the weight segments and the corresponding activations.
  • FIG. 3 shows a method for kneading weights. Assuming that six groups of weights are a batch, six cycles are often required to complete a multiplication and addition operation of six pairs of weights/activations. Slacks occur in two orthogonal dimensions: (1) dimensions in the weights, i.e., W 1 , W 2 , W 3 , W 4 and W 6 , where slacks represent arbitrary distribution; (2) slack bits also occur in a weight dimension, i.e., W 5 , which is an all zero bit (zero) weight, so it does not occur in FIG. 3( b ) .
  • W 1 , W 2 , W 3 , W 4 and W 6 where slacks represent arbitrary distribution
  • slack bits also occur in a weight dimension, i.e., W 5 , which is an all zero bit (zero) weight, so it does not occur in FIG. 3( b ) .
  • the essential bits and the slack bits are optimized through kneading, and six computing cycles of the MAC are reduced to three cycles, as shown in FIG. 3( c ) .
  • the invention acquires W′ 1 , W′ 2 and W′ 3 through weight kneading, but each bit is combined by the essential bits of W 1 to W 6 . If more weights are allowed, these slack bits are filled by the essential bits, which is referred to as a “weight kneading” operation in the invention, i.e., replacing the slack bits in the preceding original weight with the essential bits in the subsequent original weight.
  • weight kneading is that it can automatically eliminate influence of zeros without introduction additional operations.
  • the zero bits are also replaced by the essential bits for kneading, avoiding influence of two-dimensional slacks.
  • kneading the weights indicates that the current kneading weights shall perform operation with the multiple activations, instead of the separate activation.
  • the activation a has four bits
  • Multiplying by 2 3 is to left shift three bits
  • multiplying by 2 2 is to left shift two bits
  • multiplying by 2 1 is to left shift one bit. Therefore, the traditional multiplier has the shift-and-add operation after computing one w*a each time.
  • the invention reduces times of shift-and-add, and the processing mode of the invention is different from the traditional MAC and the standard bit sequence method, which is referred to as “SAC” in the invention, where SAC represents “split and accumulation”.
  • SAC represents “split and accumulation”.
  • the SAC operation divides the kneading weights, quotes the activations, and finally, accumulates each activation to a specific segment register.
  • FIG. 4 shows SAC processes of each weight/activation pair.
  • the figure only shows its concept, but in practical use, data firstly perform weight kneading, and then are sent to the SAC for computation using the kneading weights in the SAC, and an array consisting of the multiple SACs form an accelerator of the invention.
  • the split accumulator (SAC) shown in FIG. 4 is different from the MAC, aiming to accurately computing paired multiplication.
  • the split accumulator SAC divides the weight, and performs summation of the activation in the corresponding segment adder according to a dividing result. After a certain number of weight/activation pairs are processed, the final shift-and-add operation is executed to obtain the final sum. “A certain number” is obtained from tests, is associated with weight distribution of the convolutional neural network model, can set a good value according to the specific model, and also can select a compromised value suitable for most models.
  • the SAC is a split accumulator.
  • weight kneading After entering, data firstly perform weight kneading, and output of the weight kneading functions as an input of the SAC.
  • the splitters are a part of the SAC, and after entering into the SAC, the weights firstly go through the splitters to analyze the weights (divide the weights), and are distributed to the subsequent segment registers and segment adders. Then, the final result is obtained through an adder tree.
  • a 1 *w 1 shall be firstly computed, then a 2 *w 2 is computed, and so on, and six multiplication operations are required.
  • the SAC needs three cycles, and although the SAC has an operation of reading the positional information of the activations, the operation is performed in the register, and consumes short time far less than one multiplication operation. What is the most time-consuming in the convolution operation is the multiplication operation, and the invention does not have the multiplication operation. Moreover, shift operations are also less than that of the MAC. Although the invention sacrifices quite less additional storage, such sacrifice is worthy.
  • FIG. 5 shows an architecture of a Tetris accelerator.
  • Each SAC unit is formed of sixteen parallel splitters (i.e., FIG. 6 ), accepts a series of kneading weights and activations, and is expressed by a parameter of kneading stride (KS).
  • KS kneading stride
  • the same number of segment adders and the subsequent adder tree are responsible for computing the final partial sum.
  • the tetris refers to that a kneading process in the invention is similar with a stacking process of the tetris.
  • the paired accumulator cannot improve inference efficiency, because it does not distinguish invalid computation, so even if the current bit of the weight is zero, the paired accumulator still computes the corresponding weight/activation pair. Therefore, in the Tetris accelerator proposed in the invention, the SAC only distributes the activations to the segment adders for accumulation using the kneading weights.
  • FIG. 5 shows an architecture of the tetris. Each SAC unit accepts the weight/activation.
  • the accelerator is formed of symmetrical matrixes of the SAC units, connected to a throttle buffer, and accepts the kneading weights from an on-chip eDRAM.
  • each unit includes sixteen splitters to form a splitter array.
  • the splitters have sixteen output data paths for activation to reach the destination segment register of each splitter according to parameters KS, i.e., KS number of weight kneading, activations and weights, so a fully connected fabric is formed between the splitter array and the segment adders.
  • a microarchitecture of the splitter is shown in FIG. 6 .
  • the weight kneading technique requires that the splitter can accept one group of activations of the kneading weights, and each essential bit must be recognizable to indicate the relevant activations within a range of the KS.
  • the tetris method uses ⁇ W i , p> in the splitter to express activation associated with the specific essential bit.
  • the KS expresses a range represented by the number of p bits, i.e., four-bit p refers to a kneading stride of sixteen weights.
  • the splitter distributes one comparator to determine whether the bit of kneading weight is zero, because even if after kneading, some positions also may be invalid, i.e., W′ 3 in FIG. 3( c ) , depending on the KS. If the bit is slack, a multiplexer after the comparator outputs zero to the subsequent structure. Otherwise, the splitter decodes p, and outputs the corresponding activation.
  • the splitter acquires the target activation in the throttle buffer region without storing the activations several times.
  • the newly introduced positional information p occupies a small part of space, but use of the p is only to decode the activation, and is only formed of several bits, i.e., sixteen activations need four bits, so the splitter does not introduce massive on-chip resources and power cost in the accelerator.
  • each segment adder accepts the activations from all sixteen splitters, and adds with values from local segment registers. Summations of intermediate segments are stored in the registers S 0 to S 15 , and once all possible adding tasks are completed, the multiplexer is notified by “a control signal” to pass value of each segment to the subsequent adder tree. The final stage of the adder tree generates a final sum, and the final sum is passed to an output non-linear activation function.
  • the non-linear activation function is determined by network models, such as, RELU and sigmoid.
  • ends of possible adding activation/weight pairs are sent to detectors in each SAC unit by identifying them. If all marks reach the ending, the adder tree outputs the final sum.
  • the marks can reside at any position of the throttle buffer. If the new activation/weight pairs are filled in the buffer region, the marks are shifted backwardly, so computation of partial sum of each segment will not be affected.
  • the weights are firstly stored in the on-chip eDRAM through kneading to acquire kneading weights from the eDRAM, and the weights are analyzed by splitters of the SAC, and distributed to the subsequent segment registers and adders. Then, the final result is obtained through the adder tree. The final result is the feature map, i.e., the input of the next layer.
  • Each bit of W′ may correspond to different A (activations). If the bit is ‘1’, additional cost is required to store which A corresponds to this bit, and if the bit is ‘0’, storing is unnecessary.
  • the additionally stored positional information (corresponding to which A)
  • the invention does not limit the coding way, and the common coding way is Huffman coding, for example.
  • the W′ and the additionally stored positional information are sent to the SAC, and the SAC sends these weights and activations to corresponding arithmetic unit.
  • results in registers of the arithmetic unit are shown in FIG. 7 , and then results in the registers are shifted and added to obtain the final result.
  • the invention further discloses a split accumulator for a convolutional neural network accelerator, comprising:
  • a weight kneading module for acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; and
  • a split accumulation module for obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight, dividing the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • the split accumulator for a convolutional neural network accelerator further comprises splitters for dividing the kneading weight by bit, and segment adders for processing summation of the weight segments and the corresponding activations.
  • the split accumulation module comprises saving the positional information of the activation corresponding to each bit of the kneading weight with Huffman coding.
  • the activations are pixel values of an image.
  • the invention relates to split accumulator for a convolutional neural network accelerator, comprising: converting original weights into a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight; sending the kneading weight to a split accumulator, which divides the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed embodiments relate to a split accumulator for a convolutional neural network accelerator, comprising: arranging original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix, allowing essential bits in each column of the weight matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; obtaining positional information of the activation corresponding to each bit of the kneading weight; divides the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a national application of PCT/CN2019/087769, filed on May 21, 2019. The contents of PCT/CN2019/087769 are all hereby incorporated by reference.
  • BACKGROUND OF THE INVENTION 1. Field of the Invention
  • The invention relates to the field of neural network computation, and particularly to a split accumulator for a convolutional neural network accelerator.
  • 2. Related Art
  • A deep convolutional neural network has achieved a significant progress in application of machine learning, for example, real-time image recognition, detection and natural language processing. In order to improve accuracy, architecture of the advanced deep convolutional neural network (DCNN) owns a complex connection and massive neurons and synapses to satisfy requirement for high accurate and complex tasks. In a convolution operation, weights are multiplied by corresponding activations, and finally, products are added up to perform a summation. That is, the weight and the activation form a pair.
  • Considering of limitation of architecture of the traditional general-purpose processor, many researchers propose a specialized accelerator for a specific computing mode of the modern DCNN. The DCNN is formed of multiple layers, from dozens of layers to hundreds of layers, even thousands of layers. In the entire DCNN, nearly 98% of computation comes from the convolution operation. Convolution is the major factor that affects power and performance. In the case of not damaging robustness of a learning model, improving computing efficiency of convolution becomes an efficient method for accelerating the DCNN, in particular, on lightweight devices (e.g., smartphones and automatic robots) with limited resources and low consumption requirements.
  • To solve the challenge, some existing methods propose use of bit-level series when MAC is executed using characteristic that fixed-point multiplication can be decomposed into a series of single-bit multiplications and shift-and-add. However, an essential bit (or 1) may be at any position of a fixed-point number, so the solution must consider position of the bit “1” in the worst case, and if it is a 16-bit fixed-point number (fp16), a 16-bit register shall be used to save positional information of the essential bit (“1”). Different weights or activations may produce different latency time in the process of acceleration, so unexpectable cycles are produced. Design of hardware necessarily covers the worse case, and only if the cycle of the worse case functions as a cycle of the accelerator, a processing cycle is increased, and a frequency of the accelerator is reduced while also adding complexity of the design.
  • To accelerate the operation, the typical DCNN accelerator performs a multiplication and addition operation by deploying a multiplier and an adder on each activation and weight lane. In order to realize balance between acceleration and accuracy, multiplication can be floating-point 32-bit operands, or 16-bit fixed-point numbers and 8-bit integers. As compared to the adder of the fixed-point numbers, the multiplier determines delay of the convolution operation, and time desired by an 8-bit fixed-point two operand multiplier is 3.4 times of that of the adder. Moreover, accuracies desired by different DCNN models are different, and even different layers of the same model have different requirements for accuracy, so the multiplier designed for the convolutional neural network accelerator must cover the worse case.
  • A main component of the typical DCNN accelerator is a multiply-adder, and the main problem of the multiply-adder is to perform invalid operation. The invalid operation can be expressed in two aspects, and firstly, the operand is a value having many zeros or including most of zero bits. As compared to zero bits, zeros occupy a small part of the weight. These small part of zeros can be easily eliminated through an advanced microarchitecture design or a memory-level compression technique, and function as inputs to avoid use of multiplicators. FIG. 1 shows that as compared to the essential bit (or 1), an average ratio of the zero bits reaches up to 68.9%. Secondly, an intermediate result of the multiplication and addition operation is useless. For example, if y=ab+cd+de, we only need the value of y, while the value of ab is unnecessary. This means that optimization of the zero bits and the multiplication and addition operation assist in computing efficiency, and reduce power consumption.
  • Many people accelerates computation with quantitative technique. For example, the weight is converted into a two-value, or a more accurate three-value. Therefore, a multiplication operation can be converted into a pure shift or add operation. However, such will necessarily sacrifice accuracy of the result, and in particular, in large data sets, accuracies of these solutions are quite seriously damaged. Therefore, it is quite necessary to invent a high accurate accelerator.
  • The traditional CNN puts each weight/activation pair in a processing element (PE), and completes the multiplication and addition operation within one cycle. However, it is impossible to avoid computing the zero bits. If we can reduce time of invalid computation, throughput of the PE will be improved. The invention refers the bit including the bit “0” in the PE as a “slack bit”.
  • Distribution of the essential bits (or 1) has two characteristics: (1) each position has a probability of about 50% to 60% to be valid, and it also means that each position has a probability of about 40% to 50% to be slack; (2) most of bits of some weights are slack. For example, the third to fifth positions only have less than 1% of the essential bits. These bits are almost all formed of slack bits, but the multiplier does not distinguish the slack bits from the essential bits when executing the multiplication and addition operation. If efficiency of executing the convolution operation shall be improved, the slack bits must be utilized.
  • If the slack bits in the preceding weight can be replaced by the essential bits of the subsequent weight, invalid computation can be reduced, and multiple weight/activation pairs are processed within one cycle. When these weights are extruded, a total weight can be compressed to nearly a half of an original volume. In other words, inference time can be reduced by 50%. However, it is quite difficult to realize this object, because it is necessary to modify the current MAC computing mode, and reconstruct hardware architecture to support a new computing mode.
  • SUMMARY OF THE INVENTION
  • In order to solve the technical problems, an object of the invention is to provide a split accumulator for a convolutional neural network accelerator including weight kneading technique for a convolutional neural network accelerator. Weight kneading improves zero slacks commonly existing in the modern DCNN model, and differs from data compression or trimming in reducing the number of weights without loss of accuracy.
  • Specifically, the invention discloses a split accumulator for a convolutional neural network accelerator, comprising:
  • a weight kneading module for acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; and
  • a split accumulation module for obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight, dividing the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • Technical progresses of the invention comprise:
  • 1. Reducing storage and accelerating an operational speed can be realized through the method of kneading weights and activations, and the way of analyzing and operating kneading values;
  • 2. Architecture of a Tetris accelerator comprises structure of the SAC and structure of the splitters, and may further accelerate convolution computation.
  • Executing time is obtained by two methods of VivadoHLS emulation, and FIG. 2 shows inference speedup of two models of DaDN and the Tetris accelerator proposed in the invention. Here, we observe that the Tetris method can accelerate 1.3 times on the fp16, and can accelerate 1.5 times on average in an int8 model through the kneading weights. As compared to PRA, it can accelerate nearly 1.15 times.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is ratios of zeros and zero BITs in some modern deep convolutional neural network models.
  • FIG. 2 is a comparison diagram of technical effects.
  • FIG. 3 is a schematic diagram of kneading weights.
  • FIG. 4 is a schematic diagram of a split accumulator SAC.
  • FIG. 5 is an architecture diagram of a Tetris accelerator.
  • FIG. 6 is a structural diagram of a splitter.
  • FIG. 7 is a schematic diagram of results saved in registers of an arithmetic unit.
  • PREFERABLE EMBODIMENTS OF THE INVENTION
  • The invention reconstructs an inference and computing mode of the DCNN model. The invention replaces the typical computing mode MAC with a split accumulator (SAC). A series of adders with a low operation cost are replaced without typical multiplication operation. The invention can make full use of essential bits in the weight, and the split accumulator SAC is formed of adders and shifters without multipliers. Each weight/activation pair in the traditional multiplier performs one shift summation operation, where “weight/activation” means “weight and activation”. However, the invention performs several accumulations on the multiple weights/activations, but one shift-and-add summation only, thereby acquiring large acceleration.
  • Finally, the invention proposes a Tetris accelerator to tap the maximum potential of the kneading weight technique and the split accumulator SAC. The Tetris accelerator is formed of a series of split accumulator SAC units, and uses the kneading weights and activations to realize a high throughput and low power consumption inference computation. It is proved by tests of advanced synthesis tools that as compared to the prior art, the Tetris accelerator reaches the best effect. Activation on the first layer is an input, the second activation is an output of the first layer, and so on. If the input is an image, the activation on the first layer is pixel values of the image.
  • Specifically, the invention comprises:
  • step 1, acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix, i.e., deleting the slack bits to be vacancies, to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, and removing null rows, which are rows where an entire row of the intermediate matrix is vacancies, in the intermediate matrix and placing zeros at vacancies of the intermediate matrix after removing the null rows to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight;
  • step 2, obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit (element) of the kneading weight;
  • step 3, sending the kneading weight to a split accumulator, which divides the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • The split accumulator comprises splitters for dividing the kneading weight by bit, and segment adders for processing summation of the weight segments and the corresponding activations.
  • To make the above features and effects of the invention clearer, hereinafter explanations are made in details with reference to examples and the accompanying drawings.
  • FIG. 3 shows a method for kneading weights. Assuming that six groups of weights are a batch, six cycles are often required to complete a multiplication and addition operation of six pairs of weights/activations. Slacks occur in two orthogonal dimensions: (1) dimensions in the weights, i.e., W1, W2, W3, W4 and W6, where slacks represent arbitrary distribution; (2) slack bits also occur in a weight dimension, i.e., W5, which is an all zero bit (zero) weight, so it does not occur in FIG. 3(b). The essential bits and the slack bits are optimized through kneading, and six computing cycles of the MAC are reduced to three cycles, as shown in FIG. 3(c). The invention acquires W′1, W′2 and W′3 through weight kneading, but each bit is combined by the essential bits of W1 to W6. If more weights are allowed, these slack bits are filled by the essential bits, which is referred to as a “weight kneading” operation in the invention, i.e., replacing the slack bits in the preceding original weight with the essential bits in the subsequent original weight. Obviously, on the one hand, advantage of weight kneading is that it can automatically eliminate influence of zeros without introduction additional operations. On the other hand, the zero bits are also replaced by the essential bits for kneading, avoiding influence of two-dimensional slacks. However, kneading the weights indicates that the current kneading weights shall perform operation with the multiple activations, instead of the separate activation.
  • In order to support such efficient computation, it is quite important for necessity of exploring architecture of the accelerator, and this architecture is different from the traditional architecture of the typical MAC. We use equivalent shift-and-add to acquire one partial sum, and the partial sum is not completely a sum of multiplication of a series of weights/activations. Therefore, it is unnecessary to shift b bits immediately after one kneading weight w′ is computed. The final sum of shifting b bits can be performed after computing the kneading weight, and b is originated according to the principle of shift-and-add. For example, the activation a has four bits, the weight w also has four bits, which are w1, w2, w3 and w4, respectively, and w1 is a high bit, so a*w=a*w1*23+a*w2*22+a*w3*21+a*w4*20. Multiplying by 23 is to left shift three bits, multiplying by 22 is to left shift two bits, and multiplying by 21 is to left shift one bit. Therefore, the traditional multiplier has the shift-and-add operation after computing one w*a each time. However, the invention reduces times of shift-and-add, and the processing mode of the invention is different from the traditional MAC and the standard bit sequence method, which is referred to as “SAC” in the invention, where SAC represents “split and accumulation”. Firstly, the SAC operation divides the kneading weights, quotes the activations, and finally, accumulates each activation to a specific segment register. FIG. 4 shows SAC processes of each weight/activation pair. As an example of the SAC, the figure only shows its concept, but in practical use, data firstly perform weight kneading, and then are sent to the SAC for computation using the kneading weights in the SAC, and an array consisting of the multiple SACs form an accelerator of the invention.
  • The split accumulator (SAC) shown in FIG. 4 is different from the MAC, aiming to accurately computing paired multiplication. Firstly, the split accumulator SAC divides the weight, and performs summation of the activation in the corresponding segment adder according to a dividing result. After a certain number of weight/activation pairs are processed, the final shift-and-add operation is executed to obtain the final sum. “A certain number” is obtained from tests, is associated with weight distribution of the convolutional neural network model, can set a good value according to the specific model, and also can select a compromised value suitable for most models. Specifically, the SAC is a split accumulator. After entering, data firstly perform weight kneading, and output of the weight kneading functions as an input of the SAC. The splitters are a part of the SAC, and after entering into the SAC, the weights firstly go through the splitters to analyze the weights (divide the weights), and are distributed to the subsequent segment registers and segment adders. Then, the final result is obtained through an adder tree.
  • The SAC instantiates “the splitters” according to a desired bit length: if each weight uses a 16-bit fixed-point number, sixteen segment registers (p=16 in the figure) and sixteen adders are needed, and according to requirements for accuracy and speed, the digit of the fixed-point number is determined. The higher the accuracy is, the slower the speed will be, and this selection is determined according to different requirements. Moreover, different models are also different in degree of accuracy sensitivity, some models are fine with eight bits, and some models need sixteen bits. Generally, the splitters are responsible for dispersing each segment into the corresponding adder. For example, if the second bit of the weight is the essential bit, the activation is passed to S1 in the figure. The same operation is also suitable for other bits. After the weight is “split”, the subsequent weights are processed in the same way, so each weight segment register accumulates new activations. The subsequent adder tree executes one shift-and-add to obtain the final partial sum. What is different from the MAC is that the SAC does not attempt to obtain an intermediate partial sum. The reason is that output feature maps of the real CNN model only need a “final” sum, i.e., a sum of all lanes of a convolution kernel and corresponding input feature maps. In particular, when the kneading weights are used, the advantage is more obvious than that of the MAC, as shown in FIGS. 3 and 7. When six pairs of weights and activations are computed using the MAC, a1*w1 shall be firstly computed, then a2*w2 is computed, and so on, and six multiplication operations are required. However, the SAC needs three cycles, and although the SAC has an operation of reading the positional information of the activations, the operation is performed in the register, and consumes short time far less than one multiplication operation. What is the most time-consuming in the convolution operation is the multiplication operation, and the invention does not have the multiplication operation. Moreover, shift operations are also less than that of the MAC. Although the invention sacrifices quite less additional storage, such sacrifice is worthy.
  • FIG. 5 shows an architecture of a Tetris accelerator. Each SAC unit is formed of sixteen parallel splitters (i.e., FIG. 6), accepts a series of kneading weights and activations, and is expressed by a parameter of kneading stride (KS). The same number of segment adders and the subsequent adder tree are responsible for computing the final partial sum. The tetris refers to that a kneading process in the invention is similar with a stacking process of the tetris.
  • In design of the traditional accelerator, the paired accumulator cannot improve inference efficiency, because it does not distinguish invalid computation, so even if the current bit of the weight is zero, the paired accumulator still computes the corresponding weight/activation pair. Therefore, in the Tetris accelerator proposed in the invention, the SAC only distributes the activations to the segment adders for accumulation using the kneading weights. FIG. 5 shows an architecture of the tetris. Each SAC unit accepts the weight/activation. Specifically, the accelerator is formed of symmetrical matrixes of the SAC units, connected to a throttle buffer, and accepts the kneading weights from an on-chip eDRAM. If fixed-point 16-bit weights are used, each unit includes sixteen splitters to form a splitter array. The splitters have sixteen output data paths for activation to reach the destination segment register of each splitter according to parameters KS, i.e., KS number of weight kneading, activations and weights, so a fully connected fabric is formed between the splitter array and the segment adders.
  • A microarchitecture of the splitter is shown in FIG. 6. The weight kneading technique requires that the splitter can accept one group of activations of the kneading weights, and each essential bit must be recognizable to indicate the relevant activations within a range of the KS. The tetris method uses <Wi, p> in the splitter to express activation associated with the specific essential bit. The KS expresses a range represented by the number of p bits, i.e., four-bit p refers to a kneading stride of sixteen weights. The splitter distributes one comparator to determine whether the bit of kneading weight is zero, because even if after kneading, some positions also may be invalid, i.e., W′3 in FIG. 3(c), depending on the KS. If the bit is slack, a multiplexer after the comparator outputs zero to the subsequent structure. Otherwise, the splitter decodes p, and outputs the corresponding activation.
  • If necessary, the splitter acquires the target activation in the throttle buffer region without storing the activations several times. The newly introduced positional information p occupies a small part of space, but use of the p is only to decode the activation, and is only formed of several bits, i.e., sixteen activations need four bits, so the splitter does not introduce massive on-chip resources and power cost in the accelerator.
  • As for each segment adder, it accepts the activations from all sixteen splitters, and adds with values from local segment registers. Summations of intermediate segments are stored in the registers S0 to S15, and once all possible adding tasks are completed, the multiplexer is notified by “a control signal” to pass value of each segment to the subsequent adder tree. The final stage of the adder tree generates a final sum, and the final sum is passed to an output non-linear activation function. The non-linear activation function is determined by network models, such as, RELU and sigmoid. In the throttle buffer unit, ends of possible adding activation/weight pairs are sent to detectors in each SAC unit by identifying them. If all marks reach the ending, the adder tree outputs the final sum. Since we use the KS as the parameter to control the kneading weights, different lanes have different numbers of kneading weights, so in most cases, the marks can reside at any position of the throttle buffer. If the new activation/weight pairs are filled in the buffer region, the marks are shifted backwardly, so computation of partial sum of each segment will not be affected.
  • Briefly, the weights are firstly stored in the on-chip eDRAM through kneading to acquire kneading weights from the eDRAM, and the weights are analyzed by splitters of the SAC, and distributed to the subsequent segment registers and adders. Then, the final result is obtained through the adder tree. The final result is the feature map, i.e., the input of the next layer.
  • Each bit of W′ may correspond to different A (activations). If the bit is ‘1’, additional cost is required to store which A corresponds to this bit, and if the bit is ‘0’, storing is unnecessary. As for the additionally stored positional information (corresponding to which A), the invention does not limit the coding way, and the common coding way is Huffman coding, for example. The W′ and the additionally stored positional information are sent to the SAC, and the SAC sends these weights and activations to corresponding arithmetic unit.
  • According to kneading the weights in FIG. 3, results in registers of the arithmetic unit are shown in FIG. 7, and then results in the registers are shifted and added to obtain the final result.
  • Hereinafter system embodiment corresponding to the method embodiment is explained, and this embodiment can be carried out combining with the above embodiment. The relevant technical details mentioned in the above embodiment are still effective in this embodiment, and in order to reduce repetition, the details are not described here. Correspondingly, relevant technical details mentioned in this embodiment also can be applied to the above embodiment.
  • The invention further discloses a split accumulator for a convolutional neural network accelerator, comprising:
  • a weight kneading module for acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; and
  • a split accumulation module for obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight, dividing the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
  • In the split accumulator for a convolutional neural network accelerator, it further comprises splitters for dividing the kneading weight by bit, and segment adders for processing summation of the weight segments and the corresponding activations.
  • In the split accumulator for a convolutional neural network accelerator, the split accumulation module comprises saving the positional information of the activation corresponding to each bit of the kneading weight with Huffman coding.
  • In the split accumulator for a convolutional neural network accelerator, the activations are pixel values of an image.
  • INDUSTRIAL APPLICABILITY
  • The invention relates to split accumulator for a convolutional neural network accelerator, comprising: converting original weights into a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight; sending the kneading weight to a split accumulator, which divides the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result. The invention can reduce storage and accelerate an operational speed of the convolutional neural network through the kneading weights and the activations.

Claims (6)

1. A split accumulator for a convolutional neural network accelerator, comprising:
a weight kneading module for acquiring multiple groups of activations to be operated and corresponding original weights, arranging the original weights in a computation sequence and aligning by bit to obtain a weight matrix, removing slack bits in the weight matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to obtain an intermediate matrix, removing null rows in the intermediate matrix, and placing zeros at vacancies of the intermediate matrix to obtain a kneading matrix, wherein each row of the kneading matrix serves as a kneading weight; and
a split accumulation module for obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the kneading weight, dividing the kneading weight by bit into multiple weight segments, processing summation of the weight segments and the corresponding activations according to the positional information, and sending a processing result to an adder tree to obtain an output feature map by means of executing shift-and-add on the processing result.
2. The split accumulator for a convolutional neural network accelerator according to claim 1, further comprising splitters for dividing the kneading weight by bit, and segment adders for processing summation of the weight segments and the corresponding activations.
3. The split accumulator for a convolutional neural network accelerator according to claim 1, wherein the split accumulation module comprises saving the positional information of the activation corresponding to each bit of the kneading weight with Huffman coding.
4. The split accumulator for a convolutional neural network accelerator according to claim 1, wherein the activations are pixel values of an image.
5. The split accumulator for a convolutional neural network accelerator according to claim 2, wherein the split accumulation module comprises saving the positional information of the activation corresponding to each bit of the kneading weight with Huffman coding.
6. The split accumulator for a convolutional neural network accelerator according to claim 2, wherein the activations are pixel values of an image.
US17/250,890 2018-09-20 2019-05-21 Split accumulator for convolutional neural network accelerator Pending US20210357735A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201811100309 2018-09-20
CN201811100309.2 2018-09-20
PCT/CN2019/087769 WO2020057161A1 (en) 2018-09-20 2019-05-21 Split accumulator for convolutional neural network accelerator

Publications (1)

Publication Number Publication Date
US20210357735A1 true US20210357735A1 (en) 2021-11-18

Family

ID=65843951

Family Applications (3)

Application Number Title Priority Date Filing Date
US17/250,890 Pending US20210357735A1 (en) 2018-09-20 2019-05-21 Split accumulator for convolutional neural network accelerator
US17/250,889 Pending US20210350204A1 (en) 2018-09-20 2019-05-21 Convolutional neural network accelerator
US17/250,892 Pending US20210350214A1 (en) 2018-09-20 2019-05-21 Convolutional neural network computing method and system based on weight kneading

Family Applications After (2)

Application Number Title Priority Date Filing Date
US17/250,889 Pending US20210350204A1 (en) 2018-09-20 2019-05-21 Convolutional neural network accelerator
US17/250,892 Pending US20210350214A1 (en) 2018-09-20 2019-05-21 Convolutional neural network computing method and system based on weight kneading

Country Status (3)

Country Link
US (3) US20210357735A1 (en)
CN (3) CN109543140B (en)
WO (3) WO2020057160A1 (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543140B (en) * 2018-09-20 2020-07-10 中国科学院计算技术研究所 Convolutional neural network accelerator
CN110059733A (en) * 2019-04-01 2019-07-26 苏州科达科技股份有限公司 The optimization and fast target detection method, device of convolutional neural networks
CN110245324B (en) * 2019-05-19 2023-01-17 南京惟心光电系统有限公司 Deconvolution operation accelerator based on photoelectric computing array and method thereof
CN110245756B (en) * 2019-06-14 2021-10-26 第四范式(北京)技术有限公司 Programmable device for processing data sets and method for processing data sets
CN110633153A (en) * 2019-09-24 2019-12-31 上海寒武纪信息科技有限公司 Method for realizing neural network model splitting by using multi-core processor and related product
WO2021081854A1 (en) * 2019-10-30 2021-05-06 华为技术有限公司 Convolution operation circuit and convolution operation method
CN110807522B (en) * 2019-10-31 2022-05-06 合肥工业大学 General calculation circuit of neural network accelerator
CN113627600B (en) * 2020-05-07 2023-12-29 合肥君正科技有限公司 Processing method and system based on convolutional neural network
US11500811B2 (en) * 2020-06-12 2022-11-15 Alibaba Group Holding Limited Apparatuses and methods for map reduce
CN113919477A (en) * 2020-07-08 2022-01-11 嘉楠明芯(北京)科技有限公司 Acceleration method and device of convolutional neural network
CN112580787B (en) * 2020-12-25 2023-11-17 北京百度网讯科技有限公司 Data processing method, device and equipment of neural network accelerator and storage medium
CN116888575A (en) * 2021-03-26 2023-10-13 上海科技大学 Simple approximation based shared single-input multiple-weight multiplier
CN114021710A (en) * 2021-10-27 2022-02-08 中国科学院计算技术研究所 Deep learning convolution acceleration method and processor by using bit-level sparsity
CN114168991B (en) * 2022-02-10 2022-05-20 北京鹰瞳科技发展股份有限公司 Method, circuit and related product for processing encrypted data

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679185B (en) * 2012-08-31 2017-06-16 富士通株式会社 Convolutional neural networks classifier system, its training method, sorting technique and purposes
US9513870B2 (en) * 2014-04-22 2016-12-06 Dialog Semiconductor (Uk) Limited Modulo9 and modulo7 operation on unsigned binary numbers
US10049322B2 (en) * 2015-05-21 2018-08-14 Google Llc Prefetching weights for use in a neural network processor
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
US11074492B2 (en) * 2015-10-07 2021-07-27 Altera Corporation Method and apparatus for performing different types of convolution operations with the same processing elements
US11170294B2 (en) * 2016-01-07 2021-11-09 Intel Corporation Hardware accelerated machine learning
US20170344876A1 (en) * 2016-05-31 2017-11-30 Samsung Electronics Co., Ltd. Efficient sparse parallel winograd-based convolution scheme
AU2016203619A1 (en) * 2016-05-31 2017-12-14 Canon Kabushiki Kaisha Layer-based operations scheduling to optimise memory for CNN applications
US10997496B2 (en) * 2016-08-11 2021-05-04 Nvidia Corporation Sparse convolutional neural network accelerator
US11544539B2 (en) * 2016-09-29 2023-01-03 Tsinghua University Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN106529670B (en) * 2016-10-27 2019-01-25 中国科学院计算技术研究所 It is a kind of based on weight compression neural network processor, design method, chip
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
US10871964B2 (en) * 2016-12-29 2020-12-22 Qualcomm Incorporated Architecture for sparse neural network acceleration
CN107086910B (en) * 2017-03-24 2018-08-10 中国科学院计算技术研究所 A kind of weight encryption and decryption method and system for Processing with Neural Network
CN107392308B (en) * 2017-06-20 2020-04-03 中国科学院计算技术研究所 Convolutional neural network acceleration method and system based on programmable device
CN107341544B (en) * 2017-06-30 2020-04-10 清华大学 Reconfigurable accelerator based on divisible array and implementation method thereof
CN107844826B (en) * 2017-10-30 2020-07-31 中国科学院计算技术研究所 Neural network processing unit and processing system comprising same
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array
CN108182471B (en) * 2018-01-24 2022-02-15 上海岳芯电子科技有限公司 Convolutional neural network reasoning accelerator and method
US20210004668A1 (en) * 2018-02-16 2021-01-07 The Governing Council Of The University Of Toronto Neural network accelerator
CN108510066B (en) * 2018-04-08 2020-05-12 湃方科技(天津)有限责任公司 Processor applied to convolutional neural network
CN109543140B (en) * 2018-09-20 2020-07-10 中国科学院计算技术研究所 Convolutional neural network accelerator

Also Published As

Publication number Publication date
WO2020057161A1 (en) 2020-03-26
CN109543816B (en) 2022-12-06
CN109543830A (en) 2019-03-29
CN109543140A (en) 2019-03-29
WO2020057160A1 (en) 2020-03-26
CN109543830B (en) 2023-02-03
US20210350214A1 (en) 2021-11-11
WO2020057162A1 (en) 2020-03-26
CN109543816A (en) 2019-03-29
US20210350204A1 (en) 2021-11-11
CN109543140B (en) 2020-07-10

Similar Documents

Publication Publication Date Title
US20210357735A1 (en) Split accumulator for convolutional neural network accelerator
KR102557589B1 (en) Accelerated mathematical engine
CN107862374B (en) Neural network processing system and processing method based on assembly line
US10394929B2 (en) Adaptive execution engine for convolution computing systems
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
CN110851779B (en) Systolic array architecture for sparse matrix operations
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
KR20190089685A (en) Method and apparatus for processing data
Guo et al. A high-efficiency fpga-based accelerator for binarized neural network
CN110659014B (en) Multiplier and neural network computing platform
CN110716751A (en) High-parallelism computing platform, system and computing implementation method
Solovyev et al. Real-Time Recognition of Handwritten Digits in FPGA Based on Neural Network with Fixed Point Calculations
CN110765413B (en) Matrix summation structure and neural network computing platform
Shi et al. Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer
He et al. Research on Efficient CNN Acceleration Through Mixed Precision Quantization: A Comprehensive Methodology.
KR20240041036A (en) Method and apparatus for operating memory processor
CN115293328A (en) Target detection neural network compression method and system based on FPGA

Legal Events

Date Code Title Description
AS Assignment

Owner name: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, XIAOWEI;WEI, XIN;LU, HANG;REEL/FRAME:055657/0718

Effective date: 20210301

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION