WO2021232422A1 - 神经网络的运算装置及其控制方法 - Google Patents

神经网络的运算装置及其控制方法 Download PDF

Info

Publication number
WO2021232422A1
WO2021232422A1 PCT/CN2020/091883 CN2020091883W WO2021232422A1 WO 2021232422 A1 WO2021232422 A1 WO 2021232422A1 CN 2020091883 W CN2020091883 W CN 2020091883W WO 2021232422 A1 WO2021232422 A1 WO 2021232422A1
Authority
WO
WIPO (PCT)
Prior art keywords
bits
calculation unit
fixed
calculation
unit
Prior art date
Application number
PCT/CN2020/091883
Other languages
English (en)
French (fr)
Inventor
韩峰
杨康
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN202080004753.3A priority Critical patent/CN112639839A/zh
Priority to PCT/CN2020/091883 priority patent/WO2021232422A1/zh
Publication of WO2021232422A1 publication Critical patent/WO2021232422A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of neural networks, and more specifically, to a neural network computing device and its control method.
  • floating-point numbers are commonly used as the data format required for computing unit operations.
  • the weight coefficients and the output feature data of each layer obtained after the neural network computing framework training are all floating-point numbers. Since fixed-point arithmetic devices occupies a smaller area and consumes less power consumption than floating-point arithmetic devices, neural network acceleration devices generally use fixed-point numbers as the data format required for computing units. Therefore, the weight coefficients obtained by the neural network calculation framework training and the output characteristic data of each layer need to be fixed when deployed in the neural network acceleration device.
  • Fixed-point conversion refers to the process of converting data from floating-point numbers to fixed-point numbers.
  • Some deep convolutional neural networks need to use smaller bit-width fixed-point numbers for fixed-pointization to meet the requirements of arithmetic accuracy, while other deep convolutional neural networks need to use larger bit-width fixed-point numbers for fixed-pointization to meet the arithmetic accuracy requirements.
  • the current neural network computing device only supports one fixed-point digital bit width, which makes it impossible to meet the fixed-point computing accuracy requirements in the application.
  • This application provides a neural network arithmetic device and a control method thereof.
  • the arithmetic device can support a variety of fixed-point digital bit widths, so as to meet the requirements of fixed-point arithmetic accuracy in applications.
  • an embodiment of the present application provides a neural network computing device, the computing device includes a systolic array; the processing unit of the systolic array is a first computing unit, and the first computing unit supports fixed-point multiplication operands
  • the digit width is n bits, n is the m-th power of 2, and m is a positive integer; the first calculation unit can perform shifting and then accumulating operations, so that a plurality of the systolic arrays in 2 rows and c columns
  • the first calculation unit as a whole forms a second calculation unit supporting a fixed-point number of multiplication operands with a bit width of 2n bits, and c is 1 or 2.
  • an embodiment of the present application provides a neural network accelerator, including: a processing module, the processing module is the neural network computing device provided in the first aspect; an input module, used to read input feature data and The weight is stored in the processing module; the output module is used to store the output characteristic data obtained by the processing module in the external memory.
  • an embodiment of the present application provides a method for controlling a neural network computing device, the computing device includes a systolic array, the processing unit of the systolic array is a first computing unit, and the first computing unit supports multiplication operations
  • the fixed-point digit width of the number is n bits, n is 2 to the power of m, and m is a positive integer.
  • the first calculation unit may perform shifting and then accumulating operations, so that the number of rows and columns in the systolic array is more than 2
  • the first calculation units as a whole form a second calculation unit supporting a fixed-point number of multiplication operands with a bit width of 2n bits, and c is 1 or 2.
  • the control method includes: the arithmetic device needs to process the fixed-point number bit width In the case of n-bit input feature data, the first calculation unit is controlled not to perform shift first and then accumulate, so that the systolic array processes the input feature data with a fixed-point number of n bits; When the arithmetic device needs to process input feature data with a fixed-point digital bit width of 2n bits, it controls one or more of the first calculation units in the 2 rows and c columns used to form the second calculation unit. The calculation unit performs shifting first and then accumulating operation, so that the systolic array processes the input characteristic data with a fixed-point number of 2n bits in width.
  • an embodiment of the present application provides a device for executing the method in the third aspect.
  • an embodiment of the present application provides a device, the device includes a memory and a processor, the memory is used to store instructions, the processor is used to execute the instructions stored in the memory, and store the instructions in the memory Execution of the instructions of causes the processor to execute the method of the third aspect.
  • an embodiment of the present application provides a chip that includes a processing module and a communication interface, the processing module is used to control the communication interface to communicate with the outside, and the processing module is also used to implement the third aspect Methods.
  • the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer realizes the method of the first aspect.
  • the present application provides a computer program product containing instructions that when executed by a computer cause the computer to implement the method of the third aspect.
  • the computer may be the above-mentioned device.
  • an embodiment of the present application provides a neural network processing chip on which the neural network computing device provided by the first aspect or the neural network accelerator provided by the second aspect is integrated.
  • the calculation unit in the systolic array can be shifted first and then accumulate, which enables the arithmetic device to support multiple fixed-point digital bit widths, which can satisfy multiple fixed-point precisions in applications Require.
  • Figure 1 is a schematic diagram of the framework of a deep convolutional neural network.
  • Figure 2 is a schematic diagram of the convolution operation.
  • Figure 3 is a schematic diagram of the architecture of the neural network acceleration device.
  • 4 to 7 are schematic diagrams of the time sequence for implementing the convolution operation or the average pooling operation by using the neural network processing device provided by the embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a neural network computing device provided by an embodiment of the application.
  • FIG. 9 is another schematic block diagram of a neural network computing device provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of a control method of a neural network computing device provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of a first calculation unit with 2 rows and 2 columns in a systolic array in an arithmetic device provided by an embodiment of the present application, equivalently forming a second calculation unit supporting a fixed-point number width of 2n bits.
  • FIG. 12 is a schematic structural diagram of a first calculation unit in a systolic array in a computing device provided by an embodiment of the application.
  • FIG. 13 is a schematic diagram of the structure of the ACC in the ACC array in the computing device provided by the embodiment of the application.
  • FIG. 14 is a schematic flow chart of performing a convolution operation with a fixed-point digital bit width of n bits using the neural network arithmetic device provided by an embodiment of the present application.
  • FIG. 15 is a schematic flow chart of performing a convolution operation with a fixed-point digital bit width of 2n bits using a neural network arithmetic device provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of the storage format of the characteristic data with a fixed-point number of 2n bits in the SRAM.
  • Fig. 17 is a schematic diagram of the format in which characteristic data with a fixed-point bit width of n bits is stored in the SRAM.
  • FIG. 18 is a schematic block diagram of a neural network accelerator provided by an embodiment of the application.
  • FIG. 19 is a schematic block diagram of a neural network processing device provided by an embodiment of the application.
  • Figure 1 is a schematic diagram of the framework of a deep convolutional neural network.
  • the input value of the deep convolutional neural network input from the input layer
  • convolution transposed convolution or deconvolution
  • batch normalization BN
  • scaling Scale
  • the output value output by the output layer
  • the operations that may be involved in the hidden layer of the neural network in the embodiment of the present application are not limited to the foregoing operations.
  • the hidden layer of the deep convolutional neural network may include cascaded multiple layers.
  • the input of each layer is the output of the upper layer, which is a feature map.
  • Each layer performs at least one of the aforementioned operations on one or more sets of input feature maps to obtain the output of the layer.
  • the output of each layer is also a feature map.
  • each layer is named after the function it implements.
  • the layer that implements the convolution operation is called the convolution layer
  • the layer that implements the pooling operation is called the pooling layer.
  • the hidden layers of deep convolutional neural networks can also include transposed convolutional layers, BN layers, Scale layers, pooling layers, fully connected layers, Concatenation layers, element intelligent addition layers, and activation layers. One enumerate. Under normal circumstances, the convolutional layer will be followed by an activation layer. After the BN layer was proposed, more and more neural networks were connected to the BN layer after the convolutional layer, and then to the activation layer.
  • the convolution operation process of the convolution layer is to perform a vector inner product operation on a set of weight values and a set of input feature data, and output a set of output feature data.
  • a set of weight values can be called filters or convolution kernels.
  • a set of input feature data is part of the feature values in the input feature map.
  • a set of output feature data is part of the feature values in the output feature map.
  • Each output feature data of the convolutional layer is obtained by inner product operation of part of the feature value in the input feature map and the weight value in the convolution kernel.
  • the convolution kernel, input feature map and output feature map can all be represented as a multi-dimensional matrix.
  • the convolution kernel can be represented as a three-dimensional matrix R ⁇ R ⁇ N, the width and height of the convolution kernel are both R, and the depth is N;
  • the input feature map can be represented as a three-dimensional matrix H ⁇ H ⁇ M, the width and height of the input feature map are both H, and the depth is M (not shown in the figure);
  • the output feature map can be expressed as a three-dimensional matrix E ⁇ E ⁇ L, and the width and height of the output feature map are both E, and the depth Is L.
  • Each layer (including the input layer and the output layer) of the deep convolutional neural network can have one input and/or one output, and can also have multiple inputs and/or multiple outputs.
  • the width and height of feature maps tend to decrease layer by layer (input, feature map #1, feature map #2, feature map #3, and output width as shown in Figure 1. The height is decreasing layer by layer).
  • the width and height of the feature map after the width and height of the feature map are reduced to a certain depth, it may be increased layer by layer through transposed convolution operation or upsampling operation.
  • the layers that require more weight parameters for calculation are: convolutional layer, fully connected layer, transposed convolutional layer, and BN layer.
  • Neural network accelerator means a hardware circuit dedicated to processing neural network operations.
  • an acceleration device dedicated to accelerating the operation of the convolutional layer may be referred to as a deep convolutional neural network acceleration device.
  • FIG. 3 is a schematic diagram of the architecture of the neural network acceleration device.
  • the neural network acceleration device 300 includes an input feature data input module (IFM_Loader) 310, a weight input module (or called a filter input module (Filt_Loader)) 320, a calculation module (or called a multiplication and accumulation processing module (MAU)) 330, and an output Module (OFM_Packer) 340.
  • IFM_Loader input feature data input module
  • MAU multiplication and accumulation processing module
  • OFFM_Packer output Module
  • the input characteristic data input module 310 is configured to read the input characteristic data from an external memory (a static random-access memory (SRAM) is taken as an example in FIG. 3), and send it to the processing module 330.
  • SRAM static random-access memory
  • the weight input module 320 is used to read the weight value from the SRAM and send it to the processing module 330.
  • the calculation module 330 is used for multiplying and accumulating the input feature data and the weight value to obtain and output the output feature data.
  • the output module 340 is used to write the output characteristic data output by the processing module 330 into the SRAM.
  • the calculation module 330 includes a systolic array 331 and an output processing unit 332.
  • the output processing unit 332 includes a memory for storing intermediate results of neural network operations.
  • the input feature map data sent by the input feature data input module 310 is sent to the systolic array 331, and the weight value previously loaded is multiplied and accumulated.
  • the output processing unit 332 If the memory buffers the intermediate result, the output processing unit 332 accumulates the output result of the systolic array 331 and the intermediate result in the memory again. If the accumulated result is still an intermediate result, the output processing unit 332 continues to store it in the memory, otherwise, it outputs it to the output module 340 for subsequent processing.
  • Fixed-point conversion refers to the process of converting data from floating-point numbers to fixed-point numbers.
  • floating-point numbers For the concepts of floating-point numbers, fixed-point numbers, and fixed-point numbers, reference can be made to the prior art, which will not be described in detail in this article.
  • the current neural network computing device only supports a fixed-point digital width, for example, only supports 8-bit (bit) fixed-point digital width, or only supports 16-bit fixed-point digital width, which makes it impossible to meet the fixed-point calculation accuracy in the application. Require.
  • this application proposes a neural network computing device that can support multiple fixed-point digital bit widths.
  • the systolic array 331 shown in FIG. 3 includes calculation units with 3 rows and 3 columns as shown in FIG. 4: C00, C01, C02, C10, C11, C12, C20, C21, and C22.
  • the output processing unit 332 is connected to the calculation units C20, C21, and C22, and is used to obtain output characteristic data according to the calculation results output by it.
  • the flow of performing the convolution operation using the systolic array 331 is as follows.
  • the input characteristic data a11 enters the calculation unit C00, where the input characteristic data a11 is loaded from the left side of the calculation unit C00 and flows from left to right.
  • the calculation result of the calculation unit C00 is a11*W11.
  • the calculation result a11*W11 of the calculation unit C00 flows from top to bottom.
  • the input characteristic data a11 flows right into the calculation unit C01, and the calculation result a11*W11 flows downward into the calculation unit C10; at the same time the input characteristic data a21 is loaded into the calculation unit C00, the input characteristic data a21 is loaded into the calculation unit C10.
  • the calculation result of the calculation unit C00 is a12*W11
  • the calculation result of the calculation unit C01 is a11*W12
  • the calculation result of the calculation unit C10 is a11*W11+a21*W21.
  • the calculation results of each calculation unit flow from top to bottom.
  • the input feature data a11 flows right into the calculation unit C02, a12 flows right into the calculation unit C01, and a21 flows right into the calculation unit C11, the calculation result a12*W11 It flows downward into the calculation unit C10, the calculation result a12*W12 flows downward into the calculation unit C11, and the calculation result a11*W11+a21*W21 flows downward into the calculation unit C20.
  • a13 is loaded into the calculation unit C00
  • a22 is loaded into the calculation unit C10
  • a31 is loaded into the calculation unit C20.
  • the calculation result of the calculation unit C00 is a13*W11
  • the calculation result of the calculation unit C01 is a12*W12
  • the calculation result of the calculation unit C02 is a11*W13
  • the calculation result of the calculation unit C10 is a12* W11+a22*W21
  • the calculation result of the calculation unit C11 is a11*W12+a21*W22
  • the calculation result of the calculation unit C20 is a11*W11+a21*W21+a31*W31.
  • the calculation unit C21 will output the calculation result a12*W12+a22*W22+a32*W32, and at the end of the seventh cycle, the calculation unit C22 will output the calculation result a13*W13+a23* W23+a33*W33.
  • the calculation result of the calculation unit C20 at the end of the third cycle is a11*W11+a21*W21+a31*W31
  • the calculation result of the calculation unit C21 at the end of the fifth cycle is a12*W12+a22*W22+a32*W32
  • the calculation result of the calculation unit C22 a13*W13+a23*W23+a33*W33 is the accumulated value of the input feature data
  • weight The calculation result of the convolution operation.
  • the output processing unit 332 is used to receive the calculation results output by the calculation units C20, C21, and C22 (it should be understood that it is the intermediate calculation result of the convolution operation), and perform calculations on the calculation results of the calculation unit C20 at the end of the third cycle, and the fifth
  • the calculation result of the calculation unit C21 at the end of the period and the calculation result of the calculation unit C22 at the end of the seventh period are accumulated to obtain the input characteristic data And weight The calculation result of the convolution operation.
  • FIG. 8 is a schematic block diagram of a neural network computing device 800 provided by an embodiment of the application.
  • the neural network computing device 800 includes a systolic array 810.
  • the processing unit of the systolic array 810 is the first calculation unit 811.
  • the first calculation unit 811 will also be denoted as MC.
  • the first calculation unit 811 supports a fixed-point number of multiplication operands with a bit width of n bits, where n is 2 to the power of m, and m is a positive integer.
  • the first calculation unit 811 can support operations in which the fixed-point number bit width of the fixed multiplication operand is 2 bits, 4 bits, 8 bits or 16 bits or other 2 m-th power bits.
  • the first calculation unit 811 supports an n-bit representation of the fixed-point number of multiplication operands, and the first calculation unit 811 supports a multiplication operation x1*y1, where the fixed-point number of x1 and y1 are both n bits.
  • the first calculation unit 811 can perform shift first and then accumulate, so that the multiple first calculation units 811 in 2 rows and c columns in the systolic array 810 as a whole form a second fixed-point number supporting multiplication operand with a bit width of 2n bits.
  • c is 1 or 2.
  • the second calculation unit 811 will also be denoted as MU.
  • the first calculation unit 811 can perform shift first and then accumulate operation, which means that the first calculation unit 811 can perform shift first and then accumulate the operation result of this calculation unit and the operation result of other calculation units, where the other calculation unit It may include the calculation unit adjacent to the row of the calculation unit, or the calculation unit adjacent to the column of the calculation unit, or the calculation unit located on the diagonal of the calculation unit.
  • the first calculation unit 811 may first shift the calculation result of the calculation unit to the left by n bits, and then accumulate the calculation result of the previous calculation unit adjacent to the calculation unit row.
  • the multiple first calculation units 811 in 2 rows and c columns in the systolic array 810 as a whole form a second calculation unit 812 with a fixed-point bit width of 2n bits that supports multiplication operands, indicating that the number of 2 rows and c columns in the systolic array 810
  • the first calculation unit 811 as a whole can be equivalent to the second calculation unit 812, and the second calculation unit 812 supports a fixed-point number of multiplication operands with a bit width of 2n bits.
  • the second calculation unit 812 supports the fixed-point number width of the multiplication operand as 2n bits, which means that the second calculation unit 812 supports the multiplication operation x2*y2, and the fixed-point number width of the largest fixed-point number of x2 and y2 is 2n bits.
  • the fixed-point number of x2 and y2 are both 2n bits wide, or one of x2 and y2 has a fixed-point number of 2n bits, and the other number has a fixed-point number of n bits.
  • the systolic array 810 itself can support operations with a fixed-point digital bit width of 2n bits. If the multiple first calculation units 811 with 2 rows and c columns in the systolic array 810 as a whole form a second calculation unit 812 that supports the fixed-point digital bit width of the multiplication operand to be 2n bits, the systolic array 810 can also support the fixed-point digital bit width It is a 2n-bit operation.
  • the systolic array 810 can support operations with a fixed-point digital bit width of n bits and can also support operations with a fixed-point digital bit width of 2n bits. For n-bit operations, it can also support fixed-point operations with a bit width of 2n bits.
  • the computing unit in the systolic array can be shifted first and then accumulated, which can make the computing device support a variety of fixed-point digital bit widths, which can meet a variety of applications. Fixed-point accuracy requirements.
  • the second calculation unit 812 is only introduced for ease of understanding and description, and does not mean that the second calculation unit 812 is actually included in the systolic array 810.
  • the calculation unit formed by the multiple first calculation units 811 in 2 rows and c columns in the systolic array 810 as a whole is denoted as the second calculation unit 812.
  • FIG. 9 is another schematic block diagram of a neural network computing device 800 provided by an embodiment of the application.
  • the first calculation unit 811 is denoted as MC
  • the second calculation unit 812 is denoted as MU.
  • c is 2, that is, the first calculation unit (ie, 4 first calculation units) in 2 rows and 2 columns in the systolic array 810 can be equivalent to a second calculation unit 812 as a whole.
  • the MC does not perform the shift first and then accumulate operation, and the systolic array 810 can perform operations with a fixed-point number of multiplication operands having a bit width of n bits.
  • the MC in the same MU performs the first shift and then accumulate operation, so that the MU can support the fixed-point digital bit width of the multiplication operand to be 2n bits, and the systolic array 810 can perform the fixed-point digital bit width of the multiplication operand. Operation with a width of 2n bits.
  • the neural network computing device can support multiple fixed-point digital bit widths without adding additional hardware.
  • the neural network computing device 800 provided in the embodiment of the present application can be applied to a convolutional layer, and can also be applied to a pooling layer. That is, the computing device 800 can be used to process convolution operations, and can also be used to process pooling operations.
  • the computing device 800 can switch between a variety of different fixed-point digital bit widths under the control of the control unit.
  • the computing device 800 further includes a control unit 820.
  • the control unit 820 is configured to send control signaling to the systolic array 810 to control the operation mode of the systolic array 810. It can be understood that the control unit 820 may send control signaling to the first calculation unit 811 to control the operation mode of the first calculation unit 811.
  • control unit 820 is used to perform operations S1010 and S1020 as shown in FIG. 10.
  • S1020 When the computing device 800 needs to process the input feature data with a fixed-point digital bit width of 2n bits, control one or more first computing units 811 in the 2 rows and c columns of the second computing unit 812 to form the input feature data.
  • the calculation unit 811 performs a shift first and then an accumulation operation, so that the second calculation unit 812 supports the fixed-point number width of the multiplication operand to be 2n bits, so that the systolic array 810 processes the input feature data with the fixed-point number bit width of 2n bits.
  • the control unit 820 is used to control the 2 rows and 1 column used to form the second calculation unit 812 when the arithmetic device 800 needs to process input feature data with a fixed-point bit width of 2n bits and a weight with a fixed-point bit width of n bits.
  • the first calculation unit 811 in the second row of the first calculation unit 811 performs a shift first and then accumulates, so that the second calculation unit 812 supports input feature data with a fixed-point number width of 2n bits and a fixed-point number with a width of n bits.
  • the calculation of weights allows the arithmetic device 800 to perform calculations of input feature data with a fixed-point bit width of 2n bits and weights with a fixed-point bit width of n bits.
  • the control unit 820 is used to control the 2 rows and 2 columns used to form the second calculation unit 812 when the arithmetic device 800 needs to process input feature data with a fixed-point bit width of 2n bits and a weight with a fixed-point bit width of 2n bits.
  • Part of the first calculation unit 811 in the first calculation unit 811 performs a shift first and then an accumulation operation, so that the second calculation unit 812 supports the calculation of the fixed-point digital bit width of 2n bit input feature data and the fixed-point digital bit width of 2n bit weight , So that the arithmetic device 800 can perform the calculation of the fixed-point digital bit width of the input feature data of 2n bits and the fixed-point digital bit width of the weights of 2n bits.
  • control unit 820 is used to control part of the first calculation unit 811 in the 2 rows and 2 columns included in the second calculation unit 812 to perform shifting and then accumulating operations, so that the second calculation unit 812
  • the two first calculation units 811 in the last row of the included two rows of the first calculation unit 811 output the low 2n bits and the high 2n bits of the 4n-bit operation result of the second calculation unit 812, respectively. See (3) 2n bits*2n bits operation example described below.
  • control unit 820 is further configured to send the input feature data with a fixed-point number width of n bits to the first calculation unit 811.
  • the input feature data with a fixed-point bit width of n bits is sent to the systolic array 810 in the manner described below (1) n bits*n bits operation.
  • control unit 820 is also used to send the low n-bits and high n-bits of the input feature data with a fixed-point digital bit width of 2n bits, respectively.
  • the control unit 820 is also used to send the low n-bits and high n-bits of the input feature data with a fixed-point digital bit width of 2n bits, respectively.
  • the input feature data with a fixed-point digital bit width of 2n bits is sent to the systolic array 810 in the manner described below (2) 2n bits*n bits operation and (3) 2n bits*2n bits operation.
  • c is 2.
  • the control unit 820 is further configured to combine the lower n bits of the weight with a fixed-point digital bit width of 2n bits with The high n bits are respectively sent to the two columns of first calculation units 811 included in the second calculation unit 812.
  • the control unit 820 is used to combine the low n bits and high of the input feature data with a fixed-point digital bit width of 2n bits.
  • the n bits are respectively sent to the 2 rows of the first calculation unit 811 included in the second calculation unit 812; the low n bits and the high n bits of the weight with a fixed-point width of 2n bits are respectively sent to the second calculation unit
  • the input feature data and weights with a fixed-point digital bit width of 2n bits are sent to the systolic array 810 in the manner described below (3) 2n bits*2n bits operation.
  • the computing device 800 further includes a feature data input unit 840 and a weight input unit 830.
  • the weight input unit 830 is used for buffering the weights to be processed, and sending the weights into the systolic array 810 according to the control signaling of the control unit 820.
  • the weight input unit 830 is responsible for caching the weight values (for example, input by the weight input module (Filt_Loader module) in the accelerator 1800 described below), and loads the weights for the systolic array 810 under the control of the control unit 820.
  • the weight input unit 830 and the first calculation unit 810 of each column of the systolic array 810 have one and only one interface, and the interface can only transmit one weight value per clock cycle.
  • the weight loading is divided into two stages: shifting and loading. In the shifting stage, the weight input unit 830 sequentially sends the weight values required by the first calculation unit 810 of the same column to the systolic array 810 through the same interface.
  • the received weight values are sequentially passed down from the first calculation unit 810 at the interface.
  • the first calculation unit 810 in the same column in the systolic array 810 simultaneously loads the cached weight values into their respective caches for use by subsequent multiplying and accumulating units.
  • the weight input unit 830 loads data for the first calculation units 810 in two adjacent columns, there will be a delay of one clock cycle.
  • the characteristic data input unit 840 is used to buffer the input characteristic data to be processed, and send the input characteristic data into the systolic array 810 according to the control signaling of the control unit 820.
  • the feature data input unit 840 is responsible for buffering the input feature data (for example, the feature data input module (IFM_Loader module) in the accelerator 1800 described below), and loads the input feature data for the systolic array 810 under the control of the control unit 820 .
  • the characteristic data input unit 840 has only one interface for each row of the first calculation unit 811 of the systolic array 810, and the interface can only transmit one input characteristic data per clock cycle.
  • the received input feature data is sequentially transferred from the first calculation unit 811 at the interface to the right until the last first calculation unit 811.
  • the characteristic data input unit 840 loads data for the first calculation unit 811 of two adjacent rows, there will be a delay of one clock cycle.
  • the operation of shifting first and then accumulating can be performed, so that the arithmetic device can support a variety of fixed-point digital bit widths, so that the arithmetic device can be controlled in a variety of fixed-point digital bit widths according to application requirements. Switch between to meet the various fixed-point precision requirements in the application.
  • the control unit 820 is responsible for controlling each unit in the arithmetic device 800 to implement the convolution operation.
  • the control unit 820 controls the weight input unit 830 to load the weight value into the systolic array 810, and then the control unit 820 controls the characteristic data input unit 840 to send the characteristic map data into the systolic array 810, and controls the systolic array 810 to perform convolution operations.
  • the above process is repeated in sequence until all the convolution operations are completed.
  • the first calculation unit 811 in the 2 rows and c columns in the systolic array 810 is equivalent to form a second calculation unit supporting a fixed-point number width of 2n bits.
  • the first calculation unit 811 is denoted as MC, that is, MC can complete the multiplication and accumulation operation of n bits*n bits; the second calculation unit 812 is denoted as MU.
  • the first calculation unit 811 (MC) of 2 rows and 2 columns in the systolic array 810 equivalently forms a second calculation unit 812 (MU).
  • the four first calculation units 811 in 2 rows and 2 columns are respectively labeled as MC (U0_0), MC (U0_1), MC (U1_0), and MC (U1_1). That is, MC (U0_0), MC (U0_1), MC (U1_0) and MC (U1_1) can be equivalent to a second calculation unit 812 (MU).
  • MC has the following input terminals: bi (n bits), si, ai (n bits), and ci.
  • MC has the following output terminals: bo (n bits), ar, acr, ao (n bits), mr.
  • the input terminal bi(n bits) is configured to input the input weight value of n bits.
  • the input terminal bi(n bits) of MC(U0_0) inputs the input weight value b_lsb of n bits.
  • the input terminal si is configured to input the accumulated result output by the previous MC.
  • the input terminal si of MC (U0_0) inputs the accumulation result s_lsb output by the previous stage MC.
  • the input terminal ai(n bits) is configured to input the input characteristic value of n bits.
  • the input terminal ai (n bits) of MC (U0_0) inputs n bits of input characteristic data a_lsb.
  • the input terminal ci is configured to input the intermediate result of the adjacent but not the previous MC. As described below (3) 2n bits*2n bits operation, the input terminal ci of MC(U1_0) inputs the intermediate result of MC(U0_1) "RM(U0_1)[7:0]", the input terminal of MC(U0_1) ci inputs the intermediate result of MC(U1_0) "RM(U1_0)[31:8]", and the input terminal ci of MC(U1_1) inputs the intermediate result of MC(U1_0) "RA(U1_0)[31:16]".
  • each output terminal of MC is as follows.
  • the output terminal bo(n bits) is configured to output n bits of input characteristic data to the subordinate MC.
  • the output terminal bo (n bits) of MC (U0_0) outputs n bits of input characteristic data b_lsb to MC (U1_0).
  • the output terminal ar is configured to output the calculation result of the current MC.
  • the output terminal ar of MC (U0_0) outputs the calculation result RA (U0_0) of MC (U0_0).
  • the output terminal acr is configured to output the intermediate result of the current MU to the next MC in the direction in which the weight value flows. As described below in (3) 2n bits*2n bits operation, the output terminal acr of MC(U1_0) outputs the intermediate result RM(U1_0)[31:8] of MC(U1_0).
  • the output terminal ao(n bits) is configured to output the input weight value of n bits to the next MC in the direction in which the weight value flows. For example, the output terminal ao(n bits) of MC(U0_0) outputs the input weight value of n bits to MC(U0_1).
  • the output terminal mr is configured to output the intermediate result of the current MU to the diagonally located MCs belonging to the same MU. As described below in (3) 2n bits*2n bits operation, the output terminal mr of MC(U1_0) outputs the intermediate result RA(U1_0)[31:16] of MC(U1_0) to MC(U1_1).
  • the MU shown in Figure 11 can simultaneously complete 4 n bits*n bits multiply and accumulate operations, or simultaneously complete 2 2n bits*n bits multiply and accumulate operations, or 1 2n bits*2n bits multiply and accumulate operations at the same time.
  • the specific description is as follows.
  • the input ports a_lsb, a_msb, b_lsb, and b_msb input four different operands of n bits.
  • the 4 MC units in the MU respectively complete the multiplication and accumulation operations of four different n bits*n bits.
  • the calculation process (1) is shown below.
  • MC(U0_0): RM(U0_0) a_lsb*b_lsb
  • MC(U1_0): RM(U1_0) a_msb*b_lsb
  • RA(U1_0) RM(U1_0)+RA(U0_0)
  • MC(U0_1): RM(U0_1) a_lsb*b_msb
  • MC(U1_1): RM(U1_1) a_msb*b_msb
  • RA(U1_1) RM(U1_1)+RA(U0_1)
  • a_msb and a_lsb represent two different n bits of input feature data; b_msb and b_lsb represent two different n bits of input weight value.
  • s_msb and s_lsb represent the cumulative result of the previous MU output.
  • so_lsb and so_msb represent the cumulative result of the current MU output.
  • the input characteristic value a of 2n bits is sent to the MU through the input ports a_msb and a_lsb respectively; among them, the input port a_lsb is the low n bits of the input characteristic value a, and the input port a_msb is the input High n bits of eigenvalue a.
  • the input ports b_lsb and b_msb are two different input weight values of n bits.
  • the four MC units are divided into two groups, MC (U0_0) and MC (U1_0) are the first group, and MC (U0_1) and MC (U1_1) are the second group.
  • the two sets of MC units respectively complete two different 2n bits*n bits multiply and accumulate operations.
  • the calculation process (2) is shown below.
  • MC(U0_0): RM(U0_0) a_lsb*b_lsb
  • MC(U1_0): RM(U1_0) a_msb*b_lsb
  • RA(U1_0) RM(U1_0) ⁇ 8+RA(U0_0)
  • MC(U0_1): RM(U0_1) a_lsb*b_msb
  • MC(U1_1): RM(U1_1) a_msb*b_msb
  • RA(U1_1) RM(U1_1) ⁇ 8+RA(U0_1)
  • ⁇ a_msb, a_lsb ⁇ is an input feature value of 2n bits; b_msb and b_lsb are two different input weight values of n bits.
  • s_msb and s_lsb are the accumulated results output by the previous MU.
  • so_lsb and so_msb are the two accumulated results output by the current MU.
  • MC (U0_0) and MC (U1_0) as the first group can be regarded as a second calculation unit 812 as a whole
  • MC (U0_1) and MC (U1_1) as the second group can be regarded as a whole It is regarded as a second calculation unit 812.
  • MC (U0_0), MC (U1_0), MC (U0_1), and MC (U1_1) can be regarded as one second calculation unit 812 as a whole.
  • the 2nbits input feature value a is sent to the MU through the input ports a_msb and a_lsb respectively; among them, the input port a_lsb is the low n bits of the input feature value a, and the input port a_msb is the input feature value High n bits of a.
  • the 2n bits input weight value b is sent to the MU through the input ports b_msb and b_lsb respectively; among them, the input port b_lsb is the low n bits of the input weight value b, and the input port b_msb is the high n bits of the input weight value b.
  • MC(U0_0) and MC(U1_0) output 2n bits*2n bits multiplied by the low 2n bits of the accumulation result
  • MC(U0_1) and MC(U1_1) output 2n bits*2nbits multiplied by the high 2nbits of the accumulation result.
  • the calculation process (3) is shown below.
  • MC(U0_0): RM(U0_0) a_lsb*b_lsb
  • MC(U1_0): RM(U1_0) a_msb*b_lsb
  • RA(U1_0) RA(U0_0)+RM(U1_0)[7:0] ⁇ 8+RM(U0_1)[7:0] ⁇ 8
  • MC(U0_1): RM(U0_1) a_lsb*b_msb
  • RA(U0_1) RM(U0_1)[31:8]+RM(U1_0)[31:8]+s_msb
  • MC(U1_1): RM(U1_1) a_msb*b_msb
  • RA(U1_1) RM(U1_1)+RA(U0_1)+RA(U1_0)[31:16]
  • ⁇ a_msb,a_lsb ⁇ represents an input feature value of 2n bits
  • ⁇ b_msb,b_lsb ⁇ represents an input weight value of 2n bits
  • ⁇ s_msb,s_lsb ⁇ represents the cumulative result of the previous MU output
  • ⁇ so_msb,so_lsb ⁇ represents The cumulative result of the current MU output.
  • control unit 820 controls the first calculation unit 811 to perform the above calculation process (1), so that the systolic array 810 can perform the calculation of n bits*n bits, that is, the calculation device 800 can be Supports nbits*nbits operation.
  • control unit 820 controls the first calculation unit 811 to perform the above calculation process (2), so that the systolic array 810 can perform 2nbits*nbits calculations, that is, the calculation device 800 Support 2n bits*n bits operation.
  • control unit 820 controls the first calculation unit 811 to perform the above calculation process (3), so that the systolic array 810 can perform 2nbits*2nbits calculations, that is, the calculation device 800 Support 2n bits*2n bits operation.
  • the computing device 800 provided by the embodiment of the present application can not only support two fixed-point digital widths of n-bit and 2n-bit, but also support 4n-bit, 8n-bit and many other fixed-point digital widths. described as follows.
  • MC supports a fixed-point bit width of n bits
  • an MU composed of 2 rows and 2 columns of MC supports a fixed-point bit width of 2n bits
  • a calculation unit composed of 2 rows and 2 columns of MU is composed of 4
  • a calculation unit composed of MCs in rows and 4 columns can support a fixed-point bit width of 4n bits
  • a calculation unit composed of MUs in 4 rows and 4 columns that is, a calculation unit composed of MCs in 8 rows and 8 columns, can support a fixed-point number of 8n bits Digit width, and so on.
  • the calculation method of the first calculation unit 811 can be set according to application requirements, so that the calculation device 800 supports a fixed-point digital bit width that can meet the calculation accuracy.
  • FIG. 12 is a schematic diagram of the structure of the first calculation unit 811.
  • the first calculation unit 811 includes a weight shift register (Weight Shift Register), a feature map data shift register (FM Data Shift Register), a weight register (Weight Register), a feature map data register (FM Data Register), and a multiplication circuit (Mutiplier). Circuit), product register (Product Register), shift first and then accumulate operation circuit (Carry Adder circuit) and accumulate circuit (Accumulate Adder circuit).
  • the weight shift register is responsible for buffering the weight value sent from the weight input unit 830 or the upper-level first calculation unit 811 (also referred to as the previous-stage 811). In the shift phase of weight loading, the weight value buffered by the weight shift register will be passed down to the next-stage first calculation unit 811 (also referred to as the subsequent stage 811). In the loading phase of the weight loading, the weight value buffered by the weight shift register will be latched into the weight register.
  • the feature map data shift register is responsible for buffering the feature map data sent from the feature data input unit 840 or the first calculation unit 811 on the left.
  • the feature map data stored in the feature map data shift register will be latched to the feature map data register, and at the same time will be sent to the first calculation unit 811 on the right.
  • the first calculation unit 811 on the left represents the previous first calculation unit 811 in the flow direction of the input feature value in the pulsation array.
  • the first calculation unit 811 on the right represents the subsequent first calculation unit 811 in the flow direction of the input feature value in the pulsation array.
  • the multiplication circuit is responsible for multiplying the weight value and the characteristic value buffered by the weight register and the characteristic map data register, and the operation result is sent to the product register.
  • the shift-first-accumulate operation circuit is responsible for shifting the data in the product register and the calculation results of other first calculation units 811 in the second calculation unit 812 to which the current first calculation unit 811 belongs, and then performs the operation of first shifting and then accumulating.
  • the accumulation result of the first shift and then accumulation operation circuit is accumulated again in the accumulation circuit with the multiplication and accumulation calculation result sent by the previous first calculation unit 811 (also referred to as the previous first calculation unit 811) and then passed down. To the next level of the first calculation unit 811 (also referred to as the next level of the first calculation unit 811).
  • the first calculation unit 811 serves as the MC (U1_0) in FIG. 11.
  • the shift-first-accumulate operation circuit in the first calculation unit 811 is responsible for the following calculations.
  • the first calculation unit 811 serves as the MC (U1_0) in FIG. 11.
  • the shift-first-accumulate operation circuit in the first calculation unit 811 is responsible for the following calculations.
  • the first calculation unit 811 serves as the MC (U0_1) in FIG. 11.
  • the shift-first-accumulate operation circuit in the first calculation unit 811 is responsible for the following calculations.
  • the first calculation unit 811 serves as the MC (U1_1) in FIG. 11.
  • the shift-first-accumulate operation circuit in the first calculation unit 811 is responsible for the following calculations.
  • the shift-first-accumulation operation circuit is responsible for shifting the data in the product register and 0 first and then the accumulating operation, as shown in Figure 12. It shows that in this case, the shift-first-accumulate operation circuit is only used to transparently transmit data, and does not process data.
  • the accumulation circuit directly sends the accumulation result obtained to the output processing unit 850 (the output processing unit 850 will be described below).
  • the computing device 800 further includes an output processing unit 850.
  • the output processing unit 850 is configured to process the operation result output by the systolic array 810 to obtain output characteristic data.
  • the systolic array 810 includes the calculation units C00, C01, C02, C10, C11, C12, C20, C21, and C22 in the example described above in conjunction with FIGS. 4-7, and the output processing unit 850 is used to receive calculations
  • the calculation results output by the units C20, C21, and C22 (it should be understood that they are the intermediate calculation results of the convolution operation), and the calculation results of the calculation unit C20 at the end of the third cycle, and the calculations of the calculation unit C21 at the end of the fifth cycle
  • the result and the calculation result of the calculation unit C22 at the end of the seventh cycle are accumulated to obtain the output characteristic data a11*W11+a12*W12+a13*W13+a21*W21+a22*W22+a23*W23+a31*W31+ a32*W32+a33*W33.
  • the output processing unit 850 includes an accumulate (Accumulate, ACC) array 851, a result processing (Rslt_Proc) unit 852, and a storage (Psum_Mem) unit 853.
  • the column size of the ACC array 851 is the same as the column size of the systolic array 810.
  • the size of the systolic array is M*N, that is, the systolic array 810 includes the first calculation unit 811 with M rows and N columns
  • the size of the ACC array 851 is 1*N, that is, the ACC array 851 includes 1 row and N columns of ACC.
  • the first calculation unit 811 corresponding to 2 rows and c columns can form a second calculation unit 812 as a whole, and every c ACC in the ACC array 851 as a whole can form an ACC group.
  • the first calculation unit 811 with 2 rows and 2 columns can form the second calculation unit 812 as a whole, then every 2 ACCs in the ACC array 851 as a whole can form an ACC group (ACC_GRP) unit, namely ACC
  • ACC_GRP ACC group
  • the array 851 has N/2 ACC group (ACC_GRP) units in total.
  • the ACC group (ACC_GRP) unit is introduced only for ease of understanding and description, and does not mean that the ACC array 851 actually includes the ACC group unit.
  • the unit formed by every two ACCs in the ACC array 851 as a whole is recorded as an ACC group unit.
  • the result processing (Rslt_Proc) unit 852 is responsible for processing the calculation result output by the ACC array 851.
  • the result processing unit 852 Take the arithmetic device 800 for performing a convolution operation as an example. If the calculation result output by the ACC array 851 is the final result of the convolution calculation, the result processing unit 852 will output it, for example, send it to an output module outside the computing device 800 for subsequent processing. If the calculation result output by the ACC array 851 is the intermediate result of the convolution calculation, the result processing unit 852 will send the output calculation result into the storage (Psum_Mem) unit 853.
  • the storage (Psum_Mem) unit 853 is responsible for caching the intermediate results output by the ACC array 851. Taking the computing device 800 for performing a convolution operation as an example, the storage unit 853 is responsible for caching the intermediate result of the convolution calculation.
  • the storage unit 853 may include FIFOs whose number matches the column size of the systolic array 810. Assuming that the size of the systolic array 810 is M*N, that is, the column size is N, the storage unit 853 may be composed of N FIFOs.
  • Each FIFO in the storage unit 853 can perform read and write operations at the same time.
  • the N FIFOs are divided into different groups according to the size of the convolution kernel. Different FIFO groups buffer the intermediate calculation results of different convolution kernels.
  • the result processing unit 852 will send the output calculation result to the storage (Psum_Mem) unit 853. Specifically, the result processing unit 852 will The output calculation result is sent to the corresponding FIFO group in the storage unit 853.
  • the ACC array 851 in the output processing unit 850 also includes a splicing unit 854, and the splicing unit 854 corresponds to the ACC group unit one-to-one.
  • the splicing unit 854 is used to splice the input data of the two ACCs forming the ACC group unit.
  • the output processing unit 850 is configured to perform the following operations.
  • operation 1) is executed by the splicing unit 854; operation 2) is executed by an ACC of the first calculation unit 811 corresponding to the output 2n bits in the ACC group unit.
  • the weight matrix is a convolution kernel.
  • the weight matrix is a pooling matrix.
  • the output processing unit 850 may perform the accumulation operation including the splicing action or the accumulation operation not including the splicing action according to the control instruction of the control unit 820.
  • control unit 820 sends the control to the output processing unit 850 when the computing device 800 needs to perform the calculation of (1) n bits*n bits described above, or (2) the operation of 2n bits*n bits.
  • Instruction 1 when the arithmetic device 800 needs to perform the (3) 2n bits*2n bits operation described above, send a control instruction 2 to the output processing unit 850.
  • the output processing unit 850 When the output processing unit 850 receives the control command 1, it switches the working mode to mode (MODE) 0, as shown in the mode (MODE) 0 in FIG. 9; when it receives the control command 2, it will work The mode is switched to mode 1, as shown in Fig. 9.
  • the operation flow of the output processing unit 850 in mode 0 is that the ACC in the ACC array 851 obtains the output results of the systolic array 810 from the corresponding first calculation unit 811, and then accumulates these output results to obtain output characteristic data. That is, in mode 0, the output processing unit 850 does not perform the splicing action.
  • the operation flow of the output processing unit 850 in mode 1 is that the splicing unit 854 splices the low 2n bits and the high 2n bits respectively output by the two first calculation units 811 in the second calculation unit 812 to obtain the second
  • the 4n-bit calculation result of the calculation unit 812, and the 4n-bit calculation result is sent to the high-order ACC in the ACC group unit to which the splicing unit 854 belongs (that is, the ACC corresponding to the first calculation unit 811 that outputs 2n-bit high);
  • the high-order ACC accumulates the 4n-bit operation results of the P second calculation units 812 to obtain output characteristic data.
  • the structure diagram of the ACC in the ACC array 851 is shown in FIG. 13.
  • the ACC unit includes systolic array accumulation register (mc_psum Register), ACC accumulation register (acc_psum Register), sum register (sum Register), filter circuit (Filter Circuit), delay circuit (Delay Circuit), and the first stage accumulation circuit (First Stage Adder circuit) and the second-stage accumulation circuit (Second Stage Adder circuit).
  • the filter circuit filters out the redundant cumulative value (Psum value) output by the systolic array 810 according to the parameter step value (Stride value) input during the convolution calculation, and at the same time, it will filter the unfiltered cumulative value ( Psum value) is sent to the systolic array accumulation register (mc_psum Register).
  • the delay circuit (Delay circuit) delays the accumulated value (Psum value) output by the left-level ACC by a specified clock cycle and then sends it to the ACC accumulation register (acc_psum Register). The number of delayed clock cycles is determined by the parameter expansion value ( Dilation value) is calculated.
  • the first-stage accumulation circuit (First Stage Adder circuit) is responsible for accumulating the data buffered in the systolic array accumulation register (mc_psum Register) and the ACC accumulation register (acc_psum Register) and then sends it to the sum register (sum Register).
  • the first ACC among the N ACCs does not need to receive the accumulated value (Psum value) output by the left-level ACC.
  • the first ACC can receive the system preset signal.
  • the last ACC among N ACCs will not output the accumulated value (Psum value) buffered by the sum register (sum Register) to the ACC of the first level on the right, but in the second stage adder circuit (Second Stage Adder circuit)
  • the accumulated value (Psum value) buffered by the sum register (sum Register) and the accumulated value (Psum value) read back from the storage (Psum_Mem) unit 853 are accumulated and output to the result processing (Rslt_Proc) unit 852.
  • the first calculation unit 811 is denoted as MC
  • the second calculation unit 812 is denoted as MU
  • an MU is composed of two rows and two columns of MC as an example.
  • FIG. 14 is a schematic flow chart of performing a convolution operation with a fixed-point digital bit width of n bits using the neural network arithmetic device 800 provided by an embodiment of the present application.
  • the unit (MC, ACC) marked by the dotted line in Figure 14 is only responsible for transmitting data and does not participate in the convolution calculation.
  • the size of the convolution kernel is 1*3*3.
  • the meaning of each symbol in Fig. 14 is as follows.
  • KhaDb represents the b-th number corresponding to the a-th row of the convolution kernel in the input feature map.
  • Kwc represents the weight value vector of the c-th column in the convolution kernel, which will be deployed to the corresponding MC at the beginning of the convolution operation.
  • KwcDd represents the dth Psum value of the cth column of the output feature map corresponding to the convolution kernel.
  • Bias represents the bias value of the convolution operation input.
  • SxTy represents the accumulated value (Psum value) output by the x-th stage ACC at time y.
  • the weight value vector Kwc in the convolution kernel is sent to the systolic array 810 (MAC Array) in three clock cycles, and each MC is loaded with the weight value of the corresponding position in the 3*3 convolution kernel.
  • the characteristic values of the input characteristic map are sequentially sent to the systolic array 810 (MAC Array) according to the order in FIG.
  • the sequence of the cumulative value (Psum value) output by the systolic array 810 is shown in FIG. 14.
  • the accumulated value (Psum value) output from the systolic array 810 is sent to the corresponding ACC to continue the accumulation.
  • the calculation performed by the ACC unit at each moment is shown in Figure 14. After the third level ACC completes the accumulation operation, the characteristic value of the final output characteristic map can be obtained.
  • FIG. 15 is a schematic flow chart of performing a convolution operation with a fixed-point digital bit width of 2n bits using the neural network arithmetic device 800 provided by an embodiment of the present application.
  • the unit (MC, ACC) marked by the dotted line in Fig. 15 is only responsible for transmitting data and does not participate in the convolution calculation.
  • the size of the convolution kernel is 1*3*3.
  • the meaning of each symbol in Fig. 15 is as follows.
  • KhaDb_LSB represents the lower n bits of the bth number corresponding to the ath row of the convolution kernel in the input feature map; KhaDb_MSB represents the higher n bits of the bth number corresponding to the ath row of the convolution kernel in the input feature map.
  • Kwc_LSB represents the number of low nbits of the weight value vector of the c-th column in the convolution kernel;
  • Kwc_MSB represents the number of high nbits of the weight value vector of the c-th column in the convolution kernel, and they will be deployed to the corresponding MC at the beginning of the convolution operation.
  • KwcDd_LSB represents the low bit of the dth Psum value of the cth column of the output feature map corresponding to the convolution kernel
  • KwcDd_MSB is the high bit of the dth Psum value of the output feature map corresponding to the cth column of the convolution kernel.
  • Bias represents the bias value of the convolution operation input.
  • SxTy represents the Psum value output by the x-th ACC unit at time y.
  • the eigenvalue vectors Kwc_LSB and Kwc_MSB in the convolution kernel are sent to the systolic array 810 (MAC Array) in six clock cycles.
  • Each MC unit is loaded with n bits corresponding to the weight value of the corresponding position in the 3*3 convolution kernel.
  • the feature values of the input feature map are sequentially sent to the systolic array 810 in the order in FIG. 15, and they are multiplied and accumulated with the weight value in the systolic array 810.
  • the sequence of the Psum values output by the systolic array 810 is shown in FIG. 15.
  • the Psum value output from the systolic array 810 is first assembled in the ACC group unit (ACC_GRP) with the low and high bits of the Psum value into a complete Psum value, and the Psum value is sent to the ACC unit to continue accumulation.
  • the calculation performed by the ACC unit at each moment is shown in Figure 15.
  • the ACC unit transmits a Psum value to the next stage every clock cycle. After the accumulation operation of the second ACC module of the third-level ACC_GRP unit is completed, the characteristic value of the final output characteristic map can be obtained.
  • the storage format of the input feature data with a fixed-point digital bit width of 2n bits in the external memory is that the low n bits and high n bits of each row of the input feature data in the input feature map are respectively concentrated storage.
  • the format for storing feature data with a fixed-point digital bit width of 2n bits in the SRAM is shown in Figure 16.
  • the high n bits and low n bits of each row of feature values in the feature map are separately stored in a centralized manner.
  • the storage format of the weights with a fixed-point bit width of 2n bits in the external memory is that the low n bits and high n bits of the weight of each row in the weight matrix are separately stored in a centralized manner.
  • the storage format of characteristic data with a fixed-point number of nbits in SRAM is shown in Figure 17.
  • the storage format of the weight with a fixed-point bit width of nbits in the SRAM is similar to that shown in Figure 17.
  • the computing device 800 provided in this application can be applied to a deep neural network accelerator.
  • an embodiment of the present application also provides a neural network accelerator 1800.
  • the neural network accelerator 1800 includes a processing module 1810, a weight input module 1820, a characteristic data input module 1830, and an output module 1840.
  • the processing module 1810 is the neural network computing device 800 provided in the above method embodiment.
  • the weight input module 1820 is used to read the weight from the external memory and send it to the processing module 800, for example. Referring to FIGS. 8 and 9, the weight input module 1820 is used to read the weight from the external memory and send it to the weight input unit 830 in the processing module 1810.
  • the characteristic data input module 1830 is used to read the characteristic data from the external memory and send it to the processing module 800. Referring to FIGS. 8 and 9, the characteristic data input module 1830 is used to read characteristic data from an external memory and send it to the characteristic data input unit 840 in the processing module 1810.
  • the output module 1840 is used to store the output characteristic data output by the processing module 1810 in an external memory.
  • the storage format of the characteristic data in the external memory is as shown in FIG. 17.
  • the storage format of the feature data in the external memory is as shown in FIG. 16.
  • the storage format of the weight in the external memory is shown in Figure 17.
  • the storage format of the weight in the external memory is as shown in FIG. 16.
  • the embodiment of the present application also provides a control method of a neural network computing device.
  • the arithmetic device includes a systolic array.
  • the processing unit of the systolic array is a first calculation unit.
  • the first calculation unit supports a fixed-point number of multiplication operands with a bit width of n bits, where n is 2 to the power of m, and m is a positive integer.
  • the unit can perform shift first and then accumulate, so that multiple first calculation units in 2 rows and c columns in the systolic array as a whole form a second calculation unit supporting a fixed-point number of multiplication operands with a bit width of 2n bits.
  • c can be 1 or 2. That is, the first calculation unit can perform shift first and then accumulate, so that the two first calculation units in 2 rows and 1 column in the systolic array as a whole form a second calculation that supports a fixed-point number of multiplication operands with a bit width of 2n bits. Unit; or, the first calculation unit may perform shift first and then accumulate operation, so that the four first calculation units in 2 rows and 2 columns in the systolic array as a whole form a fixed-point number supporting multiplication operand with a bit width of 2n bits.
  • the control method includes operations S1010 and S1020 as shown in FIG. 10. See the description above for details, so I won't repeat them here.
  • S1020 includes: in the case where the arithmetic device needs to perform calculations on the input feature data with a fixed-point number width of 2n bits and a weight with a fixed-point number width of 2n bits, controlling part of the first calculation included in the second calculation unit The unit performs shift first and then accumulate operation, so that the two first calculation units in the second row of the first calculation unit included in the second calculation unit output the lower 2n bits of the 4n-bit calculation result of the second calculation unit. Bits and high 2n bits.
  • control method further includes: in the case where the arithmetic device needs to process the input feature data with a fixed-point digital bit width of 2n bits, combining the low n bits and the high n bits of the input feature data with a fixed-point digital bit width of 2n bits They are respectively sent to the 2 rows of the first calculation unit included in the second calculation unit.
  • control method further includes: in the case that the arithmetic device needs to process the weight of the fixed-point number with a width of 2n bits, converting the fixed-point number with a width of 2n bits to the lower n bits and the upper n bits of the weight They are respectively sent to the two columns of first calculation units included in the second calculation unit.
  • control method further includes: concatenating the low 2n-bit operation result and the high 2n-bit operation result of the systolic array output corresponding to the same second calculation unit to obtain 4n bits of the same second calculation unit Operation result: Accumulate the 4n-bit operation results of p second calculation units corresponding to the same weight matrix to obtain the output characteristic data corresponding to the weight matrix, and p is equal to the width of the weight matrix.
  • the storage format of the input feature data with a fixed-point digital bit width of 2n bits in the external memory is that the low n bits and the high n bits of each row of the input feature data in the input feature map are separately stored in a centralized manner.
  • the computing device is used to perform a convolution operation or a pooling operation.
  • an embodiment of the present application also provides a neural network processing device 1900.
  • the neural network processing device 1900 includes a memory 1910 and a processor 1920.
  • the memory 1910 is used to store instructions.
  • the processor 1920 is used to execute the instructions stored in the memory 1910 and execute the instructions stored in the memory 1910. , So that the processor 1920 is used to execute the control method provided in the above method embodiment.
  • the neural network processing device 1900 further includes a data interface 1930 for data transmission with external equipment.
  • An embodiment of the present application also provides a computer storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer executes the control method provided in the above method embodiment.
  • the embodiment of the present application also provides a computer program product containing instructions, which is characterized in that, when the instructions are executed by a computer, the computer executes the control method provided in the above method embodiment.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

提供一种神经网络的运算装置及其控制方法。运算装置包括脉动阵列;脉动阵列的处理单元为第一计算单元,第一计算单元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数;第一计算单元可进行先移位后累加操作,以使得脉动阵列中2行c列的多个第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元,c为1或2。通过设置脉动阵列中的计算单元可以进行先移位后累加操作,可以使得运算装置支持多种定点数位宽,从而可以满足应用中多种定点化精度要求。

Description

神经网络的运算装置及其控制方法
版权申明
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。
技术领域
本申请涉及神经网络领域,并且更为具体地,涉及一种神经网络的运算装置及其控制方法。
背景技术
计算机中数值的表示有两种形式,一种是定点数(fixed-point number),另一种是浮点数(floating-point number)。当前主流的神经网络计算框架中,普遍采用浮点数作为计算单元运算时要求的数据格式,例如,神经网络计算框架训练后得到的权重系数和各层的输出特征数据都是浮点数。由于定点运算装置相比于浮点运算装置占用的面积更小,消耗的功耗更少,所以神经网络加速装置普遍采用定点数作为计算单元运算时要求的数据格式。因此,神经网络计算框架训练得到的权重系数和各层的输出特征数据在神经网络加速装置中部署时,均需要进行定点化。定点化指的是将数据由浮点数转换为定点数的过程。
有些深度卷积神经网络为满足运算精度要求需要使用较小位宽的定点数进行定点化,而另一些深度卷积神经网络为满足运算精度要求需要使用较大位宽的定点数进行定点化。
但是,当前的神经网络运算装置只支持一种定点数位宽,导致无法满足应用中进行定点化的运算精度要求。
发明内容
本申请提供一种神经网络的运算装置及其控制方法,该运算装置可以支持多种定点数位宽,从而可以满足应用中进行定点化的运算精度要求。
第一方面,本申请实施例提供一种神经网络的运算装置,所述运算装置包括脉动阵列;所述脉动阵列的处理单元为第一计算单元,所述第一计算单 元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数;所述第一计算单元可进行先移位后累加操作,以使得所述脉动阵列中2行c列的多个所述第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元,c为1或2。
第二方面,本申请实施例提供一种神经网络加速器,包括:处理模块,所述处理模块为第一方面提供的神经网络的运算装置;输入模块,用于从外部存储器读取输入特征数据与权重到所述处理模块中;输出模块,用于将所述处理模块获得的输出特征数据存储到所述外部存储器中。
第三方面,本申请实施例提供一种神经网络的运算装置的控制方法,所述运算装置包括脉动阵列,所述脉动阵列的处理单元为第一计算单元,所述第一计算单元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数,所述第一计算单元可进行先移位后累加操作,以使得所述脉动阵列中2行c列的多个所述第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元,c为1或2;所述控制方法包括:在所述运算装置需要处理定点数位宽为n比特的输入特征数据的情况下,控制所述第一计算单元不进行先移位后累加操作,以使所述脉动阵列对定点数位宽为n比特的输入特征数据进行处理;在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,控制用于形成所述第二计算单元的2行c列的所述第一计算单元中的一个或多个所述第一计算单元进行先移位后累加操作,以使所述脉动阵列对定点数位宽为2n比特的输入特征数据进行处理。
第四方面,本申请实施例提供一种装置,所述装置用于执行上述第三方面中的方法。
第五方面,本申请实施例提供一种装置,所述装置包括存储器和处理器,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行使得所述处理器执行第三方面的方法。
第六方面,本申请实施例提供一种芯片,所述芯片包括处理模块与通信接口,所述处理模块用于控制所述通信接口与外部进行通信,所述处理模块还用于实现第三方面的方法。
第七方面,本申请提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被计算机执行时使得所述计算机实现第一方面的方法。
第八方面,本申请提供一种包含指令的计算机程序产品,所述指令被计算机执行时使得所述计算机实现第三方面的方法。具体地,所述计算机可以为上述装置。
第九方面,本申请实施例提供一种神经网络处理芯片,其上集成由第一方面提供的神经网络运算装置或由第二方面提供的神经网络加速器。
在本申请提供的神经网络的运算装置中,通过设置脉动阵列中的计算单元可以进行先移位后累加操作,可以使得运算装置支持多种定点数位宽,从而可以满足应用中多种定点化精度要求。
附图说明
图1为深度卷积神经网络的框架示意图。
图2为卷积操作的示意图。
图3为神经网络加速装置的架构示意图。
图4至图7为采用本申请实施例提供的神经网络处理装置实现卷积操作或平均池化操作的时序示意图。
图8为本申请实施例提供的神经网络运算装置的示意性框图。
图9为本申请实施例提供的神经网络的运算装置的另一示意性框图。
图10为本申请实施例提供的神经网络的运算装置的控制方法的示意图。
图11为本申请实施例提供的运算装置中的脉动阵列中的2行2列的第一计算单元等效形成支持2n比特的定点数位宽的第二计算单元的示意图。
图12为本申请实施例提供的运算装置中的脉动阵列中的第一计算单元的结构示意图。
图13为本申请实施例提供的运算装置中的ACC阵列中的ACC的结构示意图。
图14为使用本申请实施例提供的神经网络的运算装置进行定点数位宽为n比特的卷积运算的示意性流程图。
图15为使用本申请实施例提供的神经网络的运算装置进行定点数位宽为2n比特的卷积运算的示意性流程图。
图16为定点数位宽为2n bits的特征数据在SRAM中的存储格式的示意图。
图17为定点数位宽为n bits的特征数据在SRAM中存储的格式的示意 图。
图18为本申请实施例提供的神经网络加速器的示意性框图。
图19为本申请实施例提供的神经网络处理装置的示意性框图。
具体实施方式
为了更好地理解本申请实施例,下面先介绍本申请涉及的相关技术与概念。
1、深度神经网络(以深度卷积神经网络(Deep Convolutional Neural Network,DCNN)为例)
图1是深度卷积神经网络的框架示意图。深度卷积神经网络的输入值(由输入层输入),经隐藏层进行卷积(convolution)、转置卷积(transposed convolution or deconvolution)、归一化(Batch Normalization,BN)、缩放(Scale)、全连接(fully connected)、拼接(Concatenation)、池化(pooling)、元素智能加法(element-wise addition)和激活(activation)等运算后,得到输出值(由输出层输出)。本申请实施例的神经网络的隐藏层可能涉及的运算不仅限于上述运算。
深度卷积神经网络的隐藏层可以包括级联的多层。每层的输入为上层的输出,为特征图(feature map),每层对输入的一组或多组特征图进行前述描述的至少一种运算,得到该层的输出。每层的输出也是特征图。一般情况下,各层以实现的功能命名,例如实现卷积运算的层称作卷积层,实现池化运算的层称作池化层。此外,深度卷积神经网络的隐藏层还可以包括转置卷积层、BN层、Scale层、池化层、全连接层、Concatenation层、元素智能加法层和激活层等,此处不进行一一列举。通常情况下,卷积层的后面会紧接着一层激活层。在BN层被提出以后,越来越多的神经网络在卷积层之后先接BN层,再接激活层。
作为示例而非限定,卷积层的卷积操作过程如图2所示。卷积层的卷积操作过程为,对一组权重值与一组输入特征数据进行向量内积运算,输出一组输出特征数据。一组权重值可以称为滤波器(filter)或卷积核。一组输入特征数据为输入特征图中的部分特征值。一组输出特征数据为输出特征图中的部分特征值。卷积层的每个输出特征数据由输入特征图中的部分特征值与卷积核中的权重值进行内积运算得到。
卷积核、输入特征图与输出特征图均可以被表示为一个多维矩阵。例如,在图2中,卷积核可以被表示为三维矩阵R×R×N,该卷积核的宽度与高度均为R,深度为N;输入特征图可以表示为三维矩阵H×H×M,输入特征图的宽度与高度均为H,深度为M(图中未示出);输出特征图可以表示为三维矩阵E×E×L,输出特征图的宽度与高度均为E,深度为L。
深度卷积神经网络中其它各层的运算流程可以参考现有的技术,本文不进行赘述。
深度卷积神经网络的每层(包括输入层和输出层)可以有一个输入和/或一个输出,也可以有多个输入和/或多个输出。例如,在视觉领域的分类和检测任务中,特征图的宽高往往是逐层递减的(如图1所示的输入、特征图#1、特征图#2、特征图#3和输出的宽高是逐层递减的)。又例如,在语义分割任务中,特征图的宽高在递减到一定深度后,有可能会通过转置卷积运算或上采样(upsampling)运算,再逐层递增。当前,需要较多权重参数用于运算的层有:卷积层、全连接层、转置卷积层和BN层。
2、神经网络加速装置
从图2可知,卷积层的计算耗时较长。深度卷积神经网络中大部分运算都是卷积操作,卷积计算时间占了深度卷积神经网络的大部分计算时间,导致深度卷积神经网络的计算时间很长。为了减少深度卷积神经网络的计算时间,神经网络加速装置被提了出来。神经网络加速装置表示专用于处理神经网络运算的硬件电路。例如,专用于加速卷积层的运算的加速装置可以称为深度卷积神经网络加速装置。
作为示例而非限定,图3为神经网络加速装置的架构示意图。神经网络加速装置300包括输入特征数据输入模块(IFM_Loader)310、权重输入模块(或称为滤波器输入模块(Filt_Loader))320、计算模块(或称为乘累加处理模块(MAU))330与输出模块(OFM_Packer)340。
输入特征数据输入模块310用于从外部存储器中(图3中以静态随机存取存储器(Static Random-Access Memory,SRAM)为例)读出输入特征数据,并将其送入处理模块330中。
权重输入模块320用于从SRAM中读出权重值,并将其送入处理模块330中。
计算模块330用于对输入特征数据与权重值进行乘累加操作,获得输出 特征数据并输出。
输出模块340用于将处理模块330输出的输出特征数据写入SRAM。
如图3所示,计算模块330包括脉动阵列331与输出处理单元332。输出处理单元332中包括用来存储神经网络运算的中间结果的存储器。
下面以进行卷积运算为例,描述计算模块330采用脉动阵列331进行运算的流程。
1)将权重输入模块320送入的权重值装载到脉动阵列331。
2)将输入特征数据输入模块310送入的输入特征图数据送入脉动阵列331,将其与在先装载的权重值进行乘累加操作。
3)如果存储器缓存了中间结果,输出处理单元332则将脉动阵列331的输出结果与存储器中的中间结果再进行一次累加。若累加的结果仍为中间结果,则输出处理单元332将其继续存储到存储器中,否则将其输出到输出模块340中进行后续处理。
3、定点化
计算机中数值的表示有两种形式,一种是定点数(fixed-point number),另一种是浮点数(floating-point number)。当前主流的神经网络计算框架中,普遍采用浮点数作为计算单元运算时要求的数据格式,例如,神经网络计算框架训练后得到的权重系数和各层的输出特征数据都是浮点数。由于定点运算装置相比于浮点运算装置占用的面积更小,消耗的功耗更少,所以神经网络加速装置普遍采用定点数作为计算单元运算时要求的数据格式。因此,神经网络计算框架训练得到的权重系数和各层的输出特征数据在神经网络加速装置中部署时,均需要进行定点化。定点化指的是将数据由浮点数转换为定点数的过程。关于浮点数与定点数以及定点化的概念可以参考现有技术,本文不进行详述。
现有部分深度卷积神经网络使用较小位宽的定点数进行定点化后的精度损失较小;但是另一部分深度卷积神经网络采用相同位宽的定点数进行定点化后的精度损失却很大。也就是说,为满足运算精度要求需要,有些深度卷积神经网络使用较小位宽的定点数进行定点化,而另一些深度卷积神经网络需要使用较大位宽的定点数进行定点化。
但是,当前神经网络运算装置只支持一种定点数位宽,例如,只支持8比特(bit)的定点数位宽,或者只支持16比特的定点数位宽,导致无法满 足应用中进行定点化的运算精度要求。
针对上述问题,本申请提出一种可支持多种定点数位宽的神经网络运算装置。
为了更好地理解本申请实施例,下面先以图3中所示的脉动阵列331为例,结合图4-图7描述脉动阵列的原理。
假设图3中所示的脉动阵列331包括如图4所示的3行3列的计算单元:C00、C01、C02、C10、C11、C12、C20、C21与C22。输出处理单元332与计算单元C20、C21与C22连接,用于根据其输出的运算结果获得输出特征数据。
下面以3*3的权重矩阵为
Figure PCTCN2020091883-appb-000001
与3*3的输入特征矩阵为
Figure PCTCN2020091883-appb-000002
的卷积操作为例进行描述。即3*3的权重矩阵
Figure PCTCN2020091883-appb-000003
可以称为卷积核。
Figure PCTCN2020091883-appb-000004
Figure PCTCN2020091883-appb-000005
的卷积操作的计算结果应为:
a11*W11+a12*W12+a13*W13+a21*W21+a22*W22+a23*W23+a31*W31+a32*W32+a33*W33。
如图4至图7所示,采用脉动阵列331执行卷积操作的流程如下。
参见图4所示,预先将权重
Figure PCTCN2020091883-appb-000006
加载到计算单元C00、C01、C02、C10、C11、C12、C20、C21与C22中。
在第一个周期(T1),参见图5所示,输入特征数据a11进入计算单元C00,其中,输入特征数据a11从计算单元C00的左边载入并从左向右流动。在第一个周期结束时,计算单元C00的计算结果为a11*W11。计算单元C00的计算结果a11*W11从上向下流动。
在第二个周期(T2),参见图6所示,输入特征数据a11向右流动进入计算单元C01,计算结果a11*W11向下流动进入计算单元C10;同时输入特征数据a21载入至计算单元C00,输入特征数据a21载入至计算单元C10。在第二个周期结束时,计算单元C00的计算结果为a12*W11,计算单元C01的计算结果为a11*W12,计算单元C10的计算结果为a11*W11+a21*W21。各个计算单元的计算结果从上向下流动。
在第三个周期(T3),参见图7所示,输入特征数据a11向右流动进入计算单元C02,a12向右流动进入计算单元C01,a21向右流动进入计算单元C11,计算结果a12*W11向下流动进入计算单元C10,计算结果a12*W12向下流动进入计算单元C11,计算结果a11*W11+a21*W21向下流动进入计算单元C20。同时,a13载入计算单元C00,a22载入计算单元C10,a31载入计算单元C20。在第三个周期结束时,计算单元C00的计算结果为a13*W11,计算单元C01的计算结果为a12*W12,计算单元C02的计算结果为a11*W13,计算单元C10的计算结果为a12*W11+a22*W21,计算单元C11的计算结果为a11*W12+a21*W22,计算单元C20的计算结果为a11*W11+a21*W21+a31*W31。各个计算单元的计算结果从上向下流动。
以此类推,在第五个周期结束时计算单元C21会输出计算结果a12*W12+a22*W22+a32*W32,在第七个周期结束时计算单元C22会输出计算结果a13*W13+a23*W23+a33*W33。
可知,第三个周期结束时计算单元C20的计算结果a11*W11+a21*W21+a31*W31、第五个周期结束时计算单元C21的计算结果a12*W12+a22*W22+a32*W32、以及第七个周期结束时计算单元C22的计算结果a13*W13+a23*W23+a33*W33的累加值为输入特征数据
Figure PCTCN2020091883-appb-000007
和权重
Figure PCTCN2020091883-appb-000008
的卷积操作的计算结果。
输出处理单元332用于接收计算单元C20、C21与C22输出的运算结果 (应理解,是卷积操作的中间计算结果),并对第三个周期结束时计算单元C20的计算结果、第五个周期结束时计算单元C21的计算结果、以及第七个周期结束时计算单元C22的计算结果作累加,得到输入特征数据
Figure PCTCN2020091883-appb-000009
和权重
Figure PCTCN2020091883-appb-000010
的卷积操作的计算结果。
图8为本申请实施例提供的神经网络运算装置800的示意性框图。神经网络运算装置800包括脉动阵列810。
脉动阵列810的处理单元为第一计算单元811。下文一些实施例中也会将第一计算单元811记为MC。
第一计算单元811支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数。
m例如为1、2、3或其它正整数。即第一计算单元811可以支持定乘法操作数的定点数位宽为2比特、4比特、8比特或16比特或其它2的m次方比特的运算。
第一计算单元811支持乘法操作数的定点数位宽为n比特表示,第一计算单元811支持乘法运算x1*y1,其中,x1与y1的定点数位宽均为n比特。
第一计算单元811可进行先移位后累加操作,以使得脉动阵列810中2行c列的多个第一计算单元811作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元812,c为1或2。
下文一些实施例中也会将第二计算单元811记为MU。
第一计算单元811可进行先移位后累加操作,表示,第一计算单元811可对本计算单元的运算结果与其它计算单元的运算结果进行先移位后累加的操作,其中,该其它计算单元可以包括与本计算单元行相邻的计算单元,或与本计算单元列相邻的计算单元,或位于本计算单元的对角线上的计算单元。
例如,第一计算单元811可以将本计算单元的运算结果先左移n位,再与与本计算单元行相邻的前一级计算单元的运算结果累加。
脉动阵列810中2行c列的多个第一计算单元811作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元812,表示,脉动阵列 810中2行c列的多个第一计算单元811作为一个整体可以等效为第二计算单元812,且该第二计算单元812支持乘法操作数的定点数位宽为2n比特。
第二计算单元812支持乘法操作数的定点数位宽为2n比特,表示,第二计算单元812支持乘法运算x2*y2,x2与y2中定点数位宽最大的一个数的定点数位宽为2n比特,例如,x2与y2的定点数位宽均为2n比特,或者,x2与y2中一个数的定点数位宽为2n比特,另一个数的定点数位宽为n比特。
应理解,因为第一计算单元811支持乘法操作数的定点数位宽为n比特,因此,脉动阵列810本身是可以支持定点数位宽为2n比特的运算的。若脉动阵列810中2行c列的多个第一计算单元811作为一个整体形成的第二计算单元812支持乘法操作数的定点数位宽为2n比特,则使得脉动阵列810也可支持定点数位宽为2n比特的运算。
因此,脉动阵列810既可以支持定点数位宽为n比特的运算,又可以支持定点数位宽为2n比特的运算,也就是说,本申请实施例提供的神经网络的运算装置既可以支持定点数位宽为n比特的运算,又可以支持定点数位宽为2n比特的运算。
上述可知,在本申请提供的神经网络的运算装置中,通过设置脉动阵列中的计算单元可以进行先移位后累加操作,可以使得运算装置支持多种定点数位宽,从而可以满足应用中多种定点化精度要求。
需要说明的是,第二计算单元812仅为了便于理解与描述而引入,并非表示脉动阵列810中实际包含第二计算单元812。换句话说,在本文中,将脉动阵列810中2行c列的多个第一计算单元811作为一个整体所形成的计算单元记为第二计算单元812。
作为示例而非限定,图9为本申请实施例提供的神经网络的运算装置800的另一示意性框图。
在图9中,第一计算单元811表示为MC,第二计算单元812表示为MU。在图9中,c为2,即脉动阵列810中2行2列的第一计算单元(即4个第一计算单元)整体可等效为一个第二计算单元812。
例如,在图9中,MC不进行先移位后累加操作,则脉动阵列810可进行乘法操作数的定点数位宽为n比特的运算。
又例如,在图9中,同一个MU内的MC通过进行先移位后累加操作,使得MU可以支持乘法操作数的定点数位宽为2n比特,则脉动阵列810可 进行乘法操作数的定点数位宽为2n比特的运算。
可知,在本申请实施例中,无需额外增加硬件,就可以实现神经网络运算装置支持多种定点数位宽。
例如,本申请实施例提供的神经网络的运算装置800可以应用于卷积层,也可应用于池化层。即运算装置800可以用于处理卷积操作,也可以用于处理池化操作。
运算装置800可以在控制单元的控制下,在多种不同定点数位宽之间切换。
如图8与图9所示,运算装置800还包括控制单元820。控制单元820用于向脉动阵列810发送控制信令,以控制脉动阵列810的运算方式。可以理解到,控制单元820可以向第一计算单元811发送控制信令,以控制第一计算单元811的运算方式。
例如,控制单元820用于执行如图10所示的操作S1010与S1020。
S1010,在运算装置800需要处理定点数位宽为n比特的输入特征数据的情况下,控制第一计算单元811不进行先移位后累加操作,以使脉动阵列810对定点数位宽为n比特的输入特征数据进行处理。
S1020,在运算装置800需要处理定点数位宽为2n比特的输入特征数据的情况下,控制用于形成第二计算单元812的2行c列的第一计算单元811中的一个或多个第一计算单元811进行先移位后累加操作,使得第二计算单元812支持乘法操作数的定点数位宽为2n比特,从而使得脉动阵列810对定点数位宽为2n比特的输入特征数据进行处理。
例如,c为1。控制单元820用于在,在运算装置800需要处理定点数位宽为2n比特的输入特征数据与定点数位宽为n比特的权重的情况下,控制用于形成第二计算单元812的2行1列的第一计算单元811中的第2行的第一计算单元811进行先移位后累加操作,使得第二计算单元812支持定点数位宽为2n比特的输入特征数据与定点数位宽为n比特的权重的运算,从而使得运算装置800可以执行定点数位宽为2n比特的输入特征数据与定点数位宽为n比特的权重的运算。
例如,c为2。控制单元820用于在,在运算装置800需要处理定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重的情况下,控制用于形成第二计算单元812的2行2列的第一计算单元811中的部分第一 计算单元811进行先移位后累加操作,使得第二计算单元812支持定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重的运算,从而使得运算装置800可以执行定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重的运算。
其中,在本例中,控制单元820用于,控制第二计算单元812中所包含的2行2列中的部分第一计算单元811进行先移位后累加操作,使得第二计算单元812中所包含的2行第一计算单元811中后1行的2个第一计算单元811分别输出第二计算单元812的4n比特运算结果的低2n比特位与高2n比特位。参见下文描述的(三)2n bits*2n bits的运算的示例。
在运算装置800需要处理定点数位宽为n比特的输入特征数据的情况下,控制单元820还用于,将定点数位宽为n比特的输入特征数据送入第一计算单元811。
作为示例,定点数位宽为n比特的输入特征数据被送入脉动阵列810的方式如下文描述的(一)n bits*n bits的运算。
在运算装置800需要处理定点数位宽为2n比特的输入特征数据的情况下,控制单元820还用于,将定点数位宽为2n比特的输入特征数据的低n比特位与高n比特位分别送入第二计算单元812中所包含的2行第一计算单元811中。
作为示例,定点数位宽为2n比特的输入特征数据被送入脉动阵列810的方式如下文描述的(二)2n bits*n bits的运算与(三)2n bits*2n bits的运算。
在一些实施例中,c为2,在运算装置800需要处理定点数位宽为2n比特的权重的情况下,控制单元820还用于,将定点数位宽为2n比特的权重的低n比特位与高n比特位分别送入第二计算单元812中所包含的2列第一计算单元811中。
例如,在运算装置800需要进行定点数位宽为2n比特的输入特征数据与权重的运算的情况下,控制单元820用于,将定点数位宽为2n比特的输入特征数据的低n比特位与高n比特位分别送入第二计算单元812中所包含的2行第一计算单元811中;将定点数位宽为2n比特的权重的低n比特位与高n比特位分别送入第二计算单元812中所包含的2列第一计算单元811中。
作为示例,定点数位宽为2n比特的输入特征数据与权重被送入脉动阵列810的方式如下文描述的(三)2n bits*2n bits的运算。
继续参见图8与图9,运算装置800还包括特征数据输入单元840与权重输入单元830。
权重输入单元830,用于缓存待处理的权重,并根据控制单元820的控制信令将权重送入脉动阵列810中。
例如,权重输入单元830负责缓存权重值(例如由下文描述的加速器1800中的权重输入模块(Filt_Loader模块)送入),并在控制单元820的控制下为脉动阵列810装载权重。权重输入单元830与脉动阵列810的每一列第一计算单元810有且仅有一个接口,该接口每个时钟周期仅能传输一个权重值。权重装载分为移位和装载两个阶段,在移位阶段,权重输入单元830依次将同一列第一计算单元810需要的权重值通过同一个接口依次送入脉动阵列810。在脉动阵列810中,接收到的权重值从接口处的第一计算单元810依次向下传递。在装载阶段,脉动阵列810中同一列的第一计算单元810同时将缓存的权重值装载到各自的缓存中供后续的乘累加单元使用。权重输入单元830为相邻两列第一计算单元810装载数据时会有一个时钟周期的延迟。
特征数据输入单元840,用于缓存待处理的输入特征数据,并根据控制单元820的控制信令将输入特征数据送入脉动阵列810中。
例如,特征数据输入单元840负责缓存输入特征数据(例如由下文描述的加速器1800中的特征数据输入模块(IFM_Loader模块)送入),并在控制单元820的控制下为脉动阵列810装载输入特征数据。特征数据输入单元840与脉动阵列810的每一行第一计算单元811仅有一个接口,该接口每个时钟周期仅能传输一个输入特征数据。在脉动阵列810中,接收到的输入特征数据从接口处的第一计算单元811依次向右传递直至最后一个第一计算单元811。特征数据输入单元840为相邻两行第一计算单元811装载数据时会有一个时钟周期的延迟。
在本申请实施例中,通过设置脉动阵列中的计算单元可以进行先移位后累加操作,可以使得运算装置支持多种定点数位宽,从而可以根据应用需求控制运算装置在多种定点数位宽之间切换,以满足应用中多种定点化精度要求。
以运算装置800执行卷积操作为例,控制单元820负责控制运算装置800 中的各个单元以实现卷积运算。首先,控制单元820控制权重输入单元830将权重值装载到脉动阵列810,然后,控制单元820控制特征数据输入单元840将特征图数据送入脉动阵列810,并控制脉动阵列810进行卷积运算。待所有的特征图数据送入脉动阵列810并完成卷积运算后,再依次重复上述过程,直至完成所有的卷积运算。
下面结合图11描述,由脉动阵列810中的2行c列的第一计算单元811等效形成支持2n比特的定点数位宽的第二计算单元的方式。
在图11中,将第一计算单元811记为MC,即MC可以完成n bits*n bits的乘累加操作;将第二计算单元812记为MU。
在图11中,以c为2为例,即脉动阵列810中的2行2列的第一计算单元811(MC)等效形成一个第二计算单元812(MU)。在图11中,将2行2列的4个第一计算单元811分别标记为MC(U0_0)、MC(U0_1)、MC(U1_0)与MC(U1_1)。即MC(U0_0)、MC(U0_1)、MC(U1_0)与MC(U1_1)可以等效为一个第二计算单元812(MU)。
在图11的示例中,MC具有如下几个输入端:bi(n bits)、si、ai(n bits)、ci。MC具有如下几个输出端:bo(n bits)、ar、acr、ao(n bits)、mr。
MC的各个输入端的含义如下。
输入端bi(n bits)被配置为输入n bits的输入权重值。例如,MC(U0_0)的输入端bi(n bits)输入n bits的输入权重值b_lsb。
输入端si被配置为输入前级MC输出的累加结果。例如,MC(U0_0)的输入端si输入前级MC输出的累加结果s_lsb。
输入端ai(n bits)被配置为输入n bits的输入特征值。例如,MC(U0_0)的输入端ai(n bits)输入n bits的输入特征数据a_lsb。
输入端ci被配置为输入相邻的、但非前级的MC的中间结果。如下面描述的(三)2n bits*2n bits的运算,MC(U1_0)的输入端ci输入MC(U0_1)的中间结果“RM(U0_1)[7:0]”,MC(U0_1)的输入端ci输入MC(U1_0)的中间结果“RM(U1_0)[31:8]”,MC(U1_1)的输入端ci输入MC(U1_0)的中间结果“RA(U1_0)[31:16]”。
MC的各个输出端的含义如下。
输出端bo(n bits)被配置为向下级MC输出n bits的输入特征数据。例如,MC(U0_0)的输出端bo(n bits)向MC(U1_0)输出n bits的输入特 征数据b_lsb。
输出端ar被配置为输出当前MC的计算结果。例如,MC(U0_0)的输出端ar输出MC(U0_0)的计算结果RA(U0_0)。
输出端acr被配置为向权重值流动方向上的下一个MC输出当前MU的中间结果。如下面描述的(三)2n bits*2n bits的运算,MC(U1_0)的输出端acr输出MC(U1_0)的中间结果RM(U1_0)[31:8]。
输出端ao(n bits)被配置为向权重值流动方向上的下一个MC输出n bits的输入权重值。例如,MC(U0_0)的输出端ao(n bits))向,MC(U0_1)输出n bits的输入权重值。
输出端mr被配置为向属于同一个MU中的、位于对角线位置的MC输出当前MU的中间结果。如下面描述的(三)2n bits*2n bits的运算,MC(U1_0)的输出端mr向MC(U1_1)输出MC(U1_0)的中间结果RA(U1_0)[31:16]。
图11中所示的MU可同时完成4个n bits*n bits的乘累加操作,或同时完成2个2n bits*n bits的乘累加操作,或1个2n bits*2n bits的乘累加操作。具体描述如下。
(一)n bits*n bits的运算
当MU进行n bits*n bits的运算时,输入端口a_lsb、a_msb、b_lsb和b_msb分别输入四个不同的n bits的操作数。MU中的4个MC单元分别完成四个不同的n bits*n bits的乘累加操作。计算过程(一)如下所示。
MC(U0_0):RM(U0_0)=a_lsb*b_lsb
RA(U0_0)=RM(U0_0)+si_lsb
SO_LSB=RA(U0_0)
MC(U1_0):RM(U1_0)=a_msb*b_lsb
RA(U1_0)=RM(U1_0)+RA(U0_0)
SO_LSB=RA(U1_0)
MC(U0_1):RM(U0_1)=a_lsb*b_msb
RA(U0_1)=RM(U0_1)+si_msb
SO_MSB=RA(U0_1)
MC(U1_1):RM(U1_1)=a_msb*b_msb
RA(U1_1)=RM(U1_1)+RA(U0_1)
SO_MSB=RA(U1_1)
其中,a_msb与a_lsb表示两个不同的n bits的输入特征数据;b_msb与b_lsb表示两个不同的n bits的输入权重值。s_msb与s_lsb表示前级MU输出的累加结果。so_lsb与so_msb表示当前MU输出的累加结果。
(二)2n bits*n bits的运算
当MU进行2n bits*n bits的运算时,2n bits的输入特征值a分别通过输入端口a_msb和a_lsb送入MU;其中,输入端口a_lsb为输入特征值a的低n bits,输入端口a_msb为输入特征值a的高n bits。输入端口b_lsb和b_msb分别为两个不同的n bits的输入权重值。四个MC单元分为两组,MC(U0_0)和MC(U1_0)为第一组,MC(U0_1)和MC(U1_1)为第二组。两组MC单元分别完成二个不同的2n bits*n bits的乘累加操作。计算过程(二)如下所示。
MC(U0_0):RM(U0_0)=a_lsb*b_lsb
RA(U0_0)=RM(U0_0)+s_lsb
MC(U1_0):RM(U1_0)=a_msb*b_lsb
RA(U1_0)=RM(U1_0)<<8+RA(U0_0)
SO_LSB=RA(U1_0)
MC(U0_1):RM(U0_1)=a_lsb*b_msb
RA(U0_1)=RM(U0_1)+s_msb
MC(U1_1):RM(U1_1)=a_msb*b_msb
RA(U1_1)=RM(U1_1)<<8+RA(U0_1)
SO_MSB=RA(U1_1)
其中,{a_msb,a_lsb}为一个2n bits的输入特征值;b_msb与b_lsb为的两个不同的n bits的输入权重值。s_msb与s_lsb为前级MU输出的累加结果。so_lsb和so_msb为当前MU输出的两个累加结果。
可选地,在本例中,作为第一组的MC(U0_0)和MC(U1_0)可以整体视为一个第二计算单元812,作为第二组的MC(U0_1)和MC(U1_1)可以整体视为一个第二计算单元812。
可选地,在本例中,MC(U0_0)、MC(U1_0)、MC(U0_1)和MC(U1_1)可以整体视为一个第二计算单元812。
(三)2n bits*2n bits的运算
当MU进行2n bits*2nbits的运算时,2n bits输入特征值a分别通过输入端口a_msb和a_lsb送入MU;其中,输入端口a_lsb为输入特征值a的低n bits,输入端口a_msb为输入特征值a的高n bits。2n bits输入权重值b分别通过输入端口b_msb和b_lsb送入MU;其中,输入端口b_lsb为输入权重值b的低n bits,输入端口b_msb为输入权重值b的高n bits。MC(U0_0)和MC(U1_0)输出2n bits*2n bits乘累加结果的低2n bits,MC(U0_1)和MC(U1_1)输出2n bits*2nbits乘累加结果的高2nbits。计算过程(三)如下所示。
MC(U0_0):RM(U0_0)=a_lsb*b_lsb
RA(U0_0)=RM(U0_0)+s_lsb
MC(U1_0):RM(U1_0)=a_msb*b_lsb
RA(U1_0)=RA(U0_0)+RM(U1_0)[7:0]<<8+RM(U0_1)[7:0]<<8
SO_LSB=RA(U1_0)[15:0]
MC(U0_1):RM(U0_1)=a_lsb*b_msb
RA(U0_1)=RM(U0_1)[31:8]+RM(U1_0)[31:8]+s_msb
MC(U1_1):RM(U1_1)=a_msb*b_msb
RA(U1_1)=RM(U1_1)+RA(U0_1)+RA(U1_0)[31:16]
SO_MSB=RA(U1_1)
其中,{a_msb,a_lsb}表示一个2n bits的输入特征值;{b_msb,b_lsb}表示一个2n bits的输入权重值;{s_msb,s_lsb}表示前级MU输出的累加结果;{so_msb,so_lsb}表示当前MU输出的累加结果。
可选地,在图11所示实施例中,控制单元820控制第一计算单元811执行上述计算过程(一),可以使得脉动阵列810执行n bits*n bits的运算,即可以使得运算装置800支持n bits*n bits的运算。
可选地,在图11所示实施例中,控制单元820控制第一计算单元811执行上述计算过程(二),可以使得脉动阵列810执行2n bits*n bits的运算,即可以使得运算装置800支持2n bits*n bits的运算。
可选地,在图11所示实施例中,控制单元820控制第一计算单元811执行上述计算过程(三),可以使得脉动阵列810执行2n bits*2n bits的运算,即可以使得运算装置800支持2n bits*2n bits的运算。
需要说明的是,本申请实施例提供的运算装置800不仅可以支持n比特 与2n比特两种定点数位宽,还可支持4n比特、8n比特等其它更多种定点数位宽。说明如下。
继续参见图9,MC支持n比特的定点数位宽;由2行2列MC组成的MU支持2n比特的定点数位宽;可以理解到,由2行2列的MU组成的计算单元,即由4行4列的MC组成的计算单元,可以支持4n比特的定点数位宽;由4行4列的MU组成的计算单元,即由8行8列的MC组成的计算单元,可以支持8n比特的定点数位宽,以此类推。
实际应用中,可以根据应用需求,通过设置第一计算单元811的运算方式,使得运算装置800支持能够满足运算精度的定点数位宽。
图12为第一计算单元811的结构示意图。
第一计算单元811包括权重移位寄存器(Weight Shift Register)、特征图数据移位寄存器(FM Data Shift Register)、权重寄存器(Weight Register)、特征图数据寄存器(FM Data Register)、乘法电路(Mutiplier电路)、乘积寄存器(Product Register)、先移位后累加操作电路(Carry Adder电路)和累加电路(Accumulate Adder电路)。
权重移位寄存器负责缓存从权重输入单元830或上一级第一计算单元811(也可以称为前级811)送来的权重值。在权重装载的移位阶段,权重移位寄存器缓存的权重值会向下传递到下一级第一计算单元811(也可以称为后级811)。在权重装载的装载阶段,权重移位寄存器缓存的权重值会被锁存到权重寄存器。
特征图数据移位寄存器负责缓存从特征数据输入单元840或左边的第一计算单元811送来的特征图数据。特征图数据移位寄存器存的特征图数据会被锁存到特征图数据寄存器,同时还会被送到右边的第一计算单元811。左边的第一计算单元811表示在输入特征值在脉动阵列中的流动方向上的前一级第一计算单元811。右边的第一计算单元811表示在输入特征值在脉动阵列中的流动方向上的后一级第一计算单元811。
乘法电路负责将权重寄存器和特征图数据寄存器缓存的权重值和特征值进行乘法操作,操作结果被送到乘积寄存器中。先移位后累加操作电路负责将乘积寄存器中的数据与当前第一计算单元811所属的第二计算单元812中的其他第一计算单元811的运算结果进行先移位后累加操作。先移位后累加操作电路的累加结果在累加电路中与上一级第一计算单元811(也可称为 前一级第一计算单元811)送入的乘累加计算结果再次累加后向下传递到下一级第一计算单元811(也可称为下一级第一计算单元811)。
作为一个示例,第一计算单元811充当图11中的MC(U1_0)。当进行前文描述的(二)2n bits*n bits的运算时,第一计算单元811中的先移位后累加操作电路负责如下运算。
RM(U1_0)<<8+RA(U0_0)。
作为另一个示例,第一计算单元811充当图11中的MC(U1_0)。当进行前文描述的(三)2n bits*2n bits的运算时,第一计算单元811中的先移位后累加操作电路负责如下运算。
RA(U0_0)+RM(U1_0)[7:0]<<8+RM(U0_1)[7:0]<<8。
作为又一个示例,第一计算单元811充当图11中的MC(U0_1)。当进行前文描述的(三)2n bits*2n bits的运算时,第一计算单元811中的先移位后累加操作电路负责如下运算。
RM(U0_1)[31:8]+RM(U1_0)[31:8]。
作为再一个示例,第一计算单元811充当图11中的MC(U1_1)。当进行前文描述的(三)2n bits*2n bits的运算时,第一计算单元811中的先移位后累加操作电路负责如下运算。
RM(U1_1)+RA(U1_0)[31:16]。
需要说明的是,在运算装置800需要进行n比特*n比特的运算的情况下,先移位后累加操作电路负责将乘积寄存器中的数据与0进行先移位后累加操作,如图12所示,即这种情形下,先移位后累加操作电路只用于透传数据,而不处理数据。
对于不具有后级计算单元的第一计算单元811来说,累加电路将获得的累加结果直接送入输出处理单元850(下文将描述输出处理单元850)。
继续参见图8与图9,运算装置800还包括输出处理单元850。
输出处理单元850用于处理脉动阵列810输出的运算结果,获得输出特征数据。
作为一个示例,假设脉动阵列810包括前文结合图4-图7描述的例子中的计算单元C00、C01、C02、C10、C11、C12、C20、C21与C22,输出处理单元850用于,接收计算单元C20、C21与C22输出的运算结果(应理解,是卷积操作的中间计算结果),并对第三个周期结束时计算单元C20的计算 结果、第五个周期结束时计算单元C21的计算结果、以及第七个周期结束时计算单元C22的计算结果作累加,得到输出特征数据a11*W11+a12*W12+a13*W13+a21*W21+a22*W22+a23*W23+a31*W31+a32*W32+a33*W33。
作为示例,输出处理单元850的结构示意图如图9所示。输出处理单元850包括累加(Accumulate,ACC)阵列851、结果处理(Rslt_Proc)单元852与存储(Psum_Mem)单元853。
ACC阵列851的列大小与脉动阵列810的列大小一致。
假设脉动阵列的大小为M*N,即脉动阵列810包括M行N列的第一计算单元811,则ACC阵列851的大小为1*N,即ACC阵列851包括1行N列的ACC。
对应于2行c列的第一计算单元811可作为整体形成第二计算单元812,ACC阵列851中每c个ACC作为整体可形成一个ACC组。
例如,脉动阵列810中,2行2列的第一计算单元811可作为整体形成第二计算单元812,则ACC阵列851中每2个ACC作为整体可形成一个ACC组(ACC_GRP)单元,即ACC阵列851共有N/2个ACC组(ACC_GRP)单元。
需要说明的是,类似于第二计算单元812,ACC组(ACC_GRP)单元仅为了便于理解与描述而引入,并非表示ACC阵列851中实际包含ACC组单元。换句话说,在本文中,将ACC阵列851中每2个ACC作为一个整体所形成的单元记为ACC组单元。
结果处理(Rslt_Proc)单元852负责处理ACC阵列851输出的计算结果。
以运算装置800用于进行卷积操作为例。如果ACC阵列851输出的计算结果为卷积计算的最终结果,则结果处理单元852会将其输出,例如,将其发送到运算装置800外部的输出模块中进行后续处理。如果ACC阵列851输出的计算结果为卷积计算的中间结果,则结果处理单元852会将输出的计算结果送入存储(Psum_Mem)单元853中。
存储(Psum_Mem)单元853负责缓存ACC阵列851输出的中间结果。还以运算装置800用于进行卷积操作为例,存储单元853负责缓存卷积计算的中间结果。
作为示例,存储单元853可以包括数量与脉动阵列810的列大小相匹配的FIFO。假设,脉动阵列810的大小为M*N,即列大小为N,则存储单元853可以由N个FIFO组成。
存储单元853中的每个FIFO都可以同时进行读写操作。在进行卷积运算时,N个FIFO会根据卷积核的大小分成不同的组。不同的FIFO组缓存不同卷积核的中间计算结果。
前文描述的,如果ACC阵列851输出的计算结果为卷积计算的中间结果,则结果处理单元852会将输出的计算结果送入存储(Psum_Mem)单元853中,具体为,结果处理单元852会将输出的计算结果送入存储单元853中对应的FIFO组。
继续参见图9,输出处理单元850中的ACC阵列851中还包括拼接单元854,拼接单元854与ACC组单元一一对应。拼接单元854用于对形成ACC组单元的2个ACC的输入数据进行拼接。
在运算装置800需要处理定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重的情况下,输出处理单元850用于执行如下操作。
1)对脉动阵列810输出的对应于同一个第二计算单元812的低2n比特位运算结果与高2n比特位运算结果进行拼接,获得同一个第二计算单元812的4n比特运算结果。
2)对对应于同一个权重矩阵的p个第二计算单元812的4n比特运算结果进行累加,以获得权重矩阵对应的输出特征数据,p等于权重矩阵的宽度。
例如,操作1)由拼接单元854执行;操作2)由ACC组单元中对应输出高2n比特位的第一计算单元811的一个ACC执行。
在运算装置800用于执行卷积操作的情形下,权重矩阵为卷积核。
在运算装置800用于执行池化操作的情形下,权重矩阵为池化矩阵。
输出处理单元850可以根据控制单元820的控制指令,执行包括拼接动作的累加操作,或不包括拼接动作的累加操作。
作为一个示例,控制单元820在运算装置800需要进行上文描述的(一)n bits*n bits的运算,或者(二)2n bits*n bits的运算的情况下,向输出处理单元850发送控制指令1,在运算装置800需要进行上文描述的(三)2n bits*2n bits的运算的情况下,向输出处理单元850发送控制指令2。
输出处理单元850在接收到控制指令1的情况下,将工作模式切换为模 式(MODE)0,如图9中所示的模式(MODE)0;在接收到控制指令2的情况下,将工作模式切换为模式1,如图9中所示的模式1。
输出处理单元850在模式0下的操作流程为,ACC阵列851中的ACC从相应的第一计算单元811中获取脉动阵列810的输出结果,然后对这些输出结果进行累加,从而获得输出特征数据。即在模式0下,输出处理单元850不执行拼接动作。
输出处理单元850在模式1下的操作流程为,拼接单元854对第二计算单元812中的2个第一计算单元811分别输出的低2n比特位与高2n比特位进行拼接,获得该第二计算单元812的4n比特运算结果,并将该4n比特运算结果送入该拼接单元854所属的ACC组单元中的高位ACC(即对应输出高2n比特位的第一计算单元811的ACC)中;高位ACC对P个第二计算单元812的4n比特运算结果进行累加,获得输出特征数据。
ACC阵列851中的ACC的结构示意图如图13所示。
ACC单元包括脉动阵列累加寄存器(mc_psum Register)、ACC累加寄存器(acc_psum Register)、加和寄存器(sum Register)、滤波器电路(Filter Circuitry)、延迟电路(Delay Circuitry)、第一级累加电路(First Stage Adder电路)以及第二级累加电路(Second Stage Adder电路)。
滤波器电路(Filter电路)根据卷积计算时输入的参数步长值(Stride值)过滤掉脉动阵列810输出的冗余的累加值(Psum值),同时,会将未经过滤的累加值(Psum值)送入脉动阵列累加寄存器(mc_psum Register)。延迟电路(Delay电路)将左边一级ACC输出的累加值(Psum值)延迟指定时钟周期后送入ACC累加寄存器(acc_psum Register),延迟的时钟周期数由卷积计算时输入的参数膨胀值(Dilation值)计算得到。
第一级累加电路(First Stage Adder电路)负责将脉动阵列累加寄存器(mc_psum Register)和ACC累加寄存器(acc_psum Register)中缓存的数据进行累加后送入加和寄存器(sum Register)。
在将卷积运算的卷积核映射到脉动阵列时,连续N个ACC会映射到同一个卷积核,N的大小和卷积核的宽度相同。
N个ACC中的第一个ACC不需要接收左面一级ACC输出的累加值(Psum值)。例如,第一个ACC可以接收系统预设信号。
N个ACC中的最后一个ACC也不会将加和寄存器(sum Register)缓 存的累加值(Psum值)输出到右面一级的ACC,而是在第二级累加电路(Second Stage Adder电路)中将加和寄存器(sum Register)缓存的累加值(Psum值)与从存储(Psum_Mem)单元853中读回的累加值(Psum值)累加后输出到结果处理(Rslt_Proc)单元852中。
为了更好地理解本申请实施例,下面结合图14与图15描述两个例子。在图14与图15,第一计算单元811记为MC,第二计算单元812记为MU,以及以2行2列的MC组成一个MU为例。
图14为使用本申请实施例提供的神经网络的运算装置800进行定点数位宽为n比特的卷积运算的示意性流程图。
图14中虚线标记的单元(MC、ACC)只负责传递数据,不参与卷积计算。图14中,卷积核的大小为1*3*3。图14中的各个符号的含义如下。
KhaDb表示输入特征图中对应卷积核第a行的第b个数。Kwc表示卷积核中第c列的权重值向量,它会在卷积运算开始时部署到相应的MC中。KwcDd表示输出特征图对应卷积核第c列的第d个Psum值。Bias表示卷积运算输入的偏置值。SxTy表示第x级ACC在y时刻输出的累加值(Psum值)。
卷积运算开始时,卷积核中的权重值向量Kwc会分三个时钟周期送入脉动阵列810(MAC Array),每个MC装载3*3卷积核中对应位置的权重值。权重装载完毕后,输入特征图的特征值按照图14中的顺序依次送入脉动阵列810(MAC Array),这些特征值在脉动阵列810中与权重值进行乘累加。
脉动阵列810输出的累加值(Psum值)的顺序如图14所示。从脉动阵列810中输出的累加值(Psum值)送入对应的ACC中继续进行累加。ACC单元每个时刻进行的计算如图14所示,第3级ACC完成累加操作后,即可得到最终的输出特征图的特征值。
图15为使用本申请实施例提供的神经网络的运算装置800进行定点数位宽为2n比特的卷积运算的示意性流程图。
图15中虚线标记的单元(MC、ACC)只负责传递数据,不参与卷积计算。图15中,卷积核的大小为1*3*3。图15中的各个符号的含义如下。
KhaDb_LSB表示输入特征图中对应卷积核第a行的第b个数低n bits数;KhaDb_MSB表示输入特征图中对应卷积核第a行的第b个数的高n bits数。Kwc_LSB表示卷积核中第c列权重值向量的低nbits数;Kwc_MSB表示卷 积核中第c列权重值向量的高nbits数,它们会在卷积运算开始时部署到相应的MC中。KwcDd_LSB表示输出特征图对应卷积核第c列的第d个Psum值的低位;KwcDd_MSB为输出特征图对应卷积核第c列的第d个Psum值的高位。Bias表示卷积运算输入的偏置值。SxTy表示第x级ACC单元在y时刻输出的Psum值。
卷积运算开始时,卷积核中的特征值向量Kwc_LSB和Kwc_MSB会分六个时钟周期送入脉动阵列810(MAC Array)。每个MC单元装载3*3卷积核中对应位置的权重值的对应n bits数。权重装载完毕后,输入特征图的特征值按照图15中的顺序依次送入脉动阵列810,它们在脉动阵列810中与权重值进行乘累加。
脉动阵列810输出的Psum值的顺序如图15所示。从脉动阵列810输出的Psum值在ACC组单元(ACC_GRP)中首先会将Psum值的低位和高位拼装为一个完整的Psum值,该Psum值送入ACC单元继续进行累加。ACC单元每个时刻进行的计算如图15所示。ACC单元每个时钟周期向下一级传递一个Psum值。第3级ACC_GRP单元的第二个ACC模块的累加操作完成后,可得到最终的输出特征图的特征值。
可选地,在一些实施例中,定点数位宽为2n比特的输入特征数据在外部存储器中的存储格式为,输入特征图中每行输入特征数据的低n比特位与高n比特位分别集中存储。
例如,定点数位宽为2n bits的特征数据在SRAM中存储的格式如图16所示,该特征图中每行特征值的高n bits和低n bits分别集中存放。
可选地,在一些实施例中,定点数位宽为2n比特的权重在外部存储器中的存储格式为,权重矩阵中每行权重的低n比特位与高n比特位分别集中存储。
定点数位宽为2n bits的权重在SRAM中存储的格式类似于图16所示。
应理解,定点数位宽为2n bits的特征数据与权重在SRAM中存储的格式如图16所示,有利于特征数据与权重按照图15所示的顺序送入脉动阵列810,因此有助于提高数据读取与写入的速度。
例如,定点数位宽为nbits的特征数据在SRAM中的存储格式如图17所示。定点数位宽为nbits的权重在SRAM中的存储格式类似于图17所示。
本申请提供的运算装置800可以应用于深度神经网络加速器。
如图18所示,本申请实施例还提供一种神经网络加速器1800。神经网络加速器1800包括处理模块1810、权重输入模块1820、特征数据输入模块1830、输出模块1840。
处理模块1810为上文方法实施例提供的神经网络的运算装置800。
权重输入模块1820,用于将权重从外部存储器读出并送入处理模块800,例如。参见图8与图9,权重输入模块1820用于将权重从外部存储器读出并送入所述处理模块1810中的权重输入单元830中。
特征数据输入模块1830,用于将特征数据从外部存储器读出并送入处理模块800。参见图8与图9,特征数据输入模块1830用于将特征数据从外部存储器读出并送入所述处理模块1810中的特征数据输入单元840中。
输出模块1840,用于将处理模块1810输出的输出特征数据存储到外部存储器中。
特征数据与权重在外部存储器中的存储格式如图16或图17所示。
例如,特征数据为定点数位宽为n比特的数据,则特征数据在外部存储器中的存储格式如图17所示。
又例如,特征数据为定点数位宽为2n比特的数据,则特征数据在外部存储器中的存储格式如图16所示。
例如,权重为定点数位宽为n比特的数据,则权重在外部存储器中的存储格式如图17所示。
又例如,权重为定点数位宽为2n比特的数据,则权重在外部存储器中的存储格式如图16所示。
本申请实施例还提供一种神经网络的运算装置的控制方法。该运算装置包括脉动阵列,脉动阵列的处理单元为第一计算单元,第一计算单元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数,第一计算单元可进行先移位后累加操作,以使得脉动阵列中2行c列的多个第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元。
其中,c可以为1或2。即,第一计算单元可进行先移位后累加操作,以使得脉动阵列中2行1列的两个第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元;或者,第一计算单元可进行先移位后累加操作,以使得脉动阵列中2行2列的四个第一计算单元作为 一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元。
所述控制方法包括如图10所示的操作S1010与S1020。详见上文描述,这里不再赘述。
可选地,S1020包括:在运算装置需要对定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重进行运算的情况下,控制第二计算单元中所包含的部分第一计算单元进行先移位后累加操作,使得第二计算单元中所包含的2行第一计算单元中后1行的2个第一计算单元分别输出第二计算单元的4n比特运算结果的低2n比特位与高2n比特位。
可选地,该控制方法还包括:在运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,将定点数位宽为2n比特的输入特征数据的低n比特位与高n比特位分别送入第二计算单元中所包含的2行第一计算单元中。
可选地,c为2,该控制方法还包括:在运算装置需要处理定点数位宽为2n比特的权重的情况下,将定点数位宽为2n比特的权重的低n比特位与高n比特位分别送入第二计算单元中所包含的2列第一计算单元中。
可选地,该控制方法还包括:对脉动阵列输出的对应于同一个第二计算单元的低2n比特位运算结果与高2n比特位运算结果进行拼接,获得同一个第二计算单元的4n比特运算结果;对对应于同一个权重矩阵的p个第二计算单元的4n比特运算结果进行累加,以获得权重矩阵对应的输出特征数据,p等于权重矩阵的宽度。
可选地,定点数位宽为2n比特的输入特征数据在外部存储器中的存储格式为,输入特征图中每行输入特征数据的低n比特位与高n比特位分别集中存储。
可选地,该运算装置用于执行卷积操作或池化操作。
如图19所示,本申请实施例还提供一种神经网络处理装置1900。神经网络处理装置1900包括存储器1910与处理器1920,所述存储器1910用于存储指令,所述处理器1920用于执行所述存储器1910存储的指令,并且对所述存储器1910中存储的指令的执行,使得所述处理器1920用于执行上文方法实施例提供的控制方法。
可选地,如图19所示,神经网络处理装置1900还包括数据接口1930,用于与外部设备进行数据传输。
本申请实施例还提供一种计算机存储介质,其上存储有计算机程序,所 述计算机程序被计算机执行时,使得所述计算机执行上文方法实施例提供的控制方法。
本申请实施例还提供一种包含指令的计算机程序产品,其特征在于,所述指令被计算机执行时使得计算机执行上文方法实施例提供的控制方法。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个 系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种神经网络的运算装置,其特征在于,包括脉动阵列;
    所述脉动阵列的处理单元为第一计算单元,所述第一计算单元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数;
    所述第一计算单元可进行先移位后累加操作,以使得所述脉动阵列中2行c列的多个所述第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元,c为1或2。
  2. 如权利要求1所述的运算装置,其特征在于,还包括控制单元,用于:
    在所述运算装置需要处理定点数位宽为n比特的输入特征数据的情况下,控制所述第一计算单元不进行先移位后累加操作,以使所述脉动阵列对定点数位宽为n比特的输入特征数据进行处理;
    在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,控制用于形成所述第二计算单元的2行c列的所述第一计算单元中的一个或多个所述第一计算单元进行先移位后累加操作,以使所述脉动阵列对定点数位宽为2n比特的输入特征数据进行处理。
  3. 如权利要求2所述的运算装置,其特征在于,c为2,所述控制单元用于,在所述运算装置需要对定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重进行运算的情况下,控制所述第二计算单元中所包含的部分所述第一计算单元进行先移位后累加操作,使得所述第二计算单元中所包含的2行所述第一计算单元中后1行的2个所述第一计算单元分别输出所述第二计算单元的4n比特运算结果的低2n比特位与高2n比特位。
  4. 如权利要求2或3所述的运算装置,其特征在于,所述控制单元还用于,在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,将定点数位宽为2n比特的输入特征数据的低n比特位与高n比特位分别送入所述第二计算单元中所包含的2行所述第一计算单元中。
  5. 如权利要求2-4中任一项所述的运算装置,其特征在于,c为2;
    所述控制单元还用于,在所述运算装置需要处理定点数位宽为2n比特的权重的情况下,将定点数位宽为2n比特的权重的低n比特位与高n比特位分别送入所述第二计算单元中所包含的2列所述第一计算单元中。
  6. 如权利要求3所述的运算装置,其特征在于,还包括输出处理单元, 用于:
    对所述脉动阵列输出的对应于同一个所述第二计算单元的低2n比特位运算结果与高2n比特位运算结果进行拼接,获得所述同一个所述第二计算单元的4n比特运算结果;
    对对应于同一个权重矩阵的p个所述第二计算单元的4n比特运算结果进行累加,以获得所述权重矩阵对应的输出特征数据,p等于所述权重矩阵的宽度。
  7. 如权利要求2-6中任一项所述的运算装置,其特征在于,还包括:
    特征数据输入单元,用于缓存待处理的输入特征数据,并根据所述控制单元的控制信令将所述输入特征数据送入所述脉动阵列中;
    权重输入单元,用于缓存待处理的权重,并根据所述控制单元的控制信令将所述权重送入所述脉动阵列中。
  8. 如权利要求3-6中任一项所述的运算装置,其特征在于,所述定点数位宽为2n比特的输入特征数据在外部存储器中的存储格式为:输入特征图中每行输入特征数据的低n比特位与高n比特位分别集中存储。
  9. 如权利要求1-8中任一项所述的运算装置,其特征在于,所述运算装置用于执行卷积操作。
  10. 一种神经网络加速器,其特征在于,包括:
    处理模块,所述处理模块为如权利要求1-9中任一项所述神经网络的运算装置;
    输入模块,用于将特征数据与权重从外部存储器读出并送入所述处理模块中;
    输出模块,用于将所述处理模块输出的输出特征数据存储到所述外部存储器中。
  11. 一种运算装置的控制方法,其特征在于,所述运算装置包括脉动阵列,所述脉动阵列的处理单元为第一计算单元,所述第一计算单元支持乘法操作数的定点数位宽为n比特,n为2的m次方,m为正整数,所述第一计算单元可进行先移位后累加操作,以使得所述脉动阵列中2行c列的多个所述第一计算单元作为一个整体形成支持乘法操作数的定点数位宽为2n比特的第二计算单元,c为1或2;
    所述控制方法包括:
    在所述运算装置需要处理定点数位宽为n比特的输入特征数据的情况下,控制所述第一计算单元不进行先移位后累加操作,以使所述脉动阵列对定点数位宽为n比特的输入特征数据进行处理;
    在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,控制用于形成所述第二计算单元的2行c列的所述第一计算单元中的一个或多个所述第一计算单元进行先移位后累加操作,以使所述脉动阵列对定点数位宽为2n比特的输入特征数据进行处理。
  12. 如权利要求11所述的控制方法,其特征在于,所述在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,控制用于形成所述第二计算单元的2行c列的所述第一计算单元中的一个或多个所述第一计算单元进行先移位后累加操作,包括:
    在所述运算装置需要对定点数位宽为2n比特的输入特征数据与定点数位宽为2n比特的权重进行运算的情况下,控制所述第二计算单元中所包含的部分所述第一计算单元进行先移位后累加操作,使得所述第二计算单元中所包含的2行所述第一计算单元中后1行的2个所述第一计算单元分别输出所述第二计算单元的4n比特运算结果的低2n比特位与高2n比特位。
  13. 如权利要求11或12所述的控制方法,其特征在于,所述控制方法还包括:
    在所述运算装置需要处理定点数位宽为2n比特的输入特征数据的情况下,将定点数位宽为2n比特的输入特征数据的低n比特位与高n比特位分别送入所述第二计算单元中所包含的2行所述第一计算单元中。
  14. 如权利要求11-13中任一项所述的控制方法,其特征在于,c为2;
    所述控制方法还包括:
    在所述运算装置需要处理定点数位宽为2n比特的权重的情况下,将定点数位宽为2n比特的权重的低n比特位与高n比特位分别送入所述第二计算单元中所包含的2列所述第一计算单元中。
  15. 如权利要求12所述的控制方法,其特征在于,所述控制方法还包括:
    对所述脉动阵列输出的对应于同一个所述第二计算单元的低2n比特位运算结果与高2n比特位运算结果进行拼接,获得所述同一个所述第二计算单元的4n比特运算结果;
    对对应于同一个权重矩阵的p个所述第二计算单元的4n比特运算结果进行累加,以获得所述权重矩阵对应的输出特征数据,p等于所述权重矩阵的宽度。
  16. 如权利要求12-15中任一项所述的控制方法,其特征在于,所述定点数位宽为2n比特的输入特征数据在外部存储器中的存储格式为,输入特征图中每行输入特征数据的低n比特位与高n比特位分别集中存储。
  17. 如权利要求11-16中任一项所述的控制方法,其特征在于,所述运算装置用于执行卷积操作。
  18. 一种神经网络处理装置,其特征在于,包括:存储器与处理器,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行,使得所述处理器用于执行如权利要求11-17中任一项所述的方法。
  19. 一种计算机存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机执行时,使得所述计算机执行如权利要求11-17中任一项所述的方法。
  20. 一种包含指令的计算机程序产品,其特征在于,所述指令被计算机执行时使得计算机执行如权利要求11-17中任一项所述的方法。
PCT/CN2020/091883 2020-05-22 2020-05-22 神经网络的运算装置及其控制方法 WO2021232422A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080004753.3A CN112639839A (zh) 2020-05-22 2020-05-22 神经网络的运算装置及其控制方法
PCT/CN2020/091883 WO2021232422A1 (zh) 2020-05-22 2020-05-22 神经网络的运算装置及其控制方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/091883 WO2021232422A1 (zh) 2020-05-22 2020-05-22 神经网络的运算装置及其控制方法

Publications (1)

Publication Number Publication Date
WO2021232422A1 true WO2021232422A1 (zh) 2021-11-25

Family

ID=75291529

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/091883 WO2021232422A1 (zh) 2020-05-22 2020-05-22 神经网络的运算装置及其控制方法

Country Status (2)

Country Link
CN (1) CN112639839A (zh)
WO (1) WO2021232422A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114237551A (zh) * 2021-11-26 2022-03-25 南方科技大学 一种基于脉动阵列的多精度加速器及其数据处理方法
CN114692833A (zh) * 2022-03-30 2022-07-01 深圳齐芯半导体有限公司 一种卷积计算电路、神经网络处理器以及卷积计算方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578095A (zh) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 神经网络计算装置及包含该计算装置的处理器
CN108491926A (zh) * 2018-03-05 2018-09-04 东南大学 一种基于对数量化的低比特高效深度卷积神经网络硬件加速设计方法、模块及系统
CN110288086A (zh) * 2019-06-13 2019-09-27 天津大学 一种基于Winograd的可配置卷积阵列加速器结构
US20200151541A1 (en) * 2018-11-08 2020-05-14 Arm Limited Efficient Convolutional Neural Networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10438117B1 (en) * 2015-05-21 2019-10-08 Google Llc Computing convolutions using a neural network processor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578095A (zh) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 神经网络计算装置及包含该计算装置的处理器
CN108491926A (zh) * 2018-03-05 2018-09-04 东南大学 一种基于对数量化的低比特高效深度卷积神经网络硬件加速设计方法、模块及系统
US20200151541A1 (en) * 2018-11-08 2020-05-14 Arm Limited Efficient Convolutional Neural Networks
CN110288086A (zh) * 2019-06-13 2019-09-27 天津大学 一种基于Winograd的可配置卷积阵列加速器结构

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114237551A (zh) * 2021-11-26 2022-03-25 南方科技大学 一种基于脉动阵列的多精度加速器及其数据处理方法
CN114692833A (zh) * 2022-03-30 2022-07-01 深圳齐芯半导体有限公司 一种卷积计算电路、神经网络处理器以及卷积计算方法
CN114692833B (zh) * 2022-03-30 2023-11-21 广东齐芯半导体有限公司 一种卷积计算电路、神经网络处理器以及卷积计算方法

Also Published As

Publication number Publication date
CN112639839A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
US9384168B2 (en) Vector matrix product accelerator for microprocessor integration
CN108304922B (zh) 用于神经网络计算的计算设备和计算方法
JP2020521192A (ja) ハードウェアにおける行列乗算の実行
JP2021508125A (ja) 行列乗算器
US20040215677A1 (en) Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements
WO2018113597A1 (zh) 矩阵乘加运算装置、神经网络运算装置和方法
CN108629406B (zh) 用于卷积神经网络的运算装置
CN110705703B (zh) 基于脉动阵列的稀疏神经网络处理器
WO2021232422A1 (zh) 神经网络的运算装置及其控制方法
CN110442323A (zh) 进行浮点数或定点数乘加运算的架构和方法
WO2023065983A1 (zh) 计算装置、神经网络处理设备、芯片及处理数据的方法
WO2022041188A1 (zh) 用于神经网络的加速器、方法、装置及计算机存储介质
WO2021168644A1 (zh) 数据处理装置、电子设备和数据处理方法
EP4206993A1 (en) Configurable pooling processing unit for neural network accelerator
CN113052299A (zh) 基于通信下界的神经网络存内计算装置及加速方法
CN110766136B (zh) 一种稀疏矩阵与向量的压缩方法
CN111260043A (zh) 数据选择器、数据处理方法、芯片及电子设备
CN111985628B (zh) 计算装置及包括所述计算装置的神经网络处理器
CN111198714B (zh) 重训练方法及相关产品
WO2020108486A1 (zh) 数据处理装置、方法、芯片及电子设备
US20240086153A1 (en) Multi-bit accumulator and in-memory computing processor with same
CN113031915A (zh) 乘法器、数据处理方法、装置及芯片
WO2023116431A1 (zh) 一种矩阵计算方法、芯片以及相关设备
US11443014B1 (en) Sparse matrix multiplier in hardware and a reconfigurable data processor including same
CN111382835A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936615

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20936615

Country of ref document: EP

Kind code of ref document: A1