CN112989269A

CN112989269A - Accelerator and on-chip calculation module thereof

Info

Publication number: CN112989269A
Application number: CN202110326325.9A
Authority: CN
Inventors: 谭黎敏; 吕斌; 宋捷
Original assignee: Shanghai Westwell Information Technology Co Ltd
Current assignee: Shanghai Westwell Information Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-18
Anticipated expiration: 2041-03-26
Also published as: CN112989269B

Abstract

The invention provides an accelerator and an on-chip calculation module of the accelerator, wherein the on-chip calculation module of the accelerator comprises: a parameter distribution module configured to distribute the calculation parameters; a data distribution module configured to distribute computing data; the multiplication and addition module comprises a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence, and the first adder is connected to the data distribution module; the system comprises a plurality of selectors, each selector comprises a first input end connected to a data distribution module, a second input end connected to a parameter distribution module and an output end, and the output ends of the selectors are respectively connected to a first adder, a first multiplier, a second adder and a second multiplier, wherein the first adder, the first multiplier, the second adder, the second multiplier and the selector are configured to enable an accelerator on-chip computation module to execute different computation functions. The invention reduces the data bandwidth requirement, improves the calculation efficiency and reduces the power consumption in the convolution neural network calculation.

Description

Accelerator and on-chip calculation module thereof

Technical Field

The invention relates to the field of convolutional neural networks, in particular to an accelerator and an on-chip calculation module of the accelerator.

Background

A Convolutional Neural Network (CNN) is a feed-forward neural network whose artificial neurons can respond to a portion of the coverage of surrounding cells, and performs well for large image processing. It mainly includes a convolution layer (convolutional layer) and a pooling layer (Pooling layer). Convolutional neural networks have been widely used for image classification, object recognition, and target tracking.

The convolutional neural network calculation can be implemented on the basis of hardware such as an FPGA (Field-Programmable Gate Array), a chip, and the like.

The calculation of the convolutional neural network can be divided into two major types, one type is tensor convolution operation, and the method is characterized in that multiplication and accumulation calculation is taken as a core, and the corresponding operator type in the neural network is convolution, deconvolution, cavity, full connection and the like; the second category is batch normalization, linear rectification, Sigmoid functions, quantization calculations, tensor addition, tensor hadamard products, etc. calculations. These operations are characterized by element-wise computations, i.e., the operations are applied independently to each element in the tensor.

In the calculation, the calculation can be performed with 16bit/24bit/32bit quantized data. In recent years, low-precision quantitative calculation methods represented by imperial viation and google 8bit quantization have appeared, and have better advantages in terms of storage bandwidth, hardware resources, calculation speed and power consumption.

The convolutional neural network calculation has the advantage of high precision. However, the defects of large data bandwidth requirement, large occupied storage resource and large power consumption still exist.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides an accelerator and an on-chip calculation module of the accelerator, so as to reduce the requirement of data bandwidth, reduce storage resources and reduce power consumption in the calculation of a convolutional neural network.

According to an aspect of the present invention, there is provided an accelerator on-chip computation module, including:

a parameter distribution module configured to distribute the calculation parameters;

a data distribution module configured to distribute computing data;

the multiplication and addition module comprises a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence, and the first adder is connected to the data distribution module;

a plurality of selectors, each selector including a first input terminal connected to the data distribution module, a second input terminal connected to the parameter distribution module, and an output terminal, the output terminals of the selectors being respectively connected to the first adder, the first multiplier, the second adder, and the second multiplier, the selector being configured to select a data from the first input terminal and the second input terminal to be output by the output terminal of the selector,

wherein the first adder, the first multiplier, the second adder, the second multiplier, and the selector are configured to cause the on-accelerator-chip computation module to perform different computation functions.

In some embodiments of the present application, a comparator is further connected between the first multiplier and the first adder, and the comparator is configured to determine whether to enable the comparator according to a configuration.

In some embodiments of the present application, further comprising:

and the inverse quantization module is connected to the output end of the multiplication and addition module.

In some embodiments of the present application, the on-accelerator-chip computation module supports cascaded computation of multiple computation functions.

In some embodiments of the present application, the parameter distribution module and the data distribution module read from the memory the calculation parameters and calculation data required by the plurality of cascaded calculation functions currently at a time.

In some embodiments of the present application, the computing functionality comprises: tensor addition, tensor Hadamard product, Sigmoid function, linear rectification and batch standardization.

In some embodiments of the present application, the computing functionality further comprises: a quantization conversion function.

In some embodiments of the present application, the parameter distribution module and the data distribution module are controlled via clock signals, and the time of calculation of each addition/multiplication in the multiplication and addition module is the time of one clock signal.

In some embodiments of the present application, the adders and multipliers in the multiply-add module that are not in operation transmit data directly through a bypass.

According to still another aspect of the present invention, there is also provided an accelerator, including:

a tensor convolution calculation module;

the on-chip computation module of the accelerator as described above, wherein an input end of the on-chip computation module of the accelerator is connected to an output end of the tensor convolution computation module; and

and the input end of the output module is connected to the output end of the calculation module in the accelerator sheet.

Compared with the prior art, the invention has the advantages that:

according to the invention, the multiply-add module of the calculation module in the accelerator chip consists of the first adder, the first multiplier, the second adder and the second multiplier which are sequentially connected, so that different calculation functions and combination of the calculation functions can be realized according to different operator requirements, and multiple multiplexing modes can be realized.

Drawings

The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 shows a schematic diagram of an accelerator according to an embodiment of the invention;

FIG. 2 shows a schematic diagram of an on-accelerator-chip computation module according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a Sigmoid function calculation using an on-accelerator computation module according to an embodiment of the invention;

FIG. 4 illustrates a schematic diagram of tensor addition calculations using an on-accelerator computation module, according to an embodiment of the present invention;

FIG. 5 illustrates a schematic diagram of tensor Hadamard product computation using an on-accelerator computation module, according to an embodiment of the invention;

FIG. 6 shows a schematic diagram of a linear rectification calculation using an on-accelerator computation module according to an embodiment of the invention;

FIG. 7 illustrates a schematic diagram of batch normalization calculations using an on-accelerator-chip calculation module, according to an embodiment of the invention;

FIG. 8 shows a diagram of a quantization transformation calculation using an on-accelerator-chip calculation module according to an embodiment of the invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Tensor element-by-element calculation forms in the neural network include tensor addition, tensor Hadamard product, linear rectification (PReLU, Relu and Relu6), batch normalization (BatchNormal), Sigmoid and other operators. In the hardware implementation method, if each operator is implemented independently, each operator occupies fixed hardware resources, and the hardware resources will be more occupied during multi-path parallel computation. After the characteristics of element-by-element calculation are fully analyzed, the invention designs an on-chip calculation module of the neural network accelerator, which can be reused by 8bit/16bit fixed-point calculation resources, by combining optimization on an operator implementation method. The invention defines a neural network accelerator architecture with the aim of optimizing hardware computing resources, can realize the computation of element-by-element operators such as Sigmoid, BatchNormal, PReLU, dot multiplication, dot addition, Relu6 and the like of a 16-bit quantization network in a pipeline processing mode, and has the computation support of an 8-bit quantization network.

Specifically, the invention provides an accelerator and an on-chip calculation module of the accelerator. The accelerator provided by the present invention will be described with reference to fig. 1; the accelerator on-chip computation module provided by the present invention is further described in conjunction with fig. 2-8.

As shown in fig. 1, the accelerator 100 includes a tensor convolution calculation module 110, an on-accelerator-chip calculation module 120, and an output module 130. The input of the on-chip computation module 120 of the accelerator is connected to the output of the tensor convolution computation module 110. The input end of the output module 130 is connected to the output end of the accelerator on-chip computation module 120.

Specifically, the accelerator 100 inputs the feature and weight data, and the data enters the tensor convolution calculation module 110 for convolution calculation and then enters the intra-accelerator calculation module 120. The intra-accelerator-chip computation module 120 is designed based on a target that fixed-point computation resources can be reused, can complete the operation functions of operators such as Sigmoid, linear rectification (PReLU, Relu and Relu6), batch normalization (BatchNormal), tensor addition, tensor Hadamard product and the like, and has the fusion and cascade capability of new operators. The finally calculated data is sent to the output module 130, and the data is output after the data format conversion is completed.

The structure of the on-accelerator-chip computation module 120 is shown in fig. 2. The on-accelerator-chip computation module 120 includes a parameter distribution module 121, a data distribution module 122, a multiply-add module 123, and a plurality of selectors 124a-124 d.

The parameter distribution module 121 is configured to distribute the calculation parameters. The data distribution module 122 is configured to distribute computing data. The multiply-add module 123 includes a first adder 123a, a first multiplier 123b, a second adder 123c, and a second multiplier 123d connected in sequence. The first adder 123a is connected to the data distribution module 122. Specifically, the adders and multipliers in the multiply-add module 123 which are not in operation directly transmit data by bypassing. Each selector 124a-124d comprises a first input connected to said data distribution module 122, a second input connected to said parameter distribution module 121, and an output. The output terminals of the selectors 124a-124d are respectively connected to the first adder 123a, the first multiplier 123b, the second adder 123c and the second multiplier 123 d. The selectors 124a-124d are configured to select a data from the first input and the second input for output by the output of the selectors 124a-124 d. The first adder 123a, the first multiplier 123b, the second adder 123c, the second multiplier 123d, and the selectors 124a-124d are configured to enable the on-accelerator-chip computation module 120 to perform different computation functions. In particular, the on-accelerator-chip computation module 120 may support cascaded computations for multiple computation functions. The computational functions may include, but are not limited to, tensor addition, tensor hadamard products, Sigmoid functions, linear rectification, batch normalization. Further, the computing function may also include a quantization conversion function.

Specifically, the parameter distribution module 121 and the data distribution module 122 read the calculation parameters and the calculation data required to be calculated by the plurality of cascaded calculation functions from the external storage at a time. Therefore, the data reading times can be reduced, and the calculation efficiency can be improved. Further, the parameter distribution module 121 and the data distribution module 122 are controlled via a clock signal, and the time of each addition/multiplication in the multiplication and addition module 123 is the time of one clock signal. Therefore, pipeline control is realized, and the calculation speed is increased.

In some embodiments of the present application, a comparator 123e is further connected between the first multiplier 123b and the first adder 123c, and the comparator 123e is configured to determine whether to enable the first adder 123a according to a configuration. Specifically, the comparator 123e may be used in a computing function such as the ReLU6/ReLU that needs to perform data comparisons.

In some embodiments of the present application, the accelerator on-chip computation module 120 may further include an inverse quantization module 125. The inverse quantization module 125 is connected to the output of the multiply-add module 123 to perform a quantization transformation function.

Therefore, through the on-chip computation module 120 of the accelerator provided by the present invention, the single-path or double-path data to be computed is sent to the data distribution module 122 after being sent to the unit, and the data distribution module 122 distributes the fixed-point quantized data to each node (adder and multiplier) of the multiplication and addition module 123 according to the current operator computation requirement; the parameter distribution module 121 receives a calculation parameter preset by a user and distributes the calculation parameter to each node (adder and multiplier) of the multiplication and addition module 123; the selector selects the data input to the current computing node (the adder and the multiplier) according to the computing requirement of the current operator; the multiply-add module 123 is composed of 4 computation nodes including 2 adders and 2 multipliers, and opens the corresponding computation nodes according to the computation characteristics of different operators (computation functions), and outputs the final computation result. The computing node is provided with a comparator to be suitable for operators needing data comparison. The inverse quantization module 125 can perform the function of converting linear quantization data such as INT16 output by the Sigmoid function into INT 8. The data and parameter input data are linear quantization data, and comprise bit widths of 8 bits, 12 bits, 16 bits, 24 bits, 32 bits and the like.

Therefore, the invention designs a resource multiplexing type calculation structure to realize the support of a plurality of element-by-element calculation operators in the neural network calculation. When the neural network is calculated, the structure is connected behind the first type tensor convolution calculation structure, the composite operation of a plurality of operators can be realized, the use of storage resources can be reduced, the calculation efficiency is improved, and the power consumption is reduced. Meanwhile, the framework can be independently applied to the calculation of a 16-bit quantization network and the calculation of an 8-bit quantization network, and double support of 16-bit and 8-bit quantization calculation is realized by a set of fixed hardware resources.

Referring now to fig. 3, fig. 3 is a diagram illustrating Sigmoid function calculation using an on-accelerator computation module according to an embodiment of the present invention.

Among the current element-by-element calculation operators, the most complex is the activation function Sigmoid. The hardware implementation modes of the activation function are a table look-up method, coordinate rotation digital calculation and a piecewise linear function approximation method. Based on the comprehensive consideration of hardware resources and processing speed, the Sigmoid function is realized by adopting a piecewise function approximation method, the Sigmoid function is divided into a plurality of sections, each section adopts a second-order polynomial fitting method, and the fitting polynomial function is as follows:

f(x)＝p₀x²+p₁x+p₂formula (1)

With this structure, 4 multiplications and 2 additions are used, and in order to optimize and reduce the computational resources, the present invention transforms the above equation, and the fitting function form after transformation is:

f(x)＝p′₀[x(x+p′₁)+p′₂]formula (2)

The structure uses 2 multiplications and 2 addition operations, and 2 multiplications are reduced compared with the formula (1)

Therefore, in the multiplication and addition module 123, a calculation module based on two-group addition and two-group multiplication is designed to implement the Sigmoid function.

When a Sigmoid operation is performed, the operation principle of the on-accelerator-chip computation module 120 is as shown in fig. 3, and the dotted line is a currently non-working unit or path. Coefficient p in equation (2)₀、p₁、p₂The selector 124d, the selector 124a and the selector 124c are respectively sent to corresponding computing nodes for computation, the computation time of each node is 1 clock signal, and the computation result of the Sigmoid function can be output after 4 clock signals.

Referring now to fig. 4, fig. 4 is a schematic diagram illustrating tensor addition calculations using an on-accelerator computation module according to an embodiment of the present invention.

The hardware of tensor addition realizes the structure, the module shown by the dotted line frame does not work at present, and only the module shown by the solid line frame works. When the addition of two tensors with the same dimension is implemented, the data distribution module 122 receives two paths of tensor data, one path of tensor data is directly sent to the first adder 123a, the other route selector 124a is sent to the first adder 123a, and the first adder 123a completes the element-by-element addition of the two paths of tensor data.

Referring now to fig. 5, fig. 5 is a schematic diagram illustrating tensor hadamard product computation using an on-accelerator computation module according to an embodiment of the invention.

Hadamard products (hadamard products) are a class of operations on matrices, if a ═ a_ij) And B ═ B_ij) Are two matrices of the same order, if c_ij＝a_ij×b_ijThen, the matrix C is called (C)_ij) Is the Hadamard product of A and B.

In a hardware implementation structure for implementing the hadamard product, the module shown by the dashed line box does not work at present, and only the module shown by the solid line box works. When the hadamard product is realized, the data distribution module 122 receives two paths of tensor data, one path of tensor data is directly sent to the first multiplier 123b through the first inoperative adder 123a, the other path of tensor data is sent to the first multiplier 123b through the other path of router 124b, and the first multiplier 123b completes element-by-element multiplication operation on the two paths of tensor data.

Referring now to FIG. 6, FIG. 6 illustrates a schematic diagram of a linear rectification computation using an on-accelerator computation module, according to an embodiment of the invention.

The following description takes a PReLU as an example, where the PReLU is a ReLU with parameters, and is defined as follows:

the parameter distribution module 121 distributes a of PReLU_iThe parameters are distributed to the selector 124b and the first multiplier 123b obtains a_iThe parameters are then associated with the corresponding x_iMultiplication (distributed by the data distribution module 122) is performed, and multiplication is performedAfter the method is finished, the sign bit is compared by the comparator 123e, and the positive number is the original x_iAnd outputting, namely outputting the result of the multiplier by taking the negative number.

When the ReLU6/ReLU operator calculation is performed, the multiplier 123b and the selector 124b do not work according to the calculation formula, and only the comparator 123e is multiplexed to perform the comparison and output of the data size.

Referring now to FIG. 7, FIG. 7 illustrates a schematic diagram of batch normalization calculations using an on-accelerator-chip calculation module, according to an embodiment of the invention.

In actual network calculation, the BatchNomal operator may be connected after the convolution class operator, and may also be connected after the activation function such as Relu. In the calculation, the pipeline calculation mode after more operator hardware is cascaded can reduce the use of external storage bandwidth, and can realize faster calculation under certain calculation resources, therefore, the invention converts the formula (3) into the formula (4):

(x) t x + p formula (3)

At this time, the coefficients t and p/t are obtained and distributed by the parameter distribution module 121, x is the data distributed by the data distribution module 122, and BatchNomal can complete the calculation work of BatchNormal by the second adder 123c and the second multiplier 123 d.

Referring now to fig. 8, fig. 8 is a diagram illustrating a quantization transformation calculation using an on-accelerator computation module according to an embodiment of the present invention.

In order to make faster reasoning at the mobile terminal, an INT8 quantization method with a lower quantization bit number is often adopted in the network, and the lower quantization precision reduces the data throughput and the power consumption of the calculation. The present calculation unit can be incorporated into the calculation of the INT8 quantization network even when the INT16 calculation is completed. The INT8 conversion function by INT16 can be realized by the following design.

Currently, the commonly used INT8 is quantified as the tentorrt scheme by england and the INT8 scheme by google. The method for converting 16-bit quantized data into 8-bit quantized data in the TensorRT scheme of Ingland comprises the following steps:

X_{int8_nvidiia}＝[X_int16*M]

wherein, X_int16For data distributed by data distribution module 122; x_int8nvidiiaFor 8bit quantized output data to be obtained, M is a conversion parameter.

The method for converting 16 bits into 8 bits in INT8 scheme of Google is as follows:

X_{int8_google}＝[X_int16*M]+Z_x/quant

wherein, X_int16For data distributed by data distribution module 122; x_int8googleFor the output data to be obtained, M is a conversion parameter, Z_x/quantIs a quantized zero value (not constant 0) in this scheme.

Further, with reference to fig. 3, on the basis of Sigmoid, because the input of Sigmoid is INT16 data, in order to use a Sigmoid operator in an 8BIT low-precision network, an inverse quantization module 125 is designed in a unit, and the conversion from the output of INT16/INT24/INT32 to the quantization of google 8BIT can be realized by simple shifting and using fewer resources. The Sigmoid method for converting Int16 into Int8 is:

wherein Sigmod_int8For data to be obtained, X_floatIs the difference between the maximum and minimum values of the floating-point number X, S_floatIs 255, Zero_xquantA transition zero value (non-constant 0) in the transition scheme; x_int16Is the output data of the multiply add module 123.

Therefore, by controlling each computing node, the on-chip computing module 120 of the accelerator provided by the invention can realize the cascade function among more operators, reduce the use of storage resources and improve the computing efficiency.

If ReLU/ReLU6/PReLU is taken as the basis, the operators can be cascaded:

ADD (adder) + ReLU/ReLU6/PReLU

MUL (multiplier) + ReLU/ReLU6/PReLU

ReLU/ReLU6/PReLU+BatchNormal

ADD (adder) + ReLU/ReLU6/PReLU + BatchNormal

MUL (multiplier) + ReLU/ReLU6/PReLU + BatchNormal

If the BatchNormal operator is taken as the basis, the operators can be cascaded respectively:

ADD (adder) + BatchNormal

MUL (multiplier) + BatchNormal

ADD (adder) + MUL (multiplier) + BatchNormal

The method can also realize a combined cascade mode among operators such as ADD (adder) + MUL (multiplier), and can support more tensor element-by-element calculation operators through flexible calculation node control.

Further, based on the calculation module 120 in the accelerator slice with the first adder 123a, the first multiplier 123b, the second adder 123c, and the second multiplier 123d as cores, the calculation with Sigmoid as a core can be completed, and the realization of Sigmoid has the characteristics of less calculation units (only 2 multiplications and 2 additions are used), and less calculation period; the input and output data are compatible with the following quantization formats INT8/INT16/INT24/INT 32.

The invention realizes the hardware realization structure of the operator of the multiplexing resource, on the basis of the hardware resource of the calculation unit, the tensor addition, Hadamard multiplication, ReLU \ ReLU6\ PreLU, BatchNormal, INT16 to INT8 and other tensor element-by-element calculation operators in the neural network are realized by multiplexing the calculation resource and flexibly controlling the nodes in the calculation unit.

The invention realizes the conversion of Sigmoid into INT8, and after the calculation of input Simoid by INT16, the output INT16/INT24/INT32 datapass is converted into the realization method of Google INT8 quantization. Through the research on the conversion principle, the conversion method in the text completes the conversion through simple shift, and has the advantage of low resource usage.

The invention realizes the cascade method of a plurality of operators, realizes the starting or the bypass of the computing nodes through the control of the computing nodes, and transmits different computing data, and can realize the cascade among a plurality of different operators on the framework.

Compared with the prior art, the invention has the advantages that:

the invention can realize the combination of different calculation functions and calculation functions to realize various multiplexing modes by making the multiply-add module of the calculation module in the accelerator chip consist of a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence according to different operator requirements

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An on-chip computation module for an accelerator, comprising:

a data distribution module configured to distribute computing data;

2. The on-accelerator-chip computation module of claim 1, wherein a comparator is further connected between the first multiplier and the first adder, and the comparator is configured to determine whether to enable the comparator according to a configuration.

3. The on-accelerator-chip computation module of claim 1, further comprising:

4. The on-accelerator-chip computing module of claim 1, wherein the on-accelerator-chip computing module supports cascaded computation of a plurality of computing functions.

5. The on-accelerator-chip computation module of claim 4, wherein the parameter distribution module and the data distribution module read from memory computation parameters and computation data required for a current plurality of cascaded computation functions at a time.

6. The on-accelerator-chip computation module of any one of claims 1 to 5, wherein the computation function comprises: tensor addition, tensor Hadamard product, Sigmoid function, linear rectification and batch standardization.

7. The on-accelerator-chip computation module of any one of claims 1 to 5, wherein the computation function further comprises: a quantization conversion function.

8. The on-chip computation module of any one of claims 1 to 5, wherein the parameter distribution module and the data distribution module are controlled via a clock signal, and a time of computation of each addition/multiplication in the multiply-add module is a time of one clock signal.

9. The on-chip computation module of any one of claims 1 to 5, wherein the adders and multipliers in the multiply-add module that are not in operation directly transmit data by bypassing.

10. An accelerator, comprising:

a tensor convolution calculation module;

the on-chip computation module of any one of claims 1 to 9, an input of the on-chip computation module being connected to an output of the tensor convolution computation module; and