CN112989269A - Accelerator and on-chip calculation module thereof - Google Patents

Accelerator and on-chip calculation module thereof Download PDF

Info

Publication number
CN112989269A
CN112989269A CN202110326325.9A CN202110326325A CN112989269A CN 112989269 A CN112989269 A CN 112989269A CN 202110326325 A CN202110326325 A CN 202110326325A CN 112989269 A CN112989269 A CN 112989269A
Authority
CN
China
Prior art keywords
module
accelerator
computation
adder
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110326325.9A
Other languages
Chinese (zh)
Other versions
CN112989269B (en
Inventor
谭黎敏
吕斌
宋捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Westwell Information Technology Co Ltd
Original Assignee
Shanghai Westwell Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Westwell Information Technology Co Ltd filed Critical Shanghai Westwell Information Technology Co Ltd
Priority to CN202110326325.9A priority Critical patent/CN112989269B/en
Publication of CN112989269A publication Critical patent/CN112989269A/en
Application granted granted Critical
Publication of CN112989269B publication Critical patent/CN112989269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/501Half or full adders, i.e. basic adder cells for one denomination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an accelerator and an on-chip calculation module of the accelerator, wherein the on-chip calculation module of the accelerator comprises: a parameter distribution module configured to distribute the calculation parameters; a data distribution module configured to distribute computing data; the multiplication and addition module comprises a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence, and the first adder is connected to the data distribution module; the system comprises a plurality of selectors, each selector comprises a first input end connected to a data distribution module, a second input end connected to a parameter distribution module and an output end, and the output ends of the selectors are respectively connected to a first adder, a first multiplier, a second adder and a second multiplier, wherein the first adder, the first multiplier, the second adder, the second multiplier and the selector are configured to enable an accelerator on-chip computation module to execute different computation functions. The invention reduces the data bandwidth requirement, improves the calculation efficiency and reduces the power consumption in the convolution neural network calculation.

Description

Accelerator and on-chip calculation module thereof
Technical Field
The invention relates to the field of convolutional neural networks, in particular to an accelerator and an on-chip calculation module of the accelerator.
Background
A Convolutional Neural Network (CNN) is a feed-forward neural network whose artificial neurons can respond to a portion of the coverage of surrounding cells, and performs well for large image processing. It mainly includes a convolution layer (convolutional layer) and a pooling layer (Pooling layer). Convolutional neural networks have been widely used for image classification, object recognition, and target tracking.
The convolutional neural network calculation can be implemented on the basis of hardware such as an FPGA (Field-Programmable Gate Array), a chip, and the like.
The calculation of the convolutional neural network can be divided into two major types, one type is tensor convolution operation, and the method is characterized in that multiplication and accumulation calculation is taken as a core, and the corresponding operator type in the neural network is convolution, deconvolution, cavity, full connection and the like; the second category is batch normalization, linear rectification, Sigmoid functions, quantization calculations, tensor addition, tensor hadamard products, etc. calculations. These operations are characterized by element-wise computations, i.e., the operations are applied independently to each element in the tensor.
In the calculation, the calculation can be performed with 16bit/24bit/32bit quantized data. In recent years, low-precision quantitative calculation methods represented by imperial viation and google 8bit quantization have appeared, and have better advantages in terms of storage bandwidth, hardware resources, calculation speed and power consumption.
The convolutional neural network calculation has the advantage of high precision. However, the defects of large data bandwidth requirement, large occupied storage resource and large power consumption still exist.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides an accelerator and an on-chip calculation module of the accelerator, so as to reduce the requirement of data bandwidth, reduce storage resources and reduce power consumption in the calculation of a convolutional neural network.
According to an aspect of the present invention, there is provided an accelerator on-chip computation module, including:
a parameter distribution module configured to distribute the calculation parameters;
a data distribution module configured to distribute computing data;
the multiplication and addition module comprises a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence, and the first adder is connected to the data distribution module;
a plurality of selectors, each selector including a first input terminal connected to the data distribution module, a second input terminal connected to the parameter distribution module, and an output terminal, the output terminals of the selectors being respectively connected to the first adder, the first multiplier, the second adder, and the second multiplier, the selector being configured to select a data from the first input terminal and the second input terminal to be output by the output terminal of the selector,
wherein the first adder, the first multiplier, the second adder, the second multiplier, and the selector are configured to cause the on-accelerator-chip computation module to perform different computation functions.
In some embodiments of the present application, a comparator is further connected between the first multiplier and the first adder, and the comparator is configured to determine whether to enable the comparator according to a configuration.
In some embodiments of the present application, further comprising:
and the inverse quantization module is connected to the output end of the multiplication and addition module.
In some embodiments of the present application, the on-accelerator-chip computation module supports cascaded computation of multiple computation functions.
In some embodiments of the present application, the parameter distribution module and the data distribution module read from the memory the calculation parameters and calculation data required by the plurality of cascaded calculation functions currently at a time.
In some embodiments of the present application, the computing functionality comprises: tensor addition, tensor Hadamard product, Sigmoid function, linear rectification and batch standardization.
In some embodiments of the present application, the computing functionality further comprises: a quantization conversion function.
In some embodiments of the present application, the parameter distribution module and the data distribution module are controlled via clock signals, and the time of calculation of each addition/multiplication in the multiplication and addition module is the time of one clock signal.
In some embodiments of the present application, the adders and multipliers in the multiply-add module that are not in operation transmit data directly through a bypass.
According to still another aspect of the present invention, there is also provided an accelerator, including:
a tensor convolution calculation module;
the on-chip computation module of the accelerator as described above, wherein an input end of the on-chip computation module of the accelerator is connected to an output end of the tensor convolution computation module; and
and the input end of the output module is connected to the output end of the calculation module in the accelerator sheet.
Compared with the prior art, the invention has the advantages that:
according to the invention, the multiply-add module of the calculation module in the accelerator chip consists of the first adder, the first multiplier, the second adder and the second multiplier which are sequentially connected, so that different calculation functions and combination of the calculation functions can be realized according to different operator requirements, and multiple multiplexing modes can be realized.
Drawings
The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 shows a schematic diagram of an accelerator according to an embodiment of the invention;
FIG. 2 shows a schematic diagram of an on-accelerator-chip computation module according to an embodiment of the invention;
FIG. 3 is a diagram illustrating a Sigmoid function calculation using an on-accelerator computation module according to an embodiment of the invention;
FIG. 4 illustrates a schematic diagram of tensor addition calculations using an on-accelerator computation module, according to an embodiment of the present invention;
FIG. 5 illustrates a schematic diagram of tensor Hadamard product computation using an on-accelerator computation module, according to an embodiment of the invention;
FIG. 6 shows a schematic diagram of a linear rectification calculation using an on-accelerator computation module according to an embodiment of the invention;
FIG. 7 illustrates a schematic diagram of batch normalization calculations using an on-accelerator-chip calculation module, according to an embodiment of the invention;
FIG. 8 shows a diagram of a quantization transformation calculation using an on-accelerator-chip calculation module according to an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
Tensor element-by-element calculation forms in the neural network include tensor addition, tensor Hadamard product, linear rectification (PReLU, Relu and Relu6), batch normalization (BatchNormal), Sigmoid and other operators. In the hardware implementation method, if each operator is implemented independently, each operator occupies fixed hardware resources, and the hardware resources will be more occupied during multi-path parallel computation. After the characteristics of element-by-element calculation are fully analyzed, the invention designs an on-chip calculation module of the neural network accelerator, which can be reused by 8bit/16bit fixed-point calculation resources, by combining optimization on an operator implementation method. The invention defines a neural network accelerator architecture with the aim of optimizing hardware computing resources, can realize the computation of element-by-element operators such as Sigmoid, BatchNormal, PReLU, dot multiplication, dot addition, Relu6 and the like of a 16-bit quantization network in a pipeline processing mode, and has the computation support of an 8-bit quantization network.
Specifically, the invention provides an accelerator and an on-chip calculation module of the accelerator. The accelerator provided by the present invention will be described with reference to fig. 1; the accelerator on-chip computation module provided by the present invention is further described in conjunction with fig. 2-8.
As shown in fig. 1, the accelerator 100 includes a tensor convolution calculation module 110, an on-accelerator-chip calculation module 120, and an output module 130. The input of the on-chip computation module 120 of the accelerator is connected to the output of the tensor convolution computation module 110. The input end of the output module 130 is connected to the output end of the accelerator on-chip computation module 120.
Specifically, the accelerator 100 inputs the feature and weight data, and the data enters the tensor convolution calculation module 110 for convolution calculation and then enters the intra-accelerator calculation module 120. The intra-accelerator-chip computation module 120 is designed based on a target that fixed-point computation resources can be reused, can complete the operation functions of operators such as Sigmoid, linear rectification (PReLU, Relu and Relu6), batch normalization (BatchNormal), tensor addition, tensor Hadamard product and the like, and has the fusion and cascade capability of new operators. The finally calculated data is sent to the output module 130, and the data is output after the data format conversion is completed.
The structure of the on-accelerator-chip computation module 120 is shown in fig. 2. The on-accelerator-chip computation module 120 includes a parameter distribution module 121, a data distribution module 122, a multiply-add module 123, and a plurality of selectors 124a-124 d.
The parameter distribution module 121 is configured to distribute the calculation parameters. The data distribution module 122 is configured to distribute computing data. The multiply-add module 123 includes a first adder 123a, a first multiplier 123b, a second adder 123c, and a second multiplier 123d connected in sequence. The first adder 123a is connected to the data distribution module 122. Specifically, the adders and multipliers in the multiply-add module 123 which are not in operation directly transmit data by bypassing. Each selector 124a-124d comprises a first input connected to said data distribution module 122, a second input connected to said parameter distribution module 121, and an output. The output terminals of the selectors 124a-124d are respectively connected to the first adder 123a, the first multiplier 123b, the second adder 123c and the second multiplier 123 d. The selectors 124a-124d are configured to select a data from the first input and the second input for output by the output of the selectors 124a-124 d. The first adder 123a, the first multiplier 123b, the second adder 123c, the second multiplier 123d, and the selectors 124a-124d are configured to enable the on-accelerator-chip computation module 120 to perform different computation functions. In particular, the on-accelerator-chip computation module 120 may support cascaded computations for multiple computation functions. The computational functions may include, but are not limited to, tensor addition, tensor hadamard products, Sigmoid functions, linear rectification, batch normalization. Further, the computing function may also include a quantization conversion function.
Specifically, the parameter distribution module 121 and the data distribution module 122 read the calculation parameters and the calculation data required to be calculated by the plurality of cascaded calculation functions from the external storage at a time. Therefore, the data reading times can be reduced, and the calculation efficiency can be improved. Further, the parameter distribution module 121 and the data distribution module 122 are controlled via a clock signal, and the time of each addition/multiplication in the multiplication and addition module 123 is the time of one clock signal. Therefore, pipeline control is realized, and the calculation speed is increased.
In some embodiments of the present application, a comparator 123e is further connected between the first multiplier 123b and the first adder 123c, and the comparator 123e is configured to determine whether to enable the first adder 123a according to a configuration. Specifically, the comparator 123e may be used in a computing function such as the ReLU6/ReLU that needs to perform data comparisons.
In some embodiments of the present application, the accelerator on-chip computation module 120 may further include an inverse quantization module 125. The inverse quantization module 125 is connected to the output of the multiply-add module 123 to perform a quantization transformation function.
Therefore, through the on-chip computation module 120 of the accelerator provided by the present invention, the single-path or double-path data to be computed is sent to the data distribution module 122 after being sent to the unit, and the data distribution module 122 distributes the fixed-point quantized data to each node (adder and multiplier) of the multiplication and addition module 123 according to the current operator computation requirement; the parameter distribution module 121 receives a calculation parameter preset by a user and distributes the calculation parameter to each node (adder and multiplier) of the multiplication and addition module 123; the selector selects the data input to the current computing node (the adder and the multiplier) according to the computing requirement of the current operator; the multiply-add module 123 is composed of 4 computation nodes including 2 adders and 2 multipliers, and opens the corresponding computation nodes according to the computation characteristics of different operators (computation functions), and outputs the final computation result. The computing node is provided with a comparator to be suitable for operators needing data comparison. The inverse quantization module 125 can perform the function of converting linear quantization data such as INT16 output by the Sigmoid function into INT 8. The data and parameter input data are linear quantization data, and comprise bit widths of 8 bits, 12 bits, 16 bits, 24 bits, 32 bits and the like.
Therefore, the invention designs a resource multiplexing type calculation structure to realize the support of a plurality of element-by-element calculation operators in the neural network calculation. When the neural network is calculated, the structure is connected behind the first type tensor convolution calculation structure, the composite operation of a plurality of operators can be realized, the use of storage resources can be reduced, the calculation efficiency is improved, and the power consumption is reduced. Meanwhile, the framework can be independently applied to the calculation of a 16-bit quantization network and the calculation of an 8-bit quantization network, and double support of 16-bit and 8-bit quantization calculation is realized by a set of fixed hardware resources.
Referring now to fig. 3, fig. 3 is a diagram illustrating Sigmoid function calculation using an on-accelerator computation module according to an embodiment of the present invention.
Among the current element-by-element calculation operators, the most complex is the activation function Sigmoid. The hardware implementation modes of the activation function are a table look-up method, coordinate rotation digital calculation and a piecewise linear function approximation method. Based on the comprehensive consideration of hardware resources and processing speed, the Sigmoid function is realized by adopting a piecewise function approximation method, the Sigmoid function is divided into a plurality of sections, each section adopts a second-order polynomial fitting method, and the fitting polynomial function is as follows:
f(x)=p0x2+p1x+p2formula (1)
With this structure, 4 multiplications and 2 additions are used, and in order to optimize and reduce the computational resources, the present invention transforms the above equation, and the fitting function form after transformation is:
f(x)=p′0[x(x+p′1)+p′2]formula (2)
The structure uses 2 multiplications and 2 addition operations, and 2 multiplications are reduced compared with the formula (1)
Therefore, in the multiplication and addition module 123, a calculation module based on two-group addition and two-group multiplication is designed to implement the Sigmoid function.
When a Sigmoid operation is performed, the operation principle of the on-accelerator-chip computation module 120 is as shown in fig. 3, and the dotted line is a currently non-working unit or path. Coefficient p in equation (2)0、p1、p2The selector 124d, the selector 124a and the selector 124c are respectively sent to corresponding computing nodes for computation, the computation time of each node is 1 clock signal, and the computation result of the Sigmoid function can be output after 4 clock signals.
Referring now to fig. 4, fig. 4 is a schematic diagram illustrating tensor addition calculations using an on-accelerator computation module according to an embodiment of the present invention.
The hardware of tensor addition realizes the structure, the module shown by the dotted line frame does not work at present, and only the module shown by the solid line frame works. When the addition of two tensors with the same dimension is implemented, the data distribution module 122 receives two paths of tensor data, one path of tensor data is directly sent to the first adder 123a, the other route selector 124a is sent to the first adder 123a, and the first adder 123a completes the element-by-element addition of the two paths of tensor data.
Referring now to fig. 5, fig. 5 is a schematic diagram illustrating tensor hadamard product computation using an on-accelerator computation module according to an embodiment of the invention.
Hadamard products (hadamard products) are a class of operations on matrices, if a ═ aij) And B ═ Bij) Are two matrices of the same order, if cij=aij×bijThen, the matrix C is called (C)ij) Is the Hadamard product of A and B.
In a hardware implementation structure for implementing the hadamard product, the module shown by the dashed line box does not work at present, and only the module shown by the solid line box works. When the hadamard product is realized, the data distribution module 122 receives two paths of tensor data, one path of tensor data is directly sent to the first multiplier 123b through the first inoperative adder 123a, the other path of tensor data is sent to the first multiplier 123b through the other path of router 124b, and the first multiplier 123b completes element-by-element multiplication operation on the two paths of tensor data.
Referring now to FIG. 6, FIG. 6 illustrates a schematic diagram of a linear rectification computation using an on-accelerator computation module, according to an embodiment of the invention.
The following description takes a PReLU as an example, where the PReLU is a ReLU with parameters, and is defined as follows:
Figure BDA0002994795970000081
the parameter distribution module 121 distributes a of PReLUiThe parameters are distributed to the selector 124b and the first multiplier 123b obtains aiThe parameters are then associated with the corresponding xiMultiplication (distributed by the data distribution module 122) is performed, and multiplication is performedAfter the method is finished, the sign bit is compared by the comparator 123e, and the positive number is the original xiAnd outputting, namely outputting the result of the multiplier by taking the negative number.
When the ReLU6/ReLU operator calculation is performed, the multiplier 123b and the selector 124b do not work according to the calculation formula, and only the comparator 123e is multiplexed to perform the comparison and output of the data size.
Referring now to FIG. 7, FIG. 7 illustrates a schematic diagram of batch normalization calculations using an on-accelerator-chip calculation module, according to an embodiment of the invention.
In actual network calculation, the BatchNomal operator may be connected after the convolution class operator, and may also be connected after the activation function such as Relu. In the calculation, the pipeline calculation mode after more operator hardware is cascaded can reduce the use of external storage bandwidth, and can realize faster calculation under certain calculation resources, therefore, the invention converts the formula (3) into the formula (4):
(x) t x + p formula (3)
Figure BDA0002994795970000082
At this time, the coefficients t and p/t are obtained and distributed by the parameter distribution module 121, x is the data distributed by the data distribution module 122, and BatchNomal can complete the calculation work of BatchNormal by the second adder 123c and the second multiplier 123 d.
Referring now to fig. 8, fig. 8 is a diagram illustrating a quantization transformation calculation using an on-accelerator computation module according to an embodiment of the present invention.
In order to make faster reasoning at the mobile terminal, an INT8 quantization method with a lower quantization bit number is often adopted in the network, and the lower quantization precision reduces the data throughput and the power consumption of the calculation. The present calculation unit can be incorporated into the calculation of the INT8 quantization network even when the INT16 calculation is completed. The INT8 conversion function by INT16 can be realized by the following design.
Currently, the commonly used INT8 is quantified as the tentorrt scheme by england and the INT8 scheme by google. The method for converting 16-bit quantized data into 8-bit quantized data in the TensorRT scheme of Ingland comprises the following steps:
Xint8_nvidiia=[Xint16*M]
wherein, Xint16For data distributed by data distribution module 122; xint8nvidiiaFor 8bit quantized output data to be obtained, M is a conversion parameter.
The method for converting 16 bits into 8 bits in INT8 scheme of Google is as follows:
Xint8_google=[Xint16*M]+Zx/quant
wherein, Xint16For data distributed by data distribution module 122; xint8googleFor the output data to be obtained, M is a conversion parameter, Zx/quantIs a quantized zero value (not constant 0) in this scheme.
Further, with reference to fig. 3, on the basis of Sigmoid, because the input of Sigmoid is INT16 data, in order to use a Sigmoid operator in an 8BIT low-precision network, an inverse quantization module 125 is designed in a unit, and the conversion from the output of INT16/INT24/INT32 to the quantization of google 8BIT can be realized by simple shifting and using fewer resources. The Sigmoid method for converting Int16 into Int8 is:
Figure BDA0002994795970000091
wherein Sigmodint8For data to be obtained, XfloatIs the difference between the maximum and minimum values of the floating-point number X, SfloatIs 255, ZeroxquantA transition zero value (non-constant 0) in the transition scheme; xint16Is the output data of the multiply add module 123.
Therefore, by controlling each computing node, the on-chip computing module 120 of the accelerator provided by the invention can realize the cascade function among more operators, reduce the use of storage resources and improve the computing efficiency.
If ReLU/ReLU6/PReLU is taken as the basis, the operators can be cascaded:
ADD (adder) + ReLU/ReLU6/PReLU
MUL (multiplier) + ReLU/ReLU6/PReLU
ReLU/ReLU6/PReLU+BatchNormal
ADD (adder) + ReLU/ReLU6/PReLU + BatchNormal
MUL (multiplier) + ReLU/ReLU6/PReLU + BatchNormal
If the BatchNormal operator is taken as the basis, the operators can be cascaded respectively:
ADD (adder) + BatchNormal
MUL (multiplier) + BatchNormal
ADD (adder) + MUL (multiplier) + BatchNormal
The method can also realize a combined cascade mode among operators such as ADD (adder) + MUL (multiplier), and can support more tensor element-by-element calculation operators through flexible calculation node control.
Further, based on the calculation module 120 in the accelerator slice with the first adder 123a, the first multiplier 123b, the second adder 123c, and the second multiplier 123d as cores, the calculation with Sigmoid as a core can be completed, and the realization of Sigmoid has the characteristics of less calculation units (only 2 multiplications and 2 additions are used), and less calculation period; the input and output data are compatible with the following quantization formats INT8/INT16/INT24/INT 32.
The invention realizes the hardware realization structure of the operator of the multiplexing resource, on the basis of the hardware resource of the calculation unit, the tensor addition, Hadamard multiplication, ReLU \ ReLU6\ PreLU, BatchNormal, INT16 to INT8 and other tensor element-by-element calculation operators in the neural network are realized by multiplexing the calculation resource and flexibly controlling the nodes in the calculation unit.
The invention realizes the conversion of Sigmoid into INT8, and after the calculation of input Simoid by INT16, the output INT16/INT24/INT32 datapass is converted into the realization method of Google INT8 quantization. Through the research on the conversion principle, the conversion method in the text completes the conversion through simple shift, and has the advantage of low resource usage.
The invention realizes the cascade method of a plurality of operators, realizes the starting or the bypass of the computing nodes through the control of the computing nodes, and transmits different computing data, and can realize the cascade among a plurality of different operators on the framework.
Compared with the prior art, the invention has the advantages that:
the invention can realize the combination of different calculation functions and calculation functions to realize various multiplexing modes by making the multiply-add module of the calculation module in the accelerator chip consist of a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence according to different operator requirements
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. An on-chip computation module for an accelerator, comprising:
a parameter distribution module configured to distribute the calculation parameters;
a data distribution module configured to distribute computing data;
the multiplication and addition module comprises a first adder, a first multiplier, a second adder and a second multiplier which are connected in sequence, and the first adder is connected to the data distribution module;
a plurality of selectors, each selector including a first input terminal connected to the data distribution module, a second input terminal connected to the parameter distribution module, and an output terminal, the output terminals of the selectors being respectively connected to the first adder, the first multiplier, the second adder, and the second multiplier, the selector being configured to select a data from the first input terminal and the second input terminal to be output by the output terminal of the selector,
wherein the first adder, the first multiplier, the second adder, the second multiplier, and the selector are configured to cause the on-accelerator-chip computation module to perform different computation functions.
2. The on-accelerator-chip computation module of claim 1, wherein a comparator is further connected between the first multiplier and the first adder, and the comparator is configured to determine whether to enable the comparator according to a configuration.
3. The on-accelerator-chip computation module of claim 1, further comprising:
and the inverse quantization module is connected to the output end of the multiplication and addition module.
4. The on-accelerator-chip computing module of claim 1, wherein the on-accelerator-chip computing module supports cascaded computation of a plurality of computing functions.
5. The on-accelerator-chip computation module of claim 4, wherein the parameter distribution module and the data distribution module read from memory computation parameters and computation data required for a current plurality of cascaded computation functions at a time.
6. The on-accelerator-chip computation module of any one of claims 1 to 5, wherein the computation function comprises: tensor addition, tensor Hadamard product, Sigmoid function, linear rectification and batch standardization.
7. The on-accelerator-chip computation module of any one of claims 1 to 5, wherein the computation function further comprises: a quantization conversion function.
8. The on-chip computation module of any one of claims 1 to 5, wherein the parameter distribution module and the data distribution module are controlled via a clock signal, and a time of computation of each addition/multiplication in the multiply-add module is a time of one clock signal.
9. The on-chip computation module of any one of claims 1 to 5, wherein the adders and multipliers in the multiply-add module that are not in operation directly transmit data by bypassing.
10. An accelerator, comprising:
a tensor convolution calculation module;
the on-chip computation module of any one of claims 1 to 9, an input of the on-chip computation module being connected to an output of the tensor convolution computation module; and
and the input end of the output module is connected to the output end of the calculation module in the accelerator sheet.
CN202110326325.9A 2021-03-26 2021-03-26 Accelerator and on-chip computing module for accelerator Active CN112989269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110326325.9A CN112989269B (en) 2021-03-26 2021-03-26 Accelerator and on-chip computing module for accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110326325.9A CN112989269B (en) 2021-03-26 2021-03-26 Accelerator and on-chip computing module for accelerator

Publications (2)

Publication Number Publication Date
CN112989269A true CN112989269A (en) 2021-06-18
CN112989269B CN112989269B (en) 2023-07-25

Family

ID=76333862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110326325.9A Active CN112989269B (en) 2021-03-26 2021-03-26 Accelerator and on-chip computing module for accelerator

Country Status (1)

Country Link
CN (1) CN112989269B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310816A (en) * 1998-07-22 2001-08-29 摩托罗拉公司 Circuit and method of modulo multiplication
CN102799412A (en) * 2012-07-09 2012-11-28 上海大学 CORDIC (coordinate rotation digital computer) accelerator based on parallel pipeline design
JP2013239120A (en) * 2012-05-17 2013-11-28 Olympus Corp Image processing device
CN111461313A (en) * 2020-03-27 2020-07-28 合肥工业大学 Convolution neural network hardware accelerator based on lightweight network and calculation method thereof
CN111667051A (en) * 2020-05-27 2020-09-15 上海赛昉科技有限公司 Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
US20200320375A1 (en) * 2020-05-05 2020-10-08 Intel Corporation Accelerating neural networks with low precision-based multiplication and exploiting sparsity in higher order bits

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310816A (en) * 1998-07-22 2001-08-29 摩托罗拉公司 Circuit and method of modulo multiplication
JP2013239120A (en) * 2012-05-17 2013-11-28 Olympus Corp Image processing device
CN102799412A (en) * 2012-07-09 2012-11-28 上海大学 CORDIC (coordinate rotation digital computer) accelerator based on parallel pipeline design
CN111461313A (en) * 2020-03-27 2020-07-28 合肥工业大学 Convolution neural network hardware accelerator based on lightweight network and calculation method thereof
US20200320375A1 (en) * 2020-05-05 2020-10-08 Intel Corporation Accelerating neural networks with low precision-based multiplication and exploiting sparsity in higher order bits
CN111667051A (en) * 2020-05-27 2020-09-15 上海赛昉科技有限公司 Neural network accelerator suitable for edge equipment and neural network acceleration calculation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FLAVIUS OPRITOIU等: "FPGA-Based Single Precision Iterative Floating Point Multiplier for Educational Use", 《2014 IEEE 20TH INTERNATIONAL SYMPOSIUM FOR DESIGN AND TECHNOLOGY IN ELECTRONIC PACKAGING (SIITME) 》, pages 305 - 308 *
刘勤让;刘崇阳;: "利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计", 电子与信息学报, no. 06 *

Also Published As

Publication number Publication date
CN112989269B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
EP3460726B1 (en) Hardware implementation of a deep neural network with variable output data format
US11645224B2 (en) Neural processing accelerator
CN109828744B (en) Configurable floating point vector multiplication IP core based on FPGA
CN110705703B (en) Sparse neural network processor based on systolic array
EP3931756A1 (en) Neural network layer processing with normalization and transformation of data
CN110222833B (en) Data processing circuit for neural network
CN111694544B (en) Multi-bit multiplexing multiply-add operation device, neural network operation system, and electronic apparatus
CN111240746B (en) Floating point data inverse quantization and quantization method and equipment
CN108647780B (en) Reconfigurable pooling operation module structure facing neural network and implementation method thereof
TW202020654A (en) Digital circuit with compressed carry
CN112989269B (en) Accelerator and on-chip computing module for accelerator
CN107783935B (en) Approximate calculation reconfigurable array based on dynamic precision configurable operation
CN112884146A (en) Method and system for training model based on data quantization and hardware acceleration
CN110766136A (en) Compression method of sparse matrix and vector
CN113283591B (en) Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN115545175A (en) Running bidirectional recurrent neural networks in hardware
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution
Tayal et al. Partial product based improved reconfigurable FIR filter with control logic for automated guided vehicles on virtex-7 FPGA
CN113592067B (en) Configurable convolution calculation circuit for convolution neural network
US11687336B2 (en) Extensible multi-precision data pipeline for computing non-linear and arithmetic functions in artificial neural networks
CN113793601B (en) Voice recognition method and device
CN115081604A (en) Buffer for temporarily storing Winograd weight, computing device, integrated circuit device and board card
CN113504893B (en) Intelligent chip architecture and method for efficiently processing data
EP4293576A1 (en) Hardware implementation of an attention-based neural network
CN115079927A (en) Temporary storage of convolution results, computing device, integrated circuit device and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 503-3, 398 Jiangsu Road, Changning District, Shanghai 200050

Applicant after: Shanghai Xijing Technology Co.,Ltd.

Address before: Room 503-3, 398 Jiangsu Road, Changning District, Shanghai 200050

Applicant before: SHANGHAI WESTWELL INFORMATION AND TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant