Disclosure of Invention
The method aims to solve the problems that in the background technology, the application scene realized through software is large in size, large in power consumption and very high in cost, and the application scene realized through network transmission to a background is limited by a network, a wired network needs to have a network cable, the distance of a wireless network, the primary transmission bandwidth of the network and the hardware expense cost for maintaining the network.
The application provides a lightweight artificial intelligence acceleration module, which comprises a main controller, an internal storage component, an acceleration core and a general core; the external of the main controller is respectively connected with a coprocessor interface and an AXI bus interface, the main controller is in communication connection with the internal storage component, and the internal storage component is in communication connection with the acceleration core and the general core.
The main controller comprises a coprocessor interface processing module, an AHB interface processing module, a control unit module, a storage control module and a system register module;
the acceleration core includes the following sub-modules: a weight prefetcher prefetching weight data in advance for the high-performance multiply-add component; an activation value loader for activating values for the high performance multiply-add component; the core accelerator includes: the high-performance multiply-add component comprises a MAC unit supporting INT8 INT8+ INT16 calculation; accumulating and buffering results, and building by using a register;
the universal core includes the following sub-modules: the control engine is responsible for analyzing the command of the general core, sending the command to the corresponding computing component and controlling the reading of the data participating in the operation and the writing back of the computing result; Element-Wise components, including a Dot-Mul unit, a Dot-Add unit and a ReLU activation unit of INT 16; an activation table look-up component supporting a nonlinear activation function; quantization or dequantization components convert fixed point INT16 to single precision floating point FP32 or FP32 to INT 16.
The internal storage component is a local buffer and is realized by adopting an SRAM (static random access memory), the size of the internal storage component is 64KB, and a double-buffer mode is realized.
Further, the module has a main frequency of 300MHZ under a chip manufacturing process of 55 nanometers of TSMC.
Further, the module adopts an AMBA 5AHB master interface to carry out bus interaction, and the bit width of the external bus AHB is 32 bits or 64 bits.
Furthermore, the module supports a coprocessor interface of an ARM or RISC-V architecture to interact with the processor.
Further, the module internal acceleration core can be configured into INT8 and INT16 calculation modes, INT8 INT8+ INT16 calculation is carried out when the module internal acceleration core is configured into INT8, INT16 INT16+ INT32 calculation is carried out when the module internal acceleration core is configured into INT16, the single acceleration core is of an SMID structure with 16 widths INT8 or 8 widths INT16, and calculation force configuration is realized by setting the number of the acceleration cores; the remaining internal arithmetic units support INT16 computations.
Further, the module internally comprises: local buffer with total capacity of 32KB, constructed by SRAM, and supporting read-write access of coprocessor; the weight buffer has the total capacity of 32KB, is constructed by an SRAM and supports the access reading of the coprocessor; the result accumulation buffering in the acceleration cores is carried out, the depth of each buffer is 16, the width of each buffer is 32 bits, the buffer is built by a register, the accumulation buffering capacity in each acceleration core is 128B, and the total accumulation buffering capacity of the whole module depends on the number of the acceleration cores; the three buffers all adopt a double-buffer mode.
Further, the module supports convolution operator operations, matrix multiplication operator operations, activation operator operations, pooling operator operations, Element-Wise operator operations, and Batch-Normalization operator operations.
Furthermore, the module triggers operator operation through a coprocessor instruction, and a new instruction does not need to be customized.
Has the advantages that:
1. the lightweight artificial intelligence acceleration module is realized by software, and under the condition of meeting the requirements of high performance and high computing power, the lightweight artificial intelligence acceleration module does not need to be larger in size, low in power consumption and low in cost.
2. The lightweight artificial intelligence acceleration module supports convolution, matrix multiplication, activation, pooling, Element-Wise and Batch-Normalization operator operation, triggers the operator operation through coprocessor instructions (MCR, MCRR, MCR2, MCRR2, MRC, MRRC, MRC2, MRRC2, CDP and CDP2), and does not need to customize a new instruction; the system can well interact with a processor; the data processing capacity is improved, and the time delay is reduced.
Detailed Description
The present invention is described in detail with reference to the accompanying drawings, and as shown in fig. 1, the present application provides a lightweight artificial intelligence acceleration module, which includes a main controller, an internal storage component, an acceleration core, and a general core; the external of the main controller is respectively connected with a coprocessor interface and an AXI bus interface, the main controller is in communication connection with the internal storage component, and the internal storage component is in communication connection with the acceleration core and the general core.
The main controller comprises a coprocessor interface processing module, an AHB interface processing module, a control unit module, a storage control module and a system register module;
the acceleration core includes the following sub-modules: a weight prefetcher prefetching weight data in advance for the high-performance multiply-add component; an activation value loader for activating values for the high performance multiply-add component; the core accelerator includes: the high-performance multiply-add component comprises a MAC unit supporting INT8 INT8+ INT16 calculation; and accumulating and buffering the result, and setting up the result by a register. The internal acceleration cores can be configured into two calculation modes of INT8 and INT16, INT8 INT8+ INT16 when configured into INT8, INT16 INT16+ INT32 when configured into INT16, the single acceleration core is in a SMID structure with 16 width INT8 or 8 width INT16, the computing force can be configured by setting the number of the acceleration cores, at least 2 are configured, the minimum computing force is 2 16 0.3G 2 19.2GMAC, and 4 (38.4GMAC)/8 (76.8GMAC)/16 (153.6GMAC) can be configured. The rest internal arithmetic components support INT16 calculation, and the calculation force is 8 × 0.3G × 2 ═ 4.8 GMAC;
the universal core includes the following sub-modules: the control engine is responsible for analyzing the command of the general core, sending the command to the corresponding computing component and controlling the reading of the data participating in the operation and the writing back of the computing result; Element-Wise components, including a Dot-Mul unit, a Dot-Add unit and a ReLU activation unit of INT 16; an activation table look-up component supporting a nonlinear activation function; quantization or dequantization components convert fixed point INT16 to single precision floating point FP32 or FP32 to INT 16.
The internal storage component is a local buffer and is realized by adopting an SRAM (static random access memory), the size of the internal storage component is 64KB, and a double-buffer mode is realized.
The parameter performance of the lightweight artificial intelligence acceleration module under the TSMC 55 nanometer chip manufacturing process is shown in the table 1:
table 1: performance of parameter
The main frequency is 300 MHZ; bus interaction is carried out by adopting an AMBA 5AHB master interface, and the bit width of an external bus AHB is 32 bits or 64 bits; and the coprocessor interface supporting ARM or RISC-V architecture is an External coprocessor interface so as to facilitate the interaction with the processor.
The lightweight artificial intelligence acceleration module comprises local buffering, weight buffering and accumulation buffering, wherein the three buffering modes comprise: the local buffer has the total capacity of 32KB, is constructed by an SRAM (static random access memory), supports the read-write access of the coprocessor, and has the minimum access granularity of 32 bits; the Weight Buffer (Weight Buffer) has the total capacity of 32KB, is constructed by SRAM, supports the access and reading of the coprocessor, does not support write access, and has the minimum access granularity of 128 bits; the result accumulation buffer capacity in each acceleration core is small, each buffer depth is 16, the width is 32 bits, the result accumulation buffer capacity is built by a register, the accumulation buffer capacity in each acceleration core is 116 × 32bit × 2 ═ 128B, and the accumulation buffer capacity of the whole module depends on the number of the acceleration cores.
The lightweight artificial intelligence acceleration module triggers operator operation through coprocessor instructions (MCR, MCRR, MCR2, MCRR2, MRC, MRRC, MRC2, MRRC2, CDP and CDP2), and does not need to customize new instructions.
Examples
The following is an implementation description of the lightweight artificial intelligence acceleration module supporting convolution operators:
the convolution calculation process is as follows, the operation data comprises three-dimensional input characteristic images, three-dimensional convolution kernels and generated output characteristic images, a total number of N input characteristic images is obtained, and the size of each input characteristic image is H x W x C; there are a total of M convolution kernels, each convolution kernel being of size R S C. Convolving a three-dimensional convolution kernel with a three-dimensional input image to obtain a two-dimensional output characteristic image with the size of E x F; all M convolution kernels act on an input feature map to obtain a three-dimensional output feature image with M channels, wherein the size E is F is M; and all M convolution kernels act on the N three-dimensional input characteristic images to obtain N three-dimensional output characteristic images with M channels. The data dimensions are as follows:
input feature images (input fmaps): [ NHWC ]
Convolution kernels (weights) (filters): [ RSCM ]
Output feature images (output fmaps): [ NEFM ]
Because the data volume calculated by a single instruction of the coprocessor is limited, the loop is split to obtain pseudo code suitable for the coprocessor, and particularly, a plurality of layers of loops need to be added on the original basis:
splitting is performed along the C direction of the input feature image channel, because the coprocessor calculates 16 channels at a time, and assuming that there are 128 channels in the convolutional layer, the C direction can be calculated by dividing 128/16 into 8 times.
The segmentation is performed along the E/F direction of the output feature image, where the two directions of E/F are equivalent, and taking the F direction segmentation as an example, according to the MUL _ ADD instruction mentioned above, the maximum calculation amount is 16 at a time, so that a maximum of 16 results are obtained by one calculation, and assuming that the size of the output feature image F is 32, the calculation can be divided into 2 calculations, each calculation results 16 results, and the C direction needs to be completely calculated to obtain the final result.
The loop was developed with a specific example, assuming that one convolution layer in the classical neural network model, respe-50, was selected for analysis, the input feature image size NHWC was 64 × 14 × 128 (64 × 16 × 128 after padding), the convolution kernel size RSCM was 128 × 3 × 128, the output feature image size NEFM was 64 × 14 × 128, the convolution kernel represented by Sw/Sh was slid in steps in the horizontal/vertical direction;
it should be noted that the current co-processor accelerated Core number Core _ Num supports parameterizable configuration, is not fixed, and can be configured according to the Core _ Num.
The convolution data and the input characteristic image data are randomly generated, a 16-bit fixed-point mode is adopted (8-bit fixed-point flow is tested at the later stage, and the test flow is completely the same), and the specific format is as follows:
input feature images, NHWC format, i.e. C direction first, then W, then H, finally N direction, assume that there are two input feature images: the NHWC format is installed, then the actual order of depositing these data is: [0,0,0,1,1,1, …,8,8,8,0,0,0,1,1,1, …,8,8,8]
The format of the convolution kernel is related to the number of the acceleration cores of the coprocessor (after the model is trained, the model cannot move any more, and therefore only one-time data adjustment is needed), the convolution kernel format taken from the actual frame is RSCM, and the final result can be obtained after the format is adjusted.
The technical solutions described above only represent the preferred technical solutions of the present invention, and some possible modifications to some parts of the technical solutions by those skilled in the art all represent the principles of the present invention, and fall within the protection scope of the present invention.