CN112232517B

CN112232517B - Artificial intelligence accelerates engine and artificial intelligence treater

Info

Publication number: CN112232517B
Application number: CN202011018763.0A
Authority: CN
Inventors: 贾兆荣; 景璐; 杨继林
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-05-31
Anticipated expiration: 2040-09-24
Also published as: CN112232517A

Abstract

The invention provides an artificial intelligence acceleration engine and an artificial intelligence processor, wherein the artificial intelligence acceleration engine comprises: the cache module is configured to cache the instruction and the configuration data of the current task of the acceleration engine; the decoding module is configured to acquire the instruction from the cache module and decode the instruction; the data processing module comprises a multidimensional multiply-add computing unit, a scalar computing unit and a vector computing unit, and receives the instruction decoded by the decoding module and sends data to a corresponding unit for data processing based on the decoded instruction; and the storage module is configured to store data required by the data processing of the acceleration engine and data processed by the data processing module. By using the scheme of the invention, high-bandwidth data access can be provided for the computing unit, the high-density parallel computing requirement of deep learning is met, the computing blockage caused by insufficient data is reduced, and the computing efficiency is improved.

Description

Artificial intelligence accelerates engine and artificial intelligence treater

Technical Field

The field relates to the field of computers, and more particularly to an artificial intelligence acceleration engine and an artificial intelligence processor.

Background

In the first two decades of the 21 st century, under the promotion of technologies such as large-scale GPU server parallel computing, big data, deep learning algorithm, brain-like chip and the like, the human society successively enters an internet era, a big data era and an artificial intelligence era.

The artificial intelligence gives the intelligence of the robot, so that the robot can replace human beings to complete certain work, and the core algorithm is realized by mainly researching and summarizing human consciousness, thinking and information processing, so that the robot has the capabilities of thinking, analyzing, identifying, judging and the like similar to the human beings.

The development of artificial intelligence has important significance to national economy, artificial intelligence acts on national economic activities by integrating various production elements, the productivity level is favorably improved, and the development of entity economy is assisted, and the artificial intelligence is mainly shown in the following four aspects: firstly, artificial intelligence can rely on big data to process huge information resources, and effective data is obtained through analysis, so that wrong economic decisions are avoided, and the sustainable and stable development of economy is promoted. And secondly, the artificial intelligence can achieve the purposes of reducing resource waste and improving the production level and the production efficiency through intelligent accurate control. And thirdly, the artificial intelligence can be enabled in commercial ecology, and the artificial intelligence taking electric energy as a power source can reduce carbon emission and achieve the effects of energy conservation and environmental protection. And fourthly, under the drive of artificial intelligence, industrial economy and information economy are mutually integrated, and the traditional production mode of demand-design-manufacture-sale-service is changed. Due to the application of information technologies such as the Internet and the like, the incidence relation among different industries is continuously changed, new industries are continuously developed, the cross-border and fusion development becomes an important characteristic of the industrial ecology, the quality of economic growth is improved, and the adjustment of the economic overall structure is promoted.

Disclosure of Invention

In view of this, an objective of the embodiments of the present invention is to provide an artificial intelligence acceleration engine and an artificial intelligence processor, which can provide high bandwidth data access for a computing unit, meet the requirement of deep learning for high-density parallel computing, reduce computing congestion caused by insufficient data, and improve computing efficiency.

In view of the above object, an aspect of an embodiment of the present invention provides an artificial intelligence acceleration engine, including:

the cache module is configured to cache the instruction and the configuration data of the current task of the acceleration engine;

the decoding module is configured to acquire the instruction from the cache module and decode the instruction;

the data processing module comprises a multidimensional multiplication and addition computing unit, a scalar computing unit and a vector computing unit, and receives the instruction decoded by the decoding module and sends data to one or more of the multidimensional multiplication and addition computing unit, the scalar computing unit and the vector computing unit for data processing based on the decoded instruction;

and the storage module is configured to store data required by the data processing of the acceleration engine and data processed by the data processing module.

According to one embodiment of the present invention, the multi-dimensional multiply-add calculation unit includes a two-dimensional multiply-add calculation unit and a three-dimensional multiply-add calculation unit.

According to one embodiment of the present invention, a two-dimensional multiply-add calculation unit includes:

the data vector cache region is configured to store a group of data vectors to be processed in each clock cycle;

the model parameter vector cache region is configured to cache model parameters required by processing data vectors;

the multiplication and addition computing unit array consists of a plurality of rows of multiplication and addition computing units, and each multiplication and addition computing unit is provided with a model parameter vector cache region;

and the accumulator is configured to accumulate the data processed by the multiplication and addition computing unit array to obtain a final result.

According to one embodiment of the present invention, a three-dimensional multiply-add calculation unit is composed of a plurality of two-dimensional multiply-add calculation units arranged in parallel.

According to one embodiment of the invention, the two-dimensional multiply-add computing unit is configured to buffer multiple sets of model parameters required for data processing into each model parameter vector buffer area in the multiply-add computing unit array simultaneously, and one model parameter in the multiple sets of model parameters is used for data processing in each clock cycle.

According to one embodiment of the invention, an accumulator comprises:

an intermediate result buffer configured to buffer intermediate results of the accumulated computations;

and the final result cache region is configured to cache the final result of the accumulation calculation and send the final result to the storage module.

According to one embodiment of the present invention, a scalar calculation unit includes:

the data cache region is configured to store data in the data cache region before data processing;

the calculation unit comprises an addition and subtraction calculation area, a multiplication and division calculation area, a logic operation area and a nonlinear operation area, receives an instruction for data processing, reads data from the data cache area to a corresponding block for data processing based on the instruction, and retransmits the processed data to the data cache area.

According to an embodiment of the invention, the data processing module is further configured to connect the scalar calculation unit to the multidimensional multiply-add calculation unit through a plurality of cascade buses, connect the vector calculation unit to the scalar calculation unit, and receive the data processed by the multidimensional multiply-add calculation unit and further process the processed data by both the scalar calculation unit and the vector calculation unit.

According to one embodiment of the invention, the scalar calculation unit and the vector calculation unit are both provided with an online mode and an offline mode, when the online mode is set, the scalar calculation unit and the vector calculation unit both receive the data processed by the multidimensional multiply-add calculation unit and further process the processed data, and when the offline mode is set, the scalar calculation unit and the vector calculation unit only receive the data of the storage module for data processing.

In another aspect of the embodiments of the present invention, there is also provided an artificial intelligence processor including the artificial intelligence acceleration engine described above.

The invention has the following beneficial technical effects: according to the artificial intelligence acceleration engine provided by the embodiment of the invention, the cache module is arranged and is configured to cache the instruction and the configuration data of the current task of the acceleration engine; the decoding module is configured to acquire the instruction from the cache module and decode the instruction; the data processing module comprises a multidimensional multiplication and addition computing unit, a scalar computing unit and a vector computing unit, and receives the instruction decoded by the decoding module and sends data to one or more of the multidimensional multiplication and addition computing unit, the scalar computing unit and the vector computing unit for data processing based on the decoded instruction; the storage module is configured to store data required by the data processing of the acceleration engine and data processed by the data processing module, high-bandwidth data access can be provided for the computing unit, the requirement of high-density parallel computing of deep learning is met, computing blockage caused by insufficient data is reduced, and computing efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic block diagram of an artificial intelligence acceleration engine in accordance with one embodiment of the present invention;

FIG. 2 is a schematic diagram of a two-dimensional MAC calculation array, according to one embodiment of the invention;

FIG. 3 is a schematic diagram of convolving an input data multiplex weight according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of a three-dimensional MAC calculation array, according to one embodiment of the invention;

FIG. 5 is a schematic diagram of a scalar calculation unit according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of an artificial intelligence processor, according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

In view of the above objects, a first aspect of embodiments of the present invention proposes an embodiment of an artificial intelligence acceleration engine.

As shown in FIG. 1, the artificial intelligence acceleration engine may include:

the Cache module is configured to Cache an Instruction and configuration data of a current task of an acceleration engine, and mainly comprises an Instruction and data storage (I & D Sram/Cache) unit which is mainly used for caching the Instruction and configuration data of the current task of the acceleration engine, wherein the Instruction extraction period is one clock period, and the second is an Instruction Fetch and Decode (Instruction Fetch & Instruction Decode) unit which is used for reading and decoding the Instruction from the I & D Sram/Cache, if the Instruction storage space type is an SRAM (static random access memory), the unit needs to generate an Instruction prefetch buffer (Instruction prefetch buffer), the Cache Instruction and an Instruction counter (PC), and when an exception or an interrupt occurs, the Instruction prefetch buffer is cleared and a new Instruction is obtained from the new PC.

The decoding module is configured to acquire the instruction from the cache module and decode the instruction, the Instruction Decoding (ID) module controls the whole decoding and executing process, the inside of the decoding module comprises a controller for controlling the whole process of the acceleration engine and a multiplexer for distributing the instruction, and the decoding module supports the very long instruction word (x-way VLIW), and the function is jointly completed by the decoder and the controller.

And the data processing module comprises a multidimensional multiply-add computing unit, a scalar computing unit and a vector computing unit, the data processing module receives the instruction decoded by the decoding module and sends data to one or more of the multidimensional multiply-add computing unit, the scalar computing unit and the vector computing unit for data processing based on the decoded instruction, wherein the multidimensional multiply-add computing unit is a vector system architecture based on a RISC instruction set, and a large number of MACs (multiply-add computing units) form a 1-dimensional or 2-dimensional or 3-dimensional array for completing intensive convolution calculation and matrix multiplication. The vector length and the vector dimension supported by the MAC array can be flexibly configured, and the calculation requirements of different convolutional networks are met. In addition, the MAC array supports multi-precision calculation, and 1 MAC can complete 1 FP32 operation, 2 FP16 or INT16 operations, 4 INT8 operations, 8 INT4 operations, 16 INT2 operations or 32 INT1 operations; the scalar calculation unit mainly completes scalar calculation (or single-point calculation) and nonlinear algorithms, such as a nonlinear activation function in the CNN network. The unit may be set in an online mode, i.e. the calculation of the MAC is received directly by the unit for further processing. The method can also be set to an off-line mode, namely, data are loaded from the SRAM and independently run without depending on other computing units; the vector calculation unit mainly comprises a plurality of vector calculation modules, and mainly performs scalar and vector or vector and vector point multiplication or addition operation, such as linear transformation, data migration, data interception, data scaling, precision conversion and the like after convolution, the vector length of the unit is consistent with that of the MAC array, the calculation unit is configured in an online mode and an offline mode through a register, the calculation unit receives calculation data of the scalar calculation unit in the online mode, and the data comes from an SRAM in the offline mode.

The storage module is configured to store data required by the acceleration engine for data processing and data processed by the data processing module, and the storage module is composed of a local SRAM, and the local SRAM generally has the capacity of hundreds of MB, so that input data required by the acceleration engine, model parameters of a neural network, intermediate or final results of calculation and other data of the acceleration engine can be almost put down. The architecture with the computing unit close to the storage unit enables the data access bandwidth to completely meet the requirement of the computing unit for large data throughput, reduces intermediate caching links and reduces data access delay. In addition, in order to enable data sharing between adjacent AI engines and reduce the access of external DRAM, the SRAM is provided with four direct access ports MEM IF, and the SRAM can be directly accessed by AI engines in four directions of up, down, left, and right (in the case where a plurality of AI engines are interconnected by a Mesh network to form an AIPU). In this case, the AI Engine between the neighbors can collectively complete a computation task through appropriate pipelining, and the access frequency of the DRAM can be reduced.

In addition, the acceleration engine further comprises a data reordering module (DataReshape), which reorders the calculated result data to meet the requirements of the data sequence in the next operation, including data splicing, splitting, dimension transformation, transposition, etc., and essentially changes the storage order of the data in the SRAM, so that the next data reading can be continuously operated. The system also comprises a suspend handler module (Stall handler), which is used for processing a suspend request when an AI Engine internal computing module accesses memory conflict, or when the computing module cannot input data in the process of running water or when other situations need to be suspended, and finishing the suspend state of the module when monitoring that the suspend condition is not satisfied. And the debugging module (Debug & Trace) is used for debugging the AI Engine.

The artificial intelligence acceleration Engine (AI Engine) provided by the invention is specially designed for the accelerated computation IP of a deep learning algorithm and is a constituent unit of an artificial intelligence processing chip (AIPU). In order to make the AI Engine suitable for the deep learning algorithm, the AI Engine architecture of the present invention has the following basic characteristics:

1) the device is provided with a large number of multiplication and addition operation units (MAC), and can perform high-density convolution or matrix multiplication operation to realize data parallel;

2) VLIW (very Long instruction word) very long instruction words are supported, instruction parallelism (ILP) is realized, and the data processing speed is further improved;

3) the system is provided with a large number of MAC arrays, scalar calculation units (scalar units) and vector calculation units (vector units), and meets the heterogeneous calculation requirements of a deep learning algorithm;

4) the on-chip storage (local) SRAM with large capacity and high bandwidth is used for caching input data, model parameters and intermediate results, and reduces access delay of an external DRAM;

5) supporting multi-precision calculation;

6) supporting a general instruction set to make the AI Engine compatible with general software;

7) and a universal high-speed communication interface is supported, and a plurality of AI engines can be interconnected through an on-chip interconnection network (NoC) to form a more powerful AIPU (AI Process Unit).

The AI Engine is an accelerated calculator customized according to a deep learning algorithm, carries a large-capacity storage SRAM (static random access memory), provides high-bandwidth data access for a calculation unit, meets the high-density parallel calculation requirement of deep learning by a high-density MAC (media access control) array, can be configured into 2D (two-dimensional) data parallel calculation or 3D (three-dimensional) data parallel calculation or simple 1D data calculation according to the requirement, provides a heterogeneous calculation function for a deep learning calculation task, can form a production line, increases the multiplexing of data, reduces the access of the SRAM, reduces the calculation blockage caused by insufficient data, and improves the calculation efficiency.

In a preferred embodiment of the present invention, the multi-dimensional multiply-add calculation unit includes a two-dimensional multiply-add calculation unit and a three-dimensional multiply-add calculation unit. Whether to use a two-dimensional multiply-add calculation unit or a three-dimensional multiply-add calculation unit may be selected according to the data to be processed.

In a preferred embodiment of the present invention, as shown in fig. 2, the two-dimensional multiply-add calculating unit includes:

The MAC array is the core of the AI Engine and completes high-density calculation. The vector architecture of the unit enables the MAC array to complete parallel computation of a plurality of vectors under one instruction. The compute unit supports the RISC vector instruction set, according to which each vector register may contain N elements, each element being SEW wide. The vector registers may form register banks, and the number of vector registers in each register bank is LMUL. The RISC vector instruction set supports one instruction to perform vector computations in LMUL vector registers. In this architecture, the MAC array corresponds to LMUL vector registers, and one instruction can complete the calculation of the MAC array.

In order to enable the MAC array to be continuously calculated in each clock cycle without being blocked by data shortage factors, a data vector buffer area (data vector buffer) and a model parameter vector buffer area (weight vector buffer) are arranged, the two buffers provide enough bandwidth for the MAC array, meanwhile, the SRAM provides two ports, namely a data port and a model parameter port, for the MAC array, and the bandwidth requirement of the MAC array is guaranteed.

In addition, in order to reduce the access frequency of the SRAM, the neural network model parameter weight is multiplexed as much as possible by using the principle of convolution, as shown in fig. 3, which is a schematic diagram of multiplexing weight of convolution input data, for example, LMUL group (N vectors per group) vectors are multiplexed M times, specifically as follows:

1) according to the vector instruction, weight vectors of an LMUL group are taken out from the SRAM, and are respectively cached in the weight vector buffer of each MAC row;

2) when the cache weight is complete, the calculation begins, at which point the weight is loaded in parallel from the buffer into the MAC array. And then taking out a group of data vectors in each clock cycle according to the convolution principle, storing the data vectors into a data vector buffer area, broadcasting and sending the data vectors to all LMUL groups for sharing, wherein each small grid in the weight is a vector (vector), seven groups of vectors are shared, each vector comprises N elements, and the width of each element is SEW. The input data has 3 sets (batch ═ 3) of data (it can be understood that there are 3 pieces of picture data), and one vector in one set (vector order is 1- >2- >3- >4- >5- > …) is fetched every clock cycle, and this vector is shared by LMUL set weight. A plurality of pictures are spliced into a batch, so that the multiplexing times of weight vectors can be increased, the total access times of the SRAM can be reduced, and the pressure on the bandwidth can be reduced.

3) Assuming that the weight vectors are multiplexed for M times, continuously loading the weight vectors from the SRAM and caching the weight vectors into an emptied weight vector buffer in M clock cycles after calculation is started, wherein in order to not block the multiplexing calculation of the next group of weights, M > -, LMUL, otherwise, the weight loading of the next group of weights is incomplete;

4) after the multiplexing calculation of the previous group of weights is completed, the weights in the buffer are loaded into the MAC whole column in parallel, and the whole calculation process cannot be blocked;

in the calculation process, weight is continuously cached and calculated, and data is taken out in real time and cached in buffer, so that the condition that the whole calculation process has enough data for MAC calculation is ensured.

In the whole process, the MAC array works at full speed in parallel, and because the SRAM and the buffer guarantee the bandwidth requirement, the calculation of the MAC array cannot be blocked due to insufficient data.

In a preferred embodiment of the present invention, as shown in fig. 4, the three-dimensional multiply-add computing unit is composed of a plurality of two-dimensional multiply-add computing units arranged in parallel. The 3D MAC calculation whole-column unit is provided with a plurality of 2D MAC arrays which are arranged together in front and back, corresponding caches and accumulators are expanded into 3-dimensional modules, and the data calculation parallelism is also expanded into 3-dimensional calculation. According to the principle of convolution, similarly to a 2D MAC array, a group of weight vectors is taken, and the weight vectors broadcasted in the third dimension direction are stored in weight vector buffers of each 2D MAC array, namely the weight vectors used by each 2D MAC array are in the same group.

As with the convolution example shown in FIG. 3, in a 2D MAC array the input data vectors are computed sequentially as 1- >2- >3- >4- >5- > …, but in a 3D MAC array the input data vectors are computed in parallel. Assuming that the 3D MAC array includes 4 2D MAC arrays, the input data vector calculation order is (1,2,3,4) - > (5,6,7,8) - > …, and the vectors in the parenthesis are executed in parallel in each 2D MAC array, it can be seen that the data throughput of the 3D MAC array is N (where N is 4) times that of the 2D MAC.

In a preferred embodiment of the present invention, the two-dimensional multiply-add computing unit is configured to buffer multiple sets of model parameters required for data processing into each model parameter vector buffer in the multiply-add computing unit array simultaneously, and one of the multiple sets of model parameters is used for data processing in each clock cycle.

In a preferred embodiment of the invention, the accumulator comprises:

and the final result cache region is configured to cache the final result of the accumulation calculation and send the final result to the storage module. Two groups of caches are arranged in an Accumulator (Accumulator), namely an Assembly Ram (intermediate result cache region) and a Delivery Ram (final result cache region), wherein the Assembly Ram is used for caching the intermediate result of calculation, accumulating the next intermediate result and storing the intermediate result into the secondary Ram again until the result of convolution calculation is a complete result, the output data is cached into the Delivery Ram, and the final result is stored into an SRAM; if the Vector Unit and the MAC array calculation Unit are pipelined, the result in the Delivery Ram is directly cached in the buffer of the Vector calculation Unit.

In a preferred embodiment of the present invention, the scalar calculation unit includes:

the calculation unit comprises an addition and subtraction calculation area, a multiplication and division calculation area, a logic operation area and a nonlinear operation area, receives an instruction for data processing, reads data from the data cache area to a corresponding block for data processing based on the instruction, and retransmits the processed data to the data cache area. Scalar calculation Unit principle As shown in FIG. 5, the load/store function reads data from SRAM into the buffer according to the instruction. Since access conflicts may occur when a plurality of computing modules access the SRAM, a Controller may put as much input data as possible into the buffer when the SRAM is accessible. The input data in the buffer is loaded to the addition, subtraction, multiplication, division, logic operation and nonlinear operation units, the output result is cached in the buffer, and the Controller requests to write the SRAM. The Controller contains a state machine for SIMD mode. The data throughput rate of the scalar calculation unit needs to be matched with the MAC matrix, so that when the data throughput rate of the MAC array is high, the calculation unit possibly comprises a plurality of calculation modules and a plurality of cascade buses (cascade buses), multi-module parallel operation is realized, the data throughput rate is improved, and the calculation of the MAC array is prevented from being blocked.

In a preferred embodiment of the present invention, the data processing module is further configured to connect the scalar calculation unit to the multidimensional multiply-add calculation unit through a plurality of cascade buses, connect the vector calculation unit to the scalar calculation unit, and receive the data processed by the multidimensional multiply-add calculation unit and further process the processed data by both the scalar calculation unit and the vector calculation unit. Each computing unit completes data processing, if the next unit is needed to perform computing, the processed data are stored in the data cache, the next computing unit reads from the data cache and processes the data, and if the next unit is not needed to perform computing, the processed data are stored in the SRAM.

In a preferred embodiment of the present invention, the scalar calculation unit and the vector calculation unit are both provided with an online mode and an offline mode, and when the online mode is set, the scalar calculation unit and the vector calculation unit both receive the data processed by the multidimensional multiply-add calculation unit and further process the processed data, and when the offline mode is set, the scalar calculation unit and the vector calculation unit only receive the data of the storage module for data processing.

In view of the above objects, a second aspect of the embodiments of the present invention provides an artificial intelligence processor, which includes the above artificial intelligence acceleration engine. The AI Engine is an IP core, and multiple AI engines may be interconnected by a Mesh On Chip (NoC) interconnection Network (Network On Chip) to form a more computationally intensive architecture (e.g., an artificial intelligence processor AIPU), as shown in fig. 6. The MAC array in the AI Engine can be large or small, the number of the AI engines can also be large or small, the method is very flexible, and the method can be applied to different hardware scales. The AI Engine interconnected through the network supports a Single Instruction Multiple Data (SIMD) or Multiple Instruction Multiple Data (MIMD) mode, and the software programming is more flexible.

According to the technical scheme, high-bandwidth data access can be provided for the computing unit, the high-density parallel computing requirement of deep learning is met, computing blockage caused by insufficient data is reduced, and computing efficiency is improved.

The embodiments described above, particularly any "preferred" embodiments, are possible examples of implementations and are presented merely to clearly understand the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure and protected by the following claims.

Claims

1. An artificial intelligence acceleration engine, comprising:

a cache module configured to cache instructions and configuration data for a current task of the acceleration engine;

a decode module configured to fetch the instruction from the cache module and decode the instruction;

the data processing module comprises a multidimensional multiplication and addition computing unit, a scalar computing unit and a vector computing unit, the data processing module receives the instruction decoded by the decoding module and sends data to one or more of the multidimensional multiplication and addition computing unit, the scalar computing unit and the vector computing unit for data processing based on the decoded instruction, and the data processing module processes the data and comprises the following steps: the LMUL group weight vectors are taken out according to the vector instruction and are respectively cached in a model parameter vector cache region of each MAC row, a group of data vectors are taken out in each clock cycle and are stored in a data vector cache region and are sent to all LMUL group weight vectors for sharing, data of the weight vectors in the multiplexing times of the clock cycles are calculated, the weight vectors are continuously loaded from the SRAM and are cached in an emptied model parameter vector cache region, and when the multiplexing calculation of the last group of weight vectors is completed, the weight vectors in the model parameter vector cache region are loaded into the MAC whole column in parallel;

the storage module is configured to store data required by the acceleration engine for data processing and data processed by the data processing module.

2. The acceleration engine of claim 1, wherein the multi-dimensional multiply-add computation unit comprises a two-dimensional multiply-add computation unit and a three-dimensional multiply-add computation unit.

3. The acceleration engine of claim 2, wherein the two-dimensional multiply-add computation unit comprises:

a data vector buffer configured to store a set of data vectors to be processed per clock cycle;

a model parameter vector cache region configured to cache model parameters required for processing the data vectors;

a multiply-add computing unit array, wherein the multiply-add computing unit array is composed of a plurality of rows of multiply-add computing units, and each multiply-add computing unit is provided with a model parameter vector cache region;

an accumulator configured to accumulate the data processed by the array of multiply-add computation units to obtain a final result.

4. The acceleration engine of claim 3, wherein the three-dimensional multiply-add computing unit is comprised of a plurality of the two-dimensional multiply-add computing units arranged in parallel.

5. The acceleration engine of claim 3, wherein the two-dimensional multiply-add computing unit is configured to buffer sets of model parameters required for data processing into each of the model parameter vector buffers in the multiply-add computing unit array at the same time, and one of the sets of model parameters is used for data processing every clock cycle.

6. The acceleration engine of claim 3, wherein the accumulator comprises:

a final result buffer configured to buffer a final result of the accumulation calculation and send the final result to the storage module.

7. An acceleration engine according to claim 1, characterized in that the scalar calculation unit comprises:

8. The acceleration engine of claim 1, wherein the data processing module is further configured to connect the scalar calculation unit to the multi-dimensional multiply-add calculation unit and the vector calculation unit to the scalar calculation unit via a plurality of cascaded buses, and wherein both the scalar calculation unit and the vector calculation unit are capable of receiving the data processed by the multi-dimensional multiply-add calculation unit and further processing the processed data.

9. The acceleration engine of claim 1, wherein the scalar calculation unit and the vector calculation unit are both configured to have an online mode and an offline mode, and when the online mode is set, the scalar calculation unit and the vector calculation unit both receive the data processed by the multidimensional multiplication and addition calculation unit and further process the processed data, and when the offline mode is set, the scalar calculation unit and the vector calculation unit only receive the data of the storage module for data processing.

10. An artificial intelligence processor comprising an artificial intelligence acceleration engine as claimed in any one of claims 1-9.