CN114492777A

CN114492777A - Computing power extensible multi-core neural network tensor processor

Info

Publication number: CN114492777A
Application number: CN202210102921.3A
Authority: CN
Inventors: 罗闳訚; 周志新; 何日辉; 尤培坤; 汤梦饶
Original assignee: Xiamen Yipu Intelligent Technology Co ltd
Current assignee: Xiamen Yipu Intelligent Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-13

Abstract

The invention discloses a computing power extensible multi-core neural network tensor processor, which comprises a PCIE controller, M MTC cores and M SDRAM controllers; each MTC core comprises S STC cores, and each STC core comprises L LTC cores; all LTC cores in the same multi-core neural network tensor processor are configured to be the same functional module; the PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment; the SDRAM controller is used for accessing the off-chip SDRAM memory corresponding to the MTC core. The multi-core neural network tensor processor adopts a modular multiplexing design scheme, and forms the neural network tensor processor with a certain computational power specification by repeatedly calling the minimum computation modules and combining the minimum computation modules together; the structure can greatly reduce the complexity of design and verification.

Description

Computing power extensible multi-core neural network tensor processor

Technical Field

The invention relates to the field of neural network tensor processors, in particular to a computing power extensible multi-core neural network tensor processor.

Background

Existing neural network tensor processors typically have a fixed internal structure and provide a fixed computational performance (computational power for short). For example, in the former patent 1 (named as a neural network multi-core tensor processor, with the application number of 202011423696.0) and the former patent 2 (named as a neural network tensor processor, with the application number of 202011421828.6), the computing power of the tensor processor is determined by the number of computing resources, and in the neural network tensor processor, the computing resources need to be matched with resources such as storage capacity and bus bandwidth. After the structure of the tensor processor is determined, the maximum computing power which can be supported by the structure is basically determined and is difficult to change.

On the other hand, for a neural network tensor processor with very large computational power (e.g., 4096TOPS), the chip area may reach several hundred square millimeters. For a chip with an area of hundreds of square millimeters, the work of front-end design, verification, back-end layout design and the like becomes very time-consuming and complex, and a large amount of investment in resources such as manpower, material resources and the like is needed.

Disclosure of Invention

In order to solve the above problems, the present invention provides a computationally expandable multinuclear neural network tensor processor. The tensor processor adopts the design idea of modular multiplexing, and a structured computing module is formed by designing a minimum computing unit and multiplexing the minimum computing unit, so that the computationally intensive extensible multi-core neural network tensor processor structure is realized. The multinuclear neural network tensor processor adopts a modular multiplexing method, and complexity of design and verification can be greatly reduced. Through a simple module combination mode, various neural network tensor processor configuration schemes with different computational powers can be realized, so that the design complexity of the high-computational-power neural network tensor processor is greatly reduced, the design and verification period is shortened, and finally the design requirements of various computational-power neural network tensor processors can be met with the minimum design and verification cost.

In order to achieve the purpose, the invention provides a computing power extensible multi-core neural network tensor processor, which comprises a PCIE controller, M MTC cores and M SDRAM controllers; each MTC core comprises S STC cores, and each STC core comprises L LTC cores; all LTC cores in the same multi-core neural network tensor processor are configured to be the same functional module; the PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment; the SDRAM controller is used for accessing the off-chip SDRAM memory corresponding to the MTC core.

More preferably, S is a multiple of 2 and L is a multiple of 2.

Further, the LTC core includes a 4D tensor core, a 1D tensor core, an instruction control unit, a local cache unit, a memory reading unit, a memory writing unit, and a special unit; the 4D tensor kernel is used for realizing basic operation of 4D tensor data; the 1D tensor kernel is used for realizing basic operation of 1D data; the instruction control unit is used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory; the local cache unit is used for storing input data required by the 4D tensor core; the memory reading unit is used for providing direct reading and writing capability of an external memory for each module in the minimum computing module; the special unit is used for realizing the calculation function related to the coordinate transformation.

Further preferably, the LTC core comprises a 4D tensor core and two 1D tensor cores.

Further, the 4D tensor core contains P1 FP16 MACs and 2 × P1 INT8 MACs inside.

More preferably, P1 is 2 to the power of M1, and M1 is not less than 8.

Further, the 1D tensor core includes P2 FP16 MACs.

More preferably, P2 is 2 to the power of M2 and M2 is 4 or 5 or 6.

Further, the basic operations of the 4D tensor data include multiplication, addition, multiply-accumulate and maximum; the basic operations of the 1D data include multiplication, addition, linear activation operation and nonlinear activation calculation.

Furthermore, a cascade path is arranged between the 4D tensor core and the 1D tensor core, and is used for directly inputting the output data of the 4D tensor core into the 1D tensor core.

Further, the memory reading unit is configured as follows: the 4D tensor core is provided with two independent memory reading units and is used for respectively realizing the reading operation of data and parameters required by 4D calculation; the 1D tensor core is provided with two independent memory reading units and is used for respectively realizing the reading operation of data and parameters required by 1D calculation; the instruction control unit and the special unit share one memory reading unit; the 1D tensor core is provided with an independent memory writing unit and is used for realizing the writing operation of the 4D/1D tensor core calculation result; the special unit is provided with an independent memory writing unit and is used for realizing the writing operation of the calculation result.

Further, the configuration parameters are used for configuring the 4D tensor core as a specific computing function and configuring the 1D tensor core as a specific computing function; the instructions are for implementing control of a computational process.

Further preferably, the capacity of the local cache unit is P3 × 32KB, where P3 is an integer not less than 8.

Further, the STC core of the neural network multi-core tensor processor also comprises a shared memory and a data control unit; the shared memory is used for caching input data and intermediate data which are commonly required by all LTC cores in the STC or parameters which are commonly required; the data control unit is used for pre-fetching the shared data or shared parameters existing in the off-chip SDRAM into the shared memory in advance.

Further, the interconnection relationship among the MTC core, the STC core, and the LTC core is: the LTC core can only access the local cache unit of the LTC core and cannot access the local cache units of other LTC cores; the LTC core can access the shared memory of the STC core and can also access the shared memory of other STC cores; the LTC core can access the off-chip SDRAM memory of the MTC core and can also access the off-chip SDRAM memories of other MTC cores.

Further, the capacity of the shared memory is larger than or equal to the capacity of the local storage cache.

The invention realizes the following technical effects:

(1) the multi-core neural network tensor processor adopts a modular multiplexing design scheme, and forms the neural network tensor processor with a certain computational power specification by repeatedly calling the minimum computation modules and combining the minimum computation modules together.

(2) The multi-core neural network tensor processor can form tensor processors with various computing powers by flexibly setting the number of the MTC cores, the STC cores and the LTC cores and combining different LTC cores, STC cores and MTC cores, so that a tensor processor framework with extensible computing powers is realized.

Drawings

FIG. 1 is a block diagram of the internal architecture of a minimal computing module (LTC core) of the present invention;

FIG. 2 is a block diagram of the internal architecture of the STC core of the present invention;

FIG. 3 is a block diagram of the MTC core internal structure of the invention;

fig. 4 is a block diagram of the internal structure of the neural network multi-core tensor processor of the present invention.

Detailed Description

To further illustrate the various embodiments, the present invention provides the accompanying figures. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The invention provides a computing power extensible multi-core neural network tensor processor. The tensor processor adopts a modular multiplexing design scheme, and forms the neural network tensor processor with a certain computational power specification by repeatedly calling the minimum computation modules and combining the minimum computation modules together.

Let the computing power of the minimum computing module be a, and the computing power is extensible, that is, the computing power specification of the multinuclear neural network tensor processor is determined according to the number N of the minimum computing modules specifically called, and the specific value of the computing power is equal to a × N. The value range of the computing power expandable finger N can be any positive integer.

Although the computational power is scalable, the functionality of the neural network computation is invariant for the multi-core neural network tensor processor for different computational powers. The function of the multi-core neural network tensor processor is determined by the function of the minimum calculation module, and the function of the neural network tensor processor with N being 1 is completely the same as that of the neural network tensor processor with N being any other positive integer.

The minimum computation module (LTC Core) is composed of a 4D Tensor Core (4D Tensor Core), a 1D Tensor Core (1D Tensor Core), an Instruction control unit (Instruction unit), a Local cache unit (Local Memory), a Memory read unit (LD), a Memory write unit (ST), and a special unit (resume), and its internal structure is shown in fig. 1.

In this embodiment, the minimum computation module includes one 4D Tensor Core (4D Tensor Core) and two 1D Tensor cores (1D Tensor cores) (in different implementations, the number of 1D Tensor cores may be other numbers, such as 1). The 4D tensor core includes 1024 FP16 MACs and 2048 INT8 MACs (in different implementations, the number of FP16 MACs and INT8 MACs may be other numbers, and is usually set such that the 4D tensor core includes P1 FP16 MACs and 2 × P1 INT8 MACs, where P1 is the M1 power of 2, and M1 is not less than 8). The 1D tensor core contains 16 FP16 MACs (in different implementations the number of FP16 MACs may be other numbers, typically set up such that the 1D tensor core contains P2 FP16 MACs, where P2 is the power of M2 of 2 and M2 is 4 or 5 or 6).

Most of the calculation functions of the minimum calculation module are realized by the 4D tensor kernel and the 1D tensor kernel, and the calculation functions comprise convolution, full connection, pooling and the like. The main functions of the 4D tensor kernel are to implement basic operations of 4D tensor data (data with (n, c, h, w) dimensions), including multiplication, addition, multiply-accumulate, maximum value, and the like. The main functions of the 1D tensor core are to implement basic operations of 1D data (e.g., data with dimension (w)), including multiplication, addition, various linear activation operations (e.g., Relu), various nonlinear activation operations (e.g., Sigmod), and the like.

A cascade path is arranged between the 4D tensor core and the 1D tensor core, namely, output data of the 4D tensor core can be directly input into the 1D tensor core, and therefore the function of completing the 1D operation task immediately after the 4D operation task is achieved. Therefore, one calculation of the minimum calculation module can load a plurality of operators, such as convolution + Relu, at the same time, and therefore high calculation efficiency is achieved.

The instruction control unit is mainly used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory (the external memory may be an off-chip SDRAM or other shared memory in a tensor processor chip). The configuration parameters are used to configure the 4D tensor kernel as a specific computational function (e.g., a convolution computational function) and the 1D tensor kernel as a specific computational function (e.g., a Relu computational function). The instructions are used to implement control of the computational flow, such as start, pause, end, etc.

The local cache unit capacity is set to a multiple of 32KB, which is usually set to not less than 8. The typical capacity of the local cache unit is 320KB, and the main function of the local cache unit is to store input data required by the 4D tensor core, and to perform rearrangement operation on the input data according to the sequence required by the 4D tensor core, so as to reduce the complexity of a subsequent computing circuit. The 1D tensor core does not need a separate buffer unit, and its input can be the output from the 4D tensor core (it can also be considered that the 1D tensor core indirectly uses the local buffer of the 4D tensor core). The input of the 1D tensor core can also be directly from an external memory, and the input data is directly calculated and then output without being cached.

The memory read unit LD and the memory write unit ST are mainly used to provide the direct read/write capability of the external memory for each module within the smallest computing module. The 4D tensor core is provided with two independent memory reading units LD for respectively realizing the reading operation of data and parameters required by 4D calculation; the 1D tensor core is provided with two independent memory reading units LD which are used for respectively realizing the reading operation of data and parameters required by 1D calculation; the instruction control unit and the special unit share one memory reading unit LD; the 1D tensor core is provided with an independent memory writing unit ST and used for realizing the writing operation of the 4D/1D tensor core calculation result (the calculation result of the 4D tensor core needs to be written out through the 1D tensor core); the special unit has an independent memory write unit ST for implementing the calculation result write operation.

The special unit is mainly used for realizing a calculation function related to coordinate transformation, has no calculation resources such as multiplication or addition in the special unit, and has the main functions of reading in input data and outputting the data according to a new data arrangement mode, and typical calculation functions of the special unit are Reshape, Concat and the like.

Effectively organizing multiple LTC cores together can form a computationally powerful neural network multi-core tensor processor. The STC core is the first hierarchical organization of the neural network multi-core tensor processor, as shown in fig. 2.

Each STC core comprises L LTC cores, preferably, L is a multiple of 2, with typical values of L being 4 or 8. These LTC cores are identical in design, and a multi-core design can be achieved by repeatedly calling the same LTC core. In addition, the STC core further includes a Shared Memory (Shared Memory) and a data control unit (Streaming unit).

The capacity of the shared memory is usually the same as or slightly larger than the capacity of the local cache unit of the LTC core (for example, the typical capacity of the local cache unit of the LTC core is 320KB, and the typical capacity of the shared memory is 352 KB). The shared memory is used for buffering input data and intermediate data which are commonly required by all LTC cores in the STC or parameters which are commonly required.

According to different calculation modes of the multi-core neural network tensor processor, different parameters can be used by all LTC cores to calculate the same data, and the same parameters can also be used to calculate different data. If the shared data or parameters exist in the off-chip SDRAM, the data control unit can pre-fetch the shared data or parameters into the shared memory in advance, and then the LTC core can directly fetch the required data or parameters from the shared memory in the calculation process, so that the purpose of saving the off-chip memory bandwidth is achieved. If the capacity of the shared memory is not enough to store the shared data or parameters, the LTC core directly reads the required data or parameters from the off-chip SDRAM in the calculation process.

The MTC core is a second hierarchical organization of the neural network multi-core tensor processor, as shown in fig. 3.

Each MTC core comprises S STC cores, preferably, S is a multiple of 2, with typical values of S being 4 or 8. These STC cores are identical in design, and complex designs of more cores can be realized by repeatedly calling the same STC core.

And all STC cores in the MTC core access the same off-chip SDRAM memory through the same SDRAM controller. Therefore, the computing unit in the MTC core uses a three-level cache structure of LTC local cache unit- > STC shared memory- > off-chip SDRAM memory. The off-chip SDRAM memory stores all data required by all computing units, the STC shared memory stores data required by a plurality of LTC cores, and the LTC local cache unit stores data required by calculation of the LTC internal units.

The MTC core comprises a plurality of STC cores, the STC core comprises a plurality of LTC cores, and the interconnection relationship among the plurality of LTC cores is as follows:

an LTC core can only access its local cache unit, but cannot access local cache units of other LTC cores;

the LTC core can access the shared memory of the STC core to which it belongs, and can also access the shared memory of other STC cores;

the LTC core can access the off-chip SDRAM of the MTC core and can also access the off-chip SDRAM of other MTC cores;

in the MTC core, only the LTC local cache unit is exclusive to each LTC core and cannot be accessed by other LTC cores, and the STC shared memory and the off-chip SDRAM memory are shared and can be accessed by any STC core in the MTC core. In the initial operation stage of the multi-core neural network tensor processor, all data required by calculation are stored in an off-chip SDRAM memory. During operation, if certain data is accessed by multiple STC cores of an MTC core, the data is prefetched into an STC shared memory, if desired. Further, during operation, some data may be prefetched into the LTC local cache unit as needed.

The highest level organization mode of the neural network multi-core tensor processor is shown in fig. 4.

The multi-core tensor processor of the neural network comprises a PCIE Controller (PCI Express Controller), a plurality of MTC cores and a plurality of SDRAM controllers (SDRAM controllers), wherein the number of the MTC cores is the same as that of the SDRAM controllers.

The multi-core tensor processor of the neural network comprises M MTC cores, each MTC core comprises S STC cores, and each STC core comprises L LTC cores. By flexibly setting the number of the MTC cores, the STC cores and the LTC cores and combining different LTC cores, STC cores and MTC cores, a tensor processor with various computing powers can be formed, and therefore the tensor processor framework with extensible computing powers is achieved.

Assuming that the computing power of the LTC is A, the computing power of the neural network multi-core tensor processor is M S L A, wherein M, S, L ranges from 1 to any positive integer. Therefore, the neural network multi-core tensor processor has flexible power scalability, and by arranging M, S, L with different values, a neural network multi-core tensor processor with different powers can be realized. For example, if the computation power a of the LTC core is 4TOPS, and the configuration of the neural network multi-core tensor processor adopts M ═ 8, S ═ 8, and L ═ 16, then the total computation power of the tensor processor is 8 × 16 × 4 ═ 4096 TOPS.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A computing power extensible multi-core neural network tensor processor is characterized by comprising a PCIE controller, M MTC cores and M SDRAM controllers; each MTC core comprises S STC cores, and each STC core comprises L LTC cores; the LTC cores are minimum computation modules, and all the LTC cores in the same multi-core neural network tensor processor are configured to be the same functional module;

the PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment;

the SDRAM controller is used for accessing the off-chip SDRAM memory corresponding to the MTC core.

2. The computationally expandable multinuclear neural network tensor processor of claim 1, wherein S is a multiple of 2 and L is a multiple of 2.

3. The computationally expandable multi-core neural network tensor processor of claim 1, wherein the LTC core comprises a 4D tensor core, a 1D tensor core, an instruction control unit, a local cache unit, a memory read unit, a memory write unit, and a special unit; the 4D tensor kernel is used for realizing basic operation of 4D tensor data; the 1D tensor kernel is used for realizing basic operation of 1D data; the instruction control unit is used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory; the local cache unit is used for storing input data required by the 4D tensor core; the memory reading unit is used for providing direct reading and writing capability of an external memory for each module in the minimum computing module; the special unit is used for realizing the calculation function related to coordinate transformation.

4. The computationally expandable multinuclear neural network tensor processor of claim 2, wherein the LTC core comprises one 4D tensor core and two 1D tensor cores.

5. The computationally-expandable multinuclear neural network tensor processor of claim 3, wherein the 4D tensor core internally contains P1 FP16 MACs and 2P 1 INT8 MACs.

6. The computationally-expandable multinuclear neural network tensor processor of claim 5, wherein P1 is 2 to the power of M1 and M1 is not less than 8.

7. The computationally-scalable multi-core neural network tensor processor of claim 3, wherein the 1D tensor core internally contains P2 FP16 MACs.

8. The computationally-expandable multinuclear neural network tensor processor of claim 7, wherein P2 is 2 to the power of M2 and M2 is 4 or 5 or 6.

9. The computationally expandable multicore neural network tensor processor of claim 3, wherein the elementary operations of the 4D tensor data comprise multiplication, addition, multiply-accumulate, and maximum; the basic operations of the 1D data include multiplication, addition, linear activation operation and nonlinear activation calculation.

10. The computationally expandable multicore neural network tensor processor of claim 3, wherein a cascade path is provided between a 4D tensor core and a 1D tensor core for inputting output data of the 4D tensor core directly into the 1D tensor core.

11. The computationally expandable multinuclear neural network tensor processor of claim 10, wherein the memory read unit is configured to: the 4D tensor core is provided with two independent memory reading units and is used for respectively realizing the reading operation of data and parameters required by 4D calculation; the 1D tensor core is provided with two independent memory reading units and is used for respectively realizing the reading operation of data and parameters required by 1D calculation; the instruction control unit and the special unit share one memory reading unit; the 1D tensor core is provided with an independent memory writing unit and is used for realizing the writing operation of the 4D/1D tensor core calculation result; the special unit is provided with an independent memory writing unit and is used for realizing the writing operation of the calculation result.

12. The computationally expandable multinuclear neural network tensor processor of claim 3, wherein the configuration parameters are for configuring a 4D tensor core for a particular computational function and a 1D tensor core for a particular computational function; the instructions are for implementing control of a computational process.

13. The computationally-expandable multi-core neural network tensor processor of claim 3, wherein the capacity of the local cache unit is P3 × 32KB, wherein P3 is an integer no less than 8.

14. The computationally-expandable multi-core neural network tensor processor of claim 1, wherein the STC core of the neural network multi-core tensor processor further comprises a shared memory and a data control unit;

the shared memory is used for caching input data and intermediate data which are commonly required by all LTC cores in the STC or parameters which are commonly required;

the data control unit is used for pre-fetching the shared data or shared parameters existing in the off-chip SDRAM into the shared memory in advance.

15. The computationally expandable multi-core neural network tensor processor of claim 14, wherein an interconnection relationship between the MTC core, the STC core, and the LTC core is: the LTC core can only access the local cache unit of the LTC core and cannot access the local cache units of other LTC cores; the LTC core can access the shared memory of the STC core and can also access the shared memory of other STC cores; the LTC core can access the off-chip SDRAM memory of the MTC core and can also access the off-chip SDRAM memories of other MTC cores.

16. The computationally expandable multi-core neural network tensor processor of claim 14, wherein the capacity of the shared memory is equal to or greater than a local memory cache.