CN115796248A

CN115796248A - Multi-core tensor processor based on distributed on-chip storage

Info

Publication number: CN115796248A
Application number: CN202211331229.4A
Authority: CN
Inventors: 罗闳訚; 汤梦饶; 周志新; 何日辉; 尤培坤
Original assignee: Xiamen Yipu Intelligent Technology Co ltd
Current assignee: Xiamen Yipu Intelligent Technology Co ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-03-14

Abstract

The invention discloses a multi-core tensor processor based on distributed on-chip storage. The multi-core tensor processor comprises a plurality of storage integration modules; the storage and calculation integrated module comprises two master equipment interfaces M0 and M1 and two slave equipment interfaces S0 and S1; the main equipment interface and the slave equipment interface consist of a read/write channel and a read/write return channel; a plurality of storage and calculation integrated modules are interconnected in an end-to-end connection mode; each storage and calculation integrated module comprises a tensor processor, an on-chip memory, a MUX2 module, a MUX3 module, a DEMUX2 module, a DEMUX3 module, an address decoding module, a label decoding module, an arbitration module and the like, and the tensor processor and the on-chip memory in the storage and calculation integrated module can form a complete calculation system to complete calculation of a neural network algorithm. The multi-core tensor processor has the advantages of high storage bandwidth and excellent extensible characteristic, and can realize a large-scale multi-core tensor processor on the premise of ensuring high bandwidth and not increasing interconnection complexity.

Description

Multi-core tensor processor based on distributed on-chip storage

Technical Field

The invention relates to the field of neural network tensor processors, in particular to a multi-core tensor processor based on distributed on-chip storage.

Background

The existing SoC (System on Chip) System uses a bus structure to realize data interaction of each module in the System. The bus typically conforms to a standard protocol (e.g., AMBA protocol) that any module conforming to the standard protocol may readily access to and from other modules on the bus.

The bus generally adopts two technical schemes, one is a multi-selector bus, and the other is a cross-bar switch bus. The multi-selector bus may provide interconnection functionality for M masters and N slaves, but only one master is allowed to access one slave over the bus at a time. The crossbar bus may provide an interconnect function for M masters and N slaves and allow M slaves to access M in parallel at the same time (assuming that the number of masters M is less than the number of slaves N) without conflicting master access objectives.

The multi-selector bus scheme has the advantages of simple design, low energy consumption and area overhead, large number of accessible devices, high expandability and suitability for systems with low requirements on bus bandwidth; the design of the cross switch bus is complex, the energy consumption and the area overhead are large, when the number of the devices is increased, the area and the energy consumption overhead are correspondingly increased, and the cross switch bus is suitable for a system which has high requirement on the bus bandwidth but has a small number of modules.

For tensor processors, in particular multi-core tensor processors, the number of processor cores may be a large number, for example 1024. On the other hand, the multi-core tensor processor of the neural network is generally used for completing the calculation task with high data volume, and the bandwidth requirement on the interconnection structure is high. Therefore, for a neural network multi-core tensor processor with high interconnection equipment number and high interconnection bandwidth requirements, the interconnection requirements of the traditional multi-selector bus and the traditional crossbar bus cannot be met.

In addition, for multi-core tensor processors, computational performance is determined by both computational power and memory bandwidth. Conventional schemes use off-chip memory (e.g., DDR DRAM memory) to hold data. Since the bandwidth of off-chip memory is typically a particular value limited by the physical interface, there is a common situation where the multi-core tensor processor power increases but the off-chip memory bandwidth does not change. This is a well-known "memory wall" problem, i.e., memory bandwidth limits the computational performance of the processor.

Disclosure of Invention

In order to solve the above problems, the present invention provides a multi-core tensor processor based on distributed on-chip storage. The multi-core tensor processor adopts a distributed on-chip storage structure based on a modular interconnection structure, the number of on-chip memories is in direct proportion to the number of cores of the tensor processor, and the modular interconnection structure can provide high-bandwidth interconnection.

The technical scheme of the invention is as follows:

a multi-core tensor processor based on distributed on-chip storage, the multi-core tensor processor comprising n computation-integrated modules; the storage and calculation integrated module comprises two master equipment interfaces M0 and M1 and two slave equipment interfaces S0 and S1; the main equipment interface and the slave equipment interface consist of a read/write channel and a read/write return channel;

the n storage and calculation integrated modules are interconnected in an end-to-end connection mode; the master equipment interface M1 of the (n-1) th storage and calculation integrated module is connected with the slave equipment interface S0 of the nth storage and calculation integrated module; the slave equipment interface S1 of the (n-1) th storage and calculation integrated module is connected with the master equipment interface M0 of the nth storage and calculation integrated module; the master equipment interface M0 of the (n-1) th storage and calculation integrated module is connected with the slave equipment interface S1 of the (n-2) th storage and calculation integrated module; the slave equipment interface S0 of the (n-1) th storage and calculation integrated module is connected with the master equipment interface M1 of the (n-2) th storage and calculation integrated module;

the storage and calculation integrated module comprises a tensor processor, an on-chip memory, a MUX2 module, a MUX3 module, a DEMUX2 module, a DEMUX3 module, an address decoding module, a label decoding module and an arbitration module;

the tensor processor sends read address or write address information to the on-chip memory through the read/write channel and receives read data or write confirmation information from the on-chip memory through the read/write return channel;

the on-chip memory is used for storing data required by the tensor processor for calculation;

the MUX2 module and the DEMUX2 module are two-way channel selection modules; the MUX2 modules are two and are respectively used for selecting channels from the equipment interfaces S0 and S1; the DEMUX2 modules are two and are respectively used for channel selection of the main equipment interfaces M0 and M1;

the MUX3 module and the DEMUX3 module are three-way channel selection modules; the MUX3 module is used for channel selection of the on-chip memory; the DEMUX3 module is used for channel selection of the tensor processor;

the slave device interface S0 is respectively connected with a DEMUX3 module of the tensor processor and a DEMUX2 module of the main device interface M1 through a MUX2 module in the storage and calculation integrated module; the tensor processor is connected with the slave device interface S0 through the DEMUX3 module and the MUX2 module of the slave device interface S0; the master device interface M1 is connected with the slave device interface S0 through a DEMUX2 module of the master device interface M1 and a MUX2 module of the slave device interface S0;

the slave device interface S1 is respectively connected with a DEMUX3 module of a tensor processor and a DEMUX2 module of a main device interface M0 through a MUX2 module in the storage and calculation integrated module; the tensor processor is connected with the slave device interface S1 through the DEMUX3 module and the MUX2 module of the slave device interface S1; the master device interface M0 is connected with the slave device interface S1 through a DEMUX2 module of the master device interface M0 and a MUX2 module of the slave device interface S1;

the main equipment interface M0 is respectively connected with a MUX3 module of the on-chip memory and a MUX2 module of the slave equipment interface S1 through a DEMUX2 module inside the storage and calculation integrated module; the on-chip memory is connected with the main equipment interface M0 through a MUX3 module and a DEMUX2 module of the main equipment interface M0; the slave device interface S1 is connected with the master device interface M0 through a MUX2 module of the slave device interface S1 and a DEMUX2 module of the master device interface M0;

the main equipment interface M1 is respectively connected with a MUX3 module of the on-chip memory and a MUX2 module of the slave equipment interface S0 through a DEMUX2 module in the storage and calculation integrated module; the on-chip memory is connected with the main equipment interface M1 through a MUX3 module and a DEMUX2 module of the main equipment interface M1; the slave device interface S0 is connected with the master device interface M1 through a MUX2 module of the slave device interface S0 and a DEMUX2 module of the master device interface M1;

the tensor processor is connected with a MUX3 module of the on-chip memory, a MUX2 module of the slave device interface S0 and a MUX2 module of the slave device interface S1 through a DEMUX3 module respectively; the read/write request of the tensor processor is sent to a local on-chip memory through the DEMUX3 module, or is sent to any on-chip memory located in other memory modules through the S0 or the slave device interface S1;

the address decoding module acquires the address of the read/write channel and determines the channel selection of the internal data distributor of the DEMUX2 module or the DEMUX3 module according to the built-in fixed address interval and the port mapping information;

the tag decoding module acquires a tag of a read/write return channel and determines channel selection of a data distributor in the MUX2 module or the MUX3 module according to a built-in fixed tag and port mapping information;

the arbitration module obtains the effective request information of the read/write channel or the read/write return channel, and determines the channel selection of the internal multi-path selector of the DEMUX2 module, the DEMUX3 module, the MUX2 module or the MUX3 module according to the built-in fixed priority.

Furthermore, the MUX2 module internally comprises a two-way data distributor and an alternative multiplexer; the two-way data distributor is used for selecting one way from the two ways of output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module; the alternative multi-path selector is used for selecting one path from two paths of input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

Furthermore, the MUX3 module internally comprises a three-way data distributor and a three-to-one multiplexer; the three-way data distributor is used for selecting one way from three output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module; the one-out-of-three multiplexer is used for selecting one channel from three input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

Further, the DEMUX2 module includes a two-way data distributor and an alternative multiplexer inside; the two-way data distributor is used for selecting one way from the two ways of output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module; the alternative multi-path selector is used for selecting one path from the two paths of input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

Further, the DEMUX3 module is a three-way channel selection module, and includes a three-way data distributor and a one-out-of-three multiplexer inside; the three-way data distributor is used for selecting one way from three output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module; the one-out-of-three multiplexer is used for selecting one channel from three input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

Further, each bank module is assigned a unique ID number; the ID number is used as accessory information and is sent to the on-chip memory by the tensor processor through a read/write channel; when the on-chip memory responds to a read/write request from a read/write channel, a corresponding ID number is added to corresponding read/write return channel information, and the ID number is a tag of the read/write return channel.

Furthermore, the read/write channel is configured as a single channel shared by read and write information or a dual channel separated by read and write; the read/write return channel is configured to be a single channel shared by read and write information or a double channel separated by the read and write information.

The invention realizes the following technical effects:

the integral storage and calculation module comprises a tensor processor for calculation and an on-chip memory for storage, and the tensor processor and the on-chip memory in the integral storage and calculation module can form a complete calculation system to complete calculation of a neural network algorithm. Therefore, the calculation power and the storage bandwidth can be increased simultaneously by increasing the number of the storage and calculation integrated modules, and the limit of a memory wall can be broken, so that the high performance of the multi-core system can be realized. On the other hand, based on the structural characteristics of the calculation-integrated module, the N-fold expansion of the calculation-integrated module does not cause the increase of interconnection complexity, and the N-fold expansion of the calculation-integrated module can be realized easily, so the invention has excellent expandable characteristics.

Drawings

FIG. 1 is a schematic diagram of the connection of two integral modules of the present invention;

FIG. 2 is a schematic diagram of the connection of n computer-integrated modules of the present invention;

FIG. 3 is a memory integrated module internal structure of the present invention;

FIG. 4 is a functional schematic of a MUX2 module to which the present invention relates;

FIG. 5 is a functional schematic of a MUX3 module to which the present invention relates;

fig. 6 is a functional schematic diagram of a DEMUX2 module to which the present invention relates;

fig. 7 is a functional schematic diagram of a DEMUX3 module according to the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The invention provides a multi-core tensor processor based on distributed on-chip storage. The multi-core tensor processor is composed of a plurality of storage and computation integrated modules. As shown in fig. 1, the integrated storage module includes two master interfaces M0 and M1 and two slave interfaces S0 and S1. The main equipment interface and the slave equipment interface are composed of a read/write channel and a read/write return channel, wherein the read/write channel refers to a data channel from the main equipment interface to the slave equipment interface and is mainly used for sending relevant information such as a read address or a write address to the slave equipment interface by the main equipment interface. The read/write return channel refers to a data channel from the slave device interface to the master device interface, and is mainly used for sending relevant information such as read data or write confirmation and the like from the slave device interface to the master device interface.

The plurality of storage and calculation integrated modules are interconnected in an end-to-end connection mode, and the connection rule is shown in fig. 2: the master equipment interface M1 of the storage and calculation integrated module n-1 is connected with the slave equipment interface S0 of the storage and calculation integrated module n; the slave equipment interface S1 of the storage and calculation integrated module n-1 is connected with the master equipment interface M0 of the storage and calculation integrated module n; the master equipment interface M0 of the storage and calculation integrated module n-1 is connected with the slave equipment interface S1 of the storage and calculation integrated module n-2; the slave device interface S0 of the integral storage and calculation module n-1 is connected with the master device interface M1 of the integral storage and calculation module n-2.

The internal structure of the storage and computation integrated module is shown in fig. 3 and 4, wherein a/W represents a read/write channel, and R represents a read/write return channel.

The storage and calculation integrated module consists of a tensor processor, an on-chip memory, a MUX2 module, a MUX3 module, a DEMUX2 module, a DEMUX3 module, an address decoding module, a label decoding module and an arbitration module.

The tensor processor refers to a processor that performs a neural network algorithm computation. The tensor processor sends read address or write address information to the on-chip memory through the read/write channel, and receives read data or write confirmation information from the on-chip memory through the read/write return channel. The on-chip memory refers to a memory located inside a chip and used for storing data required by the tensor processor for calculation.

The MUX2 module is a two-way channel selection module and internally comprises a two-way data distributor and an alternative multiplexer. The two-way data distributor is used for selecting one way from the two ways of output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module. The alternative multi-path selector is used for selecting one path from two paths of input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

The MUX3 module is a three-way channel selection module and internally comprises a three-way data distributor and a three-to-one multiplexer. The three-way data distributor is used for selecting one way from three output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module. The one-out-of-three multiplexer is used for selecting one channel from three input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

The DEMUX2 module is a two-way channel selection module and internally comprises a two-way data distributor and an alternative multiplexer. The two-way data distributor is used for selecting one way from the two output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module. The alternative-to-one multiplexer is used for selecting one path from the two paths of input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

The DEMUX3 module is a three-way channel selection module and internally comprises a three-way data distributor and a three-to-one multiplexer. The three-way data distributor is used for selecting one way from three output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module. The one-out-of-three multiplexer is used for selecting one channel from three input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

The address decoding module obtains the address of the read/write channel and determines the channel selection of the internal data distributor of the DEMUX2 or DEMUX3 module according to the built-in fixed address interval and the port mapping information.

The label decoding module obtains a label of the read/write return channel and determines channel selection of the data distributor in the MUX2 or MUX3 module according to a built-in fixed label and port mapping information.

The arbitration module obtains the effective request information of the read/write channel or the read/write return channel, and determines the channel selection of the internal multiplexers of the DEMUX2, DEMUX3, MUX2 or MUX3 modules according to the built-in fixed priority.

The address decoding module, the tag decoding module and the arbitration module are general modules of the processor, and related technicians can generate corresponding modules according to parameter requirements (such as data interface width, data channel number and the like) of the modules such as the MUX2 module, the MUX3 module, the DEMUX2 module, the DEMUX3 module and the like.

The multi-core tensor processor based on the distributed on-chip storage is composed of a plurality of storage and calculation integrated modules, and each storage and calculation integrated module is assigned with a unique ID number. And the ID number is used as accessory information and is sent to the on-chip memory by the tensor processor through a read/write channel. When the on-chip memory responds to a read/write request from a read/write channel, a corresponding ID number is added to corresponding read/write return channel information, and the ID number is a label of the read/write return channel.

The specific connection relationship inside the storage and calculation integrated module is as follows:

and the slave device interface S0 is respectively connected with the DEMUX3 module of the tensor processor and the DEMUX2 module of the main device interface M1 through the MUX2 module in the storage and calculation integrated module. The tensor processor may be connected with the slave device interface S0 through the DEMUX3 module and the MUX2 module of the slave device interface S0. The master device interface M1 may be connected to the slave device interface S0 through the DEMUX2 module of the master device interface M1 and the MUX2 module of the slave device interface S0.

And inside the storage and calculation integrated module, the slave device interface S1 is respectively connected with the DEMUX3 module of the tensor processor and the DEMUX2 module of the main device interface M0 through the MUX2 module. The tensor processor may be connected with the slave device interface S1 through the DEMUX3 module and the MUX2 module of the slave device interface S1. The master interface M0 may be connected to the slave interface S1 through the DEMUX2 module of the master interface M0 and the MUX2 module of the slave interface S1.

And inside the storage and calculation integrated module, the main equipment interface M0 is respectively connected with the MUX3 module of the on-chip memory and the MUX2 module of the slave equipment interface S1 through the DEMUX2 module. The on-chip memory may be connected to the host interface M0 via a MUX3 module and a DEMUX2 module of the host interface M0. The slave interface S1 may be connected to the master interface M0 through the MUX2 module of the slave interface S1 and the DEMUX2 module of the master interface M0.

And the main equipment interface M1 is respectively connected with the MUX3 module of the on-chip memory and the MUX2 module of the slave equipment interface S0 through the DEMUX2 module in the storage and calculation integrated module. The on-chip memory may be connected to the host interface M1 via a MUX3 module and a DEMUX2 module of the host interface M1. The slave interface S0 may be connected to the master interface M1 through the MUX2 module of the slave interface S0 and the DEMUX2 module of the master interface M1.

And inside the storage and calculation integrated module, the tensor processor is respectively connected with the MUX3 module of the on-chip memory, the MUX2 module of the slave device interface S0 and the MUX2 module of the slave device interface S1 through the DEMUX3 module. The read/write request of the tensor processor can be sent to a local on-chip memory through the DEMUX3 module, and can also be sent to any on-chip memory located in other memory modules through the S0 or the slave device interface S1.

According to the multi-core tensor processor based on distributed on-chip storage, a read/write channel can be a single channel shared by read-write information and can also be a double channel separated by read-write. The read/write return channel can be a single channel shared by read-write information or a double channel separated by read-write information.

The integral storage and calculation module comprises a tensor processor for calculation and an on-chip memory for storage, and the tensor processor and the on-chip memory in the integral storage and calculation module can form a complete calculation system to complete calculation of the neural network algorithm. Therefore, increasing the number of storage-integrated modules may increase both computing power and storage bandwidth. Ideally, through compiler software, the neural network algorithm can be cut into N parts, the N parts are loaded to the N storage and computation integrated modules, the N tensor processors simultaneously access the N on-chip memories in parallel, and computation is completed. Compared with a single storage and computation integrated module, the computation power and the storage bandwidth of the N storage and computation integrated modules can be increased by N times. Therefore, the limit of the memory wall can be broken, and the high performance of the multi-core system can be realized. On the other hand, based on the structural characteristics of the calculation-integrated module, the N-fold expansion of the calculation-integrated module does not cause the increase of interconnection complexity, and the N-fold expansion of the calculation-integrated module can be realized easily, so the invention has excellent expandable characteristics.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-core tensor processor based on distributed on-chip storage, which comprises n integrative modules for computation; the storage and calculation integrated module comprises two master equipment interfaces M0 and M1 and two slave equipment interfaces S0 and S1; the master device interface and the slave device interface are composed of a read/write channel and a read/write return channel;

the MUX2 module and the DEMUX2 module are two-way channel selection modules; the MUX2 modules are two and are respectively used for selecting channels from the equipment interfaces S0 and S1; the DEMUX2 modules are two and are respectively used for channel selection of main equipment interfaces M0 and M1;

the slave device interface S1 is respectively connected with a DEMUX3 module of the tensor processor and a DEMUX2 module of the main device interface M0 through a MUX2 module inside the storage and calculation integrated module; the tensor processor is connected with the slave device interface S1 through the DEMUX3 module and the MUX2 module of the slave device interface S1; the master device interface M0 is connected with the slave device interface S1 through a DEMUX2 module of the master device interface M0 and a MUX2 module of the slave device interface S1;

the main equipment interface M1 is respectively connected with a MUX3 module of the on-chip memory and a MUX2 module of the slave equipment interface S0 through a DEMUX2 module inside the storage and calculation integrated module; the on-chip memory is connected with the main equipment interface M1 through a MUX3 module and a DEMUX2 module of the main equipment interface M1; the slave equipment interface S0 is connected with the master equipment interface M1 through a MUX2 module of the slave equipment interface S0 and a DEMUX2 module of the master equipment interface M1;

the tensor processor is connected with a MUX3 module of the on-chip memory, a MUX2 module of the slave device interface S0 and a MUX2 module of the slave device interface S1 through a DEMUX3 module respectively; the read/write request of the tensor processor is sent to a local on-chip memory through the DEMUX3 module, or is sent to any on-chip memory located in other memory modules through the equipment interface S0 or the equipment interface S1;

the address decoding module acquires the address of a read/write channel and determines the channel selection of the data distributor in the DEMUX2 module or the DEMUX3 module according to the built-in fixed address interval and port mapping information;

the label decoding module acquires a label of a read/write return channel and determines channel selection of a data distributor in the MUX2 module or the MUX3 module according to a built-in fixed label and port mapping information;

2. The distributed on-chip storage based multi-core tensor processor of claim 1 wherein the MUX2 module internally contains a two-way data distributor and an alternative multiplexer; the two-way data distributor is used for selecting one way from the two ways of output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module; the alternative multi-path selector is used for selecting one path from two paths of input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

3. The distributed on-chip storage based multi-core tensor processor of claim 1 wherein the MUX3 module includes a three-way data distributor and a one-out-of-three multiplexer internally; the three-way data distributor is used for selecting one way from three output read/write return channels and is connected with the input read/write return channel, and the channel selection is determined by the label decoding module; the one-out-of-three multiplexer is used for selecting one channel from three input read/write channels and is connected with the output read/write channel, and the channel selection is determined by the arbitration module.

4. The distributed on-chip storage based multi-core tensor processor of claim 1 wherein the DEMUX2 module includes a two-way data distributor and an alternative multiplexer internally; the two-way data distributor is used for selecting one way from the two ways of output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module; the alternative multi-path selector is used for selecting one path from the two paths of input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

5. The distributed memory-on-chip multi-core tensor processor as recited in claim 1, wherein the DEMUX3 module is a three-way channel selection module that includes a three-way data distributor and a one-out-of-three multiplexer; the three-way data distributor is used for selecting one way from three output read/write channels and is connected with the input read/write channel, and the channel selection is determined by the address decoding module; the one-out-of-three multiplexer is used for selecting one channel from three input read/write return channels and is connected with the output read/write return channel, and the channel selection is determined by the arbitration module.

6. The distributed on-chip storage based multi-core tensor processor of claim 1 wherein each banker module is assigned a unique ID number; the ID number is used as accessory information and is sent to the on-chip memory by the tensor processor through a read/write channel; when the on-chip memory responds to a read/write request from a read/write channel, a corresponding ID number is added to corresponding read/write return channel information, and the ID number is a tag of the read/write return channel.

7. The distributed on-chip storage based multi-core tensor processor as recited in claim 1, wherein the read/write channel is configured as a single channel for read and write information sharing or as two channels for read and write separation; the read/write return channel is configured to be a single channel shared by read and write information or a double channel separated by the read and write information.