CN114970844A

CN114970844A - Universal neural network tensor processor

Info

Publication number: CN114970844A
Application number: CN202210031623.XA
Authority: CN
Inventors: 罗闳訚; 尤培坤; 周志新; 何日辉; 汤梦饶
Original assignee: Xiamen Yipu Intelligent Technology Co ltd
Current assignee: Xiamen Yipu Intelligent Technology Co ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-08-30

Abstract

The invention discloses a general neural network tensor processor, which consists of five modules, namely an instruction flow kernel, a data flow kernel, a system bus, an instruction flow peripheral bus and a data flow peripheral bus, the built-in data flow kernel with limited computation types and realized by adopting data flow computation technology provides high-computation-power, low-energy-consumption and high-performance data computation, meanwhile, a command flow kernel which has a graphic complete command set, can infinitely expand the calculation types and is realized by adopting the command set calculation technology is built in, so that high-flexibility data operation (carrying, combination and the like) is provided, the data flow kernel is mainly responsible for high-calculation tasks, the command flow kernel is mainly responsible for non-calculation tasks, the data flow kernel and the command flow kernel are mutually complementary, therefore, the neural network accelerated calculation with the operator expandability, high calculation power, low energy consumption, high performance and flexibility and universality is realized.

Description

Universal neural network tensor processor

Technical Field

The invention relates to the field of neural network tensor processors, in particular to a general neural network tensor processor.

Background

The neural network tensor processor is used for efficiently realizing the accelerated calculation of the neural network algorithm. In existing solutions, the neural network tensor processor typically only supports a finite type of computation, which is often referred to as an operator. The neural network tensor processor realizes the accelerated calculation of the neural network algorithm by supporting a specific number of operators. The tensor processor for the neural network can be referred to patent 1 (named as a multi-core tensor processor for the neural network, with the application number of 202011423696.0) or patent 2 (named as a tensor processor for the neural network, with the application number of 202011421828.6).

The existing neural network tensor processor is generally a neural network tensor processor based on finite operators, and when the processor is used specifically, the function of the processor has certain limitations: for unsupported operators and operators that cannot be indirectly supported by combining existing operators, the neural network tensor processor cannot operate.

Artificial intelligence techniques are in the high-speed development stage, and the development speed of artificial intelligence algorithms is increasing. With the development of the technology, new algorithms and new operators are continuously appeared, so that the traditional neural network tensor processor based on the fixed operator set can not support the operation of the new algorithms easily along with the evolution of the algorithms.

Disclosure of Invention

In order to solve the problems, the invention provides a general neural network tensor processor which has flexible operator expandability and can meet the new requirements of continuously innovative artificial intelligence algorithms.

The specific technical scheme provided by the invention is as follows:

a universal neural network tensor processor comprising: the instruction flow kernel is used for realizing the function of instruction set operation, wherein the instruction set is a graphic complete instruction set; the data flow kernel is used for configuring and realizing a data flow operator; the system bus is externally realized as an AXI system main equipment interface, and arbitrates the system main interface internally realizing the instruction stream kernel and the data stream kernel, and allows one path of the data stream kernel or the instruction stream kernel to access an external system at the same time; the peripheral bus of the instruction flow, is used for realizing the switch of the visit of the peripheral of different address spaces of kernel of the instruction flow, a part of the address space maps to the kernel of the dataflow, a part maps to the external AXI bus; and a data stream peripheral bus for arbitrating access requests from the instruction stream core or the external AXI device and allowing one of the access requests to the register of the data stream core at the same time.

From the view point of the interface of the functional module, further, the instruction flow kernel includes: a system host interface for retrieving instructions and data from an external memory; the peripheral main interface is used for actively reading and writing registers of other equipment; an external interrupt interface, which is used for the external system to trigger the instruction flow kernel to start the calculation; and the internal signal interface is used for realizing the signal interaction of the instruction flow core and the data flow core.

From the perspective of the interface of the functional module, further, the data flow kernel includes: a system host interface for retrieving data from an external memory; a peripheral slave interface for access by the instruction stream core or an external master device; and the internal signal interface is used for realizing the signal interaction between the data flow core and the instruction flow core.

From the view of the functional module, further, the instruction flow kernel includes: the instruction pipeline module is used for realizing the execution function of all instructions; the system bus bit width adapting module is used for converting the data bit width of the instruction pipeline module and enabling the data bit width to be consistent with the data bit width of the data flow kernel; the data bit width of the data flow kernel is 2 times of the data bit width of the instruction pipeline module to the power of N, wherein N is an integer greater than or equal to 2; the address space configuration module is used for realizing the attribute configuration of the address space, and the attribute configuration information of the address space comes from the data flow kernel; the interrupt controller module is used for receiving two types of interrupt trigger signals, one type is from a data stream kernel, the other type is from an external system, and after receiving the interrupt signals, the interrupt controller triggers an instruction pipeline to execute different tasks according to different interrupt types; and the data stream kernel control module is used for realizing interaction with the data stream kernel, and input and output signals of the data stream kernel control module comprise interrupt output signals, synchronous input signals and synchronous output signals, wherein the interrupt output signals are effectively used for triggering the data stream kernel to start a new task, the synchronous input signals are effectively used for representing the completion of the data stream kernel calculation, and the synchronous output signals are effectively used for sending a state of the completion of the instruction stream kernel calculation to the data stream kernel.

Furthermore, the data bit width of the data stream core is 256 bits or 512 bits or 1024 bits, and the data bit width of the instruction pipeline module is 64 bits or 32 bits.

Further, the address space attributes of the instruction stream core execution include: the shared data address space is an address space which can be accessed by the instruction stream kernel and the data stream kernel simultaneously, is used for storing data shared by the instruction stream kernel and the data stream kernel, can be read and written, stores the data in the address space strictly according to the data bit width alignment mode of the data stream kernel, and has the highest data read-write performance; the temporary data address space is an address space which is exclusively shared by the instruction stream kernel, is used for storing data which is independently used by the instruction stream kernel, can be read and written, has no requirement on address alignment of the data in the address space, can be read and written according to the granularity of bytes at minimum, and has the highest flexibility; and the instruction address space is an address space which is shared by the instruction stream kernel and is used for storing and reading the instructions which are used by the instruction stream kernel independently, and the instructions in the address space are stored in a mode of aligning the data bit width of the data stream kernel strictly, so that the highest reading performance is achieved.

Viewed from the functional module, further, the data flow kernel includes: the main controller is used for providing a control and state interface for the outside so that the external control unit can control the data flow kernel; the data flow reconstruction controller is used for configuring the data flow calculation module to realize certain operator function; the data flow calculation module is used for realizing the calculation of data, and the specific function of the data flow calculation module is determined by the data flow reconstruction controller; and the instruction stream kernel controller is used for realizing the interaction between the data stream kernel and the instruction stream kernel, and the input and output signals of the instruction stream kernel controller comprise interrupt input signals, synchronous output signals, address control configuration signals and interrupt output signals, wherein the interrupt input signals are effectively used for triggering the data stream kernel to start new calculation, the synchronous input signals are effectively used for representing the completion of the calculation of the instruction stream kernel, the synchronous output signals are effectively used for sending the state of the completion of the calculation of the data stream kernel to the instruction stream kernel, the address control configuration signals are used for configuring the address space attribute of the instruction stream kernel, and the interrupt output signals are effectively used for triggering the instruction stream kernel to start new tasks.

The invention realizes the following technical effects:

the general neural network tensor processor provides high-computation-power, low-energy-consumption and high-performance data computation through the built-in data flow kernel which has a limited computation type and is realized by adopting a data flow computation technology, and provides high-flexibility data operation (carrying, combining and the like) through the built-in instruction flow kernel which has a graphic complete instruction set, can infinitely expand computation types and is realized by adopting an instruction set computation technology. The data flow kernel is mainly responsible for high-computing power computing tasks, the instruction flow kernel is mainly responsible for non-computing tasks, and the data flow kernel and the instruction flow kernel are complementary to each other, so that the neural network accelerated computing with operator expandability, high computing power, low energy consumption, high performance and flexibility and universality is realized.

Drawings

FIG. 1 is a processor architecture of the present invention;

FIG. 2 is an instruction flow example;

FIG. 3 is a data flow example;

FIG. 4 is an instruction flow core architecture of the present invention;

FIG. 5 is a data flow kernel architecture of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The invention provides an overall architecture of a general neural network tensor processor, which is shown in figure 1. The general neural network tensor processor provided by the invention is composed of five modules, namely an instruction flow kernel, a data flow kernel, a system bus, an instruction flow peripheral bus and a data flow peripheral bus. Wherein the content of the first and second substances,

the instruction flow kernel mainly implements the function of running an instruction set, the instruction refers to a rule for operating data, different rules can define different instructions, and the set of all instructions is called an instruction set. The instruction set may be any instruction set having graphic maturity, such as the RISC-V instruction set. The instruction flow core reads the instruction flow and executes each instruction in sequence to implement the computational function defined by the instruction flow, one instruction flow being shown in fig. 2. Although the types of instructions are limited, the combination of instructions is unlimited, and thus the computational functions that the instruction flow core can implement are determined by the combination of instructions and the instruction flow core can implement any of the functions defined by the instruction flow.

The instruction flow kernel has a system host interface, a peripheral host interface, an external interrupt interface and an internal signal interface. The system main interface is used for acquiring instructions and data from an external memory; the peripheral main interface is used for actively reading and writing registers of other equipment; the external interrupt interface is used for triggering the instruction flow kernel to start computation by an external system; the internal signal interface is used for realizing signal interaction of the instruction flow core and the data flow core.

The data flow kernel mainly realizes the function of data flow operation. The dataflow kernel may implement function-specific computations, which are defined as dataflow operators. The dataflow operators themselves have indivisible computational functions such as convolution, pooling, etc. The set of dataflow operators does not have graphic completeness, and other arbitrary operators cannot be realized through the combination of the operators. The data flow kernel has limited data flow algorithm types and can realize limited computing functions. When the operator function of the data stream kernel is configured, the calculation function is determined, and at this time, the data stream kernel reads the data stream and processes each data in sequence, thereby implementing data processing of a certain calculation function, and one data stream is as shown in fig. 3.

The data flow kernel is provided with a system main interface, a peripheral auxiliary interface and an internal signal interface. The system main interface is used for acquiring data from an external memory; the peripheral slave interface is used for being accessed by the instruction flow kernel or the external master device; the internal signal interface is used for realizing the signal interaction between the data flow core and the instruction flow core.

The system bus is externally implemented as an AXI system master interface, and is mainly used for accessing an external system (such as a DDR memory). The system bus internally realizes the arbitration of the system main interfaces of the instruction flow kernel and the data flow kernel, and allows one path of the data flow kernel or the instruction flow kernel to access an external system at the same time.

The command stream peripheral bus is used for realizing the switching of the access of the command stream kernel to peripherals with different address spaces, wherein one part of the address space is mapped to the data stream kernel, and the other part of the address space is mapped to the external AXI bus.

The data stream peripheral bus is used for arbitrating access requests from the instruction stream core or the external AXI device, and allowing one path to access the register of the data stream core at the same time.

The instruction flow kernel and the data flow kernel interact through special internal signals, and the internal signals comprise three types: a synchronization signal, an interrupt signal and an address space configuration signal.

The synchronous signal is used for realizing synchronous ending of the data stream kernel and the instruction stream kernel. In the actual operation process, the data stream kernel and the instruction stream kernel can run in parallel, when the data stream kernel and the instruction stream kernel execute different parts of the same task, the two parties need to wait for each other through a synchronous signal until the computation of the two parties is completed, and then the two parties finish the computation at the same time.

The interrupt signal is used for triggering the instruction flow kernel to start computation by the data flow kernel or triggering the data flow kernel to start computation by the instruction flow kernel.

The address space configuration signal is used for realizing the configuration of the address space attribute of the instruction flow kernel operation. The instruction stream kernel operation address space has three types of attributes, namely a shared data address space, a temporary data address space and an instruction address space.

The shared data address space refers to an address space which can be accessed by the instruction flow kernel and the data flow kernel simultaneously, is used for storing data shared by the instruction flow kernel and the data flow kernel, can be read and written, stores the data in the address space strictly according to a 512-bit alignment mode, and has the highest data read-write performance.

The temporary data address space refers to an address space which is exclusively shared by an instruction flow kernel, is used for storing data which is used by the instruction flow kernel independently, can be read and written, does not have the requirement of address alignment of the data in the address space, can be read and written according to the granularity of bytes (8 bits) at minimum, and has the highest flexibility.

The instruction address space refers to an address space which is exclusively shared by an instruction stream kernel and is used for storing instructions which are used by the instruction stream kernel independently, and the instructions are read only, and the instructions in the address space are stored strictly in a 512-bit alignment mode, so that the highest read performance is achieved.

The instruction stream kernel is composed of a system bus bit width adaptation module, a 64-bit/32-bit instruction stream pipeline module, an address space configuration module, an interrupt controller module, and a data stream kernel control module, as shown in fig. 4.

The 64-bit/32-bit instruction pipeline module implements all instruction execution functions, acquires instructions and data from the system host interface, executes the instructions and performs calculations on the data to implement specific functions. The module can also read and write registers of other equipment through a peripheral interface. The data bit width of the module is 64 bits or 32 bits.

The system bus bit width adapting module is used for converting the 64-bit or 32-bit data bit width of the instruction pipeline module and making the data bit width consistent with the data bit width of the data stream kernel, and the typical value is 512 bits. The data bit width of the data flow kernel is set to be 2N times of the data bit width of the instruction pipeline module, and N is larger than or equal to 2. In other embodiments, the data bit width of the data stream core may also be 256 bits or 1024 bits, etc.

The address space configuration module is used for realizing the address space attribute configuration, the attribute configuration information of the address space comes from the data stream kernel, and the address space configuration module provides the address space information for the instruction pipeline according to different attribute configurations. Different read-write strategies are adopted when the instruction pipeline reads and writes data in different address spaces.

The interrupt controller module is used for receiving two types of interrupt trigger signals, wherein one type of interrupt trigger signals is from a data flow kernel, and the other type of interrupt trigger signals is from an external system. After receiving the interrupt signal, the interrupt controller triggers the instruction pipeline to execute different tasks according to different interrupt types.

The data stream kernel control module is mainly used for realizing interaction with a data stream kernel, wherein the interrupt output signal can effectively trigger the data stream kernel to start a new task, the synchronous input signal effectively represents that the data stream kernel is finished in computation, and the synchronous output signal can effectively send a state that the instruction stream kernel is finished in computation to the data stream kernel.

The data flow kernel is composed of a main controller, a data flow reconstruction controller, a data flow calculation module, and an instruction flow kernel controller, as shown in fig. 5.

The main controller is used for providing control and status interfaces to the outside, so that an external control unit (such as a CPU or an instruction stream core connected to the same AXI bus) can control a data stream core, such as configuration algorithm data addresses, input data addresses, calculation enabling and the like.

The main function of the data stream reconstruction controller is to configure the data stream calculation module so that it can implement some kind of operator function, such as convolution or pooling.

The main function of the data flow calculation module is to realize the calculation of data, and the specific function of the data flow calculation module is determined by the data flow reconstruction controller.

The instruction stream kernel controller is mainly used for realizing interaction with an instruction stream kernel, wherein an interrupt input signal can effectively trigger a data stream kernel to start new calculation, a synchronous input signal effectively represents that the instruction stream kernel is finished calculating, a synchronous output signal can effectively send the state of the data stream kernel finished calculating to the instruction stream kernel, an address control configuration signal is used for configuring the address space attribute of the instruction stream kernel, and the interrupt output signal can effectively trigger the instruction stream kernel to start a new task.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A universal neural network tensor processor, comprising:

the instruction flow kernel is used for realizing the function of instruction set operation, wherein the instruction set is a graphic complete instruction set;

the data flow kernel is used for configuring and realizing a data flow operator;

the system bus is externally realized as an AXI system main equipment interface, and arbitrates the system main interface internally realizing the instruction stream kernel and the data stream kernel, and allows one path of the data stream kernel or the instruction stream kernel to access an external system at the same time;

the peripheral bus of the instruction flow, is used for realizing the switch of the visit of the peripheral of different address spaces of kernel of the instruction flow, a part of the address space maps to the kernel of the dataflow, a part maps to the external AXI bus; and

and the data flow peripheral bus is used for arbitrating access requests from the instruction flow kernel or the external AXI equipment and allowing one path of the access requests to the register of the data flow kernel at the same time.

2. The universal neural network tensor processor of claim 1, wherein the instruction stream core comprises:

a system host interface for retrieving instructions and data from an external memory;

the peripheral main interface is used for actively reading and writing registers of other equipment;

an external interrupt interface for the external system to trigger the instruction stream kernel to start computation; and

and the internal signal interface is used for realizing the signal interaction of the instruction flow core and the data flow core.

3. The universal neural network tensor processor of claim 1, wherein the data stream kernel comprises:

a system host interface for retrieving data from an external memory;

a peripheral slave interface for access by the instruction stream core or an external master device; and

and the internal signal interface is used for realizing the signal interaction between the data flow core and the instruction flow core.

4. The universal neural network tensor processor of claim 1, wherein the instruction stream core comprises:

the instruction pipeline module is used for realizing the execution function of all instructions;

the system bus bit width adapting module is used for converting the data bit width of the instruction pipeline module and enabling the data bit width to be consistent with the data bit width of the data flow kernel; the data bit width of the data flow kernel is 2 times of the data bit width of the instruction pipeline module to the power of N, wherein N is an integer greater than or equal to 2;

the address space configuration module is used for realizing the attribute configuration of the address space, and the attribute configuration information of the address space comes from the data flow kernel;

the interrupt controller module is used for receiving two types of interrupt trigger signals, one type is from a data stream kernel, the other type is from an external system, and after receiving the interrupt signals, the interrupt controller triggers an instruction pipeline to execute different tasks according to different interrupt types; and

the data stream kernel control module is used for realizing interaction with a data stream kernel, and input and output signals of the data stream kernel control module comprise interrupt output signals, synchronous input signals and synchronous output signals, wherein the interrupt output signals are effectively used for triggering the data stream kernel to start a new task, the synchronous input signals are effectively used for representing the completion of the data stream kernel calculation, and the synchronous output signals are effectively used for sending a state of the completion of the instruction stream kernel calculation to the data stream kernel.

5. The universal neural network tensor processor of claim 4 wherein the data bit width of the data stream core is 256 bits or 512 bits or 1024 bits and the data bit width of the instruction pipeline module is 64 bits or 32 bits.

6. The universal neural network tensor processor of claim 4, wherein the address space attributes of the instruction stream kernel operations include:

the shared data address space is an address space which can be accessed by the instruction stream kernel and the data stream kernel simultaneously, is used for storing data shared by the instruction stream kernel and the data stream kernel, can be read and written, and stores the data in the address space strictly according to the data bit width alignment mode of the data stream kernel;

the temporary data address space is an address space which is exclusively shared by the instruction stream kernel, is used for storing data which is independently used by the instruction stream kernel, can be read and written, has no requirement on address alignment of the data in the address space, and can be read and written according to the granularity of bytes at minimum; and

the instruction address space refers to an address space shared by the instruction stream cores, is used for storing and reading instructions used by the instruction stream cores independently, and the instructions in the address space are stored in a data bit width alignment mode of the data stream cores strictly.

7. The universal neural network tensor processor of claim 1, wherein the data stream kernel comprises:

the main controller is used for providing a control and state interface for the outside so that the external control unit can control the data flow kernel;

the data flow reconstruction controller is used for configuring the data flow calculation module to realize certain operator function;

the data flow calculation module is used for realizing the calculation of data, and the specific function of the data flow calculation module is determined by the data flow reconstruction controller; and

the instruction stream kernel controller is used for realizing interaction between a data stream kernel and an instruction stream kernel, and input and output signals of the instruction stream kernel controller comprise interrupt input signals, synchronous output signals, address control configuration signals and interrupt output signals, wherein the interrupt input signals are effectively used for triggering the data stream kernel to start new computation, the synchronous input signals are effectively used for representing the completion of the computation of the instruction stream kernel, the synchronous output signals are effectively used for sending the state of the completion of the computation of the data stream kernel to the instruction stream kernel, the address control configuration signals are used for configuring the address space attribute of the instruction stream kernel, and the interrupt output signals are effectively used for triggering the instruction stream kernel to start new tasks.