CN116431214A

CN116431214A - Instruction set device for reconfigurable deep neural network accelerator

Info

Publication number: CN116431214A
Application number: CN202310334605.3A
Authority: CN
Inventors: 梁云; 贾连成
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-07-14

Abstract

The invention discloses an instruction set device for a reconfigurable deep neural network accelerator, which comprises an instruction controller and a plurality of hardware modules, wherein each hardware module comprises an input/output module, a matrix calculation module and a vector calculation module; providing multi-level hardware configuration by adopting a microkernel programming paradigm; compiling a computing task of the deep neural network accelerator into a plurality of microkernels, wherein each microkernel is encoded into a plurality of hardware instructions; each hardware instruction is used for module hardware configuration control and time-plane configuration control of specific computation or data movement operation; each hardware instruction includes the fields: instruction type, module type, configuration address, dependency flags and module configuration content. The invention realizes the efficient programming of various complex reconfigurable functional neural network hardware accelerators by using hardware instructions to represent the data flow reconfiguration and the functional reconfiguration of the reconfigurable deep neural network accelerator.

Description

Instruction set device for reconfigurable deep neural network accelerator

Technical Field

The invention relates to a hardware instruction set architecture technology, in particular to an instruction set interaction interface device of a reconfigurable neural network accelerator.

Background

The instruction set architecture (English: instruction Set Architecture, ISA), also known as the instruction set or instruction set architecture, is a program design-related portion of a computer architecture, and includes basic data types, instruction sets, registers, addressing modes, storage systems, interrupts, exception handling, and external I/O. The instruction set architecture contains a series of opcodes, i.e., opcodes (machine language), and basic commands that are executed by a particular processor.

Deep Neural Network (DNN) accelerators are a new type of computer hardware architecture for efficiently processing various types of neural network applications. Compared with the traditional computer, the DNN accelerator has the following characteristics: (1) high parallelism: thousands of computing units (Processing Element, PE) are arranged in the accelerator, a rectangular or tree-shaped interconnection array is formed among the PE, and data transmission among different PE is realized through hardware data flow. (2) support algorithms limited: DNN accelerators typically only need to support various types of DNN algorithms, such as matrix multiplication, convolution, activation functions, etc., and do not need to support programming of general-purpose algorithms. (3) control logic is simple: DNN accelerators typically act as subsystems of a complete computer and thus do not need to support the full functionality of a complete computer, such as complex branch control, interrupts, etc. (4) explicit memory access: unlike traditional computers, which cache data through multi-layer caches, DNN accelerators use an explicit memory access mechanism, and users need to specify specific locations of data in each layer of memory through instructions and access sequences to each layer of memory in each cycle.

The reconfigurable DNN accelerator is a novel DNN accelerator hardware structure. The architecture implements one or more reconfigurable features, namely data stream reconstruction, functional reconstruction and multi-module reconstruction, respectively. The data flow reconstruction can dynamically adjust the transmission mode of data in the PE array. The functional reconstruction may dynamically adjust the algorithm implemented by the ALU unit. The multi-module reconstruction may run different DNN calculation tasks in multiple sub-modules.

Some of the recent efforts have devised the instruction set architecture of DNN accelerators. Including Cambricon [1], VTA [2], gemmini [3], and the like. These instruction set architectures are each adapted to specific DNN accelerator hardware, but existing instruction sets have limited support for reconfigurable DNN accelerators, typically only one reconfigurable feature. In the face of accelerators with multiple reconfigurable features, a new instruction set architecture is needed to support multiple reconfigurable features of multiple DNN accelerators, thereby improving the operating efficiency of the accelerator.

Reference to the literature

[1]Liu,Shaoli,Zidong Du,Jinhua Tao,Dong Han,Tao Luo,Yuan Xie,Yunji Chen,and TianshiChen."Cambricon:An instruction set architecture for neural networks."ACM SIGARCHComputer Architecture News 44,no.3(2016):393-405.

[2]Moreau,Thierry,Tianqi Chen,Luis Vega,Jared Roesch,Eddie Yan,Lianmin Zheng,JoshFromm et al."A hardware–software blueprint for flexible deep learning specialization."IEEEMicro 39,no.5(2019):8-16.

[3]Genc,Hasan,et al."Gemmini:Enabling systematic deep-learning architecture evaluation viafull-stack integration."2021 58th ACM/IEEE Design Automation Conference(DAC).IEEE,2021.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an instruction set device for a reconfigurable deep neural network accelerator, which is an instruction set architecture supporting a complex reconfigurable functional neural network accelerator, and the data flow reconfiguration and the functional reconfiguration characteristics of the neural network accelerator are represented by hardware instructions, so that the efficient programming of the complex neural network hardware accelerator is realized, the length of programming codes is reduced, and higher hardware operation efficiency is supported.

For convenience, the invention is defined in terms of:

PE (Processing Element) computing unit

DMA (Direct Memory Access) direct memory access

FSM (Finite State Machine) finite state machine

SRAM (Static Random-Access Memory) Static Random Access Memory

DRAM (Dynamic Random Access Memory) dynamic random access memory

The technical scheme of the invention is as follows:

an instruction set apparatus for a reconfigurable deep neural network accelerator,

compared with the prior art, the invention has the beneficial effects that:

existing instruction set architectures are typically capable of supporting only one type of reconfigurable DNN accelerator hardware. The instruction set architecture provided by the invention can support various hardware reconfiguration characteristics, thereby improving the programming efficiency of the reconfigurable hardware accelerator.

Detailed Description

The invention is further described by the following examples, which are not intended to limit the scope of the invention in any way.

The invention provides an instruction set device for a reconfigurable deep neural network accelerator, which can support various complex reconfigurable functional neural network accelerators by using data flow reconfiguration and functional reconfiguration characteristics of an instruction representation accelerator.

The invention provides an instruction set device for a reconfigurable deep neural network accelerator, which comprises an instruction set format, and comprises the following components: instruction type, module type, configuration address, dependency flags and module configuration content.

The present invention is directed to a reconfigurable deep neural network accelerator design instruction set architecture. The accelerator comprises an instruction controller and a plurality of hardware modules, wherein the hardware modules comprise an input-output module, a matrix calculation module and a vector calculation module, and can be used for processing various deep neural network calculation tasks.

Overall format of instruction set architecture:

the instruction set architecture of the present invention employs a microkernel (microkernel) programming paradigm to provide multi-level hardware configuration. The overall deep neural network DNN computing task is compiled into a plurality of microkernels, each microkernel encoded as a plurality of hardware instructions. Each hardware instruction is used for module hardware configuration control and time plane configuration control of a particular computing or data movement operation. The module hardware configuration determines the data flow and function of each hardware module in the reconfigurable deep neural network accelerator. The time level configuration is to program a Finite State Machine (FSM) in a computing module and an input/output module (DMA module) of the accelerator to realize time level control of a multi-layer nested loop algorithm and data transmission tasks in a deep neural network of the accelerator.

Table 1 instruction set architecture format

Position of	0 1	2 3	4 11	12 15	16 127
						Data	Instruction type	Module type	Dependency flag	Configuring addresses	Configuring content

Table 1 shows the format of a 128-bit Instruction Set Architecture (ISA) employed by the present invention. Each hardware instruction contains 5 fields: inst Type (instruction Type) occupies 2 bits, module Type occupies 2 bits, dep. Flags (dependency flag) occupies 8 bits, config Addr occupies 4 bits, config Payload occupies 112 bits. The meaning and purpose of each field are described below.

1) Instruction type, module type, configuration address

The first two bits (Inst Type) determine the Type of hardware instruction, which may be one of two: a) Pure configuration or b) configuration-execution. If the type is configuration-execution, the instruction controller sends a start signal to the corresponding hardware module (input-output module, matrix calculation module or vector calculation module) after writing the corresponding configuration to call the microkernel and wait for completion thereof. Otherwise, the instruction configuration of the microkernel is not completed, and the next instruction is fetched from the program written by the user and the configuration is continued. Next, we devised a two-dimensional table to configure each hardware module with different contents, as shown in table 2.

In each instruction, the module type field defines the column index of table 2, indicating the hardware module number that the reconfigurable deep neural network accelerator needs to be configured. The hardware module type may be one of the following 4 types: a) An input module; b) A matrix calculation module; c) A vector calculation module; d) And the module numbers of the output modules are 0,1,2 and 3 respectively. The configuration address field defines the row index of table 2, indicating the configuration type inside the hardware module. When the microkernel computing task is translated (encoded) into a hardware instruction, all configuration addresses in a hardware module configured by the hardware instruction need to be configured by one hardware instruction. The 3 rd to 7 th addresses of the input module and the output module are invalid, so that the addresses 0 to 2 are required to be configured, and 3 instructions are required; the matrix and vector calculation module needs to configure all addresses from 0 to 7 for 8 instructions in total. For example, when programming a microkernel containing K instructions, the first K-1 instruction needs to use a pure configuration type and the last instruction uses a configuration-execution type to initiate hardware execution.

2) Dependency flag

We use an 8-bit dependency flag field to encode dependencies between different microkernels. When the x-th bit (from low to high) of the lower 4 bits is 1, the instruction needs to rely on a ready signal sent after the completion of the module with the number x to start execution; when the x-th bit (from low to high) of the upper 4 bits is 1, it indicates that the ready signal is sent to the x-th module after the instruction is completed. For example, if an instruction for vector calculation depends on a read instruction, the instruction dependency of the vector calculation module is encoded as 00000001, and the 0 th bit of the lower 4 bits is 1, indicating that the vector calculation instruction depends on the ready signal sent by the module (read module) with the number 0. Meanwhile, the dependency relationship flag of the reading module is set to 01000000, and the 2 nd bit of the upper 4 bits is 1, which indicates that after the reading instruction is completed, a ready signal is sent to the module (vector calculation module) with the number of 2.

Table 2 configuration content of each module at different addresses

3) Configuring content

Table 2 shows the configuration content of each type of module at different addresses. The details of each configuration are described below.

3.1 Global internetwork configuration

All modules share the same global internetwork configuration register at address 1. The register determines which module each memory is read from or written to. Each memory uses 4 bits of interconnect network configuration information. The first two bits are the written information, namely the module type written into the memory; the second two bits are read information, i.e. the module type of the memory is read. The relationship between the module type and the corresponding value is shown in table 3. Each instruction only changes the configuration information of one part of the memories, and the configuration information of the other part of the memories is not changed, so that 0 needs to be filled in the corresponding position. Since this register is shared by all modules, the write to this register should be masked (mask bit at address 0) to avoid conflicts. Therefore, all positions filled with non-0 values need to be filled with 3 for the corresponding mask bit content.

Table 3 global interconnect network configuration values corresponding to module types

Module type	Numerical value
		Matrix calculation module	1
Vector calculation module	2
		Input and output	3

3.2 Module specific configuration

At address 2, each module configures its specific runtime configuration information. For the input and output modules, the configuration content comprises a DRAM base address (32 bits), an SRAM base address (16 bits) and a read-write length (16 bits). For the matrix calculation module, the configuration content includes a reset bit (determining whether the output matrix is reset, 1 is reset, 0 is not reset), and a data stream (determining the transmission mode of data in the matrix calculation module, 1 is output holding, and 0 is weight holding). For the vector computation module, the operational configuration and the data flow configuration each contain a plurality of pieces of content, which are discussed separately below.

3.3 operational configuration of vector computation Module

There may be multiple vector computation modules in the accelerator. For a calculation using an immediate, it is necessary to configure "whether or not to use the immediate" as 1, and to configure "immediate value" as the immediate required for the calculation. Otherwise, the "use immediate" is configured to 0. In addition, the operands, the source port of each input operand, and the output destination port are also configured. For each vector calculation module, the operation configuration content thereof is shown in table 4.

Table 4 operation configuration content and bit width occupied by each configuration

Configuration description	Bit width
		Whether or not to use an immediate	1
Immediate value	16
		Operand(s)	4
Input operand A source port	2
		Input operand B source port	2
Input operand C source port	2
		Output destination port	2

Wherein the configuration values for each operand are shown in Table 5. For example, when the operand of the vector calculation module is set to Mul, the operand corresponding position of the vector calculation module needs to be set to 7.

TABLE 5 operand definitions and corresponding configuration values

Operand(s)	Configuration values
		Min	1
Max	2
		Add	3
Sub	4
		Shl	5
Shr	6
		Mul	7
Mac	8
		exp	9
log	10
		sigmoid	11
tanh	12
		nop	13

For input operand A source, input operand B source, input operand C source and output destination, their configuration values represent different port numbers, as shown in Table 6.

TABLE 6 configuration of input operand sources and output destinations

Port numbering	Configuration values
		No input/no output	0
Port 1	1
		Port 2	2
Port 3	3

3.4 data flow configuration of vector operation Module

The data flow configuration of the vector operation module defines the data flow used by each operand of the vector operation module. The accelerator may have multiple vector operation modules, but the multiple vector operation modules share the same data flow configuration. The bit width occupied by each configuration is shown in table 7.

TABLE 7 data stream configuration content and bit width occupied by each configuration

Configuration description	Bit width
		Input operand A data stream	2
Input operand B data stream	2
		Input operand C data stream	2
Output operand data stream	2

The correspondence between the data flow of the operand a and the configuration value is shown in table 8.

Table 8 configuration values corresponding to the data stream of input operand a

Operand A data stream	Configuration values
		Horizontal multicast	0
Horizontal pulsation	1
		Horizontal pulsing, vertical multicasting	2
Horizontal unicast and vertical multicast	3

The correspondence between the operand B, C, the data stream of the output operand and the configuration value is shown in table 9.

Table 9 configuration values corresponding to the data stream of input operand B or C and output operand

3.5 cycle and memory Access configuration

The ISA designed by the invention uses the rest of configuration registers (addresses 3-7) to process the control and data access of any 4-layer perfect nested loop in a vector computing module and a matrix computing module, wherein the control and data access comprises the scope of each layer of loop, an on-chip cache base address and the data access step length of each layer of loop. Wherein, the range of each layer of circulation is configured at the address 3, and 4 layers of circulation are total, and each layer of circulation occupies 16 bits, and 64 bits total.

Addresses 4-7 respectively configure the base address of each operand to access the memory, and the access step size of each tier cycle. The address used by each operand is shown in table 10.

Table 10 configuration Address used per operand

For each operand, the memory address Addr it accesses can be expressed by the following formula:

Addr＝Base+St1×Idx1+St2×Idx2+St3×Idx3+St4×Idx4

where Idx 1-Idx 4 are subscripts to the 4-layer cycle, which are dynamically generated when the accelerator is running. For each operand, base, st 1-St 4 are respectively configured at their corresponding configuration addresses. Each number occupies 16 bits and a total of 80 bits.

The instruction set architecture provided by the invention provides a programming interface of software and hardware, and provides support for controllers, data streams and interconnection modes required by the hardware. In terms of software, the operating framework of the deep neural network should include a compiler for the reconfigurable accelerator for compiling to generate configuration code conforming to the definition of the present instruction set. In terms of hardware, the reconfigurable deep neural network accelerator should realize corresponding functions according to the configuration described in the invention, including instruction controllers, data flows, operands and internet configuration, and these configurations are used for realizing corresponding functions, so that various different deep neural network applications can be efficiently operated.

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. An instruction set device for a reconfigurable deep neural network accelerator is characterized in that the reconfigurable deep neural network accelerator comprises an instruction controller and a plurality of hardware modules, and the hardware modules comprise an input-output module, a matrix calculation module and a vector calculation module; the data flow reconstruction and the function reconstruction of the reconfigurable deep neural network accelerator are represented by using hardware instructions, so that efficient programming of various complex reconfigurable functional neural network hardware accelerators is realized;

the instruction set device adopts a microkernel programming paradigm to provide multi-level hardware configuration; compiling a computing task of the deep neural network accelerator into a plurality of microkernels, wherein each microkernel is encoded into a plurality of hardware instructions; each hardware instruction is used for module hardware configuration control and time-plane configuration control of specific computation or data movement operation;

the format of the instruction set adopts the format of a 128-bit Instruction Set Architecture (ISA); each hardware instruction includes the following fields: instruction type, module type, configuration address, dependency mark and module configuration content;

the instruction type of the hardware instruction includes a pure configuration or a configuration-execution; if the type is configuration-execution, the instruction controller sends a starting signal to the corresponding hardware module after writing the corresponding configuration to call the microkernel and wait for the microkernel to finish; otherwise, the next instruction is fetched and configuration is continued;

the module type field of each instruction is used for representing the module number corresponding to the hardware module type which needs to be configured by the reconfigurable deep neural network accelerator; the hardware module types include: an input module; a matrix calculation module; a vector calculation module; an output module;

the configuration address field is used for representing the configuration type inside the hardware module; all configuration addresses in a hardware module configured by the hardware instruction are configured by adopting one hardware instruction;

the dependency relationship mark field is used for encoding the dependency relationship between different microkernels; when the x bit from low to high of the low 4 bits is 1, the hardware instruction can start executing only according to a ready signal sent after the completion of the module with the serial number x; when the x bit from low to high of the high 4 bits is 1, the hardware instruction is finished and then a ready signal is sent to the x-th module;

the module configuration content comprises global internet configuration data, module specific configuration, operation configuration of a vector calculation module, data flow configuration and circulation and memory access configuration of a vector operation module; all modules share the same global internet configuration data and are used for determining the modules of which each memory is read or written; the module specific configuration content includes module specific runtime configuration information; the operation configuration of the vector calculation module is used for configuring the corresponding operation of the vector calculation module; the data flow configuration of the vector operation module is used for defining the data flow used by each operand of the vector operation module, and a plurality of vector operation modules share the same data flow configuration; the loop and memory access configuration is used for processing control and data access of any 4-layer perfect nested loop in the vector calculation module and the matrix calculation module, and comprises a scope of each layer of loop, an on-chip cache base address and a data access step length of each layer of loop.

2. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 1, wherein the module hardware configuration determines the data flow and function of each hardware module in the reconfigurable deep neural network accelerator; the time layer configuration realizes the time layer control of the multi-layer nested loop algorithm and the data transmission task in the deep neural network of the accelerator by programming the finite state machine in the calculation module and the input/output module of the accelerator.

3. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 1, wherein each hardware instruction includes a field in which an instruction type occupies 2 bits; the module type occupies 2 bits; the dependency mark occupies 8 bits; the configuration address occupies 4 bits; the configuration content occupies 112 bits.

4. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 3, wherein the hardware module types have module numbers of 0,1,2,3, respectively.

5. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 4, wherein when translating a microkernel computing task into a hardware instruction, all configuration addresses in a hardware module configured by the hardware instruction are configured by one hardware instruction; the 3 rd to 7 th addresses of the input module and the output module are invalid, and 3 instructions of the addresses 0 to 2 are required to be configured; the matrix and vector calculation module needs to configure all addresses from 0 to 7 for 8 instructions in total.

6. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 5, wherein when programming a microkernel containing K instructions, the first K-1 instruction is of a pure configuration type and the last instruction is of a configuration-execution type to initiate hardware execution.

7. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 5, wherein each memory uses 4-bit global internetwork configuration information; the first two bits are the written information, namely the module type written into the memory; the second two bits are read information, i.e. the module type of the memory is read.

8. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 7, wherein in the module-specific configuration, for the input module and the output module, the configuration contents include a DRAM base address, an SRAM base address, and a read-write length; for the matrix computation module, the configuration content includes reset bits and data streams.

9. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 7, wherein in the module-specific configuration, for the vector computation module, the operational configuration and the data flow configuration each comprise a plurality of pieces of content, comprising: the configuration uses immediate computation, operands, source ports for each input operand, and output destination ports.

10. The instruction set apparatus for a reconfigurable deep neural network accelerator of claim 9, wherein the loop and memory access configuration is specifically control and data access using any 4-layer perfectly nested loop in the configuration registers of addresses 3-7 to handle vector computation modules, matrix computation modules; wherein, the range of each layer of circulation is configured at the address 3, and 4 layers of circulation are total, and each layer of circulation occupies 16 bits, and 64 bits total.