CN117634569A

CN117634569A - Quantized neural network acceleration processor based on RISC-V expansion instruction

Info

Publication number: CN117634569A
Application number: CN202311581806.XA
Authority: CN
Inventors: 黄科杰; 刘佳沂; 沈海斌
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-03-01
Anticipated expiration: 2043-11-24

Abstract

The invention discloses a quantized neural network acceleration processor based on RISC-V (reduced instruction set computer-virtual machine) expansion instructions. The processor adopts a four-stage pipeline structure consisting of fetching, decoding, executing and writing back. The processor supports a self-defined and expanded high data bit width instruction set, so that the calculation parallelism is improved, and the operation efficiency is accelerated. Accordingly, the processor internally expands three data paths of different data bit widths and corresponding register sets to support packet quantization and parallel computation of the neural network. The processor realizes the rapid calculation of the convolution layer and the activation function through a calculation unit supporting an extended instruction set and a lookup table, and improves the adaptability and the calculation precision of the packet quantized neural network through dynamic fixed-point calculation. The invention has the advantages of high calculation efficiency, low energy consumption and wide application range.

Description

Quantized neural network acceleration processor based on RISC-V expansion instruction

Technical Field

The invention belongs to the technical field of neural network hardware acceleration and RISC-V instruction set expansion processors, and particularly relates to a quantized neural network acceleration processor based on RISC-V expansion instructions.

Background

With the further development of the internet of things, the real-time requirement is high, the requirements of large data transmission quantity such as real-time positioning, real-time environment detection, safety data acquisition, timely reporting of sensor and large-scale sensor data, intelligent manufacturing and the like all require the edge to have larger communication bandwidth, stable communication capability, acceleration capability for complex algorithms and large data processing capability under the condition of controllable cost.

But the resources of the internet of things terminal equipment are limited, and the computing resources of the on-board processor are usually limited. The hardware of the internet of things has become a bottleneck for limiting the development and landing of the AIoT edge, and the development of a low-power-consumption low-cost small-sized special processor for the AIoT edge is urgent.

Disclosure of Invention

The invention expands parallel instructions based on RISC-V instruction set, designs a low-power-consumption low-cost small special processor for AIoT edge, has the characteristics of high energy efficiency, high speed, low resource occupation and the like, and can support various data processing algorithms at the end side of the Internet of things. Therefore, the limitation of the limited resources and performance of the terminal equipment of the Internet of things and the limitation of AIoT development are eliminated.

To achieve the above object, the present invention provides a quantized neural network acceleration processor for RISC-V expansion instruction, comprising: the system comprises an instruction fetching module, a decoding module, an execution module, a write-back module, a data path module and a controller;

the instruction fetching module is used for acquiring instructions converted by the external neural network model from an external bus, judging whether the acquired instructions are compression instructions, if not, directly transmitting the acquired instructions to the decoding module, and if so, transmitting the compression instructions to the decoding module after interpretation;

the decoding module comprises a decoder and a register group, and the decoder is used for decoding the instruction transmitted by the instruction fetching module to obtain an instruction control signal; the register group is used for storing the address of the instruction control signal generated by the decoder and the write-back data transmitted by the write-back module;

the execution module comprises an arithmetic execution module and a status register, wherein the arithmetic execution module is used for executing specific calculation or performing memory access operation on a register set of the decoding module according to the instruction control signal obtained by decoding of the decoding module, and the status register is used for storing status information of each module in the processor;

the write-back module is used for carrying out data interaction with an external bus and writing the read data and the calculation result of the execution module back into a register group of the decoding module or an arithmetic execution module of the execution module;

the data path module comprises three data paths with the bit widths of 32 bits, 128 bits and 136 bits respectively; the 32-bit data path is used for transmitting 32-bit data among the decoding module, the executing module and the writing back module; the 128-bit data path is used for interacting with external data; the 136-bit data path is used for transmitting 136-bit data among the decoding module, the executing module and the writing back module;

the controller is used for controlling each module in the processor according to the state information of each module in the processor stored in the state register.

As a preferred embodiment of the present invention, the instruction fetch module includes an instruction interface (instruction interface), a prefetch buffer (prefetch buffer), and a compressed instruction decoder (compressed decoder), where the instruction interface is used to connect with an external bus; the prefetch buffer is used for acquiring the instruction in advance through the instruction interface, so that the delay of instruction access is reduced, the execution efficiency of the instruction is improved, and the compressed instruction decoder is used for judging whether the acquired instruction is a compressed instruction or not and interpreting the compressed instruction.

As a preferred aspect of the present invention, the instructions decoded in the decoder include an rv32ic instruction set and an extended instruction; the expansion instruction comprises a 128-bit read-in and write-out instruction, a 128-bit operation instruction, an instruction for calculating an activation function and a dynamic fixed point calculation instruction.

As a preferred embodiment of the present invention, the Register set includes a 32-bit general purpose Register set (General Purpose Register, GPR) for storing data processed by the rv32ic instruction set and a 136-bit Vector Register set (Vector Register) for storing data processed by the extended instruction set; the 136-bit vector register is physically composed of 30 columns of 128-bit data registers and 32 columns of 8-bit scaling factor registers, external bus-transmitted neural network model parameters are independently stored in the 128-bit data registers, and external bus-transmitted neural network scaling factors are independently stored in the 8-bit scaling factor registers; the output of the 136-bit vector register set concatenates the 128-bit neural network model parameters and the 8-bit neural network scaling factor into 136-bit source operands.

As a preferred embodiment of the present invention, the neural network quantization scaling factor is determined by an externally input neural network model, and the partial scaling factor is filled with 0 when a fixed-point operation is performed instead of a dynamic fixed-point operation.

As a preferable scheme of the invention, the addresses of the 32-bit general register and the 136-bit vector register are derived from control signals generated by a decoder, and the data written in the 32-bit general register and the 136-bit vector register are derived from data generated after an execution module executes an instruction or data written back by a write-back module.

As a preferred scheme of the invention, the execution module comprises a state register, a multiplication module, an arithmetic logic unit, a vector arithmetic logic unit and a lookup table, wherein the multiplication module is used for carrying out multiplication and multiplication addition operation of calculating 32-bit multiplication, vector fixed point and dynamic fixed point, and the arithmetic logic unit is used for calculating arithmetic logic operation of 32-bit data; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation; the lookup table is used for storing a calculation result corresponding to the activation function and supporting calculation of the activation function in the expansion instruction.

As a preferable scheme of the invention, the multiplication and multiplication addition operation of the dynamic fixed point in the multiplication module and the quantity dynamic fixed point operation in the vector arithmetic logic unit are all dynamic fixed point calculation; the computing resources needed in the dynamic fixed point computation comprise 16 8-bit multipliers for multiplication computation, 16 8-bit adders for addition and multiplication addition, and 3 128-bit shifters; wherein the shifter is multiplexed in the multiplication operation, the addition operation, and the multiplication and addition operation; the adder and multiplier are multiplexed in a fixed point operation.

The invention also provides equipment comprising the processor.

Compared with the prior art, the invention has the following beneficial effects:

1) The invention expands convolution operation and activation function calculation on the basis of RISC-V instruction set aiming at neural network calculation, designs expanded instruction acceleration operation, reduces the instruction number of operation so as to reduce the cycle number required by calculation;

2) The method aims at the problem of precision loss caused by fixed-point operation, designs an additional dynamic fixed-point operation instruction, and simultaneously processes the condition of bit width change in the calculation process; and an expansion unit specially used for processing expansion parallel computing, activating function computing and dynamic fixed point computing is arranged on the processor, different bit widths are designed for instructions and data, efficient parallel data streams are designed, the total data reading and writing times are reduced, the parallelism is improved, and therefore energy consumption is reduced, and the operation speed is improved.

Drawings

FIG. 1 is a processor pipeline architecture diagram of the present invention;

FIG. 2 is a schematic diagram of a register according to the present invention;

FIG. 3 is a schematic diagram of a dynamic fixed-point computing process in the present invention;

FIG. 4 is a schematic diagram of the structure of the arithmetic unit when implementing dynamic fixed-point calculation in the present invention;

fig. 5 is a schematic diagram of the structure of an operation unit when the nonlinear activation function calculation is implemented in the present invention.

Detailed Description

The invention is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

As shown in FIG. 1, an embodiment of the present invention provides a quantized neural network acceleration processor based on RISC-V extension instructions. The processor has a four-stage pipeline structure and comprises an instruction fetching module, a decoding module, an execution module, a write-back module and a controller. The instruction fetching module is used for acquiring instructions converted by the external neural network model from an external bus, judging whether the acquired instructions are compression instructions, if not, directly transmitting the acquired instructions to the decoding module, and if so, transmitting the compression instructions to the decoding module after interpretation;

In one embodiment of the invention, the instruction fetch module includes an instruction interface (instruction interface), a prefetch buffer (prefetch buffer) and a compressed instruction decoder (compressed decoder). The prefetch buffer is used for acquiring the instruction in advance, so that the delay of instruction access is reduced, and the execution efficiency of the instruction is improved; the compressed instruction decoder is used for supporting an rv32c instruction set, and is used for judging whether the compressed instruction is a compressed instruction or not and decoding the compressed instruction into a normal instruction to be sent to the decoding module. The decode block includes a decoder (decoder), a 32-bit general purpose Register set (General Purpose Register, GPR), a 136-bit Vector Register set (Vector Register). The addresses of the 32-bit register and the 136-bit register are derived from control signals generated by a decoder, and the writing data are derived from push-forward data generated by an execution module or writing-back data generated by a writing-back module. The execution modules include a status register (Control and Status Register, CSR), a multiplication module (multiplexing unit), an arithmetic logic unit (Arithmetic Logic Unit, ALU), a vector arithmetic logic unit (Vector Arithmetic Logic Unit), a Look Up Table (LUT). Wherein the status register is used for storing control and status information of the processor; the multiplication module is used for performing multiplication and multiplication addition operation of calculating 32-bit multiplication, vector fixed point and dynamic fixed point; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation. In the dynamic fixed point calculation part, the input operands include source operands and output scaling factors (fl_out) transmitted from the decoding stage. The write-back module writes back the calculation result of the execution module and the data read from the data interface (data interface) into the register. The controller is responsible for controlling various operations of the processor according to control signals input by the status register.

The processor supports a 32-bit instruction set, a 128-bit extended vector instruction set. Wherein the 32-bit data computation uses a 32-bit datapath inside the processor; the 128-bit fixed point computation and the nonlinear activation function computation use 136-bit data paths, and the upper 8 bits in the data paths will be filled with 0 s; the 128-bit dynamic fixed point computation uses 136-bit data paths, and the upper 8 bits in the data paths are filled with scaling factors corresponding to the set of data, and the quantized scaling factors are determined by an externally input neural network model. The data path between the data interface and the load-store module is 128 bits, and writing from the load-store module to the register requires bit splicing of 128 bits of data, resulting in 136 bits of data with 0 in the upper 8 bits.

As shown in fig. 2, the vector register set in the processor is physically composed of 30 128-bit data registers and 32 column 8-bit scaling factor registers. When the memory read instruction (lw 128) and the memory write instruction (sw 128) are executed, the register set is considered to be a 32-128-bit data register set, and the upper 8 bits of the input are discarded. When the address is less than 30, the data is stored in the corresponding data register; when the address is 30, the data is stored in the scale factor register 0-15; when the address is 31, the data will be stored in the scale factor registers 16-31. When executing the dynamic fixed point calculation instruction, 136 bits of data of the input register are split into high 8 bits of scaling factor and low 128 bits of data, and are respectively stored into the corresponding data register and the scaling factor register according to the input address.

As shown in FIG. 3, in one embodiment of the present invention, the processor of the present invention includes a data representation of dynamic fixed point calculations and a dynamic fixed point addition calculation. A dynamic fixed point number consists of an 8-bit data and an 8-bit scaling factor, the most significant bit of the data being the sign bit. The size of a data actually represented by a dynamic fixed point number is equal to the power fl of the data itself multiplied by 2. When two data with different scaling factors are added, the two data need to be shifted so that the integer parts of the two data are aligned to the format determined by the output scaling factor. And then executing addition of the fixed point number to obtain an output value. In the design of the present invention, 128 bits of data are 16 parallel 8 bits of data, the 16 data sharing the same scaling factor.

As shown in fig. 4, the computing resources required by the present invention in performing dynamic fixed point computation include 16 8-bit multipliers (multipliers) for multiplication computation, 16 8-bit adders (adders) for addition and multiply-add operations, 3 128-bit shifters (shifters). When the addition operation is performed, the scaling factors of the source operand a (operand_a) and the source operand b (operand_b) and the scaling factors (fl, fractional length) of the appointed output are respectively calculated to obtain a shifting mode of the source operand a and the source operand b, and after shifting, the shifting mode is added by the parallel 16 8-bit adders to obtain the output; when multiplication is executed, the source operand a and the source operand b are subjected to multiplication computation by 16 parallel 8-bit multipliers, data exceeding 8 bits in the multiplication computation are omitted, then the scaling factors of the source operand a and the source operand b and the scaling factors of the appointed output are calculated, and the output of the multipliers is subjected to shift operation to obtain the output of the multiplication operation; when the multiply-add operation is executed, the source operand a and the source operand b perform dynamic fixed-point multiply operation, and the obtained result and the source operand c (operand_c) perform dynamic fixed-point add operation to obtain the output of the final multiply-add operation.

As shown in fig. 5, the present invention stores the function values of the tanh function at different inputs in the lookup table when performing the nonlinear activation function calculation. When performing the calculation, the controller (controller) will obtain the address of the lookup table according to the input value mapping (mapping), and take the corresponding function value from the memory (memory) in the lookup table. When the input value is small, the input value will be selected directly as output. When calculating the tanh function, adding sign bits for the output of the last step as a final calculation result; when the sigmoid function is calculated, the final result is the output of the last step minus 0.5, and sign bits are added.

The invention also provides a device comprising a quantized neural network acceleration processor based on RISC-V extension instructions.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1. A quantized neural network acceleration processor for RISC-V extended instructions, comprising: the system comprises an instruction fetching module, a decoding module, an execution module, a write-back module, a data path module and a controller;

2. The quantized neural network acceleration processor of claim 1, wherein the fetch module comprises an instruction interface, a prefetch buffer, and a compressed instruction decoder, the instruction interface being configured to connect with an external bus; the prefetch buffer is used for acquiring the instruction in advance through the instruction interface, so that the delay of instruction access is reduced, the execution efficiency of the instruction is improved, and the compressed instruction decoder is used for judging whether the acquired instruction is a compressed instruction or not and interpreting the compressed instruction.

3. The quantized neural network acceleration processor of claim 1, wherein the instructions decoded in the decoder comprise an rv32ic instruction set and an expand instruction; the expansion instruction comprises a 128-bit read-in and write-out instruction, a 128-bit operation instruction, an instruction for calculating an activation function and a dynamic fixed point calculation instruction.

4. The quantized neural network acceleration processor of claim 3, wherein the register set comprises a 32-bit general purpose register set and a 136-bit vector register set, the 32-bit general purpose register set for storing data processed by an rv32ic instruction set, the 136-bit vector register set for storing data processed by an extended instruction set; the 136-bit vector register is physically composed of 30 columns of 128-bit data registers and 32 columns of 8-bit scaling factor registers, external bus-transmitted neural network model parameters are independently stored in the 128-bit data registers, and external bus-transmitted neural network scaling factors are independently stored in the 8-bit scaling factor registers; the output of the 136-bit vector register set concatenates the 128-bit neural network model parameters and the 8-bit neural network scaling factor into 136-bit source operands.

5. The quantized neural network acceleration processor of claim 4, wherein the neural network quantized scaling factor is determined by an externally input neural network model, the partial scaling factor being filled with 0 when performing fixed-point operations instead of dynamic fixed-point operations.

6. The quantized neural network acceleration processor of claim 4, wherein the addresses of the 32-bit general purpose registers and the 136-bit vector registers are derived from instruction control signals generated by a decoder, and the data written in the 32-bit general purpose registers and the 136-bit vector registers are derived from data generated after the execution of the instructions by the execution module or data written back by the write-back module.

7. The quantized neural network acceleration processor of claim 1, wherein the execution module comprises a status register, a multiplication module, an arithmetic logic unit, a vector arithmetic logic unit, and a lookup table, the multiplication module is configured to perform a multiplication and multiplication-addition operation for calculating 32-bit multiplication, a multiplication and multiplication-addition operation for vector fixed-point and dynamic fixed-point, and the arithmetic logic unit is configured to calculate an arithmetic logic operation for 32-bit data; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation; the lookup table is used for storing a calculation result corresponding to the activation function and supporting calculation of the activation function in the expansion instruction.

8. The quantized neural network acceleration processor of claim 7, wherein the multiplication and multiply-add operations of the dynamic fixed point in the multiplication module and the quantitative dynamic fixed point operations in the vector arithmetic logic unit are dynamic fixed point calculations; the computing resources needed in the dynamic fixed point computation comprise 16 8-bit multipliers for multiplication computation, 16 8-bit adders for addition and multiplication addition, and 3 128-bit shifters; wherein the shifter is multiplexed in the multiplication operation, the addition operation, and the multiplication and addition operation; the adder and multiplier are multiplexed in a fixed point operation.

9. An apparatus comprising the processor of any of claims 1-8.