CN115983348A

CN115983348A - RISC-V accelerator system supporting convolution neural network extended instruction

Info

Publication number: CN115983348A
Application number: CN202310081218.3A
Authority: CN
Inventors: 魏继增; 王兹哲
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-04-18

Abstract

A RISC-V accelerator system supporting convolution neural network extension instruction comprises an external memory for storing all instructions and data, an AXI bus for data transmission, an instruction fetching module, a decoding module, an execution module, an access module and a write-back module which are sequentially connected in series and are provided with a five-stage pipeline structure, wherein the output end of the write-back module is connected with a general register file, and the operation result of the current instruction is sent to the general register file for storage and is used for extracting the decoding module. The invention is a universal, modularized and extensible instruction set which can process all convolution layer operations, and the universality of the processor is greatly improved based on the RISC-V basic instruction set. The special matrix expansion instruction is combined with the RISC-V basic instruction set, so that the calculation performance is obviously improved and the resource occupation is reduced on the operation of the convolutional neural network.

Description

RISC-V accelerator system supporting convolution neural network extended instruction

Technical Field

The invention relates to a neural network accelerator. In particular to a RISC-V accelerator system which is realized based on a RISC-V instruction set and supports a convolution neural network extended instruction.

Background

In recent years, with the rapid development of artificial intelligence technology, the depth and the amount of computation of a Convolutional Neural Network (CNN) as a common algorithm for deep learning are rapidly increasing. An important feature of the deep learning technique is that the computation amount is extremely large and there is a growing trend, in which the convolution operation occupies a large part, and a Central Processing Unit (CPU) for executing general logic operation cannot cope with such a large task. At present, a relatively common Processing method is to use a GPU (graphics Processing Unit), an ASIC (Application Specific Integrated Circuit), and an FPGA (Field Programmable Gate Array) to perform special neural network computation. On embedded mobile platforms with limited computing and storage resources, GPUs and ASICs have the disadvantages of high cost, flexibility, poor scalability, etc. The low power consumption of the FPGA and the ASIC enables the FPGA and the ASIC to have wider application fields, such as embedded platforms with limited electric quantity. It is therefore of primary interest herein to the relevant work of Convolutional Neural Network (CNN) accelerators based on these two types of platforms. It has been observed that in previous work such accelerators typically accelerate only a specific network architecture or a specific type of layer, with a relatively fixed pattern and less flexibility.

However, there are some problems with the technical solutions in this field at present both domestically and abroad. For many convolutional neural networks, the prior art either improves the operational performance of a CPU and a GPU, which can greatly improve power consumption and cost, and brings great pressure on a general-purpose processor that needs to support various loads; or use individually designed dedicated hardware accelerators that, while improving efficiency, lack flexibility and are difficult to handle in demanding diverse scenarios.

A processor with strong universality, high flexibility and low power consumption is combined with a CNN convolution neural network algorithm, and a RISC-V accelerator supporting a user-defined matrix instruction is designed to be a feasible way for solving the problems of flexibility and energy consumption ratio.

Disclosure of Invention

The invention aims to solve the technical problem of providing a RISC-V accelerator system supporting the convolutional neural network extended instruction, which can further save the running power consumption of the convolutional neural network, save the cost, improve the flexibility and the portability.

The technical scheme adopted by the invention is as follows: a RISC-V accelerator system supporting convolution neural network extension instruction comprises an external memory for storing all instruction and data, an AXI bus for data transmission, an instruction fetching module, a decoding module, an execution module, an access module and a write-back module which are sequentially connected in series and provided with a five-stage pipeline structure, wherein the output end of the write-back module is connected with a general register file, and the operation result of the current instruction is sent to the general register file for storage and is used for extracting the decoding module; wherein the content of the first and second substances,

the instruction fetching module is connected with an external memory through an AXI bus and used for acquiring instruction information of the external memory and sending the instruction information to the decoding module;

the decoding module translates the transmitted instruction into an instruction type, a general register file address and access related information, acquires data required by the instruction from the general register file, and finally transmits the instruction type, the access related information and the data required by the instruction to the execution module;

the execution module performs corresponding operation according to the transmitted instruction type and the data required by the instruction, and transmits an operation result and the transmitted access relevant information to the access module;

the memory access module detects whether the current instruction is a memory access instruction, if the current instruction is the memory access instruction, the memory access module is connected with an external memory through an AXI bus, interacts with the external memory according to memory access related information obtained from the execution module, and simultaneously sends data obtained from the external memory and an operation result of the execution module into the write-back module; otherwise, the operation of the execution module is sent to a write-back module;

the write-back module sends the transmitted operation result to the general register file;

the general register file stores data required by the instruction, modifies the data required by the instruction by receiving the operation result from the write-back module, and fetches the data required by the instruction from the general register file by the decoding module.

The RISC-V accelerator system supporting the convolutional neural network extended instruction is a universal, modularized and extensible instruction set capable of processing all convolutional layer operations, and the universality of a processor is greatly improved based on a RISC-V basic instruction set. The special matrix expansion instruction is combined with the RISC-V basic instruction set, so that the calculation performance is obviously improved and the resource occupation is reduced on the operation of the convolutional neural network.

The RISC-V accelerator system supporting the convolutional neural network extended instruction can execute the calculation of the convolutional layer in the neural network with high performance on the embedded platform equipment with very limited calculation resources and storage resources, thereby leading the neural network with strong function to smoothly run on the equipment. Meanwhile, the structure is almost suitable for all neural networks which take convolution as main calculation, and has great universality. The method is verified on an embedded FPGA heterogeneous platform ZYNQ-7000 series, and the AlexNet and LeNet-5 convolutional neural networks are tested on the platform, so that higher accuracy and prediction performance are obtained.

Drawings

FIG. 1 is a block diagram of a RISC-V accelerator system supporting convolutional neural network extended instructions in accordance with the present invention;

FIG. 2 is a schematic diagram of the internal structure of a matrix processing unit according to the present invention;

fig. 3 is a diagram showing an internal structure of each processing unit PE in fig. 2.

Detailed Description

The present invention is described in detail below with reference to the embodiments and the accompanying drawings for a RISC-V accelerator system supporting convolutional neural network extended instructions. It should be noted that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In order to make the objects, technical solutions and advantages of the present invention more clear, the present example is implemented on the premise of the technical solutions of the present invention, and detailed embodiments and specific operation processes are given.

As shown in fig. 1, the RISC-V accelerator system supporting convolutional neural network extended instructions of the present invention includes an external memory 1 for storing all instructions and data, an AXI bus 2 for data transmission, an instruction fetching module 3, a decoding module 4, an execution module 5, an access module 6, and a write-back module 7, which are sequentially connected in series and have a five-stage pipeline structure, wherein an output end of the write-back module 7 is connected to a general register file 8, and an operation result of a current instruction is sent to the general register file 8 for storage and extraction by the decoding module 4. Wherein:

the external memory 1: the Flash memory is used as an external memory, belongs to one of memory components, is a nonvolatile memory, and is large in capacity, so that the Flash memory is mainly used as the external memory to store all instructions and data.

The AXI bus 2: AXI (Advanced eXtensible Interface Advanced eXtensible bus) is a bus protocol. The AXI-4 bus has 5 independent transmission channels: reading address, reading data, writing address, writing data and writing response, wherein each channel has own handshake protocol. Each channel does not interfere with each other but depends on each other, so that the AXI-4 bus is extremely efficient in transmitting data. The invention is mainly used for the mutual communication between the processor core and the memory.

The instruction fetching module 3 is connected with the external memory 1 through the AXI bus 2 and used for acquiring instruction information of the external memory 1 and sending the instruction information to the decoding module 4.

The decoding module 4 translates the transmitted instruction into an instruction type, a general register file address and access related information according to a decoding rule, obtains data required by the instruction from the general register file 8 according to the general register file address, and finally sends the instruction type, the access related information and the data required by the instruction to the execution module 5. The instruction types are divided into scalar instructions and matrix instructions.

The execution module 5 performs corresponding operation according to the transmitted instruction type and the data required by the instruction.

The memory access module 6 detects whether the current instruction is a memory access instruction, if the current instruction is the memory access instruction, the memory access instruction is connected with the external memory 1 through the AXI bus 2, interacts with the external memory 1 according to the memory access related information obtained from the execution module 5, and simultaneously sends the data obtained from the external memory 1 and the calculation result of the execution module 5 to the write-back module 7; otherwise, the calculation result of the execution module 5 is sent to the write-back module 7.

The general register file 8 stores data required by the instruction, modifies the data required by the instruction by receiving the operation result from the write-back module 7, and fetches the data required by the instruction from the general register file 8 by the decoding module 4.

The execution module 5 comprises a scalar processing unit 5.2 for processing all scalar instructions and a matrix processing unit 5.1 for processing matrix instructions, and after the decoding module 4 translates the received instructions, the instructions with the instruction types of scalar instructions and the data corresponding to the instructions are transmitted to the scalar processing unit 5.2; for an instruction with an instruction type of a matrix instruction, firstly detecting whether a matrix processing unit 5.1 is ready, and if the matrix processing unit is in a ready state, sending the matrix instruction and data corresponding to the matrix instruction to the matrix processing unit 5.1; if the matrix processing unit is not ready, the type of the instruction and the data corresponding to the instruction are saved, the pipeline is suspended until the next cycle, when the next cycle starts, whether the matrix processing unit 5.1 is ready or not is repeatedly detected until the matrix processing unit is ready, and the matrix instruction and the data corresponding to the matrix instruction are sent to the matrix processing unit 5.1.

Because the convolution neural network often needs to carry out convolution operation of a plurality of channels at the same time, in order to improve the parallelism degree of the processor, the invention realizes three matrix multiplication units, can carry out matrix operation of three channels at the same time, and greatly improves the parallel processing efficiency of the processor. As shown in fig. 2, the matrix processing unit 5.1 includes: the matrix multiplying unit controller 5.1.1 for receiving the instruction control information output by the decoding module 4 is respectively connected with the matrix multiplying unit controller 5.1.1, the on-chip cache 5.1.2 for receiving the output information of the matrix multiplying unit controller 5.1.1, the layer number controller 5.1.3 and the array unit controller 5.1.4, the on-chip cache 5.1.2 is connected with the array multiplier 5.1.6 through the multiplexer group 5.1.5, and the outputs of the layer number controller 5.1.3 and the array unit controller 5.1.4 are respectively connected with the array multiplier 5.1.6. Wherein:

the matrix multiplication unit controller 5.1.1 is a control core of the whole matrix processing unit 5.1, and is configured to analyze received instruction control information to obtain data control information and array multiplier control information, where the data control information is sent to a cache controller in an on-chip cache 5.1.2, the array multiplier control information includes layer number control information and array unit control information, the layer number control information is sent to the layer number controller 5.1.3, and the array unit control information is sent to the array unit controller 5.1.4.

The on-chip cache 5.1.2 is a data storage area of the matrix processing unit 5.1, and comprises a cache controller a and two data caches b and c, wherein the cache controller a receives data control information from the matrix multiplication unit controller 5.1.1, and controls the two data caches b and c to perform data interaction with the array multiplier 5.1.6 through a multiplexer set 5.1.5 according to the data control information; and one data cache is used for storing the characteristic diagram matrix participating in the operation, and the other data cache is used for storing the convolution kernel matrix participating in the operation.

The layer number controller 5.1.3 and the array unit controller 5.1.4 control the calculation mode of the array multiplier 5.1.6 according to the data control information and the array multiplier control information received from the matrix multiplier unit controller 5.1.1. Because the invention uses three matrix multiplication units and can support 49 Processing Elements, PE at most, the layer number controller is used for selecting the number of channels required by the current instruction and can select three modes of a single channel, a double channel and a triple channel; the array unit controller is used to select the number of processing units required by the current instruction, and can process any matrix size below 7 × 7.

The array multiplier 5.1.6 is responsible for most of the matrix calculation tasks, so it is important to select an efficient matrix multiplication implementation. The array multiplier 5.1.6 includes three channels d, e, f with the same structure, each of the channels d, e, f has a characteristic diagram matrix input port h and a convolution kernel matrix input port g for receiving a characteristic diagram matrix and a convolution kernel matrix participating in operation, 49 processing units PE in total of 7 × 7 for implementing matrix operation in a systolic array mode, and a characteristic diagram matrix output port m for deriving operation results, and can support matrices with various structures such as 1 × 1, 3 × 3, 5 × 5 and the like through hardware multiplexing; the layer number controller 5.1.3 selects 1-3 channel matrixes to operate simultaneously according to the layer number control information of the matrix multiplication unit controller 5.1.1, and the array unit controller 5.1.4 selects 1-49 processing units PE in each channel to operate according to the array unit control information of the matrix multiplication unit controller 5.1.1 so as to support the matrix multiplication operation with smaller size. The on-chip cache 5.1.2 inputs the characteristic diagram matrix to participate in the operation through the characteristic diagram matrix input port h, and the on-chip cache 5.1.2 inputs the convolution kernel matrix to participate in the operation through the convolution kernel matrix input port g; the operation result of each channel d, e, f is entered into the on-chip cache 5.1.2 through the output port m of the feature map matrix via the multiplexer set 5.1.5 for the next operation.

As shown in fig. 3, the 49 processing units PE have the same structure, and all include a multiplier C, an accumulator L, a first register J1, a second register J2, and a third register J3, where an input end of the first register J1 is connected to a first input a, and receives and temporarily stores a feature map matrix participating in operation, which is sent from an on-chip cache 5.1.2, in a form of a row through the first input a; the input end of the second register J2 is connected with a second input B, and receives a convolution kernel matrix which is temporarily stored and is sent from an on-chip cache 5.1.2 and participates in operation in a column mode through the second input B; the output end of the first register J1 is respectively connected with the multiplier C and the first output D, and the characteristic diagram matrix is respectively sent to the multiplier C and the first output D; the output end of the second register J2 is respectively connected with the multiplier C and the second output F, and the convolution kernel matrix is respectively sent to the multiplier C and the second output F; the multiplier C performs multiplication operation on the received characteristic diagram matrix and the convolution kernel matrix, the output of the multiplier C is connected with one input end of the accumulator L, and the operation result is sent to the input end of the accumulator L; the other input of the accumulator L is connected with a third register J3, receives data transmitted from the third register J3, adds the data received by the two input ends and sends the addition result to the third register J3; the third register J3 is responsible for storing the addition result from the accumulator L.

Claims

1. A RISC-V accelerator system supporting convolution neural network extension instruction, including, the external memory (1) used for keeping all instructions and data and AXI bus (2) used for data transmission, characterized by that, also there are fetch module (3), decode module (4), execution module (5), visit and store the module (6) and write back the module (7) that the serial connection of five-stage pipeline structure sequentially, the output end of the said write back module (7) connects the general register heap (8), send the running result of the present instruction into the general register heap (8) to keep, is used for the extraction of the decode module (4); wherein the content of the first and second substances,

the instruction fetching module (3) is connected with the external memory (1) through an AXI bus (2) and is used for acquiring instruction information of the external memory (1) and sending the instruction information to the decoding module (4);

the decoding module (4) translates the transmitted instruction into an instruction type, a general register file address and access related information, acquires data required by the instruction from the general register file (8), and finally sends the instruction type, the access related information and the data required by the instruction to the execution module (5);

the execution module (5) performs corresponding operation according to the transmitted instruction type and the data required by the instruction, and transmits the operation result and the transmitted access related information to the access module;

the memory access module (6) detects whether the current instruction is a memory access instruction, if the current instruction is the memory access instruction, the memory access instruction is connected with the external memory (1) through the AXI bus (2), interacts with the external memory (1) according to memory access related information obtained from the execution module (5), and simultaneously sends data obtained from the external memory (1) and an operation result of the execution module (5) to the write-back module (7); otherwise, the operation of the execution module (5) is sent to a write-back module (7);

the write-back module (7) sends the transmitted operation result to a general register file (8);

the general register file (8) stores data required by an instruction, modifies the data required by the instruction by receiving an operation result from the write-back module (7), and fetches the data required by the instruction from the general register file (8) by the decoding module (4).

2. A RISC-V accelerator system supporting convolutional neural network extended instructions as claimed in claim 1, wherein said instruction types are divided into scalar instructions and matrix instructions.

3. A RISC-V accelerator system supporting extended convolutional neural network instruction as claimed in claim 1, wherein said execution module (5) comprises a scalar processing unit (5.2) for processing all scalar instructions and a matrix processing unit (5.1) for processing matrix instructions, said decoding module (4) translates the received instructions and transmits the instructions with the instruction type of scalar instruction and the data corresponding to said instructions to the scalar processing unit (5.2); for an instruction with an instruction type of a matrix instruction, firstly detecting whether a matrix processing unit (5.1) is ready, and if the matrix processing unit is in a ready state, sending the matrix instruction and data corresponding to the matrix instruction to the matrix processing unit (5.1); if the matrix processing unit is not ready, the type of the instruction and the data corresponding to the instruction are saved, the pipeline is suspended until the next cycle, when the next cycle starts, whether the matrix processing unit (5.1) is ready or not is repeatedly detected until the matrix processing unit is ready, and the matrix instruction and the data corresponding to the matrix instruction are sent to the matrix processing unit (5.1).

4. A RISC-V accelerator system supporting convolutional neural network extended instruction as claimed in claim 1, wherein said matrix processing unit (5.1) comprises: the matrix multiplying unit controller (5.1.1) is used for receiving instruction control information output by the decoding module (4), and is respectively connected with the matrix multiplying unit controller (5.1.1), an on-chip cache (5.1.2) for receiving information output by the matrix multiplying unit controller (5.1.1), a layer number controller (5.1.3) and the array unit controller (5.1.4), wherein the on-chip cache (5.1.2) is connected with the array multiplier (5.1.6) through a multiplexer group (5.1.5), and the outputs of the layer number controller (5.1.3) and the array unit controller (5.1.4) are respectively connected with the array multiplier (5.1.6).

5. The RISC-V accelerator system supporting convolutional neural network extended instruction as claimed in claim 4, wherein the matrix multiplication unit controller (5.1.1) is a control core of the whole matrix processing unit (5.1) and is configured to analyze received instruction control information to obtain data control information and array multiplier control information, wherein the data control information is sent to the on-chip cache (5.1.2), the array multiplier control information includes layer number control information and array unit control information, the layer number control information is sent to the layer number controller (5.1.3), and the array unit control information is sent to the array unit controller (5.1.4).

6. A RISC-V accelerator system supporting extended convolutional neural network instruction as claimed in claim 4, wherein the on-chip cache (5.1.2) is a data storage area of the matrix processing unit (5.1), and comprises a cache controller (a) and two data caches (b, c), wherein the cache controller (a) receives data control information from the matrix multiplication unit controller (5.1.1), and controls the two data caches (b, c) to perform data interaction with the array multiplier (5.1.6) through the multiplexer set (5.1.5) according to the data control information; and one data cache is used for storing the characteristic diagram matrix participating in the operation, and the other data cache is used for storing the convolution kernel matrix participating in the operation.

7. A RISC-V accelerator system supporting convolutional neural network extended instruction as claimed in claim 4, wherein the layer number controller (5.1.3) and array unit controller (5.1.4) control the computation mode of array multiplier (5.1.6) according to the data control information and array multiplier control information received from matrix multiplier controller (5.1.1).

8. A RISC-V accelerator system supporting convolutional neural network extension instructions, as per claim 4, characterized in that the array multiplier (5.1.6) includes three identically structured channels (d, e, f), each of which has an eigen map matrix input port (h) and a convolutional kernel matrix input port (g) for receiving eigen map matrices and convolutional kernel matrices participating in the operation, 49 processing units (PE) total of 7 × 7 for implementing matrix operation using systolic array mode, and an eigen map matrix output port (m) for deriving the operation result; the layer number controller (5.1.3) selects 1-3 channel matrixes to operate simultaneously according to the layer number control information of the matrix multiplication unit controller (5.1.1), the array unit controller (5.1.4) selects 1-49 processing units (PE) in each channel to operate according to the array unit control information of the matrix multiplication unit controller (5.1.1), the on-chip cache (5.1.2) inputs a characteristic diagram matrix to participate in the operation through a characteristic diagram matrix input port (h), and the on-chip cache (5.1.2) inputs a convolution kernel matrix to participate in the operation through a convolution kernel matrix input port (g); the operation result of each channel (d, e, f) is transmitted to the on-chip cache (5.1.2) through the multiplexer group (5.1.5) through the characteristic diagram matrix output port (m) for the next operation.

9. A RISC-V accelerator system supporting extended convolutional neural network instruction as claimed in claim 8, wherein said 49 processing units (PE) are identical in structure and all comprise a multiplier (C), an accumulator (L), a first register (J1), a second register (J2) and a third register (J3), the input end of said first register (J1) is connected to the first input (a), and the first input (a) receives and temporarily stores the feature map matrix which is sent from the on-chip cache (5.1.2) and participates in the operation; the input end of the second register (J2) is connected with a second input (B), and receives a convolution kernel matrix which is transmitted from an on-chip cache (5.1.2) and is temporarily stored to participate in operation in a column form through the second input (B); the output end of the first register (J1) is respectively connected with the multiplier (C) and the first output (D), and the characteristic diagram matrix is respectively sent to the multiplier (C) and the first output (D); the output end of the second register (J2) is respectively connected with the multiplier (C) and the second output (F), and the convolution kernel matrix is respectively sent to the multiplier (C) and the second output (F); the multiplier (C) multiplies the received characteristic diagram matrix and the convolution kernel matrix, the output of the multiplier (C) is connected with one input end of the accumulator (L), and the operation result is sent to the input end of the accumulator (L); the other input of the accumulator (L) is connected with a third register (J3) and receives data transmitted from the third register (J3), the accumulator (L) adds the data received by the two input ends and sends the addition result to the third register (J3); the third register (J3) is responsible for storing the addition result from the accumulator (L).