CN118070855B

CN118070855B - Convolutional neural network accelerator based on RISC-V architecture

Info

Publication number: CN118070855B
Application number: CN202410467665.7A
Authority: CN
Inventors: 张伟; 陈雪聪; 陈云芳
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Filing date: 2024-04-18
Publication date: 2024-07-09
Anticipated expiration: 2044-04-18

Abstract

The invention provides a convolutional neural network accelerator based on a RISC-V architecture, which belongs to the field of convolutional calculation hardware acceleration. The convolution operation module comprises a first input processing module, a pulsation array, an activation module, a pooling module and a full connection module; the post-processing module comprises a second input processing module, an NMS computing module and an output processing module. On one hand, the invention saves the convolution waiting time by caching the preloaded data, and uses the pulse array pipeline to process the convolution calculation, thereby improving the calculation efficiency; on the other hand, the post-processing module is designed, the application scene of the accelerator is expanded, and compared with the software implementation, the hardware implementation of the NMS algorithm has higher efficiency.

Description

Convolutional neural network accelerator based on RISC-V architecture

Technical Field

The invention belongs to the field of hardware design of convolutional accelerators, and particularly relates to a convolutional neural network accelerator based on a RISC-V architecture.

Background

With the wide application of deep learning in the fields of computer vision, natural language processing and the like, the computing demands on deep learning models such as convolutional neural network CNN and the like are continuously increased. To meet this demand, researchers are continually exploring new hardware architectures and accelerator designs to improve the training and reasoning efficiency of deep learning models.

Conventional general-purpose computing devices, such as central processing units CPU and graphics processing units GPU, present a certain performance bottleneck in deep learning tasks. Convolution operations are computationally intensive tasks in deep learning, and CPUs are often better suited for processing control flow, branching, cache management, etc., and are not suited for processing such highly parallel numerical computation tasks. In this respect, the GPU is more suitable for convolution operation, however, the higher power consumption of the GPU limits the deployment of the convolution neural network model at the edge device, the internet of things device, the embedded device and other resource-limited device ends.

In addition, ASIC is also widely used for convolution acceleration computation, and its highly customized feature enables high performance and low power consumption convolution operation for specific fields. However, ASIC has low versatility, and ASIC chips have large research and development investment, long research and development period, and in the face of continuous innovation of convolutional neural networks, previous circuit designs soon become unsuitable. The field programmable gate array FPGA overcomes these problems of the ASIC and can implement algorithms by configuring hardware, but at the same time this also increases development difficulty and complexity.

Disclosure of Invention

The invention aims to: the invention aims to solve the defect of the existing convolutional neural network acceleration hardware, and aims to provide a convolutional neural network accelerator based on a RISC-V architecture.

The technical scheme is as follows: in order to achieve the above object, the present invention provides a convolutional neural network accelerator based on a RISC-V architecture, which includes an instruction parsing module, a control unit, a data buffering module, a convolutional operation module, and a post-processing module;

The instruction analysis module receives and further analyzes a user-defined instruction sent by the RISC-V processor, and converts the instruction into a control signal which can be understood by the convolution accelerator;

The control unit sends out a control signal according to the instruction analysis result and coordinates the execution of the data caching module, the convolution operation module and the post-processing module;

the data caching module caches the data participating in the convolution operation process; the data participating in the convolution operation process comprises data in an external memory and intermediate result data of the convolution operation process; the data in the external memory comprises preprocessed image data, weight data, lookup table data and NMS threshold data, and the preprocessed image data, the weight data, the lookup table data and the NMS threshold data are cached to reduce data reading time in the operation process; the intermediate result data in the convolution operation process is feature map data generated by convolution calculation, activation calculation and pooling calculation;

the convolution operation module is used for completing convolution, activation, pooling and full-connection calculation of the feature map according to the analyzed instruction;

And the post-processing module is used for processing the output result of the full-connection module and screening the candidate frames by applying an NMS algorithm.

Further, the custom instruction is a custom data instruction, a custom convolution operation instruction and a custom post-processing instruction which are expanded under a custom-0 instruction group; the custom data instruction comprises a CLOAD data reading instruction and a CSTORE data loading instruction; the custom convolution operation instruction comprises a CCONV calling pulse array instruction, a CRELU calling activation module instruction, a CPOOL calling pooling module instruction and a CFULLC calling full connection module instruction; the custom post-processing instruction comprises CSORT call ordering module instructions, CIOU call merge ratio calculation module instructions and CCMP comparison operation instructions.

Further, the data caching module comprises a feature map caching module, a weight caching module, a lookup table caching module and a threshold caching module;

the characteristic map buffer module buffers two types of data, wherein the first type is the preprocessed image data stored in the external memory, and the second type is the characteristic map data output by a certain layer in the convolution operation process;

The weight caching module caches the convolution kernel weight data;

The lookup table caching module caches the lookup table data and simplifies the calculation of the activation function through the mapping value in the lookup table data;

and the threshold value caching module caches threshold value data in the NMS calculation process.

Further, the convolution operation module comprises a first input processing module, a pulse array, an activation module, a pooling module and a full connection module;

the first input processing module rearranges the input characteristic diagram data to be suitable for pipeline processing of a pulsation array;

The pulse array is used for connecting pulse units with the same structure, input data required by convolution calculation pass through the pulse units one by one, and a convolution calculation process is realized in a pipeline form;

The activation module is used for carrying out nonlinear processing on the convolution operation result and extracting features with higher dimensionality from input data;

The pooling module is used for realizing downsampling on convolution operation results and retaining key characteristics;

and the full-connection module maps all information in the feature map according to the weight and outputs category information and candidate frame information.

Further, the post-processing module comprises a second input processing module, an NMS module and an output processing module;

The second input processing module analyzes the calculation result of the full-connection module, takes out candidate frame information of the same classified data, recombines the candidate frame information, and calculates the confidence coefficient of each candidate frame data;

The NMS module is used for realizing a non-maximum suppression algorithm through hardware and screening a plurality of candidate frames;

And the output processing module reprocesses the result of the NMS module, combines the candidate frame data and the classified data according to the data format of the convolution operation module, and ensures the consistency of the formats.

Further, the NMS module comprises a sequencing module, an cross-over comparison calculation module, an cross-over comparison module and a threshold comparison module;

The sorting module is used for arranging the data of each candidate frame according to the confidence level from large to small;

the cross ratio calculation module calculates the cross ratio of the single candidate frame and the rest candidate frames, and if the confidence coefficient of the rest candidate frames is larger than the confidence coefficient of the frame, zero value processing is carried out on the cross ratio calculation result of the single candidate frame and the rest candidate frames;

The cross-over comparison module is used for comparing the cross-over ratio of a single candidate frame and the rest candidate frames one by one, reserving the maximum value in all the cross-over ratios of the candidate frame and the rest candidate frames, and executing the operation on each candidate frame;

and the threshold comparison module compares the maximum value of the cross ratio with the threshold value, deletes the candidate frame data exceeding the threshold value, and reserves the candidate frame data not exceeding the threshold value.

Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the invention adopts the buffer memory to load the needed data, thereby reducing the waiting time; the convolution operation is processed in parallel by using a pulse array pipeline, so that the convolution calculation is accelerated; the design and implementation of the optional post-processing hardware module have higher efficiency compared with a software-implemented post-processing algorithm, and the application range of the convolutional neural network accelerator is improved.

Drawings

FIG. 1 is a schematic diagram of a convolutional neural network accelerator of the present invention;

FIG. 2 is a diagram of the internal architecture of the convolutional neural network accelerator of the present invention;

FIG. 3 is a diagram of a RISC-V custom instruction encoding format;

FIG. 4 is a 3x3 feature map and two 2x2 convolution kernel diagrams;

FIG. 5 is a schematic diagram of a pulse array structure and input data processing;

Fig. 6 is a schematic diagram of a first period data flow for pulse array convolution calculations.

Detailed Description

Example 1

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

As shown in fig. 1, the invention provides a convolutional neural network accelerator based on a RISC-V architecture, which comprises an instruction analysis module, a control unit, a data buffer module, a convolutional operation module and a post-processing module, wherein the dashed line in fig. 1 represents a transmission control signal, and the arrow represents a transmission output result.

The data caching module caches the preprocessed image data, the weight data, the lookup table data and the NMS threshold data from an external memory, so that the data reading time in the operation process is reduced, and caches the feature map data generated in the convolution operation process for subsequent calculation.

The instruction analysis module can analyze convolution acceleration calculation custom instructions sent by the RISC-V processor, the RISC-V instruction set considers good openness and expansibility at the beginning of design, besides basic modularized instructions are optional, RISC-V also reserves a large amount of instruction coding space for custom expansion, and defines 4 custom instruction groups for direct use, wherein the custom instruction groups are respectively custom-0/1/2/3. The code of the instruction is shown in figure 3, wherein the opcode operation code is used for designating an instruction group, and the custom-0 instruction group is selected, and the opcode operation code is 0001011; the code corresponding to each instruction under the instruction group defined by the user of func7, the code with 7-bit width can be expanded to 128 different instructions; rs1 and rs2 are source registers, and rd is a destination register; xs1 indicates whether the register indicated by the rs1 code segment needs to be read, and xs2 code segment is similar to xs1, xd indicates whether the result is written back to the destination register.

The above features based on RISC-V are designed as custom instructions to support convolution operations in deep learning:

1) Data instruction: CLOAD, data reading instruction; CSTORE, a data load instruction;

2) Convolution operation instruction: CCONV, call the systolic array instruction; CRELU, calling an activation module instruction; CPOOL, calling out a pooling module instruction; CFULLC, calling a full connection module instruction;

3) Post-processing instructions: CSORT, calling an order module instruction; CIOU, calling an instruction of an intersection ratio calculation module; CCMP, compare operation instruction.

The data caching module comprises a feature map caching module, a weight caching module, a lookup table caching module and a threshold caching module;

The weight caching module caches the convolution kernel weight data;

The convolution operation module comprises a first input processing module, a pulsation array, an activation module, a pooling module and a full-connection module;

For the input processing module, if the feature map and the convolution kernel are as shown in fig. 4, the feature map F has a size of 3x3, the convolution kernels W and K have sizes of 2x2, and when the feature map is convolved with a step size of 1, the convolution kernel needs to slide four times on the feature map, and feature map data in a sensitivity field of the four-time sliding convolution kernel are respectively: F00-F01-F03-F04; F01-F02-F04-F05; F03-F04-F06-F07; F04-F05-F07-F08. Each convolution operation is independent and does not affect each other, so that the data in each receptive field can be fetched and operated at the same time. In addition, the input data needs to be processed as follows, the current receptive field data is flattened in sequence, for example, the receptive field data of F00-F01-F03-F04 is input according to the sequence of F00, F01, F03 and F04, and if no data is input in the period, 0 is supplemented.

As shown in fig. 5, L1, L2, L3, and L4 are receptive field data corresponding to four sliding, and the pulse array is composed of PE units with the same structure, and the pulse array performs the following operations:

1) Longitudinally preloading each convolution kernel weight value into a PE unit;

2) Gradually calculating according to the period to obtain a convolution value, and in the first period, as shown in fig. 6, F00 of L1 row is transmitted to the right to input a PE unit where W00 is located, and multiplication operation is carried out on the PE unit and W00 to obtain F00xW00, which is recorded as R00; in the second period, L1 line F00 continues to transmit the PE unit where the input K00 is located to the right, and performs product operation with the K00 to obtain F00xK00; l1 line F01 transmits the PE unit where the input W00 is located to the right, and performs product operation with W00 to obtain W00xF01, which is marked as R01; l2 line F01 transfers PE unit where input W01 is located to right, R00 transfers PE unit where input W01 is located downwards, product operation is carried out on F01 and W01 to obtain W01xF01, addition operation is carried out on the value and R00 to obtain W00xF00+W01xF01, and the value of R00 is updated to W00xF00+W01xF01; and the like, 7 periods are similar to obtain the value of convolution output.

The post-processing module comprises a second input processing module, an NMS module and an output processing module;

The NMS module comprises a sequencing module, an cross-over comparison calculation module, an cross-over comparison module and a threshold comparison module;

The traditional NMS algorithm flow sorts all candidate frames according to the confidence level from high to low, then selects the candidate frame with the highest confidence level, calculates the intersection ratio of the candidate frames with the selected candidate frame for the rest candidate frames, eliminates the candidate frame if the intersection ratio with the selected candidate frame exceeds a set threshold, and then repeats the steps until all the candidate frames are processed, and finally the reserved candidate frame is the result processed by the NMS. On the basis, the iterative process for calculating the cross-over ratio can be optimized, the cross-over ratio of all frames to other frames is obtained through one-time matrix calculation, and the time required by the cross-over ratio calculation is greatly reduced by improving the parallelism. The cross-ratio calculation module may use the pulse array to complete the cross-ratio calculation operation of a single candidate frame and the rest of the candidate frames for a plurality of candidate frames simultaneously. The pulse unit can realize zero value processing on the result of the same candidate frame cross-ratio calculation and the high-confidence candidate frame cross-ratio calculation by the candidate frames with low confidence, thereby preventing the accuracy of the result from being influenced. The cross comparison module is vertically connected with the pulse array, receives the output of the longitudinal cross ratio and keeps the unique maximum value of each output column. The threshold comparison module stores reserved candidate frame data by utilizing the NMS module buffer, and when the whole NMS calculation is completed, all reserved candidate frames are obtained and are uniformly output to the output processing module.

The specific workflow and principle of the convolution accelerator are described below with reference to fig. 2, and the specific steps are as follows:

Step one: when RISC-V processing transmits a self-defined instruction to a convolutional neural network accelerator, an instruction analysis module analyzes the instruction to generate a corresponding control signal;

Step two: the control unit coordinates the corresponding modules according to the control signals;

Step three: loading the feature map or weight or threshold value or lookup table data to a data caching module according to the data address;

step four: the convolution operation module reads the feature map data from the feature map buffer module to the first input processing module according to CSTORE instructions, reads the data weight data from the weight buffer module to the pulse array, and preloads the data weight data to the PE unit. The feature map data is processed by a first input processing module and then input into a pulse array, and convolution operation is carried out according to CCONV instructions; the activation module reads the data of the lookup table from the lookup table cache module and performs activation operation according to CRELU instructions and the output result of the pulse array; the pooling module outputs data from the activation module, performs maximum pooling operation according to CPOOL instructions and outputs a result to the fully-connected module; and the full-connection module obtains a preliminary output result after full-connection operation according to CFULLC instructions, wherein the result comprises classification, classification probability and candidate frame data, and partial convolution, activation and pooling steps can be operated for multiple times.

Step five: the post-processing module transmits the result of the convolution operation module to the second input processing module, extracts the candidate frame information of the data representing the same class, and calculates the confidence coefficient of the candidate frame; the second input processing module outputs the result to the sorting module, and the sorting module sorts the data according to the CSORT instructions in descending order according to the confidence level.

And the cross ratio calculation module receives the results of the sorting module, calculates the cross ratio of the single candidate frame to the rest candidate frames, and performs zero value processing on the cross ratio calculation results of the single candidate frame and the rest candidate frames if the confidence coefficient of the rest candidate frames is larger than that of the frame. And the cross comparison module receives the result of the cross comparison calculation module, compares the cross ratio value of the single candidate frame and the rest candidate frames one by one, reserves the maximum value in all the cross ratio values of the candidate frame and the rest candidate frames, and executes the operation for each candidate frame.

The threshold comparison module receives the result of the cross comparison module, loads the threshold data in the threshold buffer module, compares the cross comparison data one by one according to the CCMP instruction, and reserves the candidate frame data which does not exceed the threshold value in the buffer. After the comparison of the candidate frame data of all classes is completed, the buffer transmits the saved result to the output processing module, and the output processing module recombines the data according to the output format of the full connection module, and the data is returned to the RISC-V processor for processing or returned to the memory for saving.

The above embodiments are provided to illustrate the technical concept and features of the present invention and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the spirit of the present invention should be construed to be included in the scope of the present invention.

Claims

1. The convolutional neural network accelerator based on RISC-V architecture is characterized by comprising an instruction analysis module, a control unit, a data cache module, a convolutional operation module and a post-processing module;

The post-processing module is used for processing the output result of the full-connection module and screening candidate frames by applying an NMS algorithm;

the convolution operation module comprises a first input processing module, a pulsation array, an activation module, a pooling module and a full connection module;

2. The convolutional neural network accelerator based on RISC-V architecture of claim 1, wherein the custom instructions are custom data instructions, custom convolution operation instructions, and custom post-processing instructions that extend under a custom-0 instruction set; the custom data instruction comprises a CLOAD data reading instruction and a CSTORE data loading instruction; the custom convolution operation instruction comprises a CCONV calling pulse array instruction, a CRELU calling activation module instruction, a CPOOL calling pooling module instruction and a CFULLC calling full connection module instruction; the custom post-processing instruction comprises CSORT call ordering module instructions, CIOU call merge ratio calculation module instructions and CCMP comparison operation instructions.

3. The convolutional neural network accelerator based on RISC-V architecture of claim 1, wherein the data caching module comprises a feature map caching module, a weight caching module, a look-up table caching module, and a threshold caching module;

The weight caching module caches the convolution kernel weight data;

4. The convolutional neural network accelerator based on the RISC-V architecture of claim 1, wherein the post-processing module comprises a second input processing module, an NMS module, and an output processing module;

5. The convolutional neural network accelerator based on RISC-V architecture of claim 4, wherein the NMS module comprises a sequencing module, an cross-over comparison calculation module, an cross-over comparison module, and a threshold comparison module;