CN113240101B

CN113240101B - Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network

Info

Publication number: CN113240101B
Application number: CN202110521611.0A
Authority: CN
Inventors: 王耀南; 赵禀睿; 张辉; 毛建旭; 朱青; 刘敏; 周显恩; 张金洲; 尹阿婷; 彭伟星; 苏学叁; 陈煜嵘
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2022-07-05
Anticipated expiration: 2041-05-13
Also published as: CN113240101A

Abstract

The invention discloses a method for realizing a convolutional neural network software and hardware cooperative acceleration heterogeneous SoC, which comprises the following steps: the on-chip processor acquires a current picture to be detected for preprocessing, and after the current picture to be detected is preprocessed, the preprocessed current picture to be detected is sent to the programmable logic circuit through the memory, and a next picture to be detected is acquired as the current picture to be detected for preprocessing; the programmable logic circuit receives the preprocessed current picture to be detected, carries out calculation according to a preset convolutional neural network hardware accelerator and a preset convolutional neural network model, and sends the calculated current picture to be detected to the on-chip processor through the memory after the current picture to be detected is calculated; and the on-chip processor receives the calculated current picture to be detected, performs post-processing on the calculated current picture to be detected and outputs a detection result of the current picture to be detected. And the high-speed real-time image processing of the embedded SoC with low cost is realized.

Description

Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for realizing a convolutional neural network software and hardware cooperative acceleration heterogeneous SoC.

Background

Embedded heterogeneous socs (system On chips) are considered as a new generation of computer processor solution following single core and multi-core, and realize 'cooperative computing and mutual acceleration' between computing units of different types of instruction sets and different types of system architectures, thereby breaking through the computing bottleneck existing in a single hardware architecture and effectively solving the problems of energy consumption, expandability and the like. The SoC device used in the invention refers to a computing device with FPGA (Field Programmable Gate Array) and arm (advanced RISC machine) as main bodies, wherein the FPGA is a semi-customized circuit chip, can design parallelization acceleration strategy for algorithm from a Gate-level circuit level, and is a Programmable logic circuit pl (Programmable logic) part in heterogeneous SoC; ARM is a low-power processor based on RISC system architecture, and is a part of a chip processor PS (processing System) in a heterogeneous SoC as a control module of the SoC. Both the PL and PS parts are programmable, with a dedicated high-speed interface for data interaction. Generally, in heterogeneous accelerated computing, the PL portion constitutes a high-performance computing circuit, and the PS portion is responsible for controlling the operation of the computing circuit and coordinating the interaction of external devices.

In recent years, deep learning algorithms represented by Convolutional Neural Networks (CNNs) are excellent in tasks such as image classification, object detection, and pattern recognition in the field of computer vision. Various algorithms based on the convolutional neural network continuously refresh accuracy records of image classification, detection and identification in public data sets such as ImageNet, FDDB, PubFig and the like, but the deep learning models have intensive parameters and calculation amount, are often difficult to deploy in a low-cost embedded platform, or cannot realize an image processing task based on the convolutional neural network in an embedded platform with scarce resources, and have low processing speed.

Disclosure of Invention

Aiming at the technical problems, the invention provides a method for realizing a heterogeneous SoC (system on chip) by cooperatively accelerating software and hardware of a convolutional neural network so as to realize high-speed real-time image processing of an embedded SoC with low cost.

The technical scheme adopted by the invention for solving the technical problems is as follows:

in one embodiment, the method comprises the steps of:

step S100: the on-chip processor acquires a current picture to be detected for preprocessing, and after the current picture to be detected is preprocessed, the preprocessed current picture to be detected is sent to the programmable logic circuit through the memory, and a next picture to be detected is acquired as the current picture to be detected for preprocessing;

step S400: the programmable logic circuit receives the preprocessed current picture to be detected, carries out calculation according to a preset convolutional neural network hardware accelerator and a preset convolutional neural network model, and sends the calculated current picture to be detected to the on-chip processor through the memory after the current picture to be detected is calculated; the preset convolutional neural network hardware accelerator is designed based on the compressed convolutional neural network model;

step S500: and the on-chip processor receives the calculated current picture to be detected, performs post-processing on the calculated current picture to be detected and outputs a detection result of the current picture to be detected.

Preferably, step S400 is preceded by:

step S200: the method comprises the steps of obtaining an initial convolutional neural network model, compressing the initial convolutional neural network model to obtain a compressed convolutional neural network model, training the compressed convolutional neural network model to obtain a network parameter file, carrying out fixed point number quantization on the network parameter file to obtain fixed point network parameters, obtaining a trained convolutional neural network model according to the compressed convolutional neural network model and the fixed point network parameters, and taking the trained convolutional neural network model as a preset convolutional neural network model.

Preferably, the initial convolutional neural network model includes a common convolutional layer, a fully-connected layer, and a nonlinear layer, and is characterized in that the step S200 of compressing the initial convolutional neural network model to obtain a compressed convolutional neural network model includes:

step S210: replacing a common convolutional layer in the initial convolutional neural network model with a depth separable convolutional layer, wherein the depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolution;

step S220: setting a nonlinear layer of the initial convolutional neural network model as a ReLU function;

step S230: and carrying out sparse pruning on the full connection layer parameters of the initial convolutional neural network model according to preset sparse parameters.

Preferably, step S400 is preceded by:

step S300: the method comprises the steps that a preset convolutional neural network hardware accelerator structure is designed into a hardware accelerator comprising a control module, an input cache module, an output cache module, a convolutional layer calculation module, a pooling layer calculation module, a full-connection layer calculation module, a first nonlinear layer and bias calculation module, a second nonlinear layer and bias calculation module and a network parameter storage module; wherein the control module is respectively connected with the input cache module, the output cache module, the convolution layer calculation module, the pooling layer calculation module and the full-connection layer calculation module, the system comprises a first nonlinear layer and bias calculation module, a second nonlinear layer and bias calculation module and a network parameter storage module, wherein an input cache module is connected with a convolutional layer calculation module, a full-connection layer calculation module and an output cache module, the convolutional layer calculation module is connected with the first nonlinear layer and bias calculation module, the first nonlinear layer and bias calculation module is connected with a pooling layer calculation module, the pooling layer calculation module is connected with an output cache module, the full-connection layer calculation module is connected with the second nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module is connected with the output cache module, and the network parameter storage module is connected with the convolutional layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module.

Preferably, step S300 further comprises: the convolutional layer calculation module receives a first control signal of the control module, loads input images of a preset number of calculation channels from the input cache module according to the first control signal, each calculation channel corresponds to an independent multiply-accumulate group, the number of the multiply-accumulate groups is determined by a preset block size of the input image, reads corresponding convolutional layer network weights from the network parameter storage module, controls calculation time sequence according to the size of a convolutional kernel of a preset convolutional neural network model, performs convolutional calculation on the loaded input images in a clock cycle of the size of the convolutional kernel according to the convolutional layer network weights to obtain a first calculation result, and sends the first calculation result to the first nonlinear layer and the bias calculation module.

Preferably, step S300 further comprises: the full-connection layer calculation module receives a second control signal sent by the control module, loads input images of a preset number of calculation channels from the input cache module, reads full-connection layer network weights corresponding to each calculation channel from the network parameter storage module, and performs multiplication and accumulation calculation according to the pixels of the current input image and the network weights of the corresponding output channels until the full-connection layer calculation of all the calculation channels is finished when a preset multiplication and accumulation unit with the same number as the preset output channels detects that the pixels of the current input image are not zero and the network weights of the corresponding output channels are not zero in parallel, so as to obtain a second calculation result and send the second calculation result to the second nonlinear layer and the offset calculation module, wherein the weight size of each output channel is consistent with the size of the corresponding input image.

Preferably, step S300 further comprises: the first nonlinear layer and bias calculation module receives a third control signal sent by the control module, reads a first calculation result from the convolutional layer calculation module, reads a bias value of the convolutional layer from the network parameter storage module, reserves a value of a pixel of an input image in the first calculation result, which is larger than zero, according to a preset ReLU function, sets a value of the pixel of the input image in the first calculation result, which is smaller than zero, as a zero value to obtain an updated first calculation result, performs addition operation according to the bias value of the convolutional layer and the updated first calculation result to obtain a third calculation result, and sends the third calculation result to the pooling layer calculation module; the second nonlinear layer and offset calculation module receives a fourth control signal sent by the control module, reads a second calculation result from the full connection layer calculation module, reads an offset value of the full connection layer from the network parameter storage module, reserves a value of a pixel of an input image, which is greater than zero, in the second calculation result according to a preset ReLU function, sets a value of the pixel of the input image, which is less than zero, in the second calculation result to a zero value, obtains an updated second calculation result, performs addition operation according to the offset value of the full connection layer and the updated second calculation result to obtain a fourth calculation result, and sends the fourth calculation result to the output buffer module.

Preferably, step S300 further comprises: the pooling layer calculating module receives a fifth control signal of the control module, reads a third calculating result from the first nonlinear layer and the bias calculating module, performs maximum pooling calculation on the third calculating result through a preset register with the number corresponding to the preset maximum pooled window size to obtain a fifth calculating result, and sends the fifth calculating result to the output cache module.

Preferably, step S300 further comprises: the network parameter storage module stores the convolutional layer network weight, the full-connection layer network weight, the bias value of the convolutional layer and the bias value of the full-connection layer, and the convolutional layer network weight, the full-connection layer network weight, the bias value of the convolutional layer and the bias value of the full-connection layer are grouped and divided in parallel according to parallel strategies designed by the convolutional layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module.

Preferably, step S300 further comprises: the input cache module receives an input image sent by the memory, and the output cache module stores a calculation result data stream to be sent to the memory.

According to the heterogeneous SoC realization method for realizing the software and hardware cooperative acceleration of the convolutional neural network, on an algorithm model, the structure and parameters of the convolutional neural network model are compressed, the calculated amount and the parameters of the convolutional neural network model are reduced, on a hardware circuit, a convolutional neural network hardware accelerator is designed on a programmable logic circuit of the heterogeneous SoC on the basis of the compressed neural network model, and the calculation of the convolutional neural network is accelerated by fully utilizing the parallelism and the reusability of hardware; on the cooperative work of software and hardware, the problem of cooperative acceleration of a heterogeneous computing part is further solved by using a pipeline working mode of a PS end and a PL end in the heterogeneous SoC, and the requirement of a low-cost embedded SoC system for high-speed real-time image processing based on a convolutional neural network algorithm can be met.

Drawings

Fig. 1 is a flowchart of a method for implementing a convolutional neural network software and hardware cooperative acceleration heterogeneous SoC according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of heterogeneous SoC software and hardware deployment and data interaction in the present invention;

FIG. 3 is a schematic diagram illustrating software and hardware cooperative acceleration based on a heterogeneous SoC according to the present invention;

fig. 4 is a flowchart of a method for implementing a convolutional neural network software and hardware cooperative acceleration heterogeneous SoC according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of the algorithm compression of a convolutional neural network model according to the present invention;

FIG. 6 is a schematic diagram of a convolutional neural network hardware accelerator in accordance with the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.

In one embodiment, as shown in fig. 1, a method for implementing a convolutional neural network software and hardware cooperative acceleration heterogeneous SoC includes the following steps:

step S100: the on-chip processor acquires a current picture to be detected for preprocessing, and after the current picture to be detected is preprocessed, the preprocessed current picture to be detected is sent to the programmable logic circuit through the memory, and a next picture to be detected is acquired and used as the current picture to be detected for preprocessing.

Step S400: the programmable logic circuit receives the preprocessed current picture to be detected, the preprocessed current picture to be detected is calculated according to a preset convolutional neural network model based on a preset convolutional neural network hardware accelerator, after the current picture to be detected is calculated, the calculated current picture to be detected is sent to an on-chip processor through a storage, the preset convolutional neural network model is a compressed convolutional neural network model, and the preset convolutional neural network hardware accelerator is designed based on the compressed convolutional neural network model.

Specifically, as shown in fig. 2, an image Processing application is deployed on an ARM processor at a PS (Processing System, on-chip processor) end, and coordinates cooperative work of a preset convolutional neural network hardware accelerator at a PL (Programmable logic circuit) end and a DDR (Double Data Rate) external memory.

The image processing application of the PS terminal executes a predetermined algorithm according to the actual processing requirements of image classification, detection or identification and the like, so as to realize the calculation with lower calculation amount of preprocessing, post-processing and the like of the corresponding image; for the calculation of a convolutional neural network, which is a main calculation amount in the algorithm, according to a preset convolutional neural network model, a corresponding control signal is sent to an accelerator and a partitioned image in a DDR external memory is organized to form an input calculation stream, the input calculation stream is sent to a preset convolutional neural network hardware accelerator at a PL end to perform convolutional neural network related calculation, and the DDR external memory is organized to receive a calculation result sent by the convolutional neural network hardware accelerator.

Further, the PS end is divided into two main tasks of image preprocessing and image post-processing, and the PL end convolution neural network calculation is used as an intermediate calculation step of the image preprocessing and the image post-processing of the PS end. If the program is executed serially, the detection sequence of each picture is as follows: preprocessing an image; calculating a convolution neural network; and thirdly, image post-processing. The serial execution efficiency is low, and the cooperative acceleration method of the PS terminal and the PL terminal in the heterogeneous SoC needs to be realized by using the mutually independent working principle of the PL and the PS, as shown in fig. 3, specifically: firstly, preprocessing a first picture to be detected at a PS end in a first round; secondly, the preprocessed first picture to be detected is transmitted to a convolutional neural network hardware accelerator at a PL end through DDR for calculation, and meanwhile, a second picture to be detected is preprocessed at a PS end; and thirdly, transmitting the second pre-processed picture to be detected to a convolutional neural network hardware accelerator at the PL end through a DDR (double data rate) for calculation, meanwhile, taking out the first picture to be detected after the PL end is calculated, and transmitting the first picture to be detected to the post-processing module at the PS end through the DDR. Outputting a detection result of the first picture after post-processing operation, and then preprocessing a third picture to be detected; and fourthly, transmitting the third preprocessed picture to be detected to a convolutional neural network hardware accelerator at the PL end through a DDR for calculation, and meanwhile, taking out the second picture to be detected, which is calculated at the PL end, and transmitting the second picture to be detected to the post-processing module at the PS end through the DDR. And outputting a detection result of the second picture after post-processing operation, and then preprocessing a fourth picture to be detected.

Through the acceleration scheme of the cooperation of the PS end and the PL end, the pretreatment and post-treatment processes of the PS end and the calculation process of the PL end can share the unified calculation delay in a pipeline mode, and the effect of shortening the detection time of each picture is achieved.

In one embodiment, as shown in fig. 4, step S400 is preceded by:

Specifically, through various model compression means, the structure of a convolutional neural network model is changed, a compressed lightweight neural network model is generated and trained, the compressed convolutional neural network model is trained, a network parameter file of the model is obtained, the network parameter file is analyzed, all floating point parameters of the network model are obtained, the value range of the network floating point parameters is further analyzed, the detection performance of fixed-point network parameters in different combination forms of integer and decimal bit width data is tested, the combination with the minimum detection precision loss bit width is selected, and the network model parameters after fixed-point quantization are obtained.

In one embodiment, as shown in fig. 5, the initial convolutional neural network model includes a normal convolutional layer, a fully-connected layer, and a non-linear layer, and the compressing the initial convolutional neural network model in step S200 to obtain a compressed convolutional neural network model includes:

step S210: replacing the normal convolutional layers in the initial convolutional neural network model with depth separable convolutional layers, which include depth convolutional and point-by-point convolutions.

Specifically, the normal convolutional layers of the convolutional neural network model are selectively replaced with depth separable convolutional layers. Assuming that the input image size of the convolutional layer is H × H, the number of input channels is C _ in, the number of output channels is C _ out, and the size of the convolutional kernel is K × K, the network parameter T _ weight is: t _ weight — K × C _ in × C _ out.

The deep separable convolution divides the convolution layer into a deep convolution part and a point-by-point convolution part, the deep convolution part carries out independent convolution calculation on each channel of the input image. And a point-by-point convolution part, which uses a convolution kernel of 1 multiplied by 1 to calculate the result generated by the depth convolution in the same way as the ordinary convolution calculation. The parameter amount T _ weight 'of the depth separable convolution is T _ weight' ═ K × C _ in + C _ in × C _ out. The parameter quantities of the two are compared,

since the output channel C out is relatively much larger than the convolution kernel size K × K, the total amount of parameters required for the depth separable convolution is about 1/K × K of the normal convolution. If the convolution kernel size is 3 x 3, the total amount of parameters will drop to 1/9 after the depth separable convolution is used instead of the normal convolution.

Step S220: the non-linear layer of the initial convolutional neural network model is set as the ReLU function.

Specifically, the nonlinear layer of the convolutional neural network model is set as the ReLU function:

that is, the input pixel with the input pixel greater than 0 is kept as it is, and the output is 0 if the input pixel is less than or equal to 0. By utilizing the ReLU function, more zero values can be added to the output characteristic diagram, and subsequent sparse matrix accelerated calculation is facilitated.

Specifically, sparse pruning is performed on full-link layer parameters of the initial convolutional neural network model. Due to the characteristic that the neurons of the full connection layer in the convolutional neural network are fully connected, the parameter quantity of the convolutional neural network is usually more than several orders of magnitude higher than that of the convolutional layer. The huge parameters of the fully connected layer and the inefficient computation are one of the main factors that compromise the computation speed of CNN. Meanwhile, a large number of parameters of the full-link layer are also one of the reasons for overfitting of the convolutional neural network. In order to reduce the parameter number and increase the generalization performance of the model, a sparse pruning mode is adopted in the full connection layer, and the full connection layer parameters in the form of a sparse matrix are obtained by setting the parameters with smaller weight threshold values to zero, so that the purpose of reducing the parameter number is achieved. The specific sparse strategy adopts a progressive sparse scheme provided by Tensorflow (a symbolic mathematical system based on data flow programming), sets superparameters such as target sparsity and sparse interval time, adaptively sets a weight threshold according to a global weight amplitude, and intermittently clears the weight of the neurons below the weight threshold. The calculation formula is as follows:

in the formula, the hyper-parameter corresponding to each symbol is as follows: s_tFor the current sparsity, S_fTarget sparsity (set to 0.9), S_iTo initial sparsity (default 0), t₀The progressive sparsity function start time (default to 0 in terms of the global step number), n the progressive sparsity function duration (default to 100 in terms of the global step number), t the sparse matrix mask update frequency (default to 10 in terms of the global step number), and e the sparse index (default to 3).

In one embodiment, as shown in fig. 6, step S400 is preceded by:

Specifically, the preset convolutional neural network hardware accelerator is designed in a modular manner based on the PL end of the heterogeneous SoC, and mainly comprises a control module, an input cache module, an output cache module, a convolutional layer calculation module (corresponding to the convolutional layer in fig. 6), a pooling layer calculation module (corresponding to the pooling layer in fig. 6), a full-link layer calculation module, a first nonlinear layer and Bias calculation module (corresponding to ReLu + Bias of quality inspection of the convolutional layer and the pooling layer in fig. 6), a second nonlinear layer and Bias calculation module (corresponding to ReLu + Bias behind the full-link layer in fig. 6), and a network parameter storage module. The convolution neural network hardware accelerator utilizes programmable logic gate array resources inside a PL end to accelerate the intensive convolution neural network operation process from the bottom layer of the circuit.

The control module is a top-level control module in a preset convolutional neural network hardware accelerator and is connected with the input cache module, the output cache module, the convolutional layer calculation module, the pooling layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module and the network parameter storage module. On one hand, the control module sends corresponding control signals to control each module in the accelerator by using the register, and coordinates the calculation logic and the preset time sequence of each module in the accelerator; on the other hand, the AXI bus utilizes an AXI-Lite protocol to interact with an ARM control program at the PS end for control signals.

In one embodiment, step S300 further comprises: the convolutional layer calculation module receives a first control signal of the control module, loads input images of a preset number of calculation channels from the input cache module according to the first control signal, each calculation channel corresponds to an independent multiply-accumulate group, the number of the multiply-accumulate groups is determined by a preset block size of the input image, reads corresponding convolutional layer network weights from the network parameter storage module, controls calculation time sequence according to the size of a convolutional kernel of a preset convolutional neural network model, performs convolutional calculation on the loaded input images in a clock cycle of the size of the convolutional kernel according to the convolutional layer network weights to obtain a first calculation result, and sends the first calculation result to the first nonlinear layer and the bias calculation module.

Specifically, the convolutional layer calculation module receives a first control signal sent by the control module, reads a calculation data stream from the input buffer module, reads a corresponding convolutional layer network weight from the network parameter storage module, performs convolutional calculation, and sends a calculation result to the first nonlinear layer and the offset calculation module. The deep separable convolution can be realized by using a calculation framework of common convolution, wherein a single-channel convolution algorithm is as follows:

wherein, x represents the input characteristic image, w represents the weight, i and j represent the pixel coordinates of convolution output respectively, and m and n represent the pixel coordinates of the input characteristic respectively. Because the storage resources in the FPGA are limited, the image is generally large, there is no way to put the whole image into the chip for processing, the image is firstly processed in blocks and then put into the chip, so the computation of the convolutional layer is performed in blocks, and it is necessary to load as many computation channel data streams as possible from the input buffer module each time according to the computation resources of the FPGA chip and the structural condition of the convolutional neural network, and design an independent multiply-accumulate unit for each computation channel, thereby achieving the requirement of parallel computation between channels. The number of each multiply-accumulate unit is determined by the size of the input image block (for example, the RGB image is 3 × 1024, and one block is 3 × 128 image pixels, so that it is necessary to divide the image into 8 blocks, that is, calculate 8 times, to obtain a complete picture), and the convolution calculation process is optimized by means of register shift, array multiplexing, and the like. In addition, the calculation sequence is controlled according to the size (p × q) of the convolution kernel, and the effect of calculating all images loaded in the current round in p × q clock cycles is achieved. Therefore, calculation acceleration inside the convolutional layer calculation channel and calculation acceleration among different calculation channels are realized, and the calculation process of the convolutional layer can be accelerated.

In one embodiment, step S300 further comprises: the full-connection layer calculation module receives a second control signal sent by the control module, loads input images of a preset number of calculation channels from the input cache module, reads full-connection layer network weights corresponding to each calculation channel from the network parameter storage module, and performs multiplication and accumulation calculation according to the pixels of the current input image and the network weights of the corresponding output channels until the full-connection layer calculation of all the calculation channels is finished when a preset multiplication and accumulation unit with the same number as the preset output channels detects that the pixels of the current input image are not zero and the network weights of the corresponding output channels are not zero in parallel, so as to obtain a second calculation result and send the second calculation result to the second nonlinear layer and the offset calculation module, wherein the weight size of each output channel is consistent with the size of the corresponding input image.

Specifically, the full-connection layer calculation module is used for calculating the full-connection layer of the neural network. The module receives a second control signal sent by the control module, reads input images loaded with a preset number of calculation channels from the input cache module, reads the full-connection layer network weight corresponding to each calculation channel from the network parameter storage module, performs full-connection layer calculation and sends a calculation result to the output cache module. The hardware acceleration of the full-connection layer is divided into three parts, including that firstly, all output channels of the full-connection layer are completely expanded, and all channels are calculated in parallel by using the same number of multiply-accumulate units; serially calculating an input feature in each clock period, judging whether the pixel of the input feature is 0 or not before calculation, and skipping the multiply-accumulate calculation of the input feature if the pixel of the input feature is 0; and thirdly, because the network parameters of the full connection layer are sparse matrixes, the multiply-accumulate calculation of the network parameters of 0 is also skipped according to the calculation mode of the sparse matrixes. By the input feature sparse calculation and the network parameter sparse calculation, the data transmission and calculation steps of which all input pixels are 0 and the weight parameters of the corresponding rows thereof can be skipped, and the calculation steps of which the input pixels are not 0 but the weight parameters are 0 can be skipped. In conclusion, the three layers of the parallel computation of different channels, the sparse computation of the input features and the sparse computation of the network parameters of the full connection layer are cooperated to be parallel and accelerated, and the computation process of the full connection layer can be accelerated.

In one embodiment, step S300 further comprises: the first nonlinear layer and offset calculation module receives a third control signal sent by the control module, reads a first calculation result from the convolutional layer calculation module, reads an offset value of the convolutional layer from the network parameter storage module, reserves a value of a pixel of an input image in the first calculation result, which is larger than zero, according to a preset ReLU function, sets a value of the pixel of the input image in the first calculation result, which is smaller than zero, as a zero value to obtain an updated first calculation result, performs addition operation according to the offset value of the convolutional layer and the updated first calculation result to obtain a third calculation result, and sends the third calculation result to the pooling layer calculation module; the second nonlinear layer and offset calculation module receives a fourth control signal sent by the control module, reads a second calculation result from the full connection layer calculation module, reads an offset value of the full connection layer from the network parameter storage module, reserves a value of a pixel of an input image, which is greater than zero, in the second calculation result according to a preset ReLU function, sets a value of the pixel of the input image, which is less than zero, in the second calculation result to a zero value, obtains an updated second calculation result, performs addition operation according to the offset value of the full connection layer and the updated second calculation result to obtain a fourth calculation result, and sends the fourth calculation result to the output buffer module.

Specifically, the first nonlinear layer and bias calculation module and the second nonlinear layer and bias calculation module are used for bias operation of the nonlinear layer, the convolutional layer and the full connection layer of the neural network. The first nonlinear layer and bias calculation module receives a third control signal sent by the control module, reads a calculation data stream (corresponding to a first calculation result) from the convolutional layer calculation module, reads a bias value of the convolutional layer from the network parameter storage module, performs nonlinear and bias calculation to obtain a third calculation result, and sends the third calculation result to the output cache module; the second nonlinear layer and bias calculation module receives a fourth control signal sent by the control module, reads a calculation data stream (corresponding to a second calculation result) from the full connection layer calculation module, reads a bias value of the full connection layer from the network parameter storage module, performs nonlinear and bias calculation to obtain a fourth calculation result, and sends the fourth calculation result to the output cache module.

Further, the nonlinear calculation maintains pixel values greater than zero and changes pixel values less than zero to zero values according to the ReLU function; the offset calculation is an addition operation using the offset value in the network parameter and the calculation flow (i.e. the corresponding calculation result) input into the module.

In one embodiment, step S300 further comprises: the pooling layer calculating module receives a fifth control signal of the control module, reads a third calculating result from the first nonlinear layer and the bias calculating module, performs maximum pooling calculation on the third calculating result through a preset register with the number corresponding to the preset maximum pooled window size to obtain a fifth calculating result, and sends the fifth calculating result to the output cache module.

In particular, the pooling layer calculation module is used for calculation of neural network pooling layers. And the pooling layer calculation module receives a fifth control signal sent by the control module, reads the calculation data stream from the first nonlinear layer and the bias calculation module, performs maximum pooling calculation and sends the calculation result to the output cache module. When the pooling layer executes the maximum pooling calculation, all calculation channels sent from the convolutional layer are processed in parallel, and according to the preset maximum pooling window size, the calculation in each pooling window is accelerated in a grouping and parallel mode through a preset register with the number corresponding to the preset maximum pooling window size, so that the cooperative parallel acceleration among different channels and in the same channel of the pooling layer is realized, and the calculation process of the pooling layer can be accelerated.

In one embodiment, step S300 further comprises: the network parameter storage module stores the convolutional layer network weight, the full-connection layer network weight, the bias value of the convolutional layer and the bias value of the full-connection layer, and the convolutional layer network weight, the full-connection layer network weight, the bias value of the convolutional layer and the bias value of the full-connection layer are grouped and divided in parallel according to parallel strategies designed by the convolutional layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module.

Specifically, the network parameter storage module receives the sixth control signal sent by the control module, outputs the network weight of the corresponding network layer, and inputs the network weight into the convolutional layer calculation module or the full connection layer calculation module to participate in calculation. The storage of the network parameters mainly comprises convolutional layer parameters, full connection layer parameters and bias parameters. All the parameters are obtained by analyzing the parameter file obtained by the combined compression means in the step S200, and the parameters divide data packets in parallel according to the parallel strategy designed by the convolutional layer calculation module, the fully-connected layer calculation module, the second nonlinear layer and the bias calculation module, so that all the network parameters needing to participate in parallel calculation in a clock cycle can be read by matching with the layers in the clock cycle, and the process of loading the network parameters can be accelerated.

In one embodiment, step S300 further comprises: the input cache module receives an input image sent by the memory, and the output cache module stores a calculation result data stream to be sent to the memory.

Specifically, the input buffer module and the output buffer module are used for respectively buffering data streams entering and exiting the convolutional neural network hardware accelerator. The module utilizes an AXI-Stream protocol to perform interaction of a calculation data Stream with an external DDR external memory, wherein an input cache module receives the calculation data Stream sent by the external DDR; and the output cache module stores the calculation result data stream to be sent to the external DDR. The input buffer module and the output buffer module both adopt a ping-pong buffer technology of double BRAM (Block RAM) to form a pipeline operation mode, and can speed up the process of reading/writing the calculation data flow.

Firstly, carrying out software acceleration on the convolutional neural network, compressing network structures and parameters through combination modes such as grouping convolution, sparse pruning, fixed point quantization and the like, and reducing the calculated amount and parameter amount of the convolutional neural network; and then hardware acceleration is carried out, a convolutional neural network hardware accelerator is designed for the compressed convolutional neural network model in the programmable logic part of the heterogeneous SoC, the calculation of the convolutional neural network is accelerated by fully utilizing the parallelism and the reusability of the hardware, and finally, a pipeline type acceleration scheme for cooperative work of PS (packet switched) and PL (packet data) parts on the heterogeneous SoC is designed, so that a software and hardware cooperative acceleration method is formed, and the deep learning model based on the convolutional neural network can be ensured to be capable of carrying out low-cost high-speed real-time image processing on the embedded SoC under the condition of ensuring the detection precision.

The method of the present invention will be further described with reference to the drawings and MTCNN (Multi-task convolutional neural network) model as an example.

The method comprises the following specific steps:

step (1): and combining a plurality of model compression means, changing the MTCNN model structure, generating a compressed lightweight LW-MTCNN neural network model and training. The specific algorithm compression flow is shown in fig. 5:

the original MTCNN network parameters are about 4.9 × 10⁵And 32-bit floating point numbers, which occupy about 14.95Mbyte of data storage space.

All convolutional layers in the MTCNN model are replaced with depth separable convolutional layers. At this point, the amount of network parameters drops to about 4.0 × 10⁵And 32-bit floating point numbers, which occupy about 12.2Mbyte of data storage space.

The nonlinear layers in all sub-networks are replaced by a PReLU function to a ReLU function. At this time, the network model parameter amount is reduced to about 3.9 × 10⁵And 32-bit floating point numbers, which occupy about 11.9Mbyte of data storage space.

The fully-connected layer parameters in all sub-networks are set to a sparse matrix form of 90% sparsity, i.e. non-zero values occupy only about 10% of the fully-connected layer weight. At this time, the amount of network model parameters is reduced to about 6.1 × 10⁴And 32-bit floating point numbers, which occupy about 1.87Mbyte of data storage space.

Training the compressed LW-MTCNN network model by utilizing an FDDB data set in Tensorflow to obtain a model parameter file, and analyzing the model parameter file to obtain all floating point parameters of the network model. Analyzing the value range of the network floating point parameter, and testing the integer and decimal bit width data to be differentThe detection performance of the 16-bit fixed point network parameters in the combined form selects the fixed point form with the input characteristics of 1-bit sign bit, 5-bit integer bit and 10-bit decimal bit under the condition of ensuring that data overflow cannot occur; the network parameters are selected to be in the form of fixed point numbers with 1-bit sign bit, 4-bit integer bit and 11-bit decimal bit. And carrying out fixed-point quantization on the trained network model floating-point parameters according to the format to obtain fixed-point quantized network model parameters. At this time, the network model parameter number is about 6.1 × 10⁴And 16-bit floating point numbers, which occupy about 0.94Mbyte of data storage space.

To this end, the MTCNN network model parameters are from 4.9 × 10⁵Down to 6.1X 10⁴The occupied storage space is reduced from 14.95Mbyte to 0.94Mbyte, more than one order of magnitude of model compression effect is achieved, and finally the detection accuracy is reduced by only about 1%. The model combination compression method provided by the invention can greatly compress the size of the convolutional neural network model under the condition of keeping higher detection precision, thereby achieving the effect of accelerating calculation in the software algorithm level.

Step (2): and designing a convolutional neural network hardware accelerator at the PL end of the heterogeneous SoC.

The schematic diagram of the convolutional neural network hardware accelerator is shown in fig. 6:

the convolutional neural network hardware accelerator is designed in a modular mode based on a PL end of a heterogeneous SoC, and the design of each module is as follows:

the control module is a top-level control module in the convolutional neural network hardware accelerator, on one hand, the register is used for sending instructions to control each module in the accelerator, and the calculation logic and the preset time sequence of each module in the accelerator are coordinated; on the other hand, the AXI bus utilizes an AXI-Lite protocol to interact with an ARM control program at the PS end for control signals. The specific control register mainly stores instructions such as the number of input channels, the number of output channels, the size of a convolution window, the size of an input image, the type of a network layer, the number of current calculation blocks and the like.

The device comprises an input cache module and an output cache module, wherein the input cache module and the output cache module use an AXI-Stream protocol to perform interaction of a calculation data Stream with an external DDR external memory, and the input cache module receives the calculation data Stream sent by the external DDR; and the output cache module stores the calculation result data stream to be sent to the external DDR. The input buffer module and the output buffer module both adopt a ping-pong buffer technology of double BRAM to form a pipeline operation mode.

The convolution layer calculation module carries out calculation in blocks, each time 64 input images of calculation channels 14 × 14 are loaded from the input buffer module, the convolution layer calculation module internally carries out parallel calculation convolution operation by using 16 groups of multiply-accumulate Machines (MAC) of 7 × 7, and the calculation of output images of 16 channels of 7 × 7 is completed in parallel in 9 clock cycles by means of register shift, array multiplexing and the like. This achieves a synergistic parallel acceleration between different channels and within the same channel of the convolutional layer.

And the pooling layer calculating module is used for accelerating the calculation in each pooling window in parallel by respectively adopting 2 groups of registers (2 registers in each group) and 3 groups of registers (3 registers in each group) according to the maximum pooling window size of 2 multiplied by 2 or 3 multiplied by 3 in the original network structure. This achieves a coordinated parallel acceleration between different channels and within the same channel of the pooling layer.

The full-connection layer calculation module takes full-connection layer calculation with input of 3 × 3 × 64 and output of 128 as an example: all 128 output channels of a full connection layer are expanded, and all channels are calculated in parallel by using 128 parallel multiply-accumulate units; calculating one input characteristic in each period for all the 3 multiplied by 64 input characteristics in series, judging whether the input characteristic is 0 before calculation, and skipping the multiplication and accumulation calculation of the input characteristic if the input characteristic is 0; thirdly, because the network parameters of the full connection layer in the embodiment are sparse matrixes with 90% of zero values, the multiply-accumulate calculation with the network parameters of 0 is also skipped according to the calculation mode of the sparse matrixes. In conclusion, the three levels of parallel computation of different channels, sparse computation of input features and sparse computation of network parameters of the full connection layer are cooperatively accelerated in parallel.

The first nonlinear layer and offset calculation module and the second nonlinear layer and offset calculation module are used for keeping pixel values larger than zero and changing pixel values smaller than zero into zero values according to the ReLU function; the offset calculation is carried out by adding the offset value in the network parameter with the calculation flow input into the module.

And the network parameter storage module is mainly used for storing layer parameters, full-connection layer parameters and bias parameters. All the parameters are obtained by analyzing the parameter file obtained by the combined compression means in the step (1), and the parameters divide data groups in parallel according to parallel strategies designed by the convolutional layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module, the second nonlinear layer and bias calculation module, so that all network parameters needing to participate in parallel calculation in the clock cycle can be read by matching with the layers in each clock cycle.

And (3): a convolutional neural network software and hardware cooperative acceleration system based on a heterogeneous SoC is designed, and specifically comprises the following steps:

and (3) using a certain development board as a heterogeneous SoC carrier, and deploying an operating system on an ARM processor at a PS end.

The heterogeneous SoC software and hardware deployment and data interaction schematic diagram is shown in fig. 2:

and designing a MTCNN-based face detection application. According to a preset algorithm flow, preprocessing and post-processing functions such as image scaling, non-maximum value suppression, candidate frame coordinate regression calculation functions and the like are realized at a PS (packet switched) end; defining a control signal register according to the requirements of a PL end convolutional neural network hardware accelerator, and controlling the PL end convolutional neural network hardware accelerator through an AXI _ Lite protocol by a PS end; peripheral devices such as DDR and the like are mounted, and interaction of calculation data streams is carried out between the DDR and the PL end through an AXI _ Stream protocol.

A software and hardware cooperative acceleration schematic diagram based on a heterogeneous SoC is shown in fig. 3:

the method for realizing cooperative acceleration of the PS terminal and the PL terminal in the heterogeneous SoC comprises the following specific steps: firstly, preprocessing a first picture to be detected at a PS end in a first round; secondly, the preprocessed first picture to be detected is transmitted to a convolutional neural network hardware accelerator at a PL (personal digital assistant) end through a DDR (double data rate) for calculation, and meanwhile, a PS (packet switched) end preprocesses a second picture to be detected; and thirdly, transmitting the second pre-processed picture to be detected to a convolutional neural network hardware accelerator at the PL end through a DDR (double data rate) for calculation, meanwhile, taking out the first picture to be detected after the PL end is calculated, and transmitting the first picture to be detected to the post-processing module at the PS end through the DDR. Outputting a detection result of the first picture after post-processing operation, and then preprocessing a third picture to be detected; and fourthly, transmitting the third preprocessed picture to be detected to a convolutional neural network hardware accelerator at the PL end through a DDR for calculation, and meanwhile, taking out the second picture to be detected, which is calculated at the PL end, and transmitting the second picture to be detected to the post-processing module at the PS end through the DDR. And outputting a detection result of the second picture after post-processing operation, and then preprocessing a fourth picture to be detected.

The method for realizing the heterogeneous SoC by the convolutional neural network software and hardware cooperative acceleration provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the core concepts of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. The method for realizing the heterogeneous SoC with the convolutional neural network software and hardware cooperative acceleration is characterized by comprising the following steps:

step S400: the programmable logic circuit receives the preprocessed current picture to be detected, calculates according to a preset convolutional neural network hardware accelerator and a preset convolutional neural network model, and sends the calculated current picture to be detected to the on-chip processor through the memory after the current picture to be detected is calculated; the preset convolutional neural network hardware accelerator is designed based on the compressed convolutional neural network model;

step S500: the on-chip processor receives the calculated current picture to be detected, performs post-processing on the calculated current picture to be detected, and outputs a detection result of the current picture to be detected;

the step S400 comprises, before:

step S200: acquiring an initial convolutional neural network model, compressing the initial convolutional neural network model to obtain a compressed convolutional neural network model, training the compressed convolutional neural network model to obtain a network parameter file, performing fixed point number quantization on the network parameter file to obtain fixed point network parameters, obtaining a trained convolutional neural network model according to the compressed convolutional neural network model and the fixed point network parameters, and taking the trained convolutional neural network model as a preset convolutional neural network model;

the initial convolutional neural network model includes a common convolutional layer, a fully-connected layer, and a nonlinear layer, and the compressing the initial convolutional neural network model in step S200 to obtain a compressed convolutional neural network model includes:

step S210: replacing a common convolutional layer in the initial convolutional neural network model with a depth separable convolutional layer, the depth separable convolutional layer comprising a depth convolutional layer and a point-by-point convolution;

step S230: performing sparse pruning on the full-connection layer parameters of the initial convolutional neural network model according to preset sparse parameters;

the step S400 comprises, before:

step S300: designing the preset convolutional neural network hardware accelerator structure into a hardware accelerator comprising a control module, an input cache module, an output cache module, a convolutional layer calculation module, a pooling layer calculation module, a full-connection layer calculation module, a first nonlinear layer and bias calculation module, a second nonlinear layer and bias calculation module and a network parameter storage module;

wherein the control module is respectively connected with the input cache module, the output cache module, the convolutional layer calculation module, the pooling layer calculation module, the fully-connected layer calculation module, the first nonlinear layer and offset calculation module, the second nonlinear layer and offset calculation module and the network parameter storage module, the input cache module is connected with the convolutional layer calculation module, the fully-connected layer calculation module and the output cache module, the convolutional layer calculation module is connected with the first nonlinear layer and offset calculation module, the first nonlinear layer and offset calculation module is connected with the pooling layer calculation module, the pooling layer calculation module is connected with the output cache module, the fully-connected layer calculation module is connected with the second nonlinear layer and offset calculation module, and the second nonlinear layer and offset calculation module is connected with the output cache module, the network parameter storage module is connected with the convolutional layer calculation module, the full-connection layer calculation module, the first nonlinear layer and bias calculation module and the second nonlinear layer and bias calculation module.

2. The method of claim 1, wherein step S300 further comprises: the convolutional layer calculation module receives a first control signal of the control module, loads input images of a preset number of calculation channels from the input cache module according to the first control signal, each calculation channel corresponds to an independent multiply-accumulate group, the number of the multiply-accumulate groups is determined by the preset block size of the input images, reads corresponding convolutional layer network weights from the network parameter storage module, controls calculation time sequence according to the size of a convolutional kernel of a preset convolutional neural network model, performs convolutional calculation on the loaded input images in a clock cycle of the size of the convolutional kernel according to the convolutional layer network weights to obtain a first calculation result, and sends the first calculation result to the first nonlinear layer and bias calculation module.

3. The method according to claim 2, wherein step S300 further comprises: the full-connection layer computing module receives a second control signal sent by the control module, loads input images of computing channels with preset quantity from the input cache module, reads the full-connection layer network weight corresponding to each computing channel from the network parameter storage module, and through a pre-set multiplication and accumulation unit with the same quantity with the preset output channels, when the parallel detection shows that the pixel of the current input image is not zero value and the network weight of the corresponding output channel is not zero value, and performing multiplication and accumulation calculation according to the pixels of the current input image and the network weights of the corresponding output channels until the calculation of the full-connection layers of all the calculation channels is finished, obtaining a second calculation result and sending the second calculation result to the second nonlinear layer and the bias calculation module, wherein the weight size of each output channel is consistent with the size of the corresponding input image.

4. The method according to claim 3, wherein step S300 further comprises: the first nonlinear layer and offset calculation module receives a third control signal sent by the control module, reads a first calculation result from the convolutional layer calculation module, reads an offset value of the convolutional layer from the network parameter storage module, reserves a value of a pixel of an input image in the first calculation result, which is larger than zero, according to a preset ReLU function, sets a value of the pixel of the input image in the first calculation result, which is smaller than zero, to a zero value to obtain an updated first calculation result, performs addition operation according to the offset value of the convolutional layer and the updated first calculation result to obtain a third calculation result, and sends the third calculation result to the pooling layer calculation module; the second nonlinear layer and offset calculation module receives a fourth control signal sent by the control module, reads a second calculation result from the full-link layer calculation module, reads an offset value of the full-link layer from the network parameter storage module, reserves a value of a pixel of an input image in the second calculation result, which is larger than zero, according to a preset ReLU function, sets a value of the pixel of the input image, which is smaller than zero, in the second calculation result to a zero value, obtains an updated second calculation result, performs addition operation according to the offset value of the full-link layer and the updated second calculation result, obtains a fourth calculation result, and sends the fourth calculation result to the output buffer module.

5. The method according to claim 4, wherein step S300 further comprises: the pooling layer calculating module receives a fifth control signal of the control module, reads a third calculating result from the first nonlinear layer and the bias calculating module, performs maximum pooling calculation on the third calculating result through a preset register with the number corresponding to the preset maximum pooling window size to obtain a fifth calculating result, and sends the fifth calculating result to the output cache module.

6. The implementation method according to any one of claims 1 to 5, wherein step S300 further comprises: the network parameter storage module stores volume layer network weight, all-connected layer network weight, bias value of volume layer and bias value of all-connected layer, according to volume layer calculation module all-connected layer calculation module first non-linear layer and bias calculation module and the parallel strategy of second non-linear layer and bias calculation module design is right volume layer network weight all-connected layer network weight, bias value of volume layer and bias value of all-connected layer divide in groups.

7. The method of claim 1, wherein step S300 further comprises: the input cache module receives the input image sent by the memory, and the output cache module stores the calculation result data stream to be sent to the memory.