CN112508184B

CN112508184B - Design method of fast image recognition accelerator based on convolutional neural network

Info

Publication number: CN112508184B
Application number: CN202011486673.4A
Authority: CN
Inventors: 向敏; 刘榆; 赵小翔; 周闰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2022-04-29
Anticipated expiration: 2040-12-16
Also published as: CN112508184A

Abstract

The invention relates to a design method of a fast image recognition accelerator based on a convolutional neural network, and belongs to the technical field of image recognition. The method comprises the following steps: 1) the method comprises the steps that based on an Internet of things terminal combining an ARM and an FPGA, the ARM end is responsible for configuring parameters of a camera and processing collected image data and weight data; 2) designing a software and hardware combined pipeline processing scheme, adopting an operation strategy of combining image block parallel, input channel parallel and output channel parallel, and establishing a model of terminal resources and identification time based on the strategy; 3) and obtaining the optimal image block size and convolution parallel parameters according to the solution model, and constructing a convolution neural network model at the FPGA end to identify the image. The method and the device can fully utilize resources on the chip on the Internet of things terminal with limited resources, improve the resource utilization rate and effectively improve the image recognition speed.

Description

Design method of fast image recognition accelerator based on convolutional neural network

Technical Field

The invention belongs to the technical field of image recognition, and relates to a design method of a fast image recognition accelerator based on a convolutional neural network.

Background

With the development of the internet of things and artificial intelligence, the requirements for target detection and face recognition are higher and higher in scenes such as intelligent traffic, video monitoring and the like in the field of the internet of things, and a Convolutional Neural Network (CNN) plays a crucial role in scenes such as target detection and the like in the field of the internet of things as a common algorithm for image recognition. However, the convolutional neural network algorithm needs a large amount of resource consumption, and the terminal device of the internet of things is usually resource intensive, and has higher requirements on cost, power consumption and real-time performance. How to apply the convolutional neural network algorithm to the internet of things terminal with limited resources to realize the maximum utilization of the internet of things terminal resources and the quick identification of images is a research focus at present.

The FPGA provides a feasible scheme for realizing the application of the deep learning algorithm in the field of the Internet of things terminal, and although good computing performance is realized on the FPGA-based Internet of things terminal, few resources are considered for the resource-limited Internet of things terminal and an improvement space exists. Particularly aiming at realizing rapid image recognition at an Internet of things terminal with limited resources, firstly, a software and hardware cooperative operation pipeline architecture is designed to realize cooperative data processing of an ARM end and an FPGA end, so that the computing performance is greatly improved; according to the characteristics of less on-chip storage resources and more calculation resources of the FPGA, a parallel strategy of combining image block parallel, input channel parallel and output channel parallel is adopted; and a model of terminal resources and identification time is established, and the optimal image block size and convolution parallel parameters are obtained by solving on the basis of meeting the requirements of storage resources and calculation resources on the FPGA chip. The resource utilization rate is improved on the resource-limited terminal of the Internet of things, and the image recognition speed is effectively improved.

Disclosure of Invention

In view of the above, the present invention provides a method for designing a fast image recognition accelerator based on a convolutional neural network, which includes designing a pipeline processing scheme combining software and hardware, adopting an operation strategy combining image block parallelism, input channel parallelism and output channel parallelism, establishing a model of terminal resources and recognition time based on the operation strategy, and solving the model to obtain an optimal image block size and convolutional parallelism parameters to realize fast image recognition.

In order to achieve the purpose, the invention provides the following technical scheme:

a design method of a fast image recognition accelerator based on a convolutional neural network comprises the following steps:

s1: on an Internet of things terminal based on the combination of an ARM and an FPGA, the ARM acquires various target images in a living scene through a camera module;

s2: the ARM processes the acquired target image data and uses the processed target image data as an input characteristic diagram, then performs data processing on convolution kernel data and offset data, generates related instruction data, and finally stores the data in an off-chip DDR memory;

s3: designing a software and hardware cooperative operation pipeline scheme based on the terminal of the Internet of things;

s4: aiming at the input feature map in the step S2, establishing an input feature map blocking principle;

s5: in the convolution operation of the FPGA, an operation strategy of combining image block parallel, input channel parallel and output channel parallel is adopted;

s6: taking the on-chip storage resources and the computing resources of the FPGA as limiting conditions, and establishing a model of the Internet of things terminal resources and the image recognition time based on the strategy in the step S5;

s7: solving the optimal image block size and convolution parallel parameters of the model in the step S6, and blocking the input feature map according to the blocking principle in the step S4, wherein the blocked input feature map is stored in an off-chip DDR memory;

s8: and the FPGA reads the convolution kernel weight data and the offset data according to the convolution parallel parameters solved in the step S7, reads the input characteristic diagram data partitioned in the step S7, and performs operation on the data to obtain an image identification result.

Further, in step S1, the various objects in the life scene are common objects, including vehicles, ships, fruits, animals, people, keyboards, mice and computers.

Further, step S2 specifically includes: the collected target image is processed, namely the size of the target image is adjusted to be consistent with the size of an input characteristic diagram of the convolutional neural network model;

the data processing of the convolution kernel data and the bias data is to quantize the convolution kernel data and the bias data obtained in the forward training of the convolution neural network into 16-bit fixed point numbers;

and the generated related instruction data are related parameters for controlling the running state of the FPGA.

Further, in step S3, the software and hardware cooperation pipeline scheme is as follows:

firstly, the ARM is responsible for the software blocking and data preparation work of the picture 1, then the data is output to the FPGA end for hardware processing, meanwhile, the ARM end carries out the blocking and data preparation work of the picture 2, and therefore the effect of a cooperative processing pipeline is achieved.

Further, in the step S4, the input feature map channel is N, and the size is H × W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the moving step length is S; the number of channels of the output characteristic diagram is M, and the size of the channels is R multiplied by C; the length and width of the input feature map are equal, namely H is W, and the length and width of the output feature map are equal, namely R is C;

the image blocking principle is as follows: when the input characteristic diagram is not completely filled with zero, the edge length of the output characteristic diagram is satisfied

The output characteristic diagram is divided into four small output blocks, and then the four small output blocks are divided into blocksThe side length RB of the subsequent output characteristic diagram is equal to R/2, and the output characteristic diagram is obtained according to the relation between the input characteristic diagram and the output characteristic diagram

And

is solved according to two formulas to obtain

The input characteristic diagram with the side length of H is divided into four input blocks with the side length of HB; when the input characteristic diagram is filled with all zeros, the input characteristic diagram is directly divided into four sides of length

The input block of (a).

Further, in step S5, the input channel parallelism is PN representing the number of parallel input channels, the convolution kernel unfolding parallelism is PW representing the number of convolution kernel expansions, the input feature map block parallelism is PB representing the number of block parallels, and the output channel parallelism is PM representing the number of parallel output channels;

the operation strategy of combining the image block parallel, the input channel parallel and the output channel parallel is as follows: parallel output channels, representing that each input feature map after partitioning is simultaneously convoluted with PM convolution kernels, and generating output feature maps with PM channels after the convolution; the input channels are parallel, representing that PN input channels are parallel, and data among the PN input channels are operated simultaneously; and (3) the input images are parallel in blocks, representing that PB blocks of feature maps in each input channel are input in parallel, and each feature map is simultaneously multiplied by the number of PW (pseudo wire) expanded by the convolution kernel.

Further, in step S6, taking on-chip storage resources and computing resources of the FPGA as limiting conditions, a process of establishing a model of internet of things terminal resources and image recognition time is as follows:

s61: the total multiplier resource of FPGA is D_aThe total on-chip memory resource is B_aThe number of convolution layers in the convolution neural network model is r, and the number of DSP resources occupied by each convolution layer is calculated as follows:

(K_i×K_i×PB_i×PN_i×PM_i)×D_c≤D_a

wherein, K_iIs the side length of the convolution kernel of the ith (i is more than or equal to 1 and less than or equal to r), PB_i、PN_iAnd PM_iBlock parallel number, input channel parallel number and output channel parallel number of i-th layer convolution respectively, D_cThe number of DSP resources required for a single multiplier;

s62: each time parallel input computation requires on-chip storage resources to be satisfied:

(HB_i×WB_i)×PB_i×PN_i×B_w/B_h+

(RB_i×CB_i)×PM_i×PB_i×B_w/B_h+

(K_i×K_i)×PM_i×PB_i×PN_i×B_w/B_h≤B_a

wherein, HB_iAnd WB_iRespectively the length and width RB of the input characteristic diagram of the ith layer after the block division_iAnd CB_iRespectively the length and width of the ith layer output characteristic diagram after the block division, B_wBeing the bit width of the data, B_hThe storage depth of a single BRAM block;

s63: input channel parallel PN_iSatisfy the total number N of the channels capable of being input_iInteger division, i.e. N_i％PN_i0; parallel line number PM of output channel_iSatisfy the total number M of channels capable of being output_iInteger division, i.e. M_i％PM_i0; convolution kernel unfolding parallelism PW of ith layer of convolution_iDetermined by the size of the convolution kernel, i.e. PW_i＝K_i×K_i(ii) a Block parallelism PB_iLess than the total number of blocks of the input profile, i.e. PB_i≤(H_i×W_i)/(HB_i×WB_i)；

S64: in convolutional layer, i-th layer convolutional single parallel input computation hardwareExecution time TH_iThe method comprises the steps of inputting image transmission time, convolution kernel transmission time and convolution calculation time; TH_iRepresented by the formula:

TH_i＝(HB_i×WB_i)×t_clk+(K_i×K_i)×WK_i×t_clk+(RB_i×CB_i)×WK_i/PM_i×t_clk

wherein, WK_iTotal number of convolution kernels, t, required for the ith layer of convolution_clkIs the system clock period;

s65: the execution time of each layer of convolution is the product of the single parallel input calculation time and the calculation times which are needed to be executed after the layer of data is calculated, namely, the time for identifying one picture is as follows:

wherein, X_iRepresents the number of times of parallel input, X, required by the i-th layer convolution to execute the convolution operation of an input image on the premise of one time of parallel input calculation_iRepresented by the formula:

s66: because the output signature is convolved with the input signature, the output signature is long

Width of

Sub-entering the formulas in steps S64 and S66 into S65 when K is_iWhen S is 3 and S is 1, the following is obtained:

s67: according to step S66, let

S68: in combination with the resource restriction condition, the optimization objective is defined as:

further, in step S7, the model is solved to obtain the optimal image block size and the convolution parallel parameters, and the input feature map is blocked according to the blocking principle, which specifically includes the following steps:

s71: for T (HB) in step S67_i) And (5) derivation to obtain:

s72: let T' (HB)_i) 0 to yield HB_i＝2+(PW_i×PM_i) 2, then for T (HB)_i) Calculating the second derivative, and adding HB_i＝2+(PW_i×PM_i) The second derivative T' (HB) is introduced by/2_i) In (b), T' (HB) is obtained_i) > 0, indicating T (HB)_i) In the interval (0, 2+ (PW)_i×PM_i)/2]Monotonically decreasing and at T (HB)_i) Satisfies HB_i＝2+(PW_i×PM_i) At the condition of/2, T (HB)_i) Obtaining a minimum value, wherein the image recognition time is shortest;

s73: determining convolution parallel parameters PN, PW, PB and PM according to the resource limitation conditions of S61 and S62; and taken to step S72, according to HB_i＝2+(PW_i×PM_i) Solving side length HB of input feature map block_i；

S74: obtaining HB according to step S73_iAfter the value is obtained, the block principle is combined to determine the length l of each block of image after the block is formed, and then the input characteristic is subjected to the input characteristic matching according to the length lambda of the original input characteristic graph and the first address a of the block imagePartitioning the graph: the ARM reads l data according to the first row address a and writes the data into an off-chip memory with continuous addresses, then obtains the first address of a secondary row according to a + l and reads the data again, and writes the data into the last row of addresses stored in the off-chip memory until l multiplied by l data are read.

Further, in step S8, the FPGA performs operations including convolution operation, activation function operation, pooling operation, and full link layer operation using the input feature map data, convolution kernel data, and offset data.

The invention has the beneficial effects that: the invention designs a pipeline scheme of software and hardware cooperative operation, realizes the cooperative processing of data by ARM and FPGA, and greatly improves the computing performance; meanwhile, an image blocking method is provided, an operation strategy combining image blocking parallelism, input channel parallelism and convolution kernel unfolding parallelism is adopted, a model of terminal resources and identification time is established based on the operation strategy, and the optimal image blocking size and convolution parallelism parameters are obtained by solving the model, so that the resource utilization rate is improved, the rapid identification of the image is realized, and the rapid identification of the image on the resource-limited internet of things terminal is facilitated.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a general flow chart of the fast image recognition accelerator based on the convolutional neural network according to the present invention;

FIG. 2 is a schematic diagram of a software and hardware co-production line scheme according to the present invention;

FIG. 3 is a block diagram of an input feature map according to the present invention;

FIG. 4 is a schematic diagram of parallel computing according to the present invention;

FIG. 5 is a schematic diagram of the input operation of the line cache architecture of the present invention;

FIG. 6 is a diagram illustrating the reading and writing of data in a pipeline according to the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1, for the overall flowchart of the fast image recognition accelerator based on the convolutional neural network, the ARM is responsible for configuring parameters of the camera to collect image data and process the collected image, the ARM also needs to send instruction data and preprocess the weight data, and finally writes the data into the off-chip DDR memory and informs the FPGA to read the data. The FPGA needs to realize a control unit module, an on-chip memory module, a DMA read-write module, a convolution calculation module, a pooling module and a full connection layer operation module. The control unit module is responsible for receiving an instruction of the ARM end, controls data to run in each module of the FPGA after receiving the instruction, and finally outputs an image recognition result after convolution, function activation, pooling and full connection layer. And finally, returning the calculation result to the off-chip DDR memory, and then reading and further processing the calculation result by the ARM end. Each of which will be described in detail below.

1) Image data acquisition and processing stage

In consideration of the fact that the target image needs to be identified on the internet of things terminal, image data needs to be collected on the internet of things terminal, and relevant data needs to be processed.

S1: on the Internet of things terminal based on the combination of the ARM and the FPGA, the ARM configures a camera module to collect various targets in a living scene. The method is mainly used for configuring parameters such as image acquisition resolution, image data output pixel format and the like of the camera module by the ARM, and acquiring common target images including but not limited to vehicles, ships, fruits, animals, people, keyboards, mice, computers and the like.

S2: the ARM firstly processes collected image data and uses the collected image data as an input characteristic diagram, then processes convolution kernel data and offset data to generate related instruction data, and finally stores the data in an off-chip memory, wherein the specific operation mode is as follows:

firstly, adjusting the size of acquired target image data, setting a convolutional neural network model as yolov2-tiny, setting an input characteristic diagram of a yolov2-tiny network model as 256 multiplied by 256, and setting the size of the acquired target image data as 640 multiplied by 480, wherein the size of the acquired target image data needs to be converted and reduced to 256 multiplied by 256 because the input characteristic diagram of the yolov2-tiny network model is 256 multiplied by 256, and the acquired target image data is used as an input characteristic diagram; performing data processing on the convolution kernel data and the offset data by discarding redundant data bits of 32-bit floating point type convolution kernel data and offset data and quantizing the redundant data bits into fixed point type data with data precision of 16 bits; and generating related instruction data, wherein the instruction data are related parameters for controlling the running state of the FPGA, and comprise instructions of when to start FPGA hardware to work, related parameters of convolution operation and the like.

2) Stage of model building

After the ARM processes the input characteristic diagram data, the convolution kernel data and the weight data and writes the processed data into the off-chip DDR memory, the ARM starts to send instruction data to the FPGA, and informs the FPGA to read the off-chip memory data and perform data operation. Before the FPGA reads data, a corresponding model is established to determine how the FPGA reads the data for hardware calculation. The process of establishing the model of the Internet of things terminal resource and the image recognition time is divided into the following steps.

S3: based on the terminal of the Internet of things, a software and hardware cooperative operation assembly line scheme is designed as follows:

referring to fig. 2, aiming at the overall implementation scheme of the accelerator, a software and hardware cooperative operation pipeline scheme is designed, and firstly, an ARM is responsible for the work of image 1 software partitioning, data preparation and the like; and then outputting the data to an FPGA end for hardware processing, and meanwhile, carrying out blocking and data preparation work on the picture 2 by an ARM end, thereby achieving the effect of a cooperative processing pipeline. Thus, the time for each picture to be processed in the FPGA hardware is the time required to identify a picture.

S4: aiming at the input feature graph of the convolutional neural network, the principle of establishing the input feature graph blocks is as follows:

in order to improve the resource utilization rate of the internet of things terminal, the sizes of the feature maps after the blocking are consistent, and the regularity of the feature maps after the blocking is higher. Setting an input characteristic diagram channel as N and the size as H multiplied by W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the step length is S(ii) a The number of channels of the output characteristic diagram is M, and the size is R multiplied by C. The length and width of the input feature map are equal, i.e., H ═ W, and the length and width of the output feature map are equal, i.e., R ═ C. Referring to fig. 3, the image blocking principle is: when the input characteristic diagram is not completely filled with zero, the side length of the output characteristic diagram should satisfy

The output characteristic diagram is divided into four small output blocks, the side length RB of the divided output characteristic diagram is R/2, and the output characteristic diagram is obtained according to the relation between the input characteristic diagram and the output characteristic diagram

And

thus, the solution is obtained according to two formulas

The input block of (2).

S5: the operation strategy of combining image block parallel, input channel parallel and output channel parallel in the convolution operation of the FPGA is as follows:

and a proper parallel strategy is selected for convolution operation, so that the image identification speed can be greatly improved. Referring to fig. 4, in the case that the convolution operation of the FPGA is performed by convolution kernel expansion, an operation strategy in which image blocks are parallel, an input channel is parallel, and an output channel is parallel is adopted. The input channel parallelism is PN to represent the number of parallel input channels, the convolution kernel expansion parallelism is PW to represent the number of convolution kernel expansions, the input feature diagram block parallelism is PB to represent the number of block parallelism, and the output channel parallelism is PM to represent the number of parallel output channels. Firstly, output channels are parallel, each input feature diagram after being partitioned is convoluted with PM convolution kernels at the same time, and output feature diagrams with PM channels are generated after convolution is finished. Then, the input channels are parallel, PN input channels are parallel, and data between the PN input channels are operated simultaneously. And finally, input images are subjected to block parallel, PB feature maps subjected to block parallel input are input in each input channel, and each feature map is simultaneously subjected to multiplication operation with PW (PW is K multiplied by K) numbers of convolution kernel expansion. Referring to fig. 5, for the input characteristic diagram, the parallel operation of the number of PWs in one clock cycle can be realized by adopting the line buffer structure input operation mode.

S6: taking on-chip storage resources and computing resources of the FPGA as limiting conditions, and establishing a model process of the Internet of things terminal resources and image recognition time as follows:

(a) let total multiplier resource of FPGA be D_aThe total on-chip memory resource is B_aAnd the number of the convolution layers in the convolution neural network model is r. The number of DSP resources occupied by each convolutional layer is as follows:

(K_i×K_i×PB_i×PN_i×PM_i)×D_c≤D_a (1)

in the formula (1), K_iIs the side length of the convolution kernel of the ith (i is more than or equal to 1 and less than or equal to r), PB_i、PN_iAnd PM_iBlock parallel number, input channel parallel number and output channel parallel number of i-th layer convolution respectively, D_cThe number of DSP resources required for a single multiplier.

(b) Each time parallel input computation needs on-chip storage resources to meet:

in the formula (2), HB_iAnd WB_iRespectively the length and width RB of the input characteristic diagram of the ith layer after the block division_iAnd CB_iRespectively the length and width of the ith layer output characteristic diagram after the block division, B_wBeing the bit width of the data, B_hThe storage depth of a single BRAM block.

(c) Input channel parallel PN_iShould satisfy the total number N of channels that can be input_iInteger division, i.e. N_i％PN_i0. Parallel line number PM of output channel_iShould also satisfy the total number M of channels that can be output_iInteger division, i.e. M_i％PM_i0. Convolution kernel unfolding parallelism PW of ith layer of convolution_iDetermined by the size of the convolution kernel, i.e. PW_i＝K_i×K_i. Block parallelism PB_iShould be less than the total number of blocks of the input feature map, i.e. PB_i≤(H_i×W_i)/(HB_i×WB_i)。

(d) In convolutional layer, the i-TH layer convolution single parallel input calculation hardware execution time TH_iThe method mainly comprises the steps of input image transmission time, convolution kernel transmission time and convolution calculation time. TH_iCan be represented by the following formula:

TH_i＝(HB_i×WB_i)×t_clk+(K_i×K_i)×WK_i×t_clk+(RB_i×CB_i)×WK_i/PM_i×t_clk (3)

in the formula (3), WK_iTotal number of convolution kernels, t, required for the ith layer of convolution_clkIs the system clock cycle.

(e) The execution time of each layer of convolution is the product of the single parallel input calculation time and the number of calculation times which needs to be executed after the layer of data is calculated, that is, the time for identifying one picture is as follows:

in the formula (4), X_iAnd the number of times of parallel input required by the i-th layer convolution for executing the convolution operation of one input image on the premise of one parallel input calculation is shown. X_iRepresented by the formula:

(f) because the output signature is convolved with the input signature, the output signature is long

Width of

Bringing the formulae (3) and (5) into the formula (4) when K is_iWhen S is 3 and S is 1, the following is obtained:

(g) according to formula (6), let

(h) In combination with the resource restriction condition, the optimization objective is defined as:

3) model solution

After a model of terminal resources and identification time is established, relevant parameters are required to be brought in to solve the model, and the optimal image block size and convolution parallel parameters are solved, so that the FPGA reads corresponding data according to the parameters to perform operation. The specific procedure is as follows.

S7: solving the optimal image block size and convolution parallel parameters according to the model, and blocking the input feature map according to a blocking principle, wherein the blocked input feature map is stored on the off-chip DDR 3. The method comprises the following specific steps:

a: for T (HB) in formula (7)_i) And (5) derivation to obtain:

b: order toT'(HB_i) When the ratio is 0, the following is obtained: HB_i＝2+(PW_i×PM_i) 2, then for T (HB)_i) Calculating the second derivative, and adding HB_i＝2+(PW_i×PM_i) The second derivative T' (HB) is introduced by/2_i) In (b), T' (HB) is obtained_i) > 0, indicating T (HB)_i) In the interval (0, 2+ (PW)_i×PM_i)/2]Monotonically decreasing and at T (HB)_i) Satisfies HB_i＝2+(PW_i×PM_i) At the condition of/2, T (HB)_i) And obtaining a minimum value, wherein the image recognition time is shortest.

c: according to the on-chip resource limitation of the formula (1) and the formula (2), the parallelism of an input channel is determined to be PN, the unfolding parallelism of a convolution kernel is PW, the block parallelism of an input feature diagram is PB, and the parallelism of an output channel is PM. Bringing parameters into HB_i＝2+(PW_i×PM_i) In/2, solving the side length HB of the input feature map block_i。

d: according to the HB obtained_iValue, combined with the blocking principle, taking into account the requirement of the regularity of the blocks, in HB_iAnd determining the specific length l of each image after the image is partitioned near the value, and according to the length lambda of the original image and the first address a of the partitioned image. The input feature map is read and partitioned as follows: the ARM end firstly reads l data according to the first row address a, writes the data into an off-chip storage with continuous storage addresses, then obtains the first row address of the secondary row according to a + l, reads the l data again, and writes the data into the last row of addresses stored in the off-chip storage until l × l data are read. The operation of reading and writing data proceeds in a pipelined manner as shown in fig. 6.

S8: the FPGA reads convolution kernel weight data, bias data and input feature map data after blocking according to the solved convolution parallel parameters, and calculates the data to finally obtain an image identification result, wherein the specific calculation process is as follows:

after a control unit module of the FPGA receives a working instruction sent by the ARM, data is controlled to run in each module of the FPGA, a DMA module is controlled to read convolution kernel data, bias data and input feature map data into an on-chip cache, then the data are sent into a convolution calculation module for convolution operation, the input feature map is sent into a pooling module for pooling through an activation function after the convolution operation, finally the data are input into a full connection layer module, an image recognition result is finally obtained, and the result is sent back to an off-chip DDR3 for storage and can be read by the ARM.

According to the design method of the fast image recognition accelerator based on the convolutional neural network, a pipeline scheme of software and hardware cooperative operation is designed, the ARM and the FPGA can cooperatively process data, and the calculation performance is greatly improved; meanwhile, an image blocking method is provided, an operation strategy combining image blocking parallelism, input channel parallelism and convolution kernel unfolding parallelism is adopted, a model of terminal resources and identification time is established based on the operation strategy, and the optimal image blocking size and convolution parallelism parameters are obtained by solving the model, so that the resource utilization rate is improved, the rapid identification of the image is realized, and the rapid identification of the image on the resource-limited internet of things terminal is facilitated.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A design method of a fast image recognition accelerator based on a convolutional neural network is characterized in that: the method comprises the following steps:

2. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S1, the various objects in the life scene are common objects, including vehicles, ships, fruits, animals, people, keyboards, mice, and computers.

3. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: step S2 specifically includes: the collected target image is processed, namely the size of the target image is adjusted to be consistent with the size of an input characteristic diagram of the convolutional neural network model;

4. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S3, the software and hardware cooperation pipeline scheme is:

5. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in the step S4, the input feature map channel is N, and the size is H × W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the moving step length is S; the number of channels of the output characteristic diagram is M, and the size of the channels is R multiplied by C; the length and width of the input feature map are equal, namely H is W, and the length and width of the output feature map are equal, namely R is C;

And

is solved according to two formulas to obtain

The input characteristic diagram with the side length of H is divided into four input blocks with the side length of HB; when the input characteristic diagram is filled with all zeros, the input characteristic diagram is directly divided into four partsLength of one side is

The input block of (a).

6. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S5, the input channel parallelism is PN representing the number of parallel input channels, the convolution kernel unfolding parallelism is PW representing the number of convolution kernel expansions, the input feature map block parallelism is PB representing the number of block parallels, and the output channel parallelism is PM representing the number of parallel output channels;

7. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S6, taking on-chip storage resources and computing resources of the FPGA as limiting conditions, a process of establishing a model of internet of things terminal resources and image recognition time is as follows:

(K_i×K_i×PB_i×PN_i×PM_i)×D_c≤D_a

(HB_i×WB_i)×PB_i×PN_i×B_w/B_h+(RB_i×CB_i)×PM_i×PB_i×B_w/B_h+(K_i×K_i)×PM_i×PB_i×PN_i×B_w/B_h≤B_a

S64: in convolutional layer, the i-TH layer convolution single parallel input calculation hardware execution time TH_iThe method comprises the steps of inputting image transmission time, convolution kernel transmission time and convolution calculation time; TH_iRepresented by the formula:

Width of

s67: according to step S66, let

8. the convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S7, the model is solved to obtain the optimal image block size and convolution parallel parameters, and the input feature map is blocked according to the blocking principle, specifically including the following steps:

s71: for T (HB) in step S67_i) And (5) derivation to obtain:

s72: let T' (HB)_i) 0 to yield HB_i＝2+(PW_i×PM_i) 2, then for T (HB)_i) Calculating the second derivative, and adding HB_i＝2+(PW_i×PM_i) The second derivative T ″ (HB) is introduced by 2_i) In (1), T ″ (HB) is obtained_i) > 0, indicating T (HB)_i) In the interval (0, 2+ (PW)_i×PM_i)/2]Monotonically decreasing and at T (HB)_i) Satisfies HB_i＝2+(PW_i×PM_i) At the condition of/2, T (HB)_i) Obtaining a minimum value, wherein the image recognition time is shortest;

S74: obtaining HB according to step S73_iAfter the value is obtained, determining the length l of each image after the image is partitioned by combining a partitioning principle, and partitioning the input feature map according to the length lambda of the original input feature map and the initial address a of the partitioned image: ARM reads l data according to the first row address a, writes the data into an off-chip memory with continuous addresses, obtains the first address of the second row according to a + l, and reads l data againWriting to the previous row is stored at the end of the address in off-chip memory until l × l data are read.

9. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S8, the input feature map data, convolution kernel data, and offset data are used in the FPGA to perform operations, including convolution operation, activation function operation, pooling operation, and full link layer operation.