CN112508184B - Design method of fast image recognition accelerator based on convolutional neural network - Google Patents

Design method of fast image recognition accelerator based on convolutional neural network Download PDF

Info

Publication number
CN112508184B
CN112508184B CN202011486673.4A CN202011486673A CN112508184B CN 112508184 B CN112508184 B CN 112508184B CN 202011486673 A CN202011486673 A CN 202011486673A CN 112508184 B CN112508184 B CN 112508184B
Authority
CN
China
Prior art keywords
input
data
convolution
parallel
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011486673.4A
Other languages
Chinese (zh)
Other versions
CN112508184A (en
Inventor
向敏
刘榆
赵小翔
周闰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011486673.4A priority Critical patent/CN112508184B/en
Publication of CN112508184A publication Critical patent/CN112508184A/en
Application granted granted Critical
Publication of CN112508184B publication Critical patent/CN112508184B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The invention relates to a design method of a fast image recognition accelerator based on a convolutional neural network, and belongs to the technical field of image recognition. The method comprises the following steps: 1) the method comprises the steps that based on an Internet of things terminal combining an ARM and an FPGA, the ARM end is responsible for configuring parameters of a camera and processing collected image data and weight data; 2) designing a software and hardware combined pipeline processing scheme, adopting an operation strategy of combining image block parallel, input channel parallel and output channel parallel, and establishing a model of terminal resources and identification time based on the strategy; 3) and obtaining the optimal image block size and convolution parallel parameters according to the solution model, and constructing a convolution neural network model at the FPGA end to identify the image. The method and the device can fully utilize resources on the chip on the Internet of things terminal with limited resources, improve the resource utilization rate and effectively improve the image recognition speed.

Description

Design method of fast image recognition accelerator based on convolutional neural network
Technical Field
The invention belongs to the technical field of image recognition, and relates to a design method of a fast image recognition accelerator based on a convolutional neural network.
Background
With the development of the internet of things and artificial intelligence, the requirements for target detection and face recognition are higher and higher in scenes such as intelligent traffic, video monitoring and the like in the field of the internet of things, and a Convolutional Neural Network (CNN) plays a crucial role in scenes such as target detection and the like in the field of the internet of things as a common algorithm for image recognition. However, the convolutional neural network algorithm needs a large amount of resource consumption, and the terminal device of the internet of things is usually resource intensive, and has higher requirements on cost, power consumption and real-time performance. How to apply the convolutional neural network algorithm to the internet of things terminal with limited resources to realize the maximum utilization of the internet of things terminal resources and the quick identification of images is a research focus at present.
The FPGA provides a feasible scheme for realizing the application of the deep learning algorithm in the field of the Internet of things terminal, and although good computing performance is realized on the FPGA-based Internet of things terminal, few resources are considered for the resource-limited Internet of things terminal and an improvement space exists. Particularly aiming at realizing rapid image recognition at an Internet of things terminal with limited resources, firstly, a software and hardware cooperative operation pipeline architecture is designed to realize cooperative data processing of an ARM end and an FPGA end, so that the computing performance is greatly improved; according to the characteristics of less on-chip storage resources and more calculation resources of the FPGA, a parallel strategy of combining image block parallel, input channel parallel and output channel parallel is adopted; and a model of terminal resources and identification time is established, and the optimal image block size and convolution parallel parameters are obtained by solving on the basis of meeting the requirements of storage resources and calculation resources on the FPGA chip. The resource utilization rate is improved on the resource-limited terminal of the Internet of things, and the image recognition speed is effectively improved.
Disclosure of Invention
In view of the above, the present invention provides a method for designing a fast image recognition accelerator based on a convolutional neural network, which includes designing a pipeline processing scheme combining software and hardware, adopting an operation strategy combining image block parallelism, input channel parallelism and output channel parallelism, establishing a model of terminal resources and recognition time based on the operation strategy, and solving the model to obtain an optimal image block size and convolutional parallelism parameters to realize fast image recognition.
In order to achieve the purpose, the invention provides the following technical scheme:
a design method of a fast image recognition accelerator based on a convolutional neural network comprises the following steps:
s1: on an Internet of things terminal based on the combination of an ARM and an FPGA, the ARM acquires various target images in a living scene through a camera module;
s2: the ARM processes the acquired target image data and uses the processed target image data as an input characteristic diagram, then performs data processing on convolution kernel data and offset data, generates related instruction data, and finally stores the data in an off-chip DDR memory;
s3: designing a software and hardware cooperative operation pipeline scheme based on the terminal of the Internet of things;
s4: aiming at the input feature map in the step S2, establishing an input feature map blocking principle;
s5: in the convolution operation of the FPGA, an operation strategy of combining image block parallel, input channel parallel and output channel parallel is adopted;
s6: taking the on-chip storage resources and the computing resources of the FPGA as limiting conditions, and establishing a model of the Internet of things terminal resources and the image recognition time based on the strategy in the step S5;
s7: solving the optimal image block size and convolution parallel parameters of the model in the step S6, and blocking the input feature map according to the blocking principle in the step S4, wherein the blocked input feature map is stored in an off-chip DDR memory;
s8: and the FPGA reads the convolution kernel weight data and the offset data according to the convolution parallel parameters solved in the step S7, reads the input characteristic diagram data partitioned in the step S7, and performs operation on the data to obtain an image identification result.
Further, in step S1, the various objects in the life scene are common objects, including vehicles, ships, fruits, animals, people, keyboards, mice and computers.
Further, step S2 specifically includes: the collected target image is processed, namely the size of the target image is adjusted to be consistent with the size of an input characteristic diagram of the convolutional neural network model;
the data processing of the convolution kernel data and the bias data is to quantize the convolution kernel data and the bias data obtained in the forward training of the convolution neural network into 16-bit fixed point numbers;
and the generated related instruction data are related parameters for controlling the running state of the FPGA.
Further, in step S3, the software and hardware cooperation pipeline scheme is as follows:
firstly, the ARM is responsible for the software blocking and data preparation work of the picture 1, then the data is output to the FPGA end for hardware processing, meanwhile, the ARM end carries out the blocking and data preparation work of the picture 2, and therefore the effect of a cooperative processing pipeline is achieved.
Further, in the step S4, the input feature map channel is N, and the size is H × W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the moving step length is S; the number of channels of the output characteristic diagram is M, and the size of the channels is R multiplied by C; the length and width of the input feature map are equal, namely H is W, and the length and width of the output feature map are equal, namely R is C;
the image blocking principle is as follows: when the input characteristic diagram is not completely filled with zero, the edge length of the output characteristic diagram is satisfied
Figure BDA0002839478300000021
The output characteristic diagram is divided into four small output blocks, and then the four small output blocks are divided into blocksThe side length RB of the subsequent output characteristic diagram is equal to R/2, and the output characteristic diagram is obtained according to the relation between the input characteristic diagram and the output characteristic diagram
Figure BDA0002839478300000031
And
Figure BDA0002839478300000032
is solved according to two formulas to obtain
Figure BDA0002839478300000033
The input characteristic diagram with the side length of H is divided into four input blocks with the side length of HB; when the input characteristic diagram is filled with all zeros, the input characteristic diagram is directly divided into four sides of length
Figure BDA0002839478300000034
The input block of (a).
Further, in step S5, the input channel parallelism is PN representing the number of parallel input channels, the convolution kernel unfolding parallelism is PW representing the number of convolution kernel expansions, the input feature map block parallelism is PB representing the number of block parallels, and the output channel parallelism is PM representing the number of parallel output channels;
the operation strategy of combining the image block parallel, the input channel parallel and the output channel parallel is as follows: parallel output channels, representing that each input feature map after partitioning is simultaneously convoluted with PM convolution kernels, and generating output feature maps with PM channels after the convolution; the input channels are parallel, representing that PN input channels are parallel, and data among the PN input channels are operated simultaneously; and (3) the input images are parallel in blocks, representing that PB blocks of feature maps in each input channel are input in parallel, and each feature map is simultaneously multiplied by the number of PW (pseudo wire) expanded by the convolution kernel.
Further, in step S6, taking on-chip storage resources and computing resources of the FPGA as limiting conditions, a process of establishing a model of internet of things terminal resources and image recognition time is as follows:
s61: the total multiplier resource of FPGA is DaThe total on-chip memory resource is BaThe number of convolution layers in the convolution neural network model is r, and the number of DSP resources occupied by each convolution layer is calculated as follows:
(Ki×Ki×PBi×PNi×PMi)×Dc≤Da
wherein, KiIs the side length of the convolution kernel of the ith (i is more than or equal to 1 and less than or equal to r), PBi、PNiAnd PMiBlock parallel number, input channel parallel number and output channel parallel number of i-th layer convolution respectively, DcThe number of DSP resources required for a single multiplier;
s62: each time parallel input computation requires on-chip storage resources to be satisfied:
(HBi×WBi)×PBi×PNi×Bw/Bh+
(RBi×CBi)×PMi×PBi×Bw/Bh+
(Ki×Ki)×PMi×PBi×PNi×Bw/Bh≤Ba
wherein, HBiAnd WBiRespectively the length and width RB of the input characteristic diagram of the ith layer after the block divisioniAnd CBiRespectively the length and width of the ith layer output characteristic diagram after the block division, BwBeing the bit width of the data, BhThe storage depth of a single BRAM block;
s63: input channel parallel PNiSatisfy the total number N of the channels capable of being inputiInteger division, i.e. Ni%PNi0; parallel line number PM of output channeliSatisfy the total number M of channels capable of being outputiInteger division, i.e. Mi%PMi0; convolution kernel unfolding parallelism PW of ith layer of convolutioniDetermined by the size of the convolution kernel, i.e. PWi=Ki×Ki(ii) a Block parallelism PBiLess than the total number of blocks of the input profile, i.e. PBi≤(Hi×Wi)/(HBi×WBi);
S64: in convolutional layer, i-th layer convolutional single parallel input computation hardwareExecution time THiThe method comprises the steps of inputting image transmission time, convolution kernel transmission time and convolution calculation time; THiRepresented by the formula:
THi=(HBi×WBi)×tclk+(Ki×Ki)×WKi×tclk+(RBi×CBi)×WKi/PMi×tclk
wherein, WKiTotal number of convolution kernels, t, required for the ith layer of convolutionclkIs the system clock period;
s65: the execution time of each layer of convolution is the product of the single parallel input calculation time and the calculation times which are needed to be executed after the layer of data is calculated, namely, the time for identifying one picture is as follows:
Figure BDA0002839478300000041
wherein, XiRepresents the number of times of parallel input, X, required by the i-th layer convolution to execute the convolution operation of an input image on the premise of one time of parallel input calculationiRepresented by the formula:
Figure BDA0002839478300000042
s66: because the output signature is convolved with the input signature, the output signature is long
Figure BDA0002839478300000043
Width of
Figure BDA0002839478300000044
Sub-entering the formulas in steps S64 and S66 into S65 when K isiWhen S is 3 and S is 1, the following is obtained:
Figure BDA0002839478300000045
s67: according to step S66, let
Figure BDA0002839478300000046
S68: in combination with the resource restriction condition, the optimization objective is defined as:
Figure BDA0002839478300000047
further, in step S7, the model is solved to obtain the optimal image block size and the convolution parallel parameters, and the input feature map is blocked according to the blocking principle, which specifically includes the following steps:
s71: for T (HB) in step S67i) And (5) derivation to obtain:
Figure BDA0002839478300000051
s72: let T' (HB)i) 0 to yield HBi=2+(PWi×PMi) 2, then for T (HB)i) Calculating the second derivative, and adding HBi=2+(PWi×PMi) The second derivative T' (HB) is introduced by/2i) In (b), T' (HB) is obtainedi) > 0, indicating T (HB)i) In the interval (0, 2+ (PW)i×PMi)/2]Monotonically decreasing and at T (HB)i) Satisfies HBi=2+(PWi×PMi) At the condition of/2, T (HB)i) Obtaining a minimum value, wherein the image recognition time is shortest;
s73: determining convolution parallel parameters PN, PW, PB and PM according to the resource limitation conditions of S61 and S62; and taken to step S72, according to HBi=2+(PWi×PMi) Solving side length HB of input feature map blocki
S74: obtaining HB according to step S73iAfter the value is obtained, the block principle is combined to determine the length l of each block of image after the block is formed, and then the input characteristic is subjected to the input characteristic matching according to the length lambda of the original input characteristic graph and the first address a of the block imagePartitioning the graph: the ARM reads l data according to the first row address a and writes the data into an off-chip memory with continuous addresses, then obtains the first address of a secondary row according to a + l and reads the data again, and writes the data into the last row of addresses stored in the off-chip memory until l multiplied by l data are read.
Further, in step S8, the FPGA performs operations including convolution operation, activation function operation, pooling operation, and full link layer operation using the input feature map data, convolution kernel data, and offset data.
The invention has the beneficial effects that: the invention designs a pipeline scheme of software and hardware cooperative operation, realizes the cooperative processing of data by ARM and FPGA, and greatly improves the computing performance; meanwhile, an image blocking method is provided, an operation strategy combining image blocking parallelism, input channel parallelism and convolution kernel unfolding parallelism is adopted, a model of terminal resources and identification time is established based on the operation strategy, and the optimal image blocking size and convolution parallelism parameters are obtained by solving the model, so that the resource utilization rate is improved, the rapid identification of the image is realized, and the rapid identification of the image on the resource-limited internet of things terminal is facilitated.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a general flow chart of the fast image recognition accelerator based on the convolutional neural network according to the present invention;
FIG. 2 is a schematic diagram of a software and hardware co-production line scheme according to the present invention;
FIG. 3 is a block diagram of an input feature map according to the present invention;
FIG. 4 is a schematic diagram of parallel computing according to the present invention;
FIG. 5 is a schematic diagram of the input operation of the line cache architecture of the present invention;
FIG. 6 is a diagram illustrating the reading and writing of data in a pipeline according to the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
As shown in fig. 1, for the overall flowchart of the fast image recognition accelerator based on the convolutional neural network, the ARM is responsible for configuring parameters of the camera to collect image data and process the collected image, the ARM also needs to send instruction data and preprocess the weight data, and finally writes the data into the off-chip DDR memory and informs the FPGA to read the data. The FPGA needs to realize a control unit module, an on-chip memory module, a DMA read-write module, a convolution calculation module, a pooling module and a full connection layer operation module. The control unit module is responsible for receiving an instruction of the ARM end, controls data to run in each module of the FPGA after receiving the instruction, and finally outputs an image recognition result after convolution, function activation, pooling and full connection layer. And finally, returning the calculation result to the off-chip DDR memory, and then reading and further processing the calculation result by the ARM end. Each of which will be described in detail below.
1) Image data acquisition and processing stage
In consideration of the fact that the target image needs to be identified on the internet of things terminal, image data needs to be collected on the internet of things terminal, and relevant data needs to be processed.
S1: on the Internet of things terminal based on the combination of the ARM and the FPGA, the ARM configures a camera module to collect various targets in a living scene. The method is mainly used for configuring parameters such as image acquisition resolution, image data output pixel format and the like of the camera module by the ARM, and acquiring common target images including but not limited to vehicles, ships, fruits, animals, people, keyboards, mice, computers and the like.
S2: the ARM firstly processes collected image data and uses the collected image data as an input characteristic diagram, then processes convolution kernel data and offset data to generate related instruction data, and finally stores the data in an off-chip memory, wherein the specific operation mode is as follows:
firstly, adjusting the size of acquired target image data, setting a convolutional neural network model as yolov2-tiny, setting an input characteristic diagram of a yolov2-tiny network model as 256 multiplied by 256, and setting the size of the acquired target image data as 640 multiplied by 480, wherein the size of the acquired target image data needs to be converted and reduced to 256 multiplied by 256 because the input characteristic diagram of the yolov2-tiny network model is 256 multiplied by 256, and the acquired target image data is used as an input characteristic diagram; performing data processing on the convolution kernel data and the offset data by discarding redundant data bits of 32-bit floating point type convolution kernel data and offset data and quantizing the redundant data bits into fixed point type data with data precision of 16 bits; and generating related instruction data, wherein the instruction data are related parameters for controlling the running state of the FPGA, and comprise instructions of when to start FPGA hardware to work, related parameters of convolution operation and the like.
2) Stage of model building
After the ARM processes the input characteristic diagram data, the convolution kernel data and the weight data and writes the processed data into the off-chip DDR memory, the ARM starts to send instruction data to the FPGA, and informs the FPGA to read the off-chip memory data and perform data operation. Before the FPGA reads data, a corresponding model is established to determine how the FPGA reads the data for hardware calculation. The process of establishing the model of the Internet of things terminal resource and the image recognition time is divided into the following steps.
S3: based on the terminal of the Internet of things, a software and hardware cooperative operation assembly line scheme is designed as follows:
referring to fig. 2, aiming at the overall implementation scheme of the accelerator, a software and hardware cooperative operation pipeline scheme is designed, and firstly, an ARM is responsible for the work of image 1 software partitioning, data preparation and the like; and then outputting the data to an FPGA end for hardware processing, and meanwhile, carrying out blocking and data preparation work on the picture 2 by an ARM end, thereby achieving the effect of a cooperative processing pipeline. Thus, the time for each picture to be processed in the FPGA hardware is the time required to identify a picture.
S4: aiming at the input feature graph of the convolutional neural network, the principle of establishing the input feature graph blocks is as follows:
in order to improve the resource utilization rate of the internet of things terminal, the sizes of the feature maps after the blocking are consistent, and the regularity of the feature maps after the blocking is higher. Setting an input characteristic diagram channel as N and the size as H multiplied by W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the step length is S(ii) a The number of channels of the output characteristic diagram is M, and the size is R multiplied by C. The length and width of the input feature map are equal, i.e., H ═ W, and the length and width of the output feature map are equal, i.e., R ═ C. Referring to fig. 3, the image blocking principle is: when the input characteristic diagram is not completely filled with zero, the side length of the output characteristic diagram should satisfy
Figure BDA0002839478300000081
The output characteristic diagram is divided into four small output blocks, the side length RB of the divided output characteristic diagram is R/2, and the output characteristic diagram is obtained according to the relation between the input characteristic diagram and the output characteristic diagram
Figure BDA0002839478300000082
And
Figure BDA0002839478300000083
thus, the solution is obtained according to two formulas
Figure BDA0002839478300000084
The input characteristic diagram with the side length of H is divided into four input blocks with the side length of HB; when the input characteristic diagram is filled with all zeros, the input characteristic diagram is directly divided into four sides of length
Figure BDA0002839478300000085
The input block of (2).
S5: the operation strategy of combining image block parallel, input channel parallel and output channel parallel in the convolution operation of the FPGA is as follows:
and a proper parallel strategy is selected for convolution operation, so that the image identification speed can be greatly improved. Referring to fig. 4, in the case that the convolution operation of the FPGA is performed by convolution kernel expansion, an operation strategy in which image blocks are parallel, an input channel is parallel, and an output channel is parallel is adopted. The input channel parallelism is PN to represent the number of parallel input channels, the convolution kernel expansion parallelism is PW to represent the number of convolution kernel expansions, the input feature diagram block parallelism is PB to represent the number of block parallelism, and the output channel parallelism is PM to represent the number of parallel output channels. Firstly, output channels are parallel, each input feature diagram after being partitioned is convoluted with PM convolution kernels at the same time, and output feature diagrams with PM channels are generated after convolution is finished. Then, the input channels are parallel, PN input channels are parallel, and data between the PN input channels are operated simultaneously. And finally, input images are subjected to block parallel, PB feature maps subjected to block parallel input are input in each input channel, and each feature map is simultaneously subjected to multiplication operation with PW (PW is K multiplied by K) numbers of convolution kernel expansion. Referring to fig. 5, for the input characteristic diagram, the parallel operation of the number of PWs in one clock cycle can be realized by adopting the line buffer structure input operation mode.
S6: taking on-chip storage resources and computing resources of the FPGA as limiting conditions, and establishing a model process of the Internet of things terminal resources and image recognition time as follows:
(a) let total multiplier resource of FPGA be DaThe total on-chip memory resource is BaAnd the number of the convolution layers in the convolution neural network model is r. The number of DSP resources occupied by each convolutional layer is as follows:
(Ki×Ki×PBi×PNi×PMi)×Dc≤Da (1)
in the formula (1), KiIs the side length of the convolution kernel of the ith (i is more than or equal to 1 and less than or equal to r), PBi、PNiAnd PMiBlock parallel number, input channel parallel number and output channel parallel number of i-th layer convolution respectively, DcThe number of DSP resources required for a single multiplier.
(b) Each time parallel input computation needs on-chip storage resources to meet:
Figure BDA0002839478300000086
in the formula (2), HBiAnd WBiRespectively the length and width RB of the input characteristic diagram of the ith layer after the block divisioniAnd CBiRespectively the length and width of the ith layer output characteristic diagram after the block division, BwBeing the bit width of the data, BhThe storage depth of a single BRAM block.
(c) Input channel parallel PNiShould satisfy the total number N of channels that can be inputiInteger division, i.e. Ni%PNi0. Parallel line number PM of output channeliShould also satisfy the total number M of channels that can be outputiInteger division, i.e. Mi%PMi0. Convolution kernel unfolding parallelism PW of ith layer of convolutioniDetermined by the size of the convolution kernel, i.e. PWi=Ki×Ki. Block parallelism PBiShould be less than the total number of blocks of the input feature map, i.e. PBi≤(Hi×Wi)/(HBi×WBi)。
(d) In convolutional layer, the i-TH layer convolution single parallel input calculation hardware execution time THiThe method mainly comprises the steps of input image transmission time, convolution kernel transmission time and convolution calculation time. THiCan be represented by the following formula:
THi=(HBi×WBi)×tclk+(Ki×Ki)×WKi×tclk+(RBi×CBi)×WKi/PMi×tclk (3)
in the formula (3), WKiTotal number of convolution kernels, t, required for the ith layer of convolutionclkIs the system clock cycle.
(e) The execution time of each layer of convolution is the product of the single parallel input calculation time and the number of calculation times which needs to be executed after the layer of data is calculated, that is, the time for identifying one picture is as follows:
Figure BDA0002839478300000091
in the formula (4), XiAnd the number of times of parallel input required by the i-th layer convolution for executing the convolution operation of one input image on the premise of one parallel input calculation is shown. XiRepresented by the formula:
Figure BDA0002839478300000092
(f) because the output signature is convolved with the input signature, the output signature is long
Figure BDA0002839478300000093
Width of
Figure BDA0002839478300000094
Bringing the formulae (3) and (5) into the formula (4) when K isiWhen S is 3 and S is 1, the following is obtained:
Figure BDA0002839478300000095
(g) according to formula (6), let
Figure BDA0002839478300000096
(h) In combination with the resource restriction condition, the optimization objective is defined as:
Figure BDA0002839478300000101
3) model solution
After a model of terminal resources and identification time is established, relevant parameters are required to be brought in to solve the model, and the optimal image block size and convolution parallel parameters are solved, so that the FPGA reads corresponding data according to the parameters to perform operation. The specific procedure is as follows.
S7: solving the optimal image block size and convolution parallel parameters according to the model, and blocking the input feature map according to a blocking principle, wherein the blocked input feature map is stored on the off-chip DDR 3. The method comprises the following specific steps:
a: for T (HB) in formula (7)i) And (5) derivation to obtain:
Figure BDA0002839478300000102
b: order toT'(HBi) When the ratio is 0, the following is obtained: HBi=2+(PWi×PMi) 2, then for T (HB)i) Calculating the second derivative, and adding HBi=2+(PWi×PMi) The second derivative T' (HB) is introduced by/2i) In (b), T' (HB) is obtainedi) > 0, indicating T (HB)i) In the interval (0, 2+ (PW)i×PMi)/2]Monotonically decreasing and at T (HB)i) Satisfies HBi=2+(PWi×PMi) At the condition of/2, T (HB)i) And obtaining a minimum value, wherein the image recognition time is shortest.
c: according to the on-chip resource limitation of the formula (1) and the formula (2), the parallelism of an input channel is determined to be PN, the unfolding parallelism of a convolution kernel is PW, the block parallelism of an input feature diagram is PB, and the parallelism of an output channel is PM. Bringing parameters into HBi=2+(PWi×PMi) In/2, solving the side length HB of the input feature map blocki
d: according to the HB obtainediValue, combined with the blocking principle, taking into account the requirement of the regularity of the blocks, in HBiAnd determining the specific length l of each image after the image is partitioned near the value, and according to the length lambda of the original image and the first address a of the partitioned image. The input feature map is read and partitioned as follows: the ARM end firstly reads l data according to the first row address a, writes the data into an off-chip storage with continuous storage addresses, then obtains the first row address of the secondary row according to a + l, reads the l data again, and writes the data into the last row of addresses stored in the off-chip storage until l × l data are read. The operation of reading and writing data proceeds in a pipelined manner as shown in fig. 6.
S8: the FPGA reads convolution kernel weight data, bias data and input feature map data after blocking according to the solved convolution parallel parameters, and calculates the data to finally obtain an image identification result, wherein the specific calculation process is as follows:
after a control unit module of the FPGA receives a working instruction sent by the ARM, data is controlled to run in each module of the FPGA, a DMA module is controlled to read convolution kernel data, bias data and input feature map data into an on-chip cache, then the data are sent into a convolution calculation module for convolution operation, the input feature map is sent into a pooling module for pooling through an activation function after the convolution operation, finally the data are input into a full connection layer module, an image recognition result is finally obtained, and the result is sent back to an off-chip DDR3 for storage and can be read by the ARM.
According to the design method of the fast image recognition accelerator based on the convolutional neural network, a pipeline scheme of software and hardware cooperative operation is designed, the ARM and the FPGA can cooperatively process data, and the calculation performance is greatly improved; meanwhile, an image blocking method is provided, an operation strategy combining image blocking parallelism, input channel parallelism and convolution kernel unfolding parallelism is adopted, a model of terminal resources and identification time is established based on the operation strategy, and the optimal image blocking size and convolution parallelism parameters are obtained by solving the model, so that the resource utilization rate is improved, the rapid identification of the image is realized, and the rapid identification of the image on the resource-limited internet of things terminal is facilitated.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (9)

1. A design method of a fast image recognition accelerator based on a convolutional neural network is characterized in that: the method comprises the following steps:
s1: on an Internet of things terminal based on the combination of an ARM and an FPGA, the ARM acquires various target images in a living scene through a camera module;
s2: the ARM processes the acquired target image data and uses the processed target image data as an input characteristic diagram, then performs data processing on convolution kernel data and offset data, generates related instruction data, and finally stores the data in an off-chip DDR memory;
s3: designing a software and hardware cooperative operation pipeline scheme based on the terminal of the Internet of things;
s4: aiming at the input feature map in the step S2, establishing an input feature map blocking principle;
s5: in the convolution operation of the FPGA, an operation strategy of combining image block parallel, input channel parallel and output channel parallel is adopted;
s6: taking the on-chip storage resources and the computing resources of the FPGA as limiting conditions, and establishing a model of the Internet of things terminal resources and the image recognition time based on the strategy in the step S5;
s7: solving the optimal image block size and convolution parallel parameters of the model in the step S6, and blocking the input feature map according to the blocking principle in the step S4, wherein the blocked input feature map is stored in an off-chip DDR memory;
s8: and the FPGA reads the convolution kernel weight data and the offset data according to the convolution parallel parameters solved in the step S7, reads the input characteristic diagram data partitioned in the step S7, and performs operation on the data to obtain an image identification result.
2. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S1, the various objects in the life scene are common objects, including vehicles, ships, fruits, animals, people, keyboards, mice, and computers.
3. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: step S2 specifically includes: the collected target image is processed, namely the size of the target image is adjusted to be consistent with the size of an input characteristic diagram of the convolutional neural network model;
the data processing of the convolution kernel data and the bias data is to quantize the convolution kernel data and the bias data obtained in the forward training of the convolution neural network into 16-bit fixed point numbers;
and the generated related instruction data are related parameters for controlling the running state of the FPGA.
4. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S3, the software and hardware cooperation pipeline scheme is:
firstly, the ARM is responsible for the software blocking and data preparation work of the picture 1, then the data is output to the FPGA end for hardware processing, meanwhile, the ARM end carries out the blocking and data preparation work of the picture 2, and therefore the effect of a cooperative processing pipeline is achieved.
5. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in the step S4, the input feature map channel is N, and the size is H × W; the number of the convolution kernel channels is N, the size is K multiplied by K, the number of the convolution kernels is M, and the moving step length is S; the number of channels of the output characteristic diagram is M, and the size of the channels is R multiplied by C; the length and width of the input feature map are equal, namely H is W, and the length and width of the output feature map are equal, namely R is C;
the image blocking principle is as follows: when the input characteristic diagram is not completely filled with zero, the edge length of the output characteristic diagram is satisfied
Figure FDA0002839478290000021
The output characteristic diagram is divided into four small output blocks, the side length RB of the divided output characteristic diagram is R/2, and the output characteristic diagram is obtained according to the relation between the input characteristic diagram and the output characteristic diagram
Figure FDA0002839478290000022
And
Figure FDA0002839478290000023
is solved according to two formulas to obtain
Figure FDA0002839478290000024
The input characteristic diagram with the side length of H is divided into four input blocks with the side length of HB; when the input characteristic diagram is filled with all zeros, the input characteristic diagram is directly divided into four partsLength of one side is
Figure FDA0002839478290000025
The input block of (a).
6. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S5, the input channel parallelism is PN representing the number of parallel input channels, the convolution kernel unfolding parallelism is PW representing the number of convolution kernel expansions, the input feature map block parallelism is PB representing the number of block parallels, and the output channel parallelism is PM representing the number of parallel output channels;
the operation strategy of combining the image block parallel, the input channel parallel and the output channel parallel is as follows: parallel output channels, representing that each input feature map after partitioning is simultaneously convoluted with PM convolution kernels, and generating output feature maps with PM channels after the convolution; the input channels are parallel, representing that PN input channels are parallel, and data among the PN input channels are operated simultaneously; and (3) the input images are parallel in blocks, representing that PB blocks of feature maps in each input channel are input in parallel, and each feature map is simultaneously multiplied by the number of PW (pseudo wire) expanded by the convolution kernel.
7. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S6, taking on-chip storage resources and computing resources of the FPGA as limiting conditions, a process of establishing a model of internet of things terminal resources and image recognition time is as follows:
s61: the total multiplier resource of FPGA is DaThe total on-chip memory resource is BaThe number of convolution layers in the convolution neural network model is r, and the number of DSP resources occupied by each convolution layer is calculated as follows:
(Ki×Ki×PBi×PNi×PMi)×Dc≤Da
wherein, KiIs the side length of the convolution kernel of the ith (i is more than or equal to 1 and less than or equal to r), PBi、PNiAnd PMiBlock parallel number, input channel parallel number and output channel parallel number of i-th layer convolution respectively, DcThe number of DSP resources required for a single multiplier;
s62: each time parallel input computation requires on-chip storage resources to be satisfied:
(HBi×WBi)×PBi×PNi×Bw/Bh+(RBi×CBi)×PMi×PBi×Bw/Bh+(Ki×Ki)×PMi×PBi×PNi×Bw/Bh≤Ba
wherein, HBiAnd WBiRespectively the length and width RB of the input characteristic diagram of the ith layer after the block divisioniAnd CBiRespectively the length and width of the ith layer output characteristic diagram after the block division, BwBeing the bit width of the data, BhThe storage depth of a single BRAM block;
s63: input channel parallel PNiSatisfy the total number N of the channels capable of being inputiInteger division, i.e. Ni%PNi0; parallel line number PM of output channeliSatisfy the total number M of channels capable of being outputiInteger division, i.e. Mi%PMi0; convolution kernel unfolding parallelism PW of ith layer of convolutioniDetermined by the size of the convolution kernel, i.e. PWi=Ki×Ki(ii) a Block parallelism PBiLess than the total number of blocks of the input profile, i.e. PBi≤(Hi×Wi)/(HBi×WBi);
S64: in convolutional layer, the i-TH layer convolution single parallel input calculation hardware execution time THiThe method comprises the steps of inputting image transmission time, convolution kernel transmission time and convolution calculation time; THiRepresented by the formula:
THi=(HBi×WBi)×tclk+(Ki×Ki)×WKi×tclk+(RBi×CBi)×WKi/PMi×tclk
wherein, WKiTotal number of convolution kernels, t, required for the ith layer of convolutionclkIs the system clock period;
s65: the execution time of each layer of convolution is the product of the single parallel input calculation time and the calculation times which are needed to be executed after the layer of data is calculated, namely, the time for identifying one picture is as follows:
Figure FDA0002839478290000031
wherein, XiRepresents the number of times of parallel input, X, required by the i-th layer convolution to execute the convolution operation of an input image on the premise of one time of parallel input calculationiRepresented by the formula:
Figure FDA0002839478290000032
s66: because the output signature is convolved with the input signature, the output signature is long
Figure FDA0002839478290000033
Width of
Figure FDA0002839478290000034
Sub-entering the formulas in steps S64 and S66 into S65 when K isiWhen S is 3 and S is 1, the following is obtained:
Figure FDA0002839478290000035
s67: according to step S66, let
Figure FDA0002839478290000036
S68: in combination with the resource restriction condition, the optimization objective is defined as:
Figure FDA0002839478290000041
8. the convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S7, the model is solved to obtain the optimal image block size and convolution parallel parameters, and the input feature map is blocked according to the blocking principle, specifically including the following steps:
s71: for T (HB) in step S67i) And (5) derivation to obtain:
Figure FDA0002839478290000042
s72: let T' (HB)i) 0 to yield HBi=2+(PWi×PMi) 2, then for T (HB)i) Calculating the second derivative, and adding HBi=2+(PWi×PMi) The second derivative T ″ (HB) is introduced by 2i) In (1), T ″ (HB) is obtainedi) > 0, indicating T (HB)i) In the interval (0, 2+ (PW)i×PMi)/2]Monotonically decreasing and at T (HB)i) Satisfies HBi=2+(PWi×PMi) At the condition of/2, T (HB)i) Obtaining a minimum value, wherein the image recognition time is shortest;
s73: determining convolution parallel parameters PN, PW, PB and PM according to the resource limitation conditions of S61 and S62; and taken to step S72, according to HBi=2+(PWi×PMi) Solving side length HB of input feature map blocki
S74: obtaining HB according to step S73iAfter the value is obtained, determining the length l of each image after the image is partitioned by combining a partitioning principle, and partitioning the input feature map according to the length lambda of the original input feature map and the initial address a of the partitioned image: ARM reads l data according to the first row address a, writes the data into an off-chip memory with continuous addresses, obtains the first address of the second row according to a + l, and reads l data againWriting to the previous row is stored at the end of the address in off-chip memory until l × l data are read.
9. The convolutional neural network-based fast image recognition accelerator design method of claim 1, wherein: in step S8, the input feature map data, convolution kernel data, and offset data are used in the FPGA to perform operations, including convolution operation, activation function operation, pooling operation, and full link layer operation.
CN202011486673.4A 2020-12-16 2020-12-16 Design method of fast image recognition accelerator based on convolutional neural network Active CN112508184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011486673.4A CN112508184B (en) 2020-12-16 2020-12-16 Design method of fast image recognition accelerator based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011486673.4A CN112508184B (en) 2020-12-16 2020-12-16 Design method of fast image recognition accelerator based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN112508184A CN112508184A (en) 2021-03-16
CN112508184B true CN112508184B (en) 2022-04-29

Family

ID=74972653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011486673.4A Active CN112508184B (en) 2020-12-16 2020-12-16 Design method of fast image recognition accelerator based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN112508184B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010469B (en) * 2021-03-18 2023-05-26 恒睿(重庆)人工智能技术研究院有限公司 Image feature extraction method, device and computer readable storage medium
CN112991144B (en) * 2021-05-10 2021-08-24 同方威视技术股份有限公司 Method and system for partitioning image data of neural network
CN115470176B (en) * 2021-06-10 2024-04-09 中科寒武纪科技股份有限公司 Computing device, method for implementing convolution operation by utilizing computing device and related product
CN113554095B (en) * 2021-07-26 2022-08-19 湖南国科微电子股份有限公司 Feature map processing method and device and computer equipment
CN113705803A (en) * 2021-08-31 2021-11-26 南京大学 Image hardware identification system based on convolutional neural network and deployment method
CN113688069B (en) * 2021-09-10 2022-08-02 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium
CN113792687A (en) * 2021-09-18 2021-12-14 兰州大学 Human intrusion behavior early warning system based on monocular camera
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107766812A (en) * 2017-10-12 2018-03-06 东南大学—无锡集成电路技术研究所 A kind of real-time face detection identifying system based on MiZ702N
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN108932548A (en) * 2018-05-22 2018-12-04 中国科学技术大学苏州研究院 A kind of degree of rarefication neural network acceleration system based on FPGA
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN110503127A (en) * 2018-05-17 2019-11-26 国际商业机器公司 The acceleration of convolutional neural networks on analog array
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium
CN111860784A (en) * 2020-07-24 2020-10-30 上海仪电(集团)有限公司中央研究院 Convolutional neural recognition system and method based on ARM and FPGA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10802992B2 (en) * 2016-08-12 2020-10-13 Xilinx Technology Beijing Limited Combining CPU and special accelerator for implementing an artificial neural network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN107766812A (en) * 2017-10-12 2018-03-06 东南大学—无锡集成电路技术研究所 A kind of real-time face detection identifying system based on MiZ702N
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN110503127A (en) * 2018-05-17 2019-11-26 国际商业机器公司 The acceleration of convolutional neural networks on analog array
CN108932548A (en) * 2018-05-22 2018-12-04 中国科学技术大学苏州研究院 A kind of degree of rarefication neural network acceleration system based on FPGA
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium
CN111860784A (en) * 2020-07-24 2020-10-30 上海仪电(集团)有限公司中央研究院 Convolutional neural recognition system and method based on ARM and FPGA

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
A High-performance CNN Processor Based on FPGA for MobileNets;Di Wu 等;《2019 29th International Conference on Field Programmable Logic and Applications (FPL)》;20160929;136-143 *
A Resources-Efficient Configurable Accelerator for Deep Convolutional Neural Networks;XIANGHONG HU 等;《IEEE Access》;20190613;72113-72124 *
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs;Liqiang Lu 等;《2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines》;20170703;101-108 *
FPGA deep learning acceleration based on convolutional neural network;Xiong Jun;《arXiv》;20201117;1-7 *
基于FPGA的卷积神经网络定点加速;雷小康 等;《计算机应用》;20201010;第40卷(第10期);2811-2816 *
基于深度学习的实时识别硬件系统框架设计;王昆 等;《人工智能》;20181006;第44卷(第10期);11-14 *
深度学习FPGA加速器的进展与趋势;吴艳霞 等;《计算机学报》;20190114;第42卷(第11期);2461-2480 *
面向嵌入式 FPGA 的高性能卷积神经网络加速器设计;曾成龙 等;《计算机辅助设计与图形学学报》;20190930;第31卷(第9期);1643-1652 *

Also Published As

Publication number Publication date
CN112508184A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112508184B (en) Design method of fast image recognition accelerator based on convolutional neural network
CN109086867B (en) Convolutional neural network acceleration system based on FPGA
US20210224125A1 (en) Operation Accelerator, Processing Method, and Related Device
US20200234124A1 (en) Winograd transform convolution operations for neural networks
Li et al. An 879gops 243mw 80fps vga fully visual cnn-slam processor for wide-range autonomous exploration
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US20180260709A1 (en) Calculating device and method for a sparsely connected artificial neural network
WO2018171717A1 (en) Automated design method and system for neural network processor
TWI818944B (en) Neural network processing unit and system on chip
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
CN111079923B (en) Spark convolutional neural network system suitable for edge computing platform and circuit thereof
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN113051216A (en) MobileNet-SSD target detection device and method based on FPGA acceleration
CN110991630A (en) Convolutional neural network processor for edge calculation
CN112799599B (en) Data storage method, computing core, chip and electronic equipment
TWI775210B (en) Data dividing method and processor for convolution operation
CN111523463A (en) Target tracking method and training method based on matching-regression network
Lu et al. An 176.3 GOPs object detection CNN accelerator emulated in a 28nm CMOS technology
Weilharter et al. ATLAS-MVSNet: Attention layers for feature extraction and cost volume regularization in multi-view stereo
CN117217274A (en) Vector processor, neural network accelerator, chip and electronic equipment
US20230206051A1 (en) Broadcasting mode of planar engine for neural processor
US20220188612A1 (en) Npu device performing convolution operation based on the number of channels and operating method thereof
WO2022017129A1 (en) Target object detection method and apparatus, electronic device, and storage medium
Hashimoto et al. Fadec: FPGA-based acceleration of video depth estimation by hw/sw co-design
CN111445019B (en) Device and method for realizing channel shuffling operation in packet convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant