CN110209627A

CN110209627A - A kind of hardware-accelerated method of SSD towards intelligent terminal

Info

Publication number: CN110209627A
Application number: CN201910474860.1A
Authority: CN
Inventors: 孙善宝; 王子彤; 姜凯; 李朋; 秦刚
Original assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Current assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-09-06

Abstract

The invention discloses a kind of hardware-accelerated methods of the SSD towards intelligent terminal, belong to FPGA hardware acceleration, target detection, computer vision and Heterogeneous Computing technical field.The hardware-accelerated method of SSD towards intelligent terminal of the invention uses ARM+FPGA isomery framework in edge side smart machine terminal, carries out computing hardware acceleration based on edge side target detection service application scene；The model training of SSD algorithm is completed by cloud data center, and is directed to different FPGA design personalization algorithms, and FPGA is dynamically loaded into smart machine terminal；3x3 convolution, add tree calculating and Relu computing unit are designed using dot-product operation and tree-like adder to SSD algorithm.The hardware-accelerated method of the SSD towards intelligent terminal of the invention can satisfy requirement of the edge side for calculating power, real-time and power consumption, have good application value.

Description

A kind of hardware-accelerated method of SSD towards intelligent terminal

Technical field

The present invention relates to FPGA hardware acceleration, target detection, computer vision and Heterogeneous Computing technical fields, specific to provide A kind of hardware-accelerated method of SSD towards intelligent terminal.

Background technique

FPGA (Field-Programmable Gate Array), i.e. field programmable gate array, are a kind of main needles The semiconductor devices that application or functional requirement can be programmed.Isomery has been widely used in it and has accelerated field, has shown Compared to general processor CPU better performance.The design of CPU is mainly for logic calculation, and different from CPU and GPU, FPGA is one The typical Fei Nuoyiman framework of kind, is the mode of hardware adaptation software, can flexibly be adjusted according to system resource and algorithm characteristics Whole degree of parallelism, the adaptation being optimal, therefore Energy Efficiency Ratio are higher than CPU and GPU.FPGA is especially good at Digital Signal Processing, can The interface of compatible more level standards, and various high-speed electronic components, such as high speed fibre transceiver etc. can be interconnected, low in energy consumption, The small feature of cost is even more that it is widely used in many fields.

In recent years, computer vision technique is quickly grown, and is widely used in the fields such as security protection, traffic, robot, unmanned vehicle, Wherein target detection is wherein important research direction.SSD (Single Shot MultiBox Detector) is as typical Algorithm of target detection, by using the priori frame (Prior boxes, Default boxes) of different scale and length-width ratio, and And the characteristic pattern for extracting different scale is detected, and is directly classified and is returned after then extracting feature using CNN, solved Common wisp detection difficult problem in target detection, while having the characteristics that fireballing, it can be used for automatic Pilot, security protection The scenes such as camera, however calculated power, size, power consumption by the terminal of these application scenarios and limited, while also SSD being required to calculate Method has executes speed faster.

Summary of the invention

Technical assignment of the invention is that in view of the above problems, providing one kind can satisfy edge side for calculating The requirement of power, real-time and power consumption, and realize the dynamic update of FPGA not power down, continue boosting algorithm efficiency towards intelligent end The hardware-accelerated method of the SSD at end.

To achieve the above object, the present invention provides the following technical scheme that

A kind of hardware-accelerated method of SSD towards intelligent terminal, it is different using ARM+FPGA in edge side smart machine terminal Framework structure carries out computing hardware acceleration based on edge side target detection service application scene；SSD is completed by cloud data center to calculate The model training of method, and it is directed to different FPGA design personalization algorithms, FPGA is dynamically loaded into smart machine terminal；It is right SSD algorithm, using dot-product operation and tree-like adder, design 1x1 convolution unit, 3x3 convolution unit, add tree unit and Relu unit completes the combination of a variety of computing units, carries out according to the loading sequence of setting rule selection data and network parameter It calculates, cooperates jointly with ARM and realize SSD algorithm.

The hardware-accelerated method of SSD towards intelligent terminal is directed to the network structure of SSD algorithm, efficiently uses FPGA power consumption Feature low, real-time parallel processing capacity is strong is based on edge side target detection service application field using ARM+FPGA isomery framework Scape carries out computing hardware acceleration, meets requirement of the edge side for calculating power, real-time and power consumption；Fully consider that edge side is set The resource situation stored in the calculating of standby resource situation and FPGA and chip, reasonably by convolution, Bias calculating, Relu etc. Nonidentity operation is run in FPGA, designs 3X3 and 1X1 convolution unit, the convolutional calculation number of reverse scan is connect using forward scan It according to loading sequence, makes full use of and is cached on chip, reduce the data exchange time that data cache on DDR memory and fpga chip Number；The operation of unsuitable FPGA flowing water operation is cooperated into execution by arm processor, includes PriorBox, pond in SSD algorithm Change etc. calculates, and ensure that the execution efficiency of FPGA, reduces the complexity of system；When ARM executes PriorBox relevant calculation, selection FPGA idle moment accelerates the convolution algorithm in calculating, improves whole operational efficiency, realizes that the hardware of SSD algorithm edge side adds Speed improves image object detection speed, realizes that high energy efficiency calculates, and then promote the overall performance of terminal.In addition, in cloud data The characteristics of heart Continuous optimization model, effective use FPGA dynamic can load, the dynamic update of FPGA not power down is realized, is persistently mentioned Rise efficiency of algorithm.

Preferably, the smart machine terminal uses ARM+FPGA isomery framework, there is memory storage and outer village to store, Image Acquisition is provided, realizes the image real-time target detection of edge side.

Preferably, the cloud data center collects target detection training set, completes to train using SSD network, will obtain SSD network model according to the FPGA performance of different size carry out customization.

Preferably, determine data and network parameter load after carrying out customization to FPGA and execute the sequence calculated, Personalized SSD network model is downloaded to the smart machine terminal according to FPGA hardware situation.

Preferably, the FPGA design convolution circuit designs 3x3 convolution using dot-product operation and tree-like adder With Relu computing unit.

Preferably, the ARM realizes the control to FPGA, by dma controller by data and network parameter from it is interior from In middle reading FPGA, while realizing the maximum pond in SSD.

Realize maximum pond MaxPool, PriorBox, Permute, Normalize in SSD, the operation such as Flatten.

Preferably, limited according to PE resource on SSD algorithm and fpga chip and the specification of cache resources, using first calculating The mode in channel, the 3X3 convolutional calculation in one group of 64 channel of design are a unit, and the inner product comprising 64 3X3 calculates, respectively The value of 64 convolution is calculated, then connects add tree unit, obtains a numerical value as a result, completing one group of 64 3X3 convolution.

Preferably, the 1X1 convolutional calculation in one group of 64 channel of design is a unit, the multiplication fortune comprising 64 1X1 It calculates, calculates separately out the value of 64 convolution, then connect add tree unit, obtain a numerical result.

Preferably, design FPGA convolutional layer computing unit, input data caching is the number of the 3X3 or 1X1 in whole channels According to node, parameter cache includes 3X3 the or 1X1 convolution nuclear parameter and a Bias parameter in whole channels of a Filter.

Relu calculate node is designed, realizes Relu (x)=Max (0, x) function in circuit；By combining multiple convolutional layers Computing unit completes one layer of convolution algorithm, is exported, and external memory is arrived in storage.

Compared with prior art, the hardware-accelerated method of the SSD of the invention towards intelligent terminal has with following prominent Beneficial effect:

(1) network structure of SSD algorithm, the effective use spy that FPGA is low in energy consumption, real-time parallel processing capacity is strong are directed to Point is carried out computing hardware acceleration based on edge side target detection service application scene, is met side using ARM+FPGA isomery framework Requirement of the edge side for calculating power, real-time and power consumption；

(2) resource situation stored on the resource situation of edge side apparatus and the calculating of FPGA and chip is fully considered, The nonidentity operations such as convolution, Bias calculating, Relu are run in FPGA reasonably, 3X3 and 1X1 convolution unit are designed, using just The convolutional calculation data loading sequence that reverse scan is connect to scanning, makes full use of and caches on chip, reduce data in DDR memory and The data exchange number cached on fpga chip；

(3) operation of unsuitable FPGA flowing water operation is cooperated into execution by arm processor, comprising in SSD algorithm PriorBox, pond etc. calculate, and ensure that the execution efficiency of FPGA, reduce the complexity of system；

(4) it when ARM executes PriorBox relevant calculation, selects FPGA idle moment to accelerate the convolution algorithm in calculating, mentions High whole operational efficiency, realizes the hardware-accelerated of SSD algorithm edge side, improves image object detection speed, realizes high energy Effect calculates, and then promotes the overall performance of terminal；

(5) the characteristics of cloud data center Continuous optimization model, effective use FPGA dynamic can load, FPGA is realized not The dynamic update of power down continues boosting algorithm efficiency, has good application value.

Detailed description of the invention

Fig. 1 is the smart machine terminal structure and section in the hardware-accelerated method of the SSD towards intelligent terminal of the present invention Point schematic diagram；

Fig. 2 is SSD accelerating algorithm schematic diagram in the hardware-accelerated method of the SSD towards intelligent terminal of the present invention；

Fig. 3 is that the SSD hardware of the smart machine terminal of the hardware-accelerated method of the SSD towards intelligent terminal of the present invention adds Fast flow chart.

Specific embodiment

Below in conjunction with drawings and examples, the hardware-accelerated method of the SSD towards intelligent terminal of the invention is made into one Step is described in detail.

Embodiment

As depicted in figs. 1 and 2, the hardware-accelerated method of the SSD of the invention towards intelligent terminal, in edge side smart machine ARM+FPGA isomery framework is used in terminal, and computing hardware acceleration is carried out based on edge side target detection service application scene, it is full Requirement of the sufficient edge side for calculating power, real-time and power consumption.By cloud data center complete SSD algorithm model training and Optimization, and it is directed to different FPGA specifications design personalization algorithms, it is dynamically loaded into smart machine terminal.For SSD Algorithm, using dot-product operation and tree-like adder etc., design 1x1 convolution unit, 3x3 convolution unit, add tree unit and The basic computational ele- ments such as Relu unit fully consider cache size on chip, the combination of a variety of computing units are completed, according to setting Rule selection data and the loading sequence of parameter calculated, cooperate realization SSD algorithm jointly with ARM.

Wherein cloud data center is responsible for collecting target detection training set, completes to train using SSD network, the SSD that will be obtained Network model carries out customization according to the FPGA performance of different size, determines data and network parameter load and executes calculating Sequentially, and according to FPGA hardware situation personalized model is downloaded to the smart machine terminal；The smart machine Terminal uses ARM+FPGA isomery framework, and there is memory storage and external memory to store, provide image collecting function, realize edge side The detection of image real-time target；The FPGA design convolution circuit, realize 3X3 dot product operations, add tree computing unit and Relu unit；The ARM realizes the control for FPGA, by dma controller by data and network model parameter from memory In middle reading FPGA, while realizing maximum pond MaxPool, PriorBox, Permute, Normalize in SSD, The operation such as Flatten；According to PE resource and the specification of cache resources limit the characteristics of SSD algorithm and on fpga chip, use The mode of channel (channel) is first calculated, it includes 64 3X3 that the 3X3 convolutional calculation in one group of 64 channel of design, which is a unit, Inner product calculate, calculate separately out the value of 64 convolution, then connect add tree unit, obtain a numerical value as a result, complete one group 64 A 3X3 convolution；The 1X1 convolutional calculation for designing one group of 64 channel is a unit, the multiplying comprising 64 1X1, difference The value of 64 convolution is calculated, then connects add tree unit (can be multiplexed with 3X3 convolution), obtains a numerical result；Design FPGA convolutional layer computing unit, input data caching are the back end of the 3X3 or 1X1 in whole channels, and parameter cache includes one 3X3 the or 1X1 convolution nuclear parameter in whole channels of a Filter and a Bias parameter；Relu calculate node is designed, in circuit Middle realization Relu (x)=Max (0, x) function；One layer of convolution algorithm is completed by combining multiple convolutional layer computing units, is obtained defeated Out, external memory is arrived in storage.

As shown in figure 3, the hardware-accelerated method of SSD towards intelligent terminal, the SSD hardware for smart machine terminal add Speed, comprising:

S1, cloud data center collect target detection training set, complete to train using SSD network, generate model.

S2, cloud data center carry out customization according to the FPGA performance of different size, and personalized model is downloaded to institute The smart machine terminal stated.

S3, smart machine terminal obtain external image by image capture devices such as camera or cameras, and carry out image Resolution adaptation conversion, meets SSD algorithm requirement.

S4, smart machine terminal read in image in DDR memory, are encoded with RGB triple channel.

S5, according to SSD algorithm the step of, by FPGA by DMA load in memory convolutional layer CONV1_1 (first layer roll up Product) a 3X3 convolution kernel (include triple channel, a total of 64 convolution kernels of CONV1_1) model parameter.

S6, by DMA in memory according to the unit triple channel of 3X3 according to image from upper left to the right to read image Data are calculated using the convolution unit in one group of 64 channel, only use wherein 3 channels.

S7, three channels are obtained by 3X3 convolution algorithm as a result, reusing add tree completes numerical value accumulation calculating, finally It is added to obtain single result with Bias value.

S8, S7 result is passed through into Relu unit, is output on chip and caches, final output is into DDR memory.

S9, S5 to S8 is repeated, makes full use of PU unit on fpga chip to carry out convolution algorithm, first loads convolution kernel, then Image data is read, is calculated, wherein image data starts according to image bottom-right location when loading next convolution kernel, to The direction of upper left is loaded, and is guaranteed that last round of last image data can be multiplexed, is reduced memory reading times.

S10, load convolutional layer CONV1_2 (second layer convolution) load 3X3 convolution kernel according to same rule, carry out 64 The calculating of a one group of unit in channel, calculated result are output to DDR storage by Relu unit.

S11, the maximum pond MaxPool that 2X2 is calculated by ARM complete the first layer network of CONV1.

CONV2 to the CONV5 of S12, SSD network completes to calculate according to S5 to S10 sequence, and wherein CONV4_3 is output to DDR In memory, regularization Normalize is completed by ARM and is calculated, and carries out the relevant calculation of CONV_MBOX PriorBox, including volume The operations such as product, Permutation and Flatten.

The convolution algorithm of the 13X13 of S13, FC6 layer network is divided into the convolution (15X15) of 5 3X3 to be calculated, wherein Unit beyond 13X13 is mended 0 and is calculated.

S14, FC7 layers of operation use one group of 64 channel 1X1 convolution algorithm, then pass through add tree, then be added with Bias value Obtain single as a result, the 3X3 convolution in similar S9 repeats to calculate, final output to DDR storage is as next layer of CONV6's Input, while result carries out processing by ARM and completes the relevant operations such as PriorBox.

S15, CONV6 are all made of similar calculating process to CONV10,1X1 convolutional layer, 3X3 convolutional layer and Relu layers by FPGA is completed, still using being scanned to bottom right since the upper left FeatureMap, computation sequence from another mistake to return, and by result It is output to DDR memory.

The related operation of S16, the NORMBOX of CONV6 to CONV9 and PriorBox transfer to ARM to complete, when CONV10 is calculated It completes, the convolution algorithm of the NORMBOX of CONV10 transfers to FPGA to carry out operation.

It, can be according to the fortune of CONV6 to CONV9 if the NORMBOX related operation after S17, CONV10 is fully completed Situation is calculated, continues to transfer to convolution algorithm therein FPGA to carry out.

After the completion of all operations in S18, front, calculating, the confidence level of last NORMBOX_priorbox are executed by ARM (MBOX_CONF) and the calculating of position (MBOX_LOC), final output.

S19, S13 to S18 is repeated, persistently carries out target detection.

S20, cloud data center Continuous optimization model, dynamically load update terminal side model.

Embodiment described above, the only present invention more preferably specific embodiment, those skilled in the art is at this The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims

1. a kind of hardware-accelerated method of SSD towards intelligent terminal, it is characterised in that: used in edge side smart machine terminal ARM+FPGA isomery framework carries out computing hardware acceleration based on edge side target detection service application scene；By in cloud data The heart completes the model training of SSD algorithm, and is directed to different FPGA design personalization algorithms, and FPGA is dynamically loaded into intelligence and is set In standby terminal；Is designed by 1x1 convolution unit, 3x3 convolution unit, is added using dot-product operation and tree-like adder for SSD algorithm Method tree unit and Relu unit, complete the combination of a variety of computing units, according to setting rule selection data and network parameter plus Load sequence is calculated, and is cooperated jointly with ARM and is realized SSD algorithm.

2. the hardware-accelerated method of the SSD according to claim 1 towards intelligent terminal, it is characterised in that: the intelligence is set Standby terminal uses ARM+FPGA isomery framework, and there is memory storage and outer village to store, provide Image Acquisition, realize the figure of edge side As real-time target detects.

3. the hardware-accelerated method of the SSD according to claim 1 or 2 towards intelligent terminal, it is characterised in that: the cloud number According to central collection target detection training set, complete to train using SSD network, by obtained SSD network model according to different size FPGA performance carry out customization.

4. the hardware-accelerated method of the SSD according to claim 3 towards intelligent terminal, it is characterised in that: carried out to FPGA After customization, determines data and network parameter load and execute the sequence calculated, it will be personalized according to FPGA hardware situation SSD network model downloads to the smart machine terminal.

5. the hardware-accelerated method of the SSD according to claim 4 towards intelligent terminal, it is characterised in that: the FPGA is set It counts convolution circuit and designs 3x3 convolution sum Relu computing unit using dot-product operation and tree-like adder.

6. the hardware-accelerated method of the SSD according to claim 5 towards intelligent terminal, it is characterised in that: the ARM is realized Data and network parameter are therefrom read in FPGA from interior by dma controller, while realized in SSD by the control to FPGA Maximum pond.

7. the hardware-accelerated method of the SSD according to claim 6 towards intelligent terminal, it is characterised in that: according to SSD algorithm And PE resource and the limitation of the specification of cache resources design one group of 64 channel by the way of first calculating channel on fpga chip 3X3 convolutional calculation be a unit, the inner product comprising 64 3X3 calculates, and calculates separately out the value of 64 convolution, then connect addition Unit is set, obtains a numerical value as a result, completing one group of 64 3X3 convolution.

8. the hardware-accelerated method of the SSD according to claim 7 towards intelligent terminal, it is characterised in that: one group 64 of design The 1X1 convolutional calculation in a channel is a unit, and the multiplying comprising 64 1X1 calculates separately out the value of 64 convolution, then Add tree unit is connect, a numerical result is obtained.

9. the hardware-accelerated method of the SSD according to claim 8 towards intelligent terminal, it is characterised in that: FPGA volumes of design Lamination computing unit, input data caching are the back end of the 3X3 or 1X1 in whole channels, and parameter cache includes one 3X3 the or 1X1 convolution nuclear parameter in whole channels of Filter and a Bias parameter.