CN110348357A

CN110348357A - A kind of fast target detection method based on depth convolutional neural networks

Info

Publication number: CN110348357A
Application number: CN201910594388.5A
Authority: CN
Inventors: 王蒙; 李威
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-10-18
Anticipated expiration: 2039-07-03
Also published as: CN110348357B

Abstract

The invention discloses a kind of fast target detection methods based on depth convolutional neural networks.First by constructing basic SSD detection model, pretreated data are trained, an original training pattern is obtained.Then, by the measurement of convolution kernel importance, convolution kernel Pruning strategy is taken, removes unessential convolution kernel, simplifies the feature extraction network in detection model, and obtain compact model.Specifically, it is as close as possible with the output of i+1 layer by the input for constituting the channel subset to i+1 layer, so as to remove other channels of i+1 layer input, and corresponding convolution kernel in i-th layer is removed in turn, to realize the beta pruning to model convolution kernel.One convolutional layer of every removal, is then finely adjusted compact model, to restore detection model precision.After all convolutional layer beta prunings, final compressed detected model is obtained.The present invention disposes model in mobile terminal by model compression, while promoting detection speed, and keep detection accuracy.

Description

A kind of fast target detection method based on depth convolutional neural networks

Technical field

The present invention relates to computer vision fields, more particularly to a kind of fast target based on depth convolutional neural networks Detection method.

Technical background

The mankind obtain the approach of information, are exactly intuitively most the sense of hearing and vision.Studies have shown that the mankind obtain hundred in information / eight nine ten, it is all to be arrived by eye-observation.Also it is based on this, vision mechanism is always the important neck of human research Domain, especially computer vision, in recent years, along with the development of deep learning, computer vision achieves many great dash forward It is broken.And target detection, be exactly people realize visual perception and understanding an important ring, the speed and precision of detection object, directly The quality for determining that we obtain information is connect, importance has some idea of.In addition, target detection has broad application prospects, It such as in unmanned, may be implemented quickly to position by 3D target detection, so that automobile can be with avoiding obstacles and suitable Benefit traveling；The spot in production product is automatically identified by the quality testing of artificial intelligence in manufacturing industry neck industry, it can be rapid Underproof product is found out, the accuracy and speed of quality inspection not only can be improved, a large amount of labour can also be saved；In traffic system In system, by the real-time detection to video, license plate number can be quickly recognized.To sum up, target detection, especially quickly Target detection plays an increasingly important role in we live.

Target detection mainly includes two processes, that is, positions and identify.Compared to general image recognition tasks, although only It has had more and has positioned this process, but is more complex on its model realization.In traditional target detection is realized, difficult point exists In feature extraction and tagsort.It is intended to using Haar small echo, LBP, SIFT, HOG (histogram of oriented The methods of) gradient manual features extraction carried out to target, then by cascade classifier AdaBoost, support vector machines, The methods of DMP classifies.But since feature extraction is mainly based upon the information of bottom, believe the abundant semanteme of comparison is possessed The high-level characteristic of breath extracts insufficient, while feature extraction has specificity, so existing, precision in identification is not high and identification object The single problem of type.

Therefore, although having studied many years, under based on traditional detection method, fail to be widely used always. Until 2012, Alexnet model was announced to the world splendidly, and the research of computer vision realizes historic breakthrough.It is being advised greatly The achievement attracted attention, and the achievement to take first place in ImageNet match are achieved in the application of mould image recognition.It develops, On the one hand it is the promotion of computer hardware performance, is significantly improved in big data storage and calculating speed；Another party The it is proposed of the progress of machine learning algorithm, especially deep neural network is had benefited from face, makes it in feature extraction, especially high level In feature extraction, it is greatly improved.Hereafter, based on the object detection method of convolutional neural networks, as emerging rapidly in large numbersBamboo shoots after a spring rain, It is flourished.

Currently, the object detection method based on convolutional neural networks model, there are two main classes, i.e. two-stage detection and single order Section detection.It is foremost in two-stage detection, surely belong to R-CNN series.It is intended to the method by region detection, and target is examined Survey is divided into two processes, and one is selection that frame is proposed on boundary, is mainly realized by returning；Another is exactly to carry out to object Classification.Thereafter, fast R-CNN, the models such as faster R-CNN and Mask R-CNN and are successively proposed, although detect It all increases in speed and precision, but from the real-time detection in practical application, still greatly differs from each other.Another thinking It is to extract feature after passing through convolutional neural networks to the picture of input, directly carry out recurrence processing, will positions and be used as one with identification A process realizes that Typical Representative is exactly yolo, SSD etc..Although detection speed greatly improves, in production application In, but still it is unable to satisfy the requirement of real-time of detection.

Summary of the invention

The fast target detection based on depth convolutional neural networks that the invention mainly solves the technical problem of providing a kind of Method reduces model size, is conducive to the deployment of model by carrying out convolution kernel cut operator to original detection model.Together When, detection speed is improved in the case where guaranteeing detection accuracy, is able to solve video monitoring in intelligent transportation system, Hang Renjian The problems such as survey.

The invention proposes a kind of fast target detection methods based on depth convolutional neural networks, comprising the following steps: Training archetype, model compression are finely adjusted on compact model, specifically, mainly including following four step:

Step1: pre-processing the image data of training set, specifically: using random cropping fixed area, random sanction Random size, color change and brightness warping method are cut, augmentation is carried out to image data, then does flip horizontal at random again, most Normalized is done to the picture after augmentation afterwards, makes its fixed size, having a size of w × h.

It is trained pretreated training set input SSD model to obtain initial model；

(1) SSD model is constructed, then the VGG16 to remove full articulamentum adds for basic feature extraction network Six convolutional layers of Conv6, Conv7, Conv8, Conv9, Conv10 and Conv11, and extract Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 layers of characteristic pattern is as prediction interval；

(2) pretreated training set input SSD model is subjected to feature extraction, generates fixed number on each prediction interval Different sizes and the priori candidate frame of different length-width ratios, then will be marked in the image data of priori candidate frame and training set True frame is matched, and the positive sample and negative sample in training process are obtained, and carries out classification and regression forecasting respectively, first First, true frame is handed over it and than selecting frame to match after maximum priori；Then, for handing over and than true greater than 0.5 Frame is matched with remaining candidate frame, will be other then make with the matched candidate priori frame of true frame as positive sample For negative sample, negative sample is arranged according to forecast confidence descending, chooses the negative sample of front, and guarantee that positive and negative sample proportion is 1:3.

In this training process, obtained initial using SGD gradient optimization algorithm by backpropagation training network Training pattern, the loss function in training process are as follows:

Wherein, N is the number of the priori candidate frame to match with true frame, L_locTo position loss function, L_confTo divide The loss amount of class confidence level, α are regularization parameter, and z is input picture, and p is target category, and l is model prediction frame, and g is mark Infuse frame.

Step2: the strategy of convolution kernel beta pruning is taken initial model obtained in Step1, initial model is compressed Obtain compact model；In above-mentioned basic convolutional neural networks VGG16 (Conv1_1-Conv4_3), it is arranged a kind of based on volume The Pruning strategy in product core channel and size, the convolution kernel channel high to convolution feature extraction contribution rate are just retained, and feature mentions Take influence is small then to give up.Specifically, feature extraction is carried out to image, if i-th layer of convolution characteristic layer port number is C_i, it is wide and Height is respectively H_iAnd D_i, i-th layer of convolutional layer is denoted as I_i, and haveCorresponding, convolution kernel size is n_i×C_i ×K_i×K_i, i.e., shared n_iA convolution kernel, port number C_i, wide and high respectively K_iAnd K_i, i-th layer of convolution kernel be denoted as W_i, and HaveOur purpose is removal W_iIn unessential convolution kernel, with reduced-order models parameter.It is cut based on convolution kernel The main thought of branch is: for i+1 layer, the output that input is i-th layer, and the input that output is the i-th+2 layers, if the The i+1 layer input that i layers of channel subset are constituted is approximate with the output of i+1 layer, then i+1 layer inputs its in corresponding i-th layer He can remove in channel, meanwhile, corresponding convolution kernel can also remove in i-th layer.

Detailed process are as follows:

If y is random in the output characteristic layer of the stochastical sampling point namely i+1 layer in the i-th+2 layers input feature vector layer Sampled point

Wherein,WithRespectively the convolution kernel and sliding window of i+1 layer response, c indicate convolution kernel channel, C is maximum port number, k₁Indicate width, k₂Indicate that height, the two maximum value are K, b is corresponding biasing；

The convolution kernel of i+1 layer response and each channel output of sliding window convolution operation are

The then stochastical sampling point in the output characteristic layer of i+1 layerIt indicates are as follows:

Wherein,

If the i+1 layer input that i-th layer of channel subset is constituted is approximate with the output of i+1 layer, the input of i+1 layer is corresponded to I-th layer in other channels can remove, speciallyWherein S is characterized a layer channel subset, and hasIf above formula is set up, anyCorresponding channel characteristicsIt can remove, at the same time, i+1 Input is i-th layer of output, then corresponding convolution kernel can be also removed in i-th layer, to realize i-th layer of convolution kernel beta pruning；

In the training process, it is equipped with training setWherein M is that picture number and convolution feature are empty Between number of positions product,For m-th of input convolution feature,It is obtained by formula (3), is The jth channel that corresponding convolution kernel and sliding window convolution operation obtain exports；It is obtained by formula (4), to export characteristic layer In m-th of stochastical sampling point, then former channel selecting problem is changed into following optimization problem:

Wherein, | S | for the quantity of element in channel subset S, r is compression ratio, and T is enabled to indicate removed channel of characteristic layer Collection, then the intersection of set T and S is empty set, and union is that { 1,2 ..., C } is gathered in channel, above formula conversion are as follows:

In general, | T | < | S |, therefore during hands-on, it is logical by optimization formula (6) Lai Shixian convolution kernel The beta pruning in road.By above-mentioned optimization, i-th layer of convolution kernel to be removed has been obtained, meanwhile, the mould after being removed in order to ensure convolution kernel Type performance minimizes its reconstructed error.

Formula (7) is solved according to common least square, available

According to above step, to 10 convolutional layers (Conv1_1-Conv4_3) preceding in the basic model VGG16 of detection model Convolution kernel carry out cut operator, and obtain compact model.

Step3: carrying out after subtracting branch each characteristic layer of model in step (2), be and then finely adjusted training to model, Compact model is trained using pretreated training set and saves model.In trim process, use step (1) In training step, the detection model after obtaining convolution kernel beta pruning, to promote the model inspection precision after beta pruning.In general, One to two periods of repetition training, and finally obtained model is saved.

Step4: repeating step Step2-Step3 several times, and the model after fine tuning in Step (3) is taken again The convolution kernel Pruning strategy of Step (2), with further compact model, until all complete on the lesser convolution kernel of detection performance influence Portion's removal, and saved obtained model is finally finely tuned, as last compressed detected model.

Beneficial effects of the present invention:

The present invention improves detection speed in the case where reducing model size, guaranteeing detection accuracy as far as possible.

In Step1, by carrying out augmentation to data, model can be made to have more robustness to target size, size.And The precision of original detection model is promoted, as far as possible simultaneously to improve the detection accuracy upper limit of compact model.

In Step2, by the importance of convolution kernel in measurement archetype, removal does not have influential volume to detection performance Product core, so that implementation model compresses, and keeps detection accuracy simultaneously.

It in Step3, is finely adjusted on the network of compression, so that compressed model inspection performance reaches most again It is excellent.

In Step4, step (3) and step (4) are repeated several times, to realize the convolution kernel beta pruning of all convolutional layers, and Detection accuracy is kept, to obtain final detection model.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of preferred embodiment of the invention；

Fig. 2 is the basic detector network structural model SSD of the embodiment of the present invention, and wherein basic network is VGG-16.

Specific embodiment

The preferred embodiments of the present invention will be described in detail with reference to the accompanying drawing, so that advantages and features of the invention energy It is easier to be readily appreciated by one skilled in the art, so as to make a clearer definition of the protection scope of the present invention.

Embodiment 1: the present invention can be applied in numerous areas, such as in traffic system, by detection monitor video come Positioning target in real time；Also it can be applied to criminal investigation field, position suspect by quickly detecting；In automatic Pilot, Quick positioning to road scene, to avoid pedestrian and barrier.In order to show the versatility of this method, below mainly in public affairs For opening the experiment on data set Pascal VOC, 20 classifications are detected, altogether to illustrate particular condition in use of the invention.This hair In bright experimentation, using system Ubuntu18.04, use hardware CPU for 3.7GHz × 6 i78700k, programming language is Python3.6, video card are tall and handsome up to GeForce RTX 2070, and deep learning frame is Pytorch1.0.

Step1: pre-processing the image data of training set, and pretreated training set input SSD model is carried out Training obtains initial model；Firstly, building SSD network model, and to remove the VGG16 of full articulamentum as basic feature extraction Network, overall network structure are as shown in Figure 2.VOC2007 and VOC2012 training set and verifying is used to collect as training dataset, Share 16551 trained pictures；Test set is VOC2007 test data set, shares 4952 pictures.Then, data are carried out Pretreatment, using the methods of random cropping fixed area, the random size of random cropping, color change, brightness distortion, to picture number According to augmentation is carried out, flip horizontal is then done at random again.Normalized finally is done to the picture after augmentation, makes its fixed size 300x300.Pretreated data input SSD detection model is subjected to feature extraction, and on the prediction interval of six different scales Make classification and regression analysis respectively.In the training process, batch 32, total iteration 120000 times, and it is excellent using the decline of SGD gradient Change algorithm, by backpropagation training network, obtains initial training pattern.

Step2: the strategy of convolution kernel beta pruning is taken initial model obtained in Step1, initial model is compressed Compact model is obtained, feature extraction is carried out to training data using initial model, obtains preceding 10 convolutional layers in VGG16 (Conv1_1-Conv4_3) characteristic layer.If the output of input namely i+1 characteristic layer that y is the i-th+2 layers, is grasped by convolution It is obtained as formula:

Further, the convolution kernel of i+1 layer response and each channel of sliding window convolution operation export and are

The then stochastical sampling point in the output characteristic layer of i+1 layerIt can indicate are as follows:

Wherein,

By above formula, training set is obtainedOur purpose is optimization following formula, is not weighed with removal The convolution feature channel wanted:

Wherein, T is except the convolutional channel set gone out, C are convolutional layer channel set, and r is compression ratio.

In order to solve the above optimization problem, setting T=φ first, i.e., | T |=0, concurrently set compression ratio r=0.5, formula (7) initial solution is set as min_val →+∞.Then as | T | when < C × (1-r), execute following operation: for arbitrary m ∈ C, T '=T ∪ { m } is set, and obtains the solution of formula (7) with T ', so that formula (7) value is minimum when one channel of every increase, if Otherwise, continue the above operation if val < min_val, updates min_val=val, while updating T=T ' for val.

By the above method, so that the value of above formula is minimum when increasing a channel every time, the available volume to be removed Product core set T, to realize that network model compresses.

Meanwhile the model performance after being removed in order to ensure convolution kernel, we minimize reconstructed error shown in following formula:

It is solved according to common least square and above formula is solved, it is available

Step3: carrying out after subtracting branch each characteristic layer of model in step (2), be and then finely adjusted training to model, And model is saved.In trim process, using the training step in step (1), detection model after obtaining convolution kernel beta pruning, To promote the model inspection precision after beta pruning.In general, one to two periods of repetition training, and finally obtained model is protected It deposits.

Step4: Step (3) and Step (4) are repeated several times, 3 repetitions are taken in this operation, obtain final detection mould Type.

By above step, last available archetype and the model inspection effect after convolution kernel beta pruning, at present Table 1 gives model size, detection speed and detection accuracy before and after beta pruning.As seen from the table, original model is compared, although inspection It surveys precision to be declined slightly, is kept to 75.2% by 77.3%, but model size is reduced to 13.8M, model inspection speed by 105.2M It is promoted by 46FPS (frame is per second) to 200FPS (frame is per second), can be met real in the case where slightly sacrificing detection accuracy in this way The model deployment of mobile terminal and the requirement of real-time of detection in the production application of border.

1 archetype of table and compact model performance comparison

Model	Model size (M)	It detects speed (FPS)	Detection accuracy (mAP)
				Archetype	105.2	46	77.3
Compact model	13.8	200	75.2

Compared with existing other methods, the implementation of this example obtains an initial detection by using training data training Then model carries out importance assessment by extracting the convolution kernel of network to detector feature, removes convolution kernel that should not be important, With this reduced model.The importance different from the past that convolution kernel is assessed using the statistic of i-th layer of characteristic layer, we pass through the The characteristic layer of i+1 guides the convolution kernel assessment to i-th layer of characteristic layer.During entire convolution kernel beta pruning, we are not Change the structure of original model, can preferably keep the precision of model in this way.After completing convolution kernel beta pruning, then to model into Row fine tuning, so that compact model performance is optimal.By our algorithm, the compression of model is realized, portion is allowed to Administration improves detection speed in mobile terminal, and maintains the precision of detection substantially.

The above description is only an embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of fast target detection method based on depth convolutional neural networks, which comprises the following steps:

Step1: pre-processing the image data of training set, and pretreated training set input SSD model is trained Obtain initial model；

Step2: taking initial model obtained in Step1 the strategy of convolution kernel beta pruning, is compressed to obtain to initial model Compact model；

Step3: compact model is trained using pretreated training set, i.e., compact model is finely adjusted；

Step4: repeating step Step2-Step3 several times, obtains final detection model.

2. the fast target detection method according to claim 1 based on depth convolutional neural networks, which is characterized in that institute State the pre-processing image data of training set in Step1 specifically: using random cropping fixed area, the random size of random cropping, Color change and brightness warping method, to image data carry out augmentation, then do flip horizontal at random again, finally to augmentation after Picture does normalized, makes its fixed size, having a size of w × h.

3. the fast target detection method according to claim 1 based on depth convolutional neural networks, which is characterized in that institute SSD model in Step1 is stated to be trained to obtain the detailed process of initial training pattern are as follows:

(1) construct SSD model, with remove full articulamentum VGG16 be basic feature extraction network, then add Conv6, Six convolutional layers of Conv7, Conv8, Conv9, Conv10 and Conv11, and extract Conv4_3, Conv7, Conv8_2, Conv9_ 2, Conv10_2 and Conv11_2 layers of characteristic pattern is as prediction interval；

(2) pretreated training set input SSD model is subjected to feature extraction, generates fixed number not on each prediction interval It is then true by what is marked in the image data of priori candidate frame and training set with the priori candidate frame of size and different length-width ratios Frame is matched, and the positive sample and negative sample in training process are obtained, and carries out classification and regression forecasting respectively, is trained herein In the process, initial training pattern is obtained by backpropagation training network using SGD gradient optimization algorithm.

4. the fast target detection method according to claim 3 based on depth convolutional neural networks, which is characterized in that institute It states the true frame marked in the image data of priori candidate frame and training set and carries out matching strategy are as follows: firstly, by true frame It is handed over it and than selecting frame to be matched after maximum priori；Then, for hand over and than greater than 0.5 true frame and remaining time Select frame to be matched, will be used as positive sample with the matched candidate priori frame of true frame, it is other then as negative sample, by negative sample This is arranged according to forecast confidence descending, chooses the negative sample of front, and guarantees that positive and negative sample proportion is 1:3.

5. the fast target detection method according to claim 3 based on depth convolutional neural networks, which is characterized in that institute State the loss function in training process are as follows:

Wherein, N is the number of the priori candidate frame to match with true frame, L_locTo position loss function, L_confIt is set for classification The loss amount of reliability, α are regularization parameter, and z is input picture, and p is target category, and l is model prediction frame, and g is mark side Frame.

6. the fast target detection method according to claim 1 based on depth convolutional neural networks, which is characterized in that institute State the strategy of convolution kernel beta pruning are as follows: feature extraction is carried out to pretreated training set obtained in Step1 using initial model, is obtained Into VGG16 the characteristic layer of preceding 10 convolutional layers and to convolution kernel carry out cut operator, detailed process are as follows:

If y is the stochastical sampling in the output characteristic layer of the stochastical sampling point namely i+1 layer in the i-th+2 layers input feature vector layer Point:

Wherein,WithThe respectively convolution kernel and sliding window of i+1 layer response, c indicate convolution kernel channel, and C is most Big port number, k₁Indicate width, k₂Indicate that height, the two maximum value are K, b is corresponding biasing；

Wherein,

If the i+1 layer input that i-th layer of channel subset is constituted is approximate with the output of i+1 layer, the input of i+1 layer is i-th layer corresponding In other channels can remove, speciallyWherein S is characterized a layer channel subset, and hasSuch as Fruit above formula is set up, then anyCorresponding channel characteristicsIt can remove, at the same time, the input of i+1 is i-th layer defeated Out, then corresponding convolution kernel can be also removed in i-th layer, to realize i-th layer of convolution kernel beta pruning；

In the training process, it is equipped with training setWherein M is picture number and convolution feature space position The product of quantity is set,For m-th of input convolution feature,For corresponding convolution kernel and sliding window The jth channel output that mouth convolution operation obtains；For m-th of stochastical sampling point in output characteristic layer, then former channel selecting is asked Topic is changed into following optimization problem:

Wherein, | S | for the quantity of element in channel subset S, r is compression ratio, and T is enabled to indicate the removed channel subset of characteristic layer, Then the intersection of set T and S is empty set, and union is that { 1,2 ..., C } is gathered in channel, above formula conversion are as follows:

By above-mentioned optimization, i-th layer of convolution kernel to be removed has been obtained, meanwhile, the model after being removed in order to ensure convolution kernel Can, minimize its reconstructed error.