CN109241982A

CN109241982A - Object detection method based on depth layer convolutional neural networks

Info

Publication number: CN109241982A
Application number: CN201811035114.4A
Authority: CN
Inventors: 张灿龙; 何东城; 李志欣; 程庆贺
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2019-01-18
Anticipated expiration: 2038-09-06
Also published as: CN109241982B

Abstract

The present invention discloses a kind of object detection method based on depth layer convolutional neural networks, carries out feature extraction network, including deep-neural-network and shallow-layer neural network to training image first, obtains union feature figure；Secondly using recommending net RPN further to extract joint characteristic pattern, region recommended characteristics figure is obtained；Then Feature Dimension Reduction is carried out to region recommended characteristics figure；Model is classified and returned using the region recommended characteristics figure after dimensionality reduction later, obtains target detection model；Finally test image is treated using the target detection model to be detected.The present invention can be obviously improved the accuracy rate to small target deteection, and target detection speed is not substantially reduced while keeping big target detection accuracy rate constant.

Description

Object detection method based on depth layer convolutional neural networks

Technical field

The present invention relates to target detection technique fields, and in particular to a kind of target inspection based on depth layer convolutional neural networks Survey method.

Background technique

Target detection technique extensive application in terms of intelligent transportation, Road Detection, military target.With depth The appearance of learning art and large-scale visual identity data set is spent, depth targets detection technique is developed rapidly, wherein with base In R-CNN (Region-based Convolutional Neural Networks) two stages target detection frame and be based on The single phase target detection frame directly returned is most representative.Target detection frame based on R-CNN is mainly by image convolution Feature generates, object candidate area is recommended, candidate region is classified and recurrence three parts composition.With traditional object detection method It compares, R-CNN series methods eliminate the artificial subjective one-sided for extracting feature, while realizing target's feature-extraction and classification Process it is two-in-one.Candidate region recommendation is eliminated based on the single phase target detection frame object detection method directly returned Link, by directly returning the classification of target in multiple positions of image and frame completes target detection.With based on direct The method of recurrence is compared, although the detection speed of R-CNN series methods is slow, detection accuracy generally wants high, due to big Most object detection tasks is more to precise requirements, thus the application based on R-CNN series methods is more extensive.

Paper " Rich feature hierarchies for accurate object detection and Semantic segmentation Tech report is (using function gradation structure abundant in accurate target detection and language The technical report of justice segmentation) " (it is published in " Conference on Computer Vision and Pattern Recognition (computer vision and pattern-recognition meeting) ") propose a kind of R- for recommending to combine with CNN based on region CNN object detection method opens new era of depth targets detection.Paper " Fast R-CNN (the convolution based on fast area Network method) " (it is published in " (the computer vision world International Conference on Computer Vision Meeting) ") region of interest pond layer is embedded on the basis of R-CNN, and multitask loss function is added in a network and carries out frame It returns, detects the candidate frame of target more acurrate.Paper " Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks (faster R-CNN: suggests network implementations target using region Real-time detection) " (it is published in " International Conference on Neural Information Processing Systems (international neural information processing systems meeting) ") region recommendation network is embedded on the basis of Fast R-CNN (Region Proposal Network, RPN), significantly improves the speed of target detection, and truly realize end Training and test pattern to end.However, the object detection method of current the type is all it can be seen from this series of method The structure of improved model or new detection means is added around the promotion of detection accuracy and speed, but most methods all cause Power is easy to ignore the extracting method of target signature, this causes target detection to wisp in the improvement to object detection method Detectability is insufficient, it is therefore necessary to which inventing a kind of can improve wisp detectability and be able to maintain big object detection ability again Method.

Summary of the invention

To be solved by this invention is that existing object detection method real-time is poor and not high to small target deteection accuracy rate Problem provides a kind of object detection method based on depth layer convolutional neural networks, is keeping big target detection accuracy rate not While change, it can be obviously improved the accuracy rate to small target deteection, and target detection speed is not substantially reduced.

To solve the above problems, the present invention is achieved by the following technical solutions:

Based on the object detection method of depth layer convolutional neural networks, comprise the following steps that

Step 1, based on the target detection model of ImageNet data set pre-training, target detection model is joined Number initialization；

Step 2 carries out feature extraction to training image, it may be assumed that

Training image is carried out the convolution feature that convolution algorithm extracts image by step 2.1；

The obtained convolution feature of step 2.1 is respectively fed to shallow-layer convolutional neural networks and deep layer convolution by step 2.2 Neural network carries out feature extraction；

Step 2.3, will be by the extracted feature of the obtained shallow-layer convolutional neural networks of step 2.2 and deep layer convolutional Neural The extracted feature of network carries out organic joint, and is compressed into a unified space, obtains union feature figure；

Step 3 carries out traversal and convolution, and using area using sliding window to the obtained union feature figure of step 2 The anchor mechanisms of recommendation network generate a certain number of regions on union feature figure and recommend frame, and recommend frame according to these regions Region recommended characteristics figure is extracted from union feature figure；

Step 4 carries out dimensionality reduction to the obtained region recommended characteristics figure of step 3 using Feature Dimension Reduction device；

Step 5 is sent in target detection model the obtained dimensionality reduction rear region recommended characteristics figure of step 4, to target Detection model carries out classification based training and regression training, obtains final target detection model；

Image to be tested is sent in the obtained final target detection model of step 5, and obtains to be measured by step 6 Attempt classification and the regression result of picture.

As an improvement, the step 3 still further comprises the mistake for recommending frame to be screened in the region in joint characteristic pattern Journey: recommend frame to use non-maximum suppression method in obtained region, retain the region for being more than or equal to α with real estate Duplication Frame is recommended to recommend frame as positive sample region, and region of the Duplication less than β recommends frame as negative sample region and recommends frame, finally The score height for recommending frame according to region recommends frame and negative sample region recommendation frame middle selection score most from positive sample region δ high region recommends frame to recommend frame as the region ultimately remained in union feature figure, and is ultimately remained according to these Frame is recommended to extract region recommended characteristics figure from union feature figure in region in union feature figure；Wherein α, β and δ are setting Value, 0 < α < 1,0 < β < 1, δ > 1.

In above scheme, α > β.

Compared with prior art, the present invention has a characteristic that

(1) in feature extraction network, deep layer network is mainly used for capturing the high-level semantic of big target, and shallow-layer network master It is used to retain the underlying image feature of Small object, two kinds of networks are combined the spy that can make full use of different convolutional layers by the present invention Sign is to improve the purpose to target detection capabilities；

(2) RPN used directly generates candidate region on convolution characteristic pattern by using " anchor " mechanism.Although it is essential On be still by the way of window sliding, but due to its region recommend, classification and return together share convolutional layer feature, from And the detection speed of whole network is increased dramatically；

(3) Feature Dimension Reduction device can not only make more compact structure, moreover it is possible to carry out dimensionality reduction to characteristic pattern, furthermore Feature Dimension Reduction device It instead of one layer of full articulamentum, increased speed；

(4) SoftmaxWithLoss function and SmoothL1Loss function are that most popular target detection is damaged at present Function is lost, therefore can preferably complete object detection task.

Detailed description of the invention

Fig. 1 is the schematic diagram of the object detection method based on depth layer convolutional neural networks.

Fig. 2 is the schematic diagram for expanding convolution, and (a) is common convolution characteristic pattern；It (b) is that the expansion that flare factor is 2 is rolled up Product characteristic pattern.

Specific embodiment

Technical solution for a better understanding of the present invention with reference to the accompanying drawing makees specifically the embodiment of the present invention It is bright.The present embodiment is implemented based on the technical solution of the present invention, gives detailed embodiment and specific behaviour Make process, but protection scope of the present invention is not limited to following embodiments.

Based on the target detection model of depth layer convolutional neural networks, be roughly divided into four parts: first part is feature Extract network, including deep-neural-network and shallow-layer neural network；Second part is that net RPN is recommended in region；Part III is special Levy dimensionality reduction network；4th is that full articulamentum and classification return layer.

In shallow-layer neural network, we no longer need to capture the high-level semantics features of image, and are desirable to obtain bottom Characteristics of image, therefore do not need too deep network, that is, it does not need using a large amount of convolutional layer.It is more excellent in order to allow parallel-connection structure to obtain Effect, since conv2-1, we only use 4 convolutional layers, and it is 5 × 5 that each layer, which has 24 kernel sizes, filling system The filter that number is 3.Deep layer network and shallow-layer network to make possess identical spatial resolution, herein also in shallow-layer network It is 4 × 4 that kernel size is designed after each convolutional layer, the average pond layer that step-length is 2, in this structure using be averaged Chi Huake with Ensure that excessive image information will not be lost because of maximum pond.

In deep-neural-network, the parameter from conv1-1 to pool4 is identical as VGG16, we arrive conv5-1 It is 2 that tri- layers of conv5-3, which are all improved to fill factor, and kernel size is 3 × 3, step-length 1, the expansion convolution that flare factor is 2 Layer.Expansion convolution is the common method of image segmentation field, can increase receptive field in the case where not changing characteristic pattern size, It include more global informations, realization principle is as shown in Figure 2, wherein (a) is common convolution characteristic pattern, is (b) expansion The expansion convolution characteristic pattern that coefficient is 2.For 7 × 7 characteristic area, practical convolution kernels size is 3x3, and cavity value is 1, Other weights are 0 i.e. in addition to 9 stains.Although not changing relative to common convolution characteristic pattern kernel size, its Receptive field has had increased to 7x7, this allows each convolution output to contain more global informations.

Model needs to carry out joint training, the entire specific training process of model are as follows: first using alternative optimization coaching method Step carrys out initialization feature using the model of ImageNet data set pre-training and extracts network, and with PASCAL VOC data set list Solely training RPN obtains candidate frame.Second step reinitializes feature extraction net using the model of ImageNet data set pre-training Network, and the candidate frame of first step generation is added, train a list on PASCAL VOC data set using DS-CNN model at this time Only detection network, it is therefore an objective to obtain convolution layer parameter by the penalty values of full articulamentum and the candidate frame of RPN.Third step, Re -training DS-CNN model, the model initialization obtained using second step and fixed convolution layer parameter, are not involved in convolutional layer Backpropagation recycles the parameter of RPN in RPN model initialization individually trained in the first step and fixed DS-CNN, makes RPN not Participate in backpropagation.Trained purpose is that feature extraction network is made to be connected with RPN.4th step, the model obtained using third step Convolution layer parameter and RPN parameter reinitialize and fix DS-CNN model, so that convolutional layer and RPN is not involved in reversed biography It broadcasts, the purpose of this training is the full articulamentum of fine tuning, the result optimized.

Referring to Fig. 1, a kind of object detection method based on depth layer convolutional neural networks specifically comprises the following steps:

Step (1) joins the training pattern of target detection based on the model of ImageNet data set pre-training Number initialization.

Step (2) carries out feature extraction to training image.

Step (2.1) by the image of input by two layers with 64 kernel sizes be 3 × 3 filter (filter) into Row convolution algorithm extracts the convolution feature of image, this two layers of convolutional layer is fixed using the model of ImageNet data set pre-training Parameter is not involved in backpropagation.

The convolutional layer feature that step (2.2) utilizes step (2.1) to obtain continues to be fed into shallow-layer convolutional neural networks and carries out spy Sign is extracted.Shallow-layer convolutional neural networks include 4 layers of convolutional layer and 4 layers of average pond layer.

For step (2.3) while step (2.1) carry out, the convolutional layer feature in step (2.1) is also fed into deep layer convolution Neural network carries out feature extraction.Deep layer convolutional neural networks include 11 layers of convolutional layer and 4 layers of maximum pond layer.

Feature in (2.2) and (2.3) is carried out organic joint by step (2.4) Concat feature combiner, and is compressed into One unified space, the dimension after joint is 536 dimensions.

Step (3) obtains object candidate area recommended characteristics figure.

Traversal and convolution, using area recommendation network are carried out using 3 × 3 sliding window to the characteristic pattern after joint " anchor " mechanism of (Region Proposal Network, RPN), every sliding is primary, and the center position of window can generate 12 Kind " anchor " generates 12 regions and recommends frames, and recommends frame to extract certain amount from union feature figure according to these regions Recommended characteristics figure, the corresponding recommended characteristics figure of frame is recommended in one of region.

In the present embodiment, RPN carries out traversal and convolution using 3 × 3 sliding window to the characteristic pattern after joint, 3 The center position of × 3 sliding windows, 4 kinds of scales (64,128,256,512) of corresponding input picture and 3 kinds of length-width ratios (1:1, 1:2,2:1), it can produce 12 kinds of different regions altogether and recommend frame, that is, generate 12 kinds of anchors.Therefore for 14 × 14 feature of input Figure shares a region in about 2300 (14 × 14 × 12) and recommends frame.

Recommend frame to screen effective region, RPN partial parameters are adjusted, are retained using non-maximum suppression method It is more than or equal to 0.5 conduct positive sample with real estate Duplication, and Duplication is used as negative sample, last basis less than 0.3 Region recommendation scores height, 500 regions for choosing highest scoring recommend frame as the area ultimately remained on union feature figure Recommend frame in domain.

After obtaining region and recommending frame, RPN can recommend region frame and corresponding characteristic pattern to take out as new characteristic pattern That is provincial characteristics figure.Then one trained picture of every input eventually exports 500 region recommended characteristics figures.

Step (4) carries out dimensionality reduction to region recommended characteristics figure using Feature Dimension Reduction device.

Feature Dimension Reduction device is made of region of interest pond layer and monokaryon convolutional layer.Region of interest pond layer can be after RPN The characteristic pattern of fixed size is exported, plays compressive features figure herein.Monokaryon convolutional layer be kernel size be 1 × 1 and The convolutional layer that step-length is 1, using can not only make more compact structure after the layer of region of interest pond, moreover it is possible to region recommended characteristics Figure carries out dimensionality reduction.Feature Mapping size can be fixed as to 7 × 7 by using dimensionality reduction device, and by dimension from 536 be reduced to 512 after again Sign is sent into full articulamentum.

Step (5) carries out classification and regression training to model using region recommended characteristics figure, obtains target detection training mould Type.

Region recommended characteristics figure after dimensionality reduction obtains 4096 dimensions by full articulamentum (fully connected, FC) processing Characteristic pattern, then carry out classification based training and regression training.

Layer of classifying includes 2 elements, for differentiating target and non-targeted estimated probability.Classification based training is the spy of 4096 dimensions Sign figure obtains the characteristic pattern of 21 dimensions by full articulamentum cls_score layers, and cls_score layers, for classifying, export K+1 dimension group P indicates the probability for belonging to K class and background.Because used training dataset PASCAL VOC has K=20 class, and background Belong to 1 class, then the output of full articulamentum is 21.

Loss_cls layers use SoftmaxWithLoss function as the loss function of classification.It is corresponding by the u that really classifies Probability determine that calculation formula is

L_cls=-log p_u (1)

Returning layer includes 4 coordinate elements (x, y, w, h), for determining target position.Regression training is the spy of 4096 dimensions Sign figure obtains the characteristic pattern of 84 dimensions by full articulamentum bbox_prdict layers, and bbox_prdict layers for adjusting candidate region position It sets, exports 4*K dimension group t, indicate the parameter that should translate scaling when being belonging respectively to K class.

The loss_bbox layers of loss function for using SmoothL1Loss function to position as detection block.It is used to compare back The parameter t for the translation scaling predicted when returning^uDifference between true parameter v, calculation formula are

Wherein function g is Smooth L1 error, and value formula is

SoftmaxWithLoss function and SmoothL1Loss function is used to be iterated and ask as loss function respectively The minimum value of Loss.Final training pattern is obtained after the completion of iteration.The result of loss function is classification results and regression result Weighted sum, do not consider if being classified as background return loss, its calculation formula is

Step (6) fixes all parameters in test network using the model that training obtains as initialization model, uses Softmax classifier and object candidate area recommended method obtain classification and the regression result of image to be tested.

It should be noted that although the above embodiment of the present invention be it is illustrative, this be not be to the present invention Limitation, therefore the invention is not limited in above-mentioned specific embodiment.Without departing from the principles of the present invention, all The other embodiment that those skilled in the art obtain under the inspiration of the present invention is accordingly to be regarded as within protection of the invention.

Claims

1. the object detection method based on depth layer convolutional neural networks, characterized in that comprise the following steps that

Step 1, based on the target detection model of ImageNet data set pre-training, to target detection model carry out parameter at the beginning of Beginningization；

Step 2 carries out feature extraction to training image, it may be assumed that

The obtained convolution feature of step 2.1 is respectively fed to shallow-layer convolutional neural networks and deep layer convolutional Neural by step 2.2 Network carries out feature extraction；

Step 2.3, will be by the extracted feature of the obtained shallow-layer convolutional neural networks of step 2.2 and deep layer convolutional neural networks Extracted feature carries out organic joint, and is compressed into a unified space, obtains union feature figure；

Step 3 carries out traversal and convolution using sliding window to the obtained union feature figure of step 2, and using area is recommended The anchor mechanisms of network generate a certain number of regions on union feature figure and recommend frame, and recommend frame from connection according to these regions It closes in characteristic pattern and extracts region recommended characteristics figure；

Image to be tested is sent in the obtained final target detection model of step 5 by step 6, and is obtained and to be measured attempted The classification of picture and regression result.

2. the object detection method according to claim 1 based on depth layer convolutional neural networks, characterized in that the step Rapid 3 still further comprise the process for recommending frame to be screened in the region in joint characteristic pattern: recommending frame to obtained region Using non-maximum suppression method, retains the region recommendation frame with real estate Duplication more than or equal to α and pushed away as positive sample region Frame is recommended, and region of the Duplication less than β recommends frame as negative sample region and recommends frame, finally recommends the score of frame high according to region It is low, recommend δ region of frame and negative sample region recommendation frame middle selection highest scoring to recommend frame conduct from positive sample region It ultimately remains in the region in union feature figure and recommends frame, and ultimately remain in the region in union feature figure according to these and recommend Frame extracts region recommended characteristics figure from union feature figure；Wherein α, β and δ are setting value, 0 < α < 1,0 < β < 1, δ > 1.

3. the object detection method according to claim 1 based on depth layer convolutional neural networks, characterized in that α > β.