CN109145769A

CN109145769A - The target detection network design method of blending image segmentation feature

Info

Publication number: CN109145769A
Application number: CN201810860392.7A
Authority: CN
Inventors: 孙福明; 蔡希彪; 贾旭
Original assignee: Liaoning University of Technology
Current assignee: Liaoning University of Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-01-04

Abstract

Blending image divides the target detection network design method of feature, and this method is significant for large-scale target effect.In conjunction with general target detection framework Mask RCNN, blending image divides feature simultaneously, by feature integration that Target Segmentation feature and ResNet-101 convolutional network obtain to being input in RPN module, RoI Pooling module and RoI Align module together on the basis of basic Mask RCNN algorithm, it is shown experimentally that this method is significant for large-scale target effect, it can then be improved comprehensively according to for the ideal image segmentation algorithm of Small object image segmentation, for self iteration of Mask RCNN.

Description

The target detection network design method of blending image segmentation feature

Technical field

The invention belongs to pedestrian detection method fields, in particular to the target detection network design of blending image segmentation feature Method.

Background technique

With the development of science and technology with the progress in epoch, it has to recognize that our life style also constantly changes therewith ?.The trip mode of people is constantly updated, and automobile is most commonly used a kind of vehicles under Contemporary Environmental, according to Ministry of Public Security's traffic control Office's statistics, in by the end of June, 2017 by, national vehicle guaranteeding organic quantity is up to 3.04 hundred million, and wherein 2.05 hundred million, automobile, colleague's traffic are pacified Full problem is particularly pertinent, and according to incompletely statistics, the number that annual middle and low income country dies in traffic accident has reached entirely 90% or more of the total death toll of ball, however the vehicle fleet that these countries possess only accounts for the 48% of worldwide vehicle sum.It is striking Soul-stirring data bring is the deep thinking of traffic safety problem behind, although and analyzing a variety of traffic accident causations we have found that making The reason of taking place frequently at traffic accident has very much, but not high to the attention rate of pedestrian is wherein most important one of reason.

In order to solve this problem, researchers at home and abroad provide many solutions, and most typical is exactly that auxiliary is driven Sail system.Advanced DAS (Driver Assistant System) (ASAD), and within the system, the most key technology is exactly pedestrian detection technology.

The final purpose of pedestrian detection technology is the presence to judge whether to have pedestrian in some video sequence or image, The position of pedestrian can be accurately outlined on this basis.Although current research can judge to a certain degree pedestrian Image, but still remain much be unable to entirely accurates identification differentiate the problem of.

In pedestrian's detection field, Early features extraction mainly uses HOG feature, but since HOG feature is engineer, Its feature extraction algorithm process is fixed, and only pedestrian preferably identifies keeping standing position Shi Caike, therefore many researchs at that time Personnel propose the thought of Fusion Features, HOG feature and other characteristics of image are blended, such as image segmentation feature, picture depth Feature and picture edge characteristic etc..In the recent period, very fast in the development of computer vision field convolutional neural networks, it has gradually replaced artificial The feature of design, but its performance is still to be improved, and Fusion Features thought stands good herein.

In the pedestrian detection model using the design of Mask RCNN algorithm, to identify compared with Small object, then need to put image Greatly, so that Small object size enters in the range of region recommendation window, but some inhuman details can be amplified at this time, makes it Shape is approximatively close to pedestrian's shape.

Summary of the invention

The object of the present invention is to provide the target detection network design methods of blending image segmentation feature, and this method is for big Type target effect is significant.

In conjunction with general target detection framework Mask RCNN, while blending image divides feature, calculates in basic Mask RCNN By feature integration that Target Segmentation feature and ResNet-101 convolutional network obtain to being input to RPN mould together on the basis of method In block, RoI Pooling module and RoI Align module.And tested in MS COCO test set, verify its validity.

Pedestrian detection method Target Segmentation image based on multi-feature fusion has a kind of characteristic, both after amplification can't Excessive details is generated, blending image divides feature in Mask RCNN algorithm based on this idea, improves its performance.

The advantage is that:

Describe the algorithm design conditions about Mask RCNN blending image segmentation feature in detail in technical solution, it is first It first illustrates to introduce the motivation of image partition method and the process of this method.Then DeepLabv3 image segmentation net is described The case where network, illustrates the effect of empty convolution wherein newly introduced.Next image segmentation network and feature gold word are described The details of tower progress Fusion Features.Show that this method is significant for large-scale target effect finally by experiment, according to for small The ideal image segmentation algorithm of target image segmentation effect can then improve comprehensively, for self iteration of Mask RCNN.

Detailed description of the invention

Fig. 1 is the Mask RCNN algorithm structure schematic diagram that blending image divides feature.

Fig. 2 is empty convolution schematic diagram, and (a) convolution kernel is 3*3.

Fig. 3 is empty convolution schematic diagram, and (b) convolution kernel is 7*7.

Fig. 4 is empty convolution schematic diagram, and (c) convolution kernel is 15*15.

Fig. 5 is comparison diagram before and after image segmentation network processes, (a) original image.

Fig. 6 is comparison diagram before and after image segmentation network processes, (b) effect picture after image segmentation network processes.

Fig. 7 is comparison diagram before and after image segmentation network processes.

Fig. 8 is that blending image divides feature schematic diagram.

Fig. 9 is residual error network element structures schematic diagram.

Figure 10 is residual error network portion structural schematic diagram.

Figure 11 is region recommendation network algorithm flow chart.

Figure 12 is candidate window classification, dividing processing flow chart.

Figure 13 is MS COCO training set human object's height distribution histogram.

Figure 14 is Cityscapes data set different resolution test result figure, (a) 0.5 times of scaling test result histogram Figure.

Figure 15 is Cityscapes data set different resolution test result figure, (b) 1 times of scaling test result histogram.

Figure 16 is Cityscapes data set different resolution test result figure, (c) 2 times of scaling test result histograms.

Figure 17 is that Cityscapes data set different proportion scales test result figure.

Figure 18 is interior driver and passenger's erroneous detection is pedestrian, (a) passenger inside the vehicle.

Figure 19 is interior driver and passenger's erroneous detection is pedestrian, (b) interior driver.

Figure 20 is that optimization algorithm applies front and back comparison diagram, before (a) algorithm improvement.

Figure 21 is that optimization algorithm applies front and back comparison diagram, (b) after algorithm improvement.

Figure 22 optimization algorithm test result.

Specific embodiment

The target detection network design method of blending image segmentation feature:

Network structure design:

The Mask RCNN algorithm structure schematic diagram that blending image divides feature is as shown in Figure 1.

Image segmentation network introduction:

As seen from Figure 1, Target Segmentations different from the Target Segmentation in Mask RCNN, that this Fusion Features is added Network is the module with independent processing ability, selects DeepLabv3 semantic segmentation algorithm as Target Segmentation network here, DeepLabv3 method is divided into two steps:

(1) to obtain primary segmentation result figure using full convolutional network, and it is interpolated into original image size.

(2) the fine amendment in details is carried out to the image segmentation result that interpolation obtains using full connection CRFs algorithm, Successive ignition is carried out to obtain optimum segmentation result.

The full convolutional network of DeepLabv3 makes it have structure end to end, is highly convenient for training, therein using cavity Convolution design, or it is expansion convolution, can effectively replace pond layer reduces information loss.Based on convolutional neural networks Image segmentation algorithm uses the model of coding and decoding, first successively reduces the Spatial Dimension of input data by encoder, recycles solution Code device both the networks such as deconvolution successively restore target details and corresponding spatial position.

Model encoder therein generallys use pond layer reduction input data and then expands receptive field, but pond layer can be lost Bulk information is lost, can greatly increase calculation amount if the method by increasing convolution kernel expands receptive field, especially encode Interlude, the port number of eigenmatrix is usually 512 or 1024, if the convolution kernel of original 3*3 is revised as 5*5 or 7*7 Then calculation amount can explode.The alternative pondization operation of empty convolution is introduced, empty convolution schematic diagram is as in Figure 2-4.

Fig. 2 (a) is normal 3*3 convolution kernel, and Fig. 3 (b) is replaced the convolution kernel of 7*7 by the empty convolution kernel of 3*3, and Fig. 4 (c) is The empty convolution kernel of 3*3 replaces the convolution kernel of 15*15, and wherein blue region is convolution kernel overlay area, its other than red dot His convolution kernel part is zero padding operation.It can replace pondization to operate it can be seen from Fig. 2-4, do not losing information, do not increasing meter Increase receptive field in the case where calculation amount, so that each convolution output covering larger area.

Image segmentation network segmentation front and back comparison is as seen in figs. 5-6.

Fig. 5 is original image, and Fig. 6 is effect picture after processing, it can be seen that after image segmentation network processes, nearby Pedestrian contour is very clear, and the pedestrian contour of distant place is then unintelligible.By Fusion Features, region recommendation network and time Select the result figure obtained after window classification, segmentation network processes as shown in Figure 7.

Digital convergence is in 0.994-0.998 range behind person in Fig. 7, wherein 0.998 number is at most (for seven), 0.994 is two, and 0.996 is three.

Comparison diagram 5-6 and Fig. 7 is it can be found that the result figure that image segmentation network obtains can not accurately be found out present in figure Each target, but it can substantially distinguish complicated background with target is subsequent more accurately target detection It lays the foundation with Target Segmentation, it is highly important that this shows that blending image segmentation is characterized in.

Due to Mask RCNN network introduced feature pyramid structure in feature extraction of this secondary design, so at this The Target Segmentation network of addition must also make corresponding adjustment, and Fusion Features use Faster RCNN for basic framework, feature The part-structure that network is Vgg16 convolutional neural networks is extracted, finally obtained is a global feature matrix, and the spy introduced The eigenmatrix that sign pyramid structure will obtain 5 resolution ratio and successively successively decrease.It shows with the DeepLabv3 image segmentation network integration It is intended to as shown in Figure 8.It can be seen that the eigenmatrix of image segmentation network output is for the spy that exports with feature pyramid in figure Sign matrix blends, and needs constantly to go to reduce matrix resolution by pond layer, can use on Feature fusion a variety of Mode is such as added by element, channel is cumulative in the case where multiplication or equal resolution.Wherein the image segmentation of this input is special Levying port number is 3, the method for misfitting with 256 channels of feature pyramid output, therefore using cumulative port number, by port number Increase to 259, and convolutional layer is added dimensionality reduction can be docked directly with network below in this way to 256 again by its port number.

Wherein the feature extraction network of Mask RCNN is single spur track, and the Target Segmentation network being newly added and original spy Sign extracts network and forms two branches, in training, the case where encountering gradient backpropagation, needs to freeze original net at this time Network weight parameter is only trained the weight parameter for the Target Segmentation network being newly added.If not freezing legacy network weight ginseng Several, the network bring error being newly added will seriously pollute legacy network, and model performance is made serious degenerate occur.

The weight training of new network is built upon on the basis of original Mask RCNN, continues to select MS COCO as training Collection, is totally divided into three steps: (1) freezing all original weight parameters first, only carry out to the Target Segmentation network weight being newly added Training, makes loss function reach original numerical value.(2) freeze frame is input to network weight and Target Segmentation network between FPN Weight is trained Weighted residue part.(3) whole weights of the network after stabilization are trained.

The first experimental analysis of foregoing teachings:

Mask RCNN marks context of detection after blending image divides feature, in pedestrian, the knowledge to Small object and medium target Rate does not improve lower, is then significantly increased for the discrimination of big target.The basic reason for this phenomenon occur is to be merged Image segmentation feature clarity in terms of Small object is poor, can only play the role of seed point, but it is in the extraction of larger target Aspect significant effect improves meeting with image segmentation algorithm stepping up in aspect of performance brought by Fusion Features It is increasing.Shown in specific experiment tables of data 1.The experiment test set is the human object in MS COCO test set.

The comparison of 1 accuracy rate of table

Feature extraction network design above-mentioned:

Current effect is selected preferably to increase income CNN network as feature extraction network.Wherein, ResNet-101 network is in mesh Performance is excellent in terms of preceding feature extraction, and compared with other convolutional neural networks, it joined residual error function, this residual error letter Number can be such that the depth of CNN reaches not degenerate in very high situation.In computer vision, with the increasing of the network number of plies Add, the feature level extracted is also higher, closer to semantic information.Before the appearance of no residual error network, the too deep meeting of the network number of plies Gradient disperse or gradient explosion phenomenon are brought, after solving degenerate problem, its performance is also continuous with the increase of the network number of plies Increase, if the performance of ResNet-50, ResNet-101, ResNet-152 are to be promoted steadily.

Specific residual error function structure is shown in Fig. 9.If setting input feature vector matrix as x, intermediate weights network is F, then defeated The eigenmatrix for arriving next layer out is H (x)=F (x)+x, and the function of unit networks fitting is F (x)=H (x)-x.Its initial mesh Be the identical study situation of simulation, it is believed that network, which will learn mapping of F (the x)=x mapping than study F (x)=0, to be more difficult. In addition to this, another actual influence of residual error structure bring is to export the shadow that the variation of eigenmatrix is F to intermediate weighting network Sound is bigger, keeps it more sensitive.The thought of residual error network is to remove identical main part, thus the change of prominent features matrix Change, this is quite similar with the differential amplification system in circuit, differential amplification system can solve signal transmit at a distance in line Road interference, residual error network can solve gradient disperse or gradient explosion issues in deep layer network.

Characteristic extraction part schematic network structure is as shown in Figure 10.

X branch of the Figure 10 after pooling layers of Max joined the convolutional layer and BN (Batch that convolution kernel is 1*1 Normalization) layer, it acts as matrix dimension is changed, then without the convolutional layer and BN layers in subsequent residual error structure, Input feature vector matrix is directly added with F (x).It in each F (x) structure, is all made of cubic convolution, for the first time and finally Primary is the convolution kernel of 1*1, and work, which is used as, is changed to matrix dimension, and centre is the convolution kernel of 3*3.Its subsequent residual error network is exactly The repetition of this network is superimposed, and wherein convolution kernel is 3*3 always, and the port number of eigenmatrix then constantly changes.

Reappear Mask RCNN algorithm with the deep learning frame library Keras based on the rear end TensorFlow, using MS COCO data set trains the model.Currently, algorithm of target detection mainstream frame is FPN and integrating context information.Fusion is up and down Literary information, that is, Low Level Vision information combination high-layer semantic information, in conjunction with method be various ways, as element be added one by one or be multiplied, Characteristic pattern is cumulative to make port number the modes such as increase, wherein it is preferable by element addition method effect, therefore this is adopted this method.

Currently, TensorFlow frame supports preferably more video card parallel trainings, when using stochastic gradient descent algorithm, The sample image of each batch processing is more, then model generalization ability and loss decline stability are higher, but MS COCO data set Image resolution ratio is not consistent, and to improve batch processing ability, input picture is uniformly processed as 1024*1024 resolution ratio, but protects Sample image original aspect ratio is demonstrate,proved, other parts are carried out to mend 0 processing.

Table 2 lists the dimension specific value of ResNet-101 residual error network with feature pyramid eigenmatrix when combining, And feature pyramid carries out the resolution ratio after dimensionality reduction to it, this feature pyramid is 5 layers, and residual error network only exports 4 features The eigenmatrix of matrix, the last layer is obtained by the direct dimensionality reduction of layer second from the bottom.

Be to small target deteection in conjunction with contextual information it is helpful, in pedestrian detection task, it may appear that a large amount of small mesh Mark needs detected situation, after obtaining FPN layer eigenmatrix, to the progress deconvolution processing of high-level characteristic matrix, make its with Preceding layer eigenmatrix dimension is consistent, merges it mutually by the method that matrix is added by element.

2 eigenmatrix resolution ratio contrast table of table

So far, the feature extraction of image is completed, and converts 5 eigenmatrixes for an image, subsequent to be pushed away by region Algorithm is recommended, finds foreground target from 5 eigenmatrixes.

The design of region recommendation network:

The basic procedure of RCNN serial algorithm are as follows: feature extraction first is carried out to image, is then obtained by this feature matrix Foreground target, the previous form chosen foreground target and generally use sliding window, it is envisaged that this is multiple small tasks cumulative It is handled, treatment process then serially carries out.And Faster RCNN proposes the region RPN recommendation network structure, uses anchor Point form, makes the sliding window task of serial process become the anchor point task of parallel processing, this greatly accelerates processing speed.And The form of Mask RCNN selection foreground target is almost the same with Faster RCNN's, FPN layers each due to FPN layers of presence The quantity of each anchor point be not 3 kinds of scales and 3 kinds of shapes (combine totally 9 kinds of shapes) in Faster RCNN, but 3 kinds of shapes of only a kind of scale, i.e., vertical rectangle, horizontal rectangle and square.Such as the eigenmatrix of the first layer of FPN Resolution ratio is 256*256*256, then generates 256*256*3 pre-selection window altogether, and the eigenmatrix resolution ratio of the second layer is 128*128*256 then generates altogether 128*128*3 pre-selection window, according to original image resolution and each eigenmatrix point Resolution, can calculate the coordinate of three of each anchor point pre-selection windows, the convolutional layer for being 3 × 3 by a convolution kernel, can be with New matrix is obtained, value is the destination probability value and four coordinate shift amounts that each window generates.With 4 variable P_cx、P_cy、 P_w、P_hRespectively indicate center abscissa, center ordinate, the window width, window height of each anchor point pre-selection window.Four seats Mark offset is d_x、d_y、d_w、d_hRespectively preselect the translation abscissa of window center point, translation ordinate, the window of central point Width zoom factor, window height zoom factor.Finally obtained new window value is P '_cx、P′_cy、P′_w、P′_h, production Methods As shown in formula (1).

Faster RCNN is all anchor point windows to be generated on a characteristic layer, and the Mask RCNN that FPN is added is Various sizes of anchor point window is generated on different characteristic layer, as characteristic layer gradually increases, characteristic layer is more and more abstract, each The corresponding area in original image of anchor point is then bigger, and table 3 is the corresponding size of each layer anchor point window of RPN.

The setting of table 3RPN anchor point window

Table 3 is it can be seen that the setting of its anchor point window covers the target of each size in MS COCO data set extensively.

Region recommendation network overall flow figure is as shown in figure 11.

5 eigenmatrixes obtained by feature extraction network will be final to obtain by the processing of flow chart shown in Figure 11 To recommendation window list.It designs herein and does not use SVM classifier, but use Softmax classifier, compared to obtaining for SVM The scores of the result Softmax classifier divided have probability meaning, its score is mapped to probability space by Softmax, Last item score had both been the probability value that target belongs to the category.

Be adjusted to the image of 1024*1024 resolution ratio for input picture, generated number of windows it is huge, even if Later period removes the anchor point window beyond image border, quantity be also it is very huge, classified one by one to it, returned and mesh Mark segmentation just needs great calculation amount.Therefore be ranked up by the destination probability value to each window, carrying out non-maximum After inhibition processing, retain 2000 recommendation windows in the training stage, retains 1000 recommendation windows in test phase.

The problem of being matched about anchor with sample object frame, select in all anchors with sample object frame Duplication it is highest and It is positive sample that Duplication, which is greater than 0.7, and Duplication is negative sample less than 0.3, other are as neutral sample.To guarantee positive and negative sample The balance of this quantity, positive sample quantity must not exceed the half of selection anchor sum, and the part of positive negative sample exceeded will be set to Neutral sample.All anchors are the summation of different FPN layers of anchors.

Candidate window classification, dividing processing design:

It has been observed that obtaining preliminary foreground target by the region RPN recommendation network, this brief summary will be to obtained recommendation target It is handled.

Due to the presence of articulamentum complete in Faster RCNN, to handle the image of arbitrary resolution, it is necessary to connect entirely Unified quantization operation is carried out before layer, Faster RCNN completes this operation using RoI Pooling layers, but it is not ensured that The one-to-one correspondence of Pixel-level between input and output, the process can't affect greatly classification, but to Pixel-level Target Segmentation impacts larger.And all quantizing process in Pooling layers of RoI of removal of Align layers of the RoI of Mask RCNN, Target detection branch is continued to the 7*7 size of the RoI Pooling using Faster RCNN, Target Segmentation branch is adopted when reproduction With 14*14 size.The design reduces the presence of error when extracting area-of-interest from characteristic pattern using bilinear interpolation.

Target Segmentation network and target detection network are all to receive candidate window from region recommendation network, in conjunction with preceding feature 5 layers of eigenmatrix that pyramid obtains, therefrom extract local feature matrix corresponding to candidate window, both RoI Pooling The eigenmatrix that layer is exported with Align layers of RoI, the two principle is identical, and only resolution ratio is different.

Candidate window classification, dividing processing flow chart are as shown in figure 12.From the figure, it can be seen that the RoI in reproduction Pooling layers of obtained eigenmatrix are handled by the convolutional layer that two convolution kernels are 1*1, and effect is as full articulamentum. Since MS COCO data set complexity is higher, then do not copy Faster RCNN that dropout layers are added after convolutional layer.Often A candidate window will all obtain the vector of one 81 dimension, obtains 81 probability values after the processing of sigmoid function, corresponds to The probability value of 80 kinds of objects and background, wherein which class probability value highest, the candidate window be both that a kind of target, and final mask is defeated Out when result, the object can be just marked in the picture when only probability value is higher than given threshold.Obtaining candidate window probability While value, also further its position coordinates accurately will be adjusted according to which type objects is target belong to.Since output is tied Fruit is very more, it may appear that the case where unifying object repeating label, therefore using non-very big in the final output of target detection Value restrainable algorithms carry out deleting processing to the higher target of Duplication.

While obtaining object detection results, processing target is also divided into network.It is obtained after Align layers of RoI processing To the eigenmatrix of each candidate window, deconvolution operation is carried out to it after multiple convolution operation is carried out to it, and to every one kind The binary map of image segmentation is not generated not individually, and the process of binaryzation is completed by sigmoid function, and the final area-of-interest is adopted Which determined with the binary map of classification by the target category that target detection branch exports, which also eliminates usual Target Segmentation institutes Race problem between the class faced.

The loss function design can be used:

About loss function, Mask RCNN is added to L on the basis of Faster RCNN_maskVariable, it obtains prediction Target Segmentation binary map carry out cross entropy operation, be multi-task learning mode, shown in whole loss function such as formula (2):

L=L_cls+L_box+L_mask+L_p+L_r (2)。

Wherein L_maskFor the loss of object segmentation result, L_clsFor target detection Classification Loss, L_boxFor target detection coordinate Return loss, L_rFor weight regularization loss, L_pFor the loss of region recommendation network.

(1) target detection Classification Loss:

In training, window is recommended in the region that target detection network can obtain 200, and the ratio of this positive negative sample is 1: 2.If p is the corresponding probability value of correct classification, L_clsIt indicates that the Classification Loss of window is recommended in 200 regions, cross entropy is selected to make For measurement standard, calculate as shown in formula (3):

(2) target detection coordinate returns loss:

Coordinate recurrence loss is different from the measurement standard of target detection Classification Loss, selects smooth L1 as its measurement Standard.It is calculated as shown in formula (4):

(3) loss of object segmentation result:

In training, window is recommended in the region that target detection network can obtain 200, and Target Segmentation can export 200 28* 28 matrix, the probability value that each element of matrix is 0 to 1.Logarithm loss function is selected to measure object segmentation result.Below Logarithm loss function is provided in the definition of individual data point, as shown in formula (5):

Cost (y, p (y | x))=- y ln p (y | x)-(1-y) ln (1-p (y | x)) (5).

Each window object segmented image matrix dimensionality is 28*28, then L_maskFormula such as formula (6) shown in:

(4) region recommendation network loses:

RPN network only need to distinguish candidate window whether be prospect can, therefore its be two classification problems, L can be referred to_cls、 L_boxIt is calculated.

(5) weight regularization error:

L_rThe quadratic sum of as all weight coefficients and the product of proportionality coefficient α, specific as shown in formula (7), wherein w is The trainable weight parameter of network.

The setting of network hyper parameter:

Mask RCNN network is end-to-end design, this brings great convenience to training, not only increases whole behaviour The threshold of operator can also be effectively reduced in the efficiency of work.But it does so also with the presence of its drawback, is mainly improved in performance When it is not noticeable to where problem, if it is the training executed step by step, where finding problem by contrast operation, To breakthrough bottleneck, performance is improved.In training, intentional frozen fraction weight can choose to meet the need of substep training It asks.

In training deep learning model, not only to prepare the data set marked, network structure and initialization network weight Weight parameter, the more preferable hyper parameter that controlled training process is arranged, the hyper parameter of present networks are listed in table 4.

The setting of 4 hyper parameter of table

The second experimental analysis of foregoing teachings:

Training set of the training set of MS COCO data set as Mask RCNN is selected, there are 80 class different targets, Target mark aspect, especially Small object mark aspect are more more careful and clear than other polytypic data sets.Before training By training set image, scaling unites to wherein human object's size to 1024*1024 resolution ratio the case where keeping aspect ratio Meter, as a result as shown in figure 13, this it appears that the distributed pole of its target size is extensively but uneven from histogram figure, big portion Human object's height dimension is divided to concentrate near 30 pixel values.

In Mask CNN, area-of-interest area corresponding to each anchor point of characteristic pattern kind of the bottom is 16* in FPN 64,32*32 and 64*16, in training and test, the Duplication of the area-of-interest of positive sample and anchor point is 0.7, if target Area is too small, then it can not obtain enough Duplication, therefore very poor for the discrimination compared with Small object.We are to Cityscapes Data set carries out scaling processing, and scaling multiple is respectively 0.5 times, 1 times and 2 times, sharp respectively for the data set of three kinds of resolution ratio It is tested with the model of MS COCO training set training, test result is as shown in figure 13, and blue histogram is test set institute in figure Someone's class object height histogram, red histogram is the height histogram that model recalls correct target in the test set, from figure In as can be seen that input picture by 0.5 times scaling after, the Small object to height less than 16 pixels can not identify.

In general, the training set sample size the more when being trained to image classification model, the mould that model learning arrives Type is more accurate.There are bigger differences for the sample size of each scale of human object in the MS COCO data set used, this can also go out The different situation of existing each scale weight training degree, for the scale of sample size abundance, then training effect is preferable, on the contrary, right Then effect is relatively poor for the insufficient scale of sample size.In Figure 14-16, Figure 15 (b) and Figure 16 (c) are compared and can be seen Out, its recognition effect is relatively poor after same target is amplified.

In practice, the human cost and time cost for making a data set are huge, this target scale distributions Non-uniform situation be difficult to avoid and change, but can be improved by certain methods in the case where not changing data set The performance of model:

(1) problem fixed for most grapnel size, can be put with the size of smallest sample in actual queries test set Big arrive can be in identification range.

(2) while Small object is amplified, normal target is also amplified, and discrimination can reduce, and is not considering that video memory makes In the case where not meeting cost performance, a variety of zoomed images to an image it can carry out while handle simultaneously, to what is obtained Object statistics merge, and take out duplicate target with non-maxima suppression algorithm to final result.

For pedestrian detection problem, using Mask RCNN general target detection framework, formation zone recommendation network algorithm Mode determine that it identifies Small object there is rigid critical value, i.e., the target that area is less than a certain threshold value can not be identified. Small target deteection ability can be improved by carrying out resolution adjustment to input picture, in pedestrian's detection field, for model performance Evaluation criterion there are many, wherein that relatively reasonable is MR-FPPI (Miss rate against false positives Per image) curve index.Herein using the training set training Mask RCNN model in MS COCO data set, test set I Select in Cityscapes the training set of Precision criterion and verifying collection, by it, scaling is 0.5 times, 1 times and 2 times and surveys respectively Examination, test result are depicted as MR-FPPI curve, as shown in figure 17.

Horizontal axis is FPPI index, and the longitudinal axis is MR index, and wherein red line, blue line and green line respectively correspond scaling 0.5 Times, 1 times with 2 times after testing result, it can be seen that when averagely every figure erroneous detection quantity is 1,1 times and 2 times difference of scaling can To obtain 0.7 and 0.73 accuracy rate.Its curve and the area that the longitudinal axis, horizontal axis are surrounded are smaller to show that its true model performance is got over Good, i.e., in the case where every figure judges the alap situation of target by accident, the omission factor obtained is relatively low.

In Figure 17 it can be seen that, in the lower situation of FPPI index, three curve weave ins, i.e. model at this time Detection performance for the image of three kinds of resolution ratio be it is similar, as the index of FPPI constantly increases, i.e., judge mesh by accident in image Target gradually increases, and the omission factor of target obviously lowers in high-definition picture.And in the presence of the omission factor of low resolution image Limit, the reason for this is that feature pyramid and its corresponding anchor have most window, i.e., its score is too low when target is too small to pass through The mechanism of feature pyramid and anchor filters out it in subsequent region recommendation network, and then can not be in subsequent target detection Network directly loses the target with Target Segmentation network.After image amplification, corresponding target also follows puts together Greatly, and in the picture the quantity of high-resolution big target is compared and low Small object image its negligible amounts differentiated, therefore its MR- FPPI curve will appear the situation in Figure 17.

The optimization of pedestrian detection algorithm:

It is all that all small target deteections are used described in preceding, a kind of common optimization algorithm, is being directed to specific pedestrian at last Test problems are that it is not that all mankind are pedestrians in image that pedestrian, which has its specificity, the row only walked on road The talent is pedestrian, and people in the car is simultaneously not belonging to pedestrian, but in a practical situation often judges interior driver or passenger For pedestrian, specifically as depicted in figs. 18-19.

In Mask RCNN, the edge of object can be accurate to Pixel-level rather than just a rectangle frame, be based on This, can accurately judge that the people is in the car or outside vehicle.Judge that process is as follows:

(1) by whether having motorcycle or bicycle below detection pedestrian to determine whether for pedestrian, because either taking charge of Bicycle or motorcycle will not occur thereunder in machine or passenger.

(2) judge that the people is in the car or outside vehicle by the pixel coincidence factor of detection the people and vehicle.

Using in detection effect before and after the algorithm as shown in figures 20-21 figure it can be seen that interior driver is not marked with passenger Note, and biggish pedestrian by motorcycle Chong Die with vehicle, are not judged as passenger inside the vehicle yet, this is being improved accurately to a certain degree Degree.

For driver interior in pedestrian detection and the misjudged problem of passenger, its performance of MR-FPPI curve test is also used, is made It applies the Cityscapes data set in the preferable 2 times of scalings of effect, and test result is as shown in figure 19.

It can be seen that in Figure 22, the blue line after optimization is lower than the red line before optimization, by optimizing later period input picture scaling Accuracy rate is 0.75 when for 2 times of resolution ratio.

Claims

1. the target detection network design method of blending image segmentation feature: it is characterized by comprising the following steps:

The Mask RCNN algorithm of blending image segmentation feature: input picture carries out feature extraction and image segmentation network respectively, Then to region recommendation network, candidate window classification, segmentation network.

2. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: image segmentation network is the module with independent processing ability；Target Segmentation network selects DeepLabv3 semantic Partitioning algorithm, DeepLabv3 method are divided into two steps:

1) to obtain primary segmentation result figure using full convolutional network, and it is interpolated into original image size；

2) the fine amendment in details is carried out to the image segmentation result that interpolation obtains using full connection CRFs algorithm, carried out more Secondary iteration is to obtain optimum segmentation result；

First successively reduce the Spatial Dimension of input data by encoder, decoder is recycled successively to restore the details of target and corresponding Spatial position.

3. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: the eigenmatrix of image segmentation network output needs in order to which the eigenmatrix exported with feature pyramid blends Otherwise open close pond layer of crossing goes to reduce matrix resolution.

4. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: feature extraction network is Target Segmentation network two branches of composition of single spur track and image segmentation, in training When, the case where encountering gradient backpropagation, needs to freeze at this time legacy network weight parameter only to the Target Segmentation being newly added The weight parameter of network is trained.

5. the target detection network design method of blending image segmentation feature according to claim 4: it is characterized in that packet Include the following steps: being trained to the weight parameter for the Target Segmentation network being newly added: the weight training of new network is totally divided into Three steps: 1) freezing all original weight parameters first, be only trained to the Target Segmentation network weight being newly added, and makes to lose letter Number reaches original numerical value；2) freeze frame is input to the weight of network weight and Target Segmentation network between FPN, to Weighted residue Part is trained；3) whole weights of the network after stabilization are trained.

6. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: feature extraction: the x branch after pooling layers of Max joined the convolutional layer and BN that convolution kernel is 1*1 Layer, it acts as matrix dimension is changed, then without the convolutional layer and BN layers in subsequent residual error structure, by input feature vector matrix Directly it is added with F (x)；It in each F (x) structure, is all made of cubic convolution, the volume for being for the first time 1*1 with last time Product core, work, which is used as, is changed to matrix dimension, and centre is the convolution kernel of 3*3；Its subsequent residual error network is exactly the repetition of this network Superposition；

List residual error network and feature pyramid when combining the dimension specific value of eigenmatrix and feature pyramid to it Resolution ratio after carrying out dimensionality reduction, this feature pyramid, the number of plies are only exported than residual error network 1 more than eigenmatrix number, and residual error network is most The eigenmatrix of later layer is obtained by the direct dimensionality reduction of layer second from the bottom；

In conjunction with contextual information to small target deteection, in pedestrian detection task, it may appear that a large amount of Small objects need detected feelings Condition carries out deconvolution processing to high-level characteristic matrix, ties up it with preceding layer eigenmatrix after obtaining FPN layers of eigenmatrix Degree is consistent, merges it mutually by the method that matrix is added by element；

So far, the feature extraction of image is completed, subsequent to recommend to calculate by region by an image degree of being converted into eigenmatrix Method finds foreground target from multiple eigenmatrixes.

7. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: the design of region recommendation network:

The each eigenmatrix obtained by feature extraction network, respectively by two-way convolutional layer, BN layers, ReLU layers, convolutional layer, BN layers and ReLU layers, then respectively correspond Sigmoid layers and anchor point window probability value list and the transformation of anchor point window coordinates and anchor Point the window's position list, then handled jointly by non-maxima suppression, finally obtain recommendation window list.

8. the target detection network design method of blending image segmentation feature according to claim 1: it is characterized in that packet Include the following steps: candidate window classification, dividing processing design:

In reproduction, Pooling layers of obtained eigenmatrix of RoI are incited somebody to action by two convolutional layers processing, each candidate window To the vector of a multidimensional, multiple probability values are obtained after the processing of sigmoid function, correspond to a variety of objects and background Probability value, wherein which class probability value highest, the candidate window be both that a kind of target；

While obtaining candidate window probability value, also by further according to target belong to which type objects to its position coordinates into The accurate adjustment of row；Since output result is very more, it may appear that the case where unifying object repeating label, therefore in the final of target detection The higher target of Duplication is carried out deleting processing using non-maxima suppression algorithm in output result；

While obtaining object detection results, processing target is also divided into network；It is obtained after Align layers of RoI processing each The eigenmatrix of candidate window carries out deconvolution operation to it after carrying out multiple convolution operation to it, and independent to each classification The binary map of image segmentation is generated, the process of binaryzation is completed by sigmoid function, which the final area-of-interest uses The binary map of classification is determined by the target category that target detection branch exports.