CN106709568A

CN106709568A - RGB-D image object detection and semantic segmentation method based on deep convolution network

Info

Publication number: CN106709568A
Application number: CN201611168200.3A
Authority: CN
Inventors: 刘波; 邓广晖
Original assignee: Beijing University of Technology
Current assignee: Shenzhen Xiaofeng Technology Co ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2017-05-24
Anticipated expiration: 2036-12-16
Also published as: CN106709568B

Abstract

The invention discloses an RGB-D image object detection and semantic segmentation method based on a deep convolution network, which belongs to the field of depth learning and machine vision. According to the method provided by the technical scheme of the invention, Faster-RCNN is used to replace the original slow RCNN; Faster-RCNN uses GPU, which is fast in the aspect of feature extracting, and at the same time generates a regional scheme in the network; the whole training process is training from end to end; FCN is used to carry out RGB-D image semantic segmentation; FCN uses a GPU and the deep convolution network to rapidly extract the deep features of an image; deconvolution is used to fuse deep features and shallow features of the image convolution; and the local semantic information of the image is integrated into the global semantic information.

Description

The object detection and semantic segmentation method of the RGB-D images based on deep layer convolutional network

Technical field

The invention belongs to deep learning and field of machine vision, more particularly to a kind of object detection comprising RGB-D images With semantic segmentation method, this has application widely in reality scene, for example, pedestrian is examined in monitor video Survey and tracking, Navigation of Pilotless Aircraft, automatic Pilot etc..

Background technology

Object detection and semantic segmentation are two important research fields of computer vision, object detection mainly for detection of The position of objects in images and the classification of object, the main of object detection have two tasks, and one is the region side for finding out object Case (Region Proposals), zone scheme is a kind of pre-selection frame, represents object approximate location in the picture； Two is that the object in pre-selection frame is classified.The problem that semantic segmentation is solved is that each pixel of image is assigned to just True label, semantic segmentation is mainly used in scene and understands and there are many potential applications.With deep layer convolutional Neural net The rise of network, the object detection based on deep layer convolutional neural networks has turned into detection algorithm presently the most prevailing, is equally based on The semantic segmentation of (Region Proposals) has turned into semantic segmentation algorithm presently the most prevailing.

First, the method for producing Region Proposals relatively more prevailing is as follows.

The method of traditional generation Region Proposals has many kinds, such as selective search (Selective Search different colours feature (such as hsv color space, Lab color spaces etc.)), according to image merges super-pixel (has phase Like the adjacent pixel blocks of feature), image under cpu model using selective search produce Region Proposals when Between be 2s.Multiscale combination is grouped (MCG), it is necessary to merge super-pixel generating region according to contour feature under various zoom scale The two dimensional characters such as candidate scheme, area, girth, boundary intensity then according to zone scheme carry out ranking.Edge frame (EdgeBoxes), using the method generating region scheme of sliding window, using marginal information (profile number in frame and with The profile number that frame edge is overlapped), Region Proposals are ranked up.The above method is all in cpu model Under carry out.Zone scheme network (Region Proposals Network, abbreviation RPN) can utilize deep layer under GPU patterns Convolutional neural networks extract the Region Proposals produced while characteristics of image.This causes that the speed of object detection is obtained Large increase.

2nd, the region deep layer convolutional neural networks for being quickly used for object detection are as follows.

With quick region deep layer convolutional neural networks significantly improving in the speed and accuracy rate of object detection, with After occur in that many deep layer convolutional neural networks faster, such as Faster-RCNN is made up of two networks, and one is RPN, is used In Region Proposals are produced, one is quick region deep layer convolutional neural networks, for object identification；YOLO is thing The selection of body frame is combined with identification, is synchronously completed by primary network, but the area that YOLO is produced to piece image Domain scheme only has 98, causes the accuracy of object frame than relatively low；SSD is that acquiescence side is produced on each layer of characteristic image Frame, advantage is the input picture for low resolution, can also produce the frame of pinpoint accuracy, has the disadvantage detecting system to frame Size is very sensitive, for wisp, detects poor-performing；R-FCN is a kind of object detection network based on FCN, network house Grader layer is abandoned, full articulamentum has been changed into convolutional layer, core network selection ResNet-101, and propose a kind of to thing The sensitive mapping method in body position solves the translation changeability of object.

3rd, semantic segmentation network is as follows.

From convolutional neural networks, it replaces full articulamentum using convolutional layer for full convolutional network reorganization.In order to realize image Semantic segmentation, the method that FCN is used is that one or more deconvolution operation is performed to further feature image so that further feature figure As size as original image size, each pixel is classified using Softmax graders then, it is realized Can not consider to be lost during down-sampling for the semantic segmentation of the pixel to pixel end to end of whole pictures, but deconvolution behaviour True information.SegNet does not take deconvolution to operate, but successively up-sampling operation is performed to further feature image, again such that deep Then the size of layer characteristic image is classified using Softmax graders as original image size to each pixel, It considers the distortion information that image loses in convolution process because of down-sampling, but can so bring very big memory consumption. After DeepLab models add a condition random field (Conditional Random Field, abbreviation CRF) after FCN Treatment operation, optimizes to the image after segmentation in terms of edge details, but this processing procedure is not to locate end to end Reason process, in order to solve this problem, CRFasRNN is combined CRF and deep learning technology so that whole network structure It is a trainable network end to end.

The above research work is concentrated mainly in RGB color image, with the popularization of depth image sensor, for example Intel RealSense 3D Camera, Asus Xtion PRO LIVE, Microsoft Kinect, increasing research Person is transferred to research center of gravity on RGB-D images, for example object detection, three-dimensional reconstruction, robot vision, virtual reality, figure As segmentation etc..Image segmentation is concentrated mainly on the fields such as semantic segmentation, example segmentation, scene label.

Research on RGB-D images, wherein most typically Gupta et al. is fully used on the basis of RCNN RGB-D image studies object detections, and the semantic segmentation based on super-pixel feature.They propose one kind during object detection The novel method of converting for depth image being changed into triple channel image, and this triple channel is named as HHA, first by many chis The method generating region scheme of degree combination packet, is then respectively trained the RCNN of RGB and HHA, merges the two network extractions Feature, is finally classified using SVMs to each zone scheme.During semantic segmentation, the depth based on super-pixel Feature (geocentrical attitude) and geometric properties (size, shape) carry out classification mark using SVMs to super-pixel The prophesy of label, but the method is slowly, and the method generating region scheme being grouped using Multiscale combination is a kind of very slow Slow process, operating speed is slow and RCNN of redundancy, and training is divided into multiple flow line stages, and calculating super-pixel is characterized in One complexity and slow process.

The content of the invention

In order to solve the problems, such as the above, replace original slow using Faster-RCNN in the technical scheme that this method is used Slow RCNN, Faster-RCNN not only possess speed quickly using GPU in extraction characteristic aspect, and in a network can be simultaneously Generating region scheme, can realize that whole training process is to train end to end, while performing the language of RGB-D images using FCN Justice segmentation, FCN uses the further feature of GPU and deep layer convolutional network rapid extraction image, using deconvolution operation handlebar image volume Long-pending further feature and shallow-layer feature is merged, and the local semantic information of image is dissolved into global semantic information.

To achieve these goals, the technical solution adopted by the present invention is the RGB-D images based on deep layer convolutional network Object detection and semantic segmentation method, on object detection and semantic segmentation task, the content of this method is：

S1, by RGB image calculate gray level image, HHG images are merged into by gray level image and HHA images.Such as Fig. 2 institutes Show in tri- optical imagerys of the discrete Fourier transform of passage of HHA, there is the discrete fourier clearest differences is that A channel Conversion, the intensity that it is embodied in the i.e. transverse and longitudinal coordinate axle of DC component is very faint, therefore casts out this passage.Due to tri- passages of RGB Discrete Fourier transform optical imagery it is all similar and intensity of DC component is also strong, use the gray-scale map of RGB image As replacing the A channel image in HHA images, so using the triple channel image of fusion RGB image and depth image for HHG schemes Picture.

S2, using Faster-RCNN as HHG images object detecting system.Using HHG images as network input Data, Region Proposals are produced by the RPN in Faster-RCNN, and Region is extracted by Fast-RCNN The feature of Proposals, then classifies to each Region Proposals, and the testing result of this method is, in HHG Position and the scope of object are marked in image with a rectangle frame, and marks the classification of object in the rectangle frame, thing in such as Fig. 1 Shown in physical examination mapping.

S3, the mechanism for changing non-maxima suppression (Non-Maximum Suppression, abbreviation NMS) reservation frame, Frame quantity around frame is used as factor of evaluation.As shown in Figure 3.Specific step is as follows：

Each frame is 5 tuple (x1, y1, x2, y2, score), wherein (x1, y1) is the seat in the frame upper left corner Mark, (x2, y2) is the coordinate of lower right bezel corner, and score is the confidence level comprising object in frame.Frame is first according to score Value carries out ascending sort to each tuple.It is calculated as follows the double ratio of frame simultaneously (Intersection-over- Union) Duplication.

Wherein, O_(i,j)Represent the double ratio and Duplication, inter of frame i and frame j_(i,j)Represent the weight of frame i and frame j Folded area, area_(i)Represent the area of frame i, area_(j)Represent the area of frame j.For frame i, statistics's Quantity Sum_iIf, Sum_i>=δ, casts out frame i, otherwise retains, and n represents frame total quantity, and δ is represented and accepted or rejected threshold value.

S4, the semantic segmentation task that RGB-D images are completed using HHG images and FCN.Using HHG images as the defeated of FCN Enter data, after FCN extracts semantic feature and classifies, the class label of each pixel, uses label in output HHG images It is worth as the pixel value of the pixel.The segmentation result of this method shows phase to belong to same category of pixel in HHG images Same color is as shown in semantic segmentation figure in Fig. 1.

The object detection of this paper and the structural framing of semantic segmentation are as shown in Figure 1.

Brief description of the drawings

The object detection of Fig. 1 RGB-D images and the flow chart and design sketch of semantic segmentation

Fig. 2 HHG images and RGB image, HHA image comparison figures

Nms ' and top2000 comparison diagrams when Fig. 3 reduces frame

Specific embodiment

The present invention is described in further detail below with reference to drawings and Examples.

The present invention will be illustrated from the following aspects：The fusion of RGB image and depth image, the NMS for changing, The training of model and experimental result.

The object detection and semantic segmentation method of the RGB-D images based on deep layer convolutional network comprise the following steps：

Firstth, RGB image and depth image are fused into HHG images according to the method described above；

Secondth, object detecting system model is trained；

The training method of Faster-RCNN has three kinds：One is alternately training (Alternating Training), and two is near Like joint training (Approximate Joint Training), three is non-approximated joint training (Non-approximate Joint Training).This method uses alternately training program, and alternately the thinking of training program is to make zone scheme network With the shared convolutional layer parameters of Fast-RCNN, the parameter for belonging to each automatic network is finely tuned, this scheme trains zone scheme net first Network, secondly trains Fast-RCNN models according to the zone scheme that zone scheme network is produced, and then uses Fast-RCNN models Initialization area scheme network.This process can be repeated.

This method uses 4- to walk alternately training program：The first step, uses the pre-training on ImageNet data sets Model initialization network, trains zone scheme network；Second step, the zone scheme produced using the zone scheme network of the first step As the pre-detection frame of Fast-RCNN, the model initialization network of the pre-training on ImageNet data sets, training are used Fast-RCNN.3rd step, the netinit zone scheme network and training network trained using second step, because using Shared convolutional layer parameter, so only fine setting belongs to the network layer parameter of zone scheme network here.4th step, equally shares convolution Layer parameter, the netinit Fast-RCNN trained using second step simultaneously finely tunes the network layer parameter for being pertaining only to the network.We The first step and second step are referred to as the first stage, the 3rd step and the 4th step are referred to as second stage.

The computational methods of this paper loss functions are referred to multitask loss (Multi-task loss) of Faster-RCNN Formula, formula expression is as follows：

Wherein, p_iProphesy probability of i-th anchor point frame (Anchors box) comprising object is represented,Represent that ground is true Value (Ground-Truth) label, if anchor point frame is positive example,If anchor point frame is counter-example,t_iRepresent The coordinate (4 parameters) of prophesy frame i,Represent and the related ground truth frame of positive example anchor point frame.L_clsRepresent Softmax Classification Loss,Represent that frame returns loss,N_clsTable Show block size, N in experiment_cls=256.N_regRepresent the quantity of anchor point frame, λ represents an equalizing coefficient, taken in experiment λ= 10,Computational methods be referred to Fast-RCNN, formula is as follows：

3rd, semantic segmentation system model is trained

During the training of full convolution, backpropagation equally uses stochastic gradient descent (Stochastic gradient Descent, abbreviation SGD) backpropagation is carried out, loss is that the Softmax losses of each pixel are sued for peace.Full convolution Training network (herein only use Vgg-16 networks) be divided into three kinds.The first be after conv7 convolutional layers perform one across Step (Stride) is 32 deconvolution operation (FCN-32s)；Be for second conv7 convolutional layers are performed one stride for 2 it is anti- Convolution results perform one and merges (average) with pool4 ponds layers result, then this fusion results execution one is striden for 16 deconvolution operation (FCN-16s)；The third is that three kinds of results are merged, and these three results are respectively to be held to conv7 Row one stride for 4 deconvolution result, to pool4 ponds layer perform one stride for 2 deconvolution result, pool3 ponds Layer.Once striden again using this fusion results as 8 deconvolution operates (FCN-8s)；Wherein second and the third side Formula is referred to as the jump framework of full convolutional network.

Full convolutional network training process typically uses a model fine setting FCN-32s network for training, then FCN- 16s is based on the model of FCN-32s models fine setting jump framework, finally finely tunes the model of FCN-8s using the model of FCN-16s.This Text continues to use this training mode, and the model trained using FCN-8s is tested, and as last experimental result.

4th, experimental result；

Using RGB color image and the fused images of depth image --- HHG images are tested, and are realized using HHG images The average accuracy of object detection is 37.6% (row of table the 1, the 6th), and the result than Gupta et al. improves 5.1%.

Table 1：1,2,3 row are the experimental results of Gupta et al., and 4-8 row are the experimental results of this method, and wherein nms ' is represented Be to use the experiment after the non-maxima suppression changed.Experimental result is the percentage of Average Accuracy.

The frame quantity produced by zone scheme network there are about 17000, then process overlap by non-maxima suppression Frame, be left 2000 to 3000 frames, this process need average time be 0.71s, then once changed again Non-maxima suppression afterwards is reduced to 2,000 (± 50) by the quantity of frame, and the average time that this process needs is 0.133s. Wherein the value on δ refer to table 2.When frame quantity is in different zones (between 2050 to 3000) δ value (δ ∈ [8, 13] it is) different, when quantity is less than 2050, the non-maxima suppression changed is not performed, (this when quantity is more than 3000 The situation of kind seldom occurs), take 2000 frames of score values highest.The average accuracy of experimental result is on the basis of HHG images Improve 1.6% (row of table the 1, the 7th).

Finally use VGG-16 network models as final object detection experimental result, average accuracy is 43.7% (row of table the 1, the 8th), the experimental result than Gupta et al. improves 11.2%.

Table 2：The first row is represented and processes remaining frame quantity interval by first time non-maxima suppression, and the second row is represented δ is in different interval values.

It is as shown in table 3 on segmentation result.This method obtains best segmentation using HHG images under FCN-8s networks As a result, average double ratio and 30.9% is brought up to from the 28.6% of Gupta et al..

Table 3：40 kinds of IU of semantic segmentation label (%), the first row is the semantic segmentation result of Gupta et al., and the second row is arrived Fourth line is that we use HHG images semantic segmentation result respectively under FCN-32s, FCN-16s, FCN-8s network.

Claims

1. the object detection and semantic segmentation method of the RGB-D images of deep layer convolutional network are based on, it is characterised in that：

S1, by RGB image calculate gray level image, HHG images are merged into by gray level image and HHA images；Tri- passages of HHA Discrete Fourier transform optical imagery in, have the discrete Fourier transform clearest differences is that A channel, it is embodied in directly Flow component is that the intensity of transverse and longitudinal coordinate axle is very faint, therefore casts out this passage；Due to tri- discrete Fourier transforms of passage of RGB Optical imagery it is all similar and intensity of DC component is also strong, in replacing HHA images using the gray level image of RGB image A channel image, so using fusion RGB image and depth image triple channel image be HHG images；

S2, using Faster-RCNN as HHG images object detecting system；Using HHG images as network input number According to, Region Proposals are produced by the RPN in Faster-RCNN, Region is extracted by Fast-RCNN The feature of Proposals, then classifies to each Region Proposals, and the testing result of this method is, in HHG Position and the scope of object are marked in image with a rectangle frame, and marks the classification of object in the rectangle frame；

S3, change non-maxima suppression are the mechanism that NMS retains frame, using the frame quantity around frame as factor of evaluation； Specific step is as follows：

Each frame is 5 tuple (x1, y1, x2, y2, score), wherein (x1, y1) is the coordinate in the frame upper left corner, (x2, y2) is the coordinate of lower right bezel corner, and score is the confidence level comprising object in frame；It is right that frame is first according to score values Each tuple carries out ascending sort；It is calculated as follows the double ratio and Duplication of frame；

\underset{j = [i + 1, n]}{\underset{i = [1, n - δ]}{O_{(i, j)}}} = \frac{{inter}_{(i, j)}}{{area}_{(i)} + {area}_{(j)} - {inter}_{(i, j)}}

Wherein, O_(i,j)Represent the double ratio and Duplication, inter of frame i and frame j_(i,j)Represent the faying surface of frame i and frame j Product, area_(i)Represent the area of frame i, area_(j)Represent the area of frame j；For frame i, statisticsQuantity Sum_iIf, Sum_i>=δ, casts out frame i, otherwise retains, and n represents frame total quantity, and δ is represented and accepted or rejected threshold value；

S4, the semantic segmentation task that RGB-D images are completed using HHG images and FCN；Using HHG images as FCN input number According to after FCN extracts semantic feature and classifies, the class label of each pixel, is made with label value in output HHG images It is the pixel value of the pixel.

2. the object detection and semantic segmentation method of the RGB-D images based on deep layer convolutional network according to claim 1, It is characterized in that：

The training method of Faster-RCNN has three kinds：One is alternately training, and two is approximate joint training, and three is non-approximated joint Training；This method uses alternately training program, and alternately the thinking of training program is to make zone scheme network and Fast-RCNN Shared convolutional layer parameter, fine setting belongs to the parameter of each automatic network, and this scheme trains zone scheme network first, secondly according to area The zone scheme training Fast-RCNN models that domain scheme network is produced, then use Fast-RCNN model initialization zone schemes Network；This process can be repeated.

3. the object detection and semantic segmentation method of the RGB-D images based on deep layer convolutional network according to claim 2, It is characterized in that：

This method uses 4- to walk alternately training program：The first step, uses the model of the pre-training on ImageNet data sets Initialization network, trains zone scheme network；Second step, using the first step zone scheme network produce zone scheme as The pre-detection frame of Fast-RCNN, using the model initialization network of the pre-training on ImageNet data sets, trains Fast- RCNN；3rd step, the netinit zone scheme network and training network trained using second step, because having used shared volume Lamination parameter, so only fine setting belongs to the network layer parameter of zone scheme network here；4th step, equally shared convolutional layer ginseng Number, the netinit Fast-RCNN trained using second step simultaneously finely tunes the network layer parameter for being pertaining only to the network；We are One step and second step are referred to as the first stage, and the 3rd step and the 4th step are referred to as second stage；

The computational methods of this paper loss functions are referred to the multitask loss formula of Faster-RCNN, and formula expression is as follows：

L ({p_{i}}, {t_{i}}) = \frac{1}{N_{c l s}} \underset{i}{Σ} L_{c l s} (p_{i}, p_{i}^{*}) + λ \frac{1}{N_{r e g}} \underset{i}{Σ} p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*})

Wherein, p_iI-th prophesy probability of the anchor point frame comprising object is represented,Ground truth label is represented, if anchor point side Frame is positive example,If anchor point frame is counter-example,t_iThe coordinate (4 parameters) of prophesy frame i is represented,Represent With the related ground truth frame of positive example anchor point frame；L_clsSoftmax Classification Loss is represented,Represent that frame is returned Loss,N_clsRepresent block size, N in experiment_cls=256；N_regRepresent anchor The quantity of frame is put, λ represents an equalizing coefficient, λ=10 are taken in experiment,Computational methods be referred to Fast- RCNN, formula is as follows：

4. the object detection and semantic segmentation method of the RGB-D images based on deep layer convolutional network according to claim 1, It is characterized in that：

During the training of full convolution, backpropagation equally carries out backpropagation using stochastic gradient descent, and loss is to each picture The Softmax losses of vegetarian refreshments are sued for peace；The training network of full convolution is divided into three kinds；The first is held after conv7 convolutional layers Row one strides as 32 deconvolution operates FCN-32s；Be for second conv7 convolutional layers are performed one stride for 2 warp Product result is merged for one with the execution of pool4 ponds layers result, then this fusion results execution one is striden for 16 warp Product operation FCN-16s；The third is that three kinds of results are merged, these three results be respectively to conv7 perform one stride Be 4 deconvolution result, to pool4 ponds layer perform one stride for 2 deconvolution result, pool3 ponds layer；Using this Fusion results are once striden as 8 deconvolution operates FCN-8s again；It is referred to as full volume with the third mode wherein second The jump framework of product network；

Full convolutional network training process typically uses a model fine setting FCN-32s network for training, then FCN-16s bases In the model of FCN-32s models fine setting jump framework, finally the model of FCN-8s is finely tuned using the model of FCN-16s；This paper edges This training mode is used, the model trained using FCN-8s is tested, and as last experimental result.