CN108596102B

CN108596102B - RGB-D-based indoor scene object segmentation classifier construction method

Info

Publication number: CN108596102B
Application number: CN201810382977.2A
Authority: CN
Inventors: 沈旭昆; 周锋; 迟小羽
Original assignee: Qingdao Research Institute Of Beihang University
Current assignee: Qingdao Research Institute Of Beihang University
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2022-04-05
Anticipated expiration: 2038-04-26
Also published as: CN108596102A

Abstract

The invention discloses an indoor scene object segmentation classifier construction method based on RGB-D. The method comprises the steps of collecting RGB modal pictures of an indoor scene and depth modal pictures at the same pose, then sequentially extracting RGB modal picture features and depth modal picture features, sequentially carrying out semantic analysis on the collected RGB modal pictures and the depth modal pictures, adding corresponding category labels to each pixel in the collected pictures, connecting the extracted RGB features and the depth features together, and inputting the RGB features and the depth features into a full convolution network embedded with an RPN module to carry out object segmentation on the indoor scene. The method can be applied to understanding of indoor scenes, and can effectively help indoor robot navigation and indoor real-time reconstruction by effectively segmenting the currently captured scenes.

Description

RGB-D-based indoor scene object segmentation classifier construction method

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to a construction method of an indoor scene body segmentation classifier.

Background

The perception and understanding of the scene are particularly the indoor scene, the labeling effects such as segmentation, detection and the like can be better achieved by using the RGB images in the outdoor scene, and for the indoor scene, the scene is difficult to understand only by using the RGB images due to the characteristics of complexity, changeability and the like. Complexity, variability and occlusion of an indoor scene are one of research focuses in the field of scene cognition and understanding, and the problem is always a problem to be solved urgently by virtual reality, artificial intelligence, intelligent robots and machine vision.

Object detection is a prerequisite for many advanced visual tasks, virtual reality, augmented reality and other tasks, such as intelligent video surveillance, content-based image retrieval, robotic navigation, augmented reality and the like. A great number of excellent object detection algorithms are proposed by a great number of researchers, for example, an algorithm framework based on AdaBoost, Haar-like wavelet features are used for classification, and then a sliding window method is used for positioning an object to be detected in an image, wherein the algorithm is the first target detection algorithm which can achieve real-time performance and has good detection accuracy. And as another example, by utilizing the HOG characteristic, a Support Vector Machine (SVM) is used as a classifier for pedestrian detection. As the multi-scale deformation component model (DPM) algorithm which has the most influence before the deep neural network is popular, the DPM algorithm is composed of a root filter and several component filters, deformation between components is derived through hidden variables, and the algorithm inherits the advantages of a classifier based on HOG features and using SVM, but the algorithm is difficult to use because the algorithm uses a sliding window mode to perform object positioning and additionally needs to manually specify the number of components and the relation between the components. The optimal algorithm and the best result of target detection before 2012 were based on DPM or improved algorithms of DPM, and after 2012, as the task of image recognition of the AlexNet deep neural network greatly surpassed the classic work at that time, the deep neural network gradually occupied various fields of computer vision and computational graphics. The action of the deep learning algorithm on the mountain making on target detection is RCNN network. This algorithm has similar disadvantages to the classical DPM algorithm, making it very slow due to the need to repeatedly detect regions. In order to solve the defects of the RCNN, a subsequent researcher integrates the process of extracting the features of the region into the network to provide the RPN, the repeated feature extraction of the picture is not needed, a large amount of time is saved, and the defect that the RCNN extracts the region by using a selectSearch algorithm and consumes a large amount of time is solved.

Scene semantic segmentation belongs to a picture classification task at a pixel level, namely, a given picture outputs a picture with pixel-by-pixel labels which are consistent with the input size through a segmentation algorithm. I.e. the expression of each pixel point in a given sample is

Wherein xⁱRepresenting the ith picture, representing the size of the picture represented by wxh, and outputting d which is the dimension of a picture pixel point through a picture semantic segmentation algorithm

Where C ∈ {1, 2, 3., C }, indicating that each pixel belongs to one of the classes C. Since the pixels in the picture have relevance, the relationship between the variables needs to be considered when dividing the pixels.

Although object detection and semantic segmentation of scenes have been well solved, the existing work mainly solves the problem of object positioning in indoor scenes or the problem of object semantic segmentation in indoor scenes, the former can provide coarse-scale semantic information in the indoor scenes to know the approximate position of an object in the indoor scenes, but does not know which pixel belongs to the current object, and the latter provides finer-scale indoor semantic scene information to provide a semantic label for each pixel in the indoor scenes, but the intra-class objects in the indoor scenes are not distinguished. Thus, these two separate tasks are not currently well integrated and do not provide more robust information for semantic understanding of indoor scenes.

Disclosure of Invention

The invention provides a refined object identification method for solving the problem that the same network cannot well provide the positioning and pixel labels of indoor objects, namely a construction method of an indoor scene pixel-by-pixel object segmentation classifier based on RGB-D, and the scheme is as follows:

an object segmentation classifier construction method for an indoor scene based on RGB-D comprises the following steps:

acquiring an RGB (red, green and blue) modal picture and a depth modal picture for an indoor scene;

counting the types of objects contained in the RGB modal picture and the depth modal picture, and then carrying out category marking on each pixel in the picture;

inputting the collected RGB mode picture into an RGB sub-network in a full-convolution network embedded into an RPN module, simultaneously inputting the collected depth mode picture into a depth sub-network in the full-convolution network of the RPN module, simultaneously extracting the features of the RGB mode picture and the depth mode picture, and respectively obtaining the feature f output by the RGB sub-network_rgbAnd the characteristics f of the output of the depth sub-network_depth。

Defining an RGB-D loss function, connecting the RGB sub-network and the depth sub-network together to construct an RGB-D multi-mode network structure for training an RGB-D-based indoor scene object segmentation pixel-by-pixel classifier_rgbd；

Step five, in the stage of network reasoning, test sample RGB-D data is respectively input into the trained RGB-D multi-mode network according to data mode, and the RGB sub-network extracts f from the input RGB mode picture_rgbExtracting f from input depth mode picture by depth sub-network_depthSplicing the two extracted modal characteristics together, and inputting the two modal characteristics into a pixel-by-pixel classifier_rgbdThe detection and the segmentation of the indoor scene object are carried out.

Further, in the fourth step, the RGB-D loss function is defined as follows:

wherein the content of the first and second substances,

as above, λ and γ are balance factors for balancing the ratio of RGB mode data to depth mode data in calculating loss, α and β are balance factors for balancing the proportion of the final calculated loss in the Reg network and the Seg network, N represents the number of anchor points, and when j is an anchor point, j is a balance factor

Otherwise, the reverse is carried out

IⁱRepresenting the ith RGB training data, DⁱDenoted the ith depth training data, label lⁱE {0, 1.,. C }, one label value for each pixel in the given training data,

is the bounding box label corresponding to the ith training data,

represents a pixel classification result obtained by calculating the weight w and the corresponding parameter theta of the input ith RGB training data, wherein k represents the kth pixel,

denoted is the weight that the ith RGB training data maps from the classification layer to the label domain,

representing parameters based from a layer preceding the classification layer

And (5) extracting characteristic expression.

Further, in the third step, the RGB subnetwork includes two parts, where a network responsible for detecting the object in the indoor scene is defined as a Reg network, a network responsible for semantic segmentation of the indoor scene is defined as a Seg network, and the process of extracting features from the input RGB modal data using the RGB subnetwork is as follows: inputting the picture into a Reg network, and extracting the position of each object in an indoor scene image input into the network; c (3,64,1) -C (3,128,1) -C (3,256,1)

-C(3,256,1)-C(3,512,1)-C(3,512,1)-RPN(9)-F(4096)-F(4096)

Simultaneously inputting the RGB image input into the network into the Seg network to extract the category of each pixel in the indoor scene image input into the network

C(3,64,1)-C(3,64,1)-C(3,64,1)-C(3,128,1)-C(3,128,1)-C(3,256,1)-C(3,256,1)

-C(3,256,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)

-ASPP(6,12,18,24)

Where C represents the convolution operation in the network, k in C (k, n, s, d) represents the kernel size of the convolution kernel, n represents the number of convolution kernels, s represents the step size of the convolution kernel shift in the convolution operation, ASPP (d)_i) Given is a spatial pyramid pooling structure with a convolution with a hole, where d_iThe convolution kernel fill amplitude of the hole convolution is indicated.

Further, in the first step, the indoor scene is collected by using the Microsoft depth sensor Kinect, and the Kinect can be held by a hand to walk indoors at a constant speed in the collection process.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention provides a novel RGB-D-based indoor scene pixel-by-pixel object segmentation classifier construction method, which can analyze the pixel-by-pixel category of objects in an acquired indoor scene picture according to acquired RGB and depth modal information, namely, the position of the objects in the indoor scene and the label of each pixel can be simultaneously output through a complete RGB-D network, and the method belongs to a multi-task network and provides finer-scale semantic understanding information for semantic understanding of the indoor scene.

In addition, the invention is a multi-task end-to-end learning network, which can optimize end to end, perfectly embed RPN network into FCN semantic segmentation network through designed loss function, and can well realize end-to-end indoor scene pixel-by-pixel object segmentation algorithm.

Detailed Description

The design concept of the invention is as follows:

the invention mainly focuses on object segmentation in an indoor scene, and in order to well solve the problem of object segmentation in the indoor scene, the problems of object detection and semantic segmentation of the scene in the indoor scene need to be solved.

The generation algorithm for the object bounding box is originally intended to employ an RPN network. The position of an object in an indoor scene can be quickly positioned through the RPN, but the RPN can only provide the position of each type of object and cannot provide pixel-by-pixel type in the indoor scene;

to solve the above problem, a full convolution network with a convolution with a hole is then adopted. The input image can be segmented by a convolution with a hole, but the image size is reduced due to the input image being subjected to the operations of convolution and pooling. The output of the image segmentation is a score map with the same size as the input, and in order to solve the problem of image size inconsistency caused by network calculation, the applicant intends to adopt a bilinear interpolation method to solve the problem. The two networks can provide the problems of object positioning and pixel-by-pixel segmentation understood by indoor scenes, but due to the fact that the two networks are dispersed, end-to-end de-optimization cannot be achieved, in order to solve the problem, the RPN network is finally embedded into the full convolution network with the convolution, and practice proves that the RPN network is embedded into the full convolution network with the convolution to achieve the following effects: firstly, the whole network is optimized end to end, the second weight can be shared, and the low-level features extracted from the first layers of the deep neural network can be shared, so that fine adjustment can be carried out.

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the following examples.

The embodiment provides a construction method of an RGB-D indoor scene-based pixel-by-pixel object segmentation classifier, which comprises the following steps:

in the embodiment, the microsoft depth sensor Kinect is mainly used for collecting indoor scenes, and the Kinect depth sensor can simultaneously collect RGB modal data and depth modal data under the same visual angle so as to construct a picture sample set. The Kinect can be held by hand to walk indoor at a constant speed in the acquisition process.

in this embodiment, the acquired RGB-D data is mainly analyzed manually, the types of objects included in the picture are counted, and then each pixel in the picture is labeled by a category, and since the paired RGB modal picture and depth modal picture describe the same scene, the pixel label of the RGB modal picture and the pixel label of the depth modal picture are the same.

Inputting the collected RGB pictures into an RGB sub-network in a full-convolution network embedded into an RPN module, simultaneously inputting the collected depth mode pictures into a depth sub-network in the full-convolution network of the RPN module, simultaneously extracting the features of the RGB mode pictures and the depth mode pictures, and respectively obtaining the features f output by the RGB sub-networks_rgbAnd the characteristics f of the output of the depth sub-network_depth. In the embodiment, the RGB subnetwork includes two parts, wherein the network responsible for detecting the objects in the indoor scene is defined as Reg network, and the network responsible for semantic segmentation of the indoor scene is defined as Seg network. The specific process of extracting features is as follows, inputting pictures into the Reg network as follows

C(3,64,1)-C(3,64,1)-C(3,128,1)-C(3,128,1)-C(3,256,1)-C(3,256,1)-C(3,256,1)

-C(3,256,1)-C(3,512,1)-C(3,512,1)-RPN(9)-F(4096)-F(4096)

Simultaneously inputting RGB image inputted into network into Seg network

C(3,64,1)-C(3,64,1)-C(3,64,1)-C(3,128,1)-C(3,128,1)-C(3,256,1)-C(3,256,1)

-C(3,256,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)

-ASPP(6,12,18,24)

Where C represents the convolution operation in the network, k in C (k, n, s, d) represents the kernel size of the convolution kernel, n represents the number of convolution kernels, s represents the step size of the convolution kernel shift in the convolution operation, ASPP (d)_i) Given is a spatial pyramid pooling structure with a convolution with a hole, where d_iThe convolution kernel fill amplitude of the hole convolution is indicated. The position of each object in the indoor scene image input into the network is extracted through the above Reg network, and the category of each pixel in the picture input into the network is extracted through the Seg network.

An RGB-D multi-mode network structure is constructed by connecting an RGB sub-network and a depth sub-network together by defining an RGB-D loss function, and is used for training an RGB-D-based indoor scene object segmentation pixel-by-pixel classifier_rgbdThe process of (2) is as follows:

the RGB-D loss function is first defined as follows:

wherein

As above, λ and γ are a balance factor for balancing RGB mode data and depth mode dataThe proportion of loss in calculation, alpha and beta are balance factors for balancing the proportion of loss in final calculation in the Reg network and the Seg network, N represents the position number of anchor points, and when j belongs to the anchor points

Otherwise, the reverse is carried out

given is the bounding box label corresponding to the ith training data,

representing a pixel classification result obtained by calculating the ith RGB training data of the input through a weight w and a corresponding parameter theta, wherein k represents the kth pixel,

it is shown that the fc7 layer is the layer before the classification layer (in this embodiment, the fc7 layer before the softmax layer) based on the parameter

Extracted feature expressions). And updating the learning of the network through the calculated loss value so as to obtain a final indoor scene pixel-by-pixel object segmentation classifier.

Step five, in the stage of network reasoning, test sample RGB-D data are respectively input into the trained RGB-D multi-mode network according to data modesExtracting f from input RGB mode picture by RGB sub network_rgbExtracting f from input depth mode picture by depth sub-network_depth. Splicing the two extracted modal characteristics together, and inputting the two modal characteristics into a pixel-by-pixel classifier_rgbdThe detection and segmentation tasks of the indoor scene objects are carried out.

The execution environment of the invention adopts a 4.0GHZ central processing unit and a core 4-core computer with 128 Gbyte memory, and simultaneously, in order to accelerate the training and reasoning process of an object recognition network, 4 blocks of GeForceGTX1080TIGPU display cards are adopted for accelerating the calculation. Meanwhile, a construction program of the RGB-D indoor scene pixel-by-pixel object segmentation classifier is compiled by adopting C + + and python languages, and other execution environments can be adopted, so that the description is omitted.

Compared with the mode based on manual features (handed) used in the prior art, the method needs a strong professional background to realize the mode by constructing the features of RGB and depth modes and then inputting the obtained features into the SVM classifier, is complex and cannot perform end-to-end optimization, and the sectional optimization is easy to cause local optimization. In addition, the invention can well distinguish the difference in the classification, and the output has the discrimination in the classification. For example, the invention can not only separate the chair and the table in the indoor scene, but also output the positions of the chair and the table in the indoor scene, and separate the two chairs in the indoor scene.

The method can be applied to understanding of indoor scenes, and can effectively help indoor robot navigation and indoor real-time reconstruction by effectively segmenting the currently captured scenes.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or variations to the disclosed embodiments and equivalent embodiments, but also all simple modifications, equivalent variations and variations made to the above embodiments according to the technical spirit of the present invention may still fall within the technical scope of the present invention.

Claims

1. An object segmentation classifier construction method for an indoor scene based on RGB-D is characterized by comprising the following steps: the method comprises the following steps:

inputting the collected RGB mode picture into an RGB sub-network in a full-convolution network embedded into an RPN module, simultaneously inputting the collected depth mode picture into a depth sub-network in the full-convolution network of the RPN module, simultaneously extracting the features of the RGB mode picture and the depth mode picture, and respectively obtaining the feature f output by the RGB sub-network_rgbAnd the characteristics f of the output of the depth sub-network_depth；

Defining an RGB-D loss function, and connecting an RGB sub-network and a depth sub-network together to construct an RGB-D multi-mode network structure for training an RGB-D-based indoor scene object segmentation pixel-by-pixel classifier_rgbd；

The RGB-D loss function is defined as follows:

wherein the content of the first and second substances,

Otherwise, the reverse is carried out

is the bounding box label corresponding to the ith training data,

representing parameters based from a layer preceding the classification layer

Extracting characteristic expression;

step five, in the stage of network reasoning, test sample RGB-D data are respectively input into the trained RGB-D multi-mode network according to data modes, and RGB sub-networksExtracting f from input RGB modal picture_rgbExtracting f from input depth mode picture by depth sub-network_depthSplicing the two extracted modal characteristics together, and inputting the two modal characteristics into a pixel-by-pixel classifier_rgbdThe detection and the segmentation of the indoor scene object are carried out.

2. The RGB-D based indoor scene object segmentation classifier construction method as claimed in claim 1, wherein: in the third step, the RGB sub-network includes two parts, wherein the network responsible for detecting the objects in the indoor scene is defined as the Reg network, the network responsible for semantic segmentation of the indoor scene is defined as the Seg network, and the process of extracting the features of the input RGB modal data by using the RGB sub-network is as follows: inputting the picture into a Reg network, and extracting the position of each object in an indoor scene image input into the network;

C(3,64,1)-C(3,64,1)-C(3,64,1)-C(3,128,1)-C(3,128,1)-C(3,256,1)-C(3,256,1)-C(3,256,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-C(3,512,1)-ASPP(6,12,18,24)

Where C represents the convolution operation in the network, k in C (k, n, s) represents the kernel size of the convolution kernel, n represents the number of convolution kernels, s represents the step size of the convolution kernel shift in the convolution operation, ASPP (d)₁,d₂,d₃,d₄) Given is a spatial pyramid pooling structure with a convolution with a hole, where d₁,d₂,d₃,d₄The convolution kernel fill amplitude size of the punctured convolution is indicated.

3. The RGB-D based indoor scene object segmentation classifier construction method as claimed in claim 1, wherein: in the first step, the indoor scene is collected by using the Microsoft depth sensor Kinect, and the Kinect can be held by a hand to walk indoors at a constant speed in the collection process.