CN107622244B

CN107622244B - Indoor scene fine analysis method based on depth map

Info

Publication number: CN107622244B
Application number: CN201710874793.3A
Authority: CN
Inventors: 曹治国; 杭凌霄; 肖阳; 赵峰; 张博深; 王立; 李涛
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2020-08-28
Anticipated expiration: 2037-09-25
Also published as: CN107622244A

Abstract

The invention discloses a depth map-based indoor scene fine analysis method, which is applied to the technical field of digital image processing and pattern recognition, and comprises the following steps: extracting three-channel characteristics of the depth map, and segmenting a target in the depth map of the indoor scene to be analyzed by utilizing the trained full convolution network; on the depth feature map, utilizing a fully connected conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed; and converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target. According to the invention, only the depth map is used as input, the semantic segmentation of the indoor scene is realized, the spatial posture of a specific object under a three-dimensional coordinate is given, the shielding can be effectively overcome, the foreground and the background are separated, and the privacy of a user is protected.

Description

Indoor scene fine analysis method based on depth map

Technical Field

The invention belongs to the technical field of digital image processing and pattern recognition, and particularly relates to a depth map-based indoor scene fine analysis method.

Background

The indoor scene analysis is a task integrating target detection and image segmentation technologies, requires a computer to understand an image in multiple levels, and comprises 2D to 3D omnibearing and multi-angle algorithm design from bottom-layer object positioning, identification and segmentation to upper-layer scene identification and indoor object layout analysis.

Traditional scene parsing is mainly based on color images, and relies on limited information sources, mainly color, texture, and the like. The existing algorithm adopts a bottom-up frame to classify the image superpixels, and then optimizes the segmentation result by using a graph model. However, the existing algorithm has two defects: firstly, under the conditions of serious indoor shielding and complex objects, the robustness is poor, and the target and the background are difficult to distinguish; secondly, the planar color image has the defect of insufficient innate information source, and the position information of the target three-dimensional space cannot be provided.

In recent years, the popularization of depth cameras provides a new dimension for solving the problems, so that the level of analysis and understanding of indoor scenes is greatly improved. The depth images provide a visual angle closer to the real world, the difference between the foreground and the background can be reflected through the distance, meanwhile, the surface geometric information is added on the basis of the visual information, and the unique characteristics of the depth images provide great convenience for the 3D analysis of the indoor scene.

The existing indoor scene analysis technology based on the depth map is similar to the traditional color image method in thinking, only uses the depth information as a new feature, and does not fully utilize the unique characteristics of the depth map. It is worth mentioning that, in practical applications, both the traditional method based on color images alone and the method that relies on depth images and also relies on color images inevitably fails in the case of night light turn-off. Furthermore, with color cameras, there is a risk of revealing the privacy of the user.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides a depth map-based indoor scene fine analysis method, so that the technical problem that an indoor scene cannot be identified under the condition of no illumination due to a color image in the existing depth map-based indoor scene analysis technology is solved.

In order to achieve the above object, according to an aspect of the present invention, there is provided a depth map-based indoor scene refinement analysis method, including:

(1) extracting a three-channel characteristic diagram of an indoor scene depth map to be analyzed, taking the extracted three-channel characteristic diagram as the input of a trained full convolution network, and segmenting a target in the indoor scene depth map to be analyzed;

(2) according to the extracted three-channel characteristic diagram, utilizing a full-connection conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed;

(3) and converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target.

Preferably, step (1) specifically comprises:

(1.1) coding the indoor scene depth map I to be analyzed into a three-channel map I^EThe pixels of each channel image correspond to the pixels in the indoor scene depth map I to be analyzed one by one, and the three channels respectively represent the parallax value, the height from the ground and the size of an included angle between a normal vector and the gravity direction;

(1.2) three-channel diagram I^EExtracting multi-level CNN characteristics layer by layer as input of a trained full convolution network, wherein a convolution characteristic graph obtained in the previous layer is sent to the next layer to extract a new convolution characteristic graph after downsampling;

(1.3) respectively passing the convolution characteristic graphs in different layers through deconvolution layers, upsampling to the same size, then mutually fusing the characteristic graphs in different layers, and sending the fused characteristic graphs into a softmax layer;

(1.4) predicting the category of each pixel point through a softmax layer, outputting the probability that each pixel point belongs to each category, wherein the category corresponding to the maximum probability value is the initial category label of the pixel point.

Preferably, step (1.1) specifically comprises:

(1.1.1) preparation of

Obtaining the relation between the parallax d and the depth value Z corresponding to the pixel point;

(1.1.2) preparation of

Obtaining the normal vector of each pixel point, wherein norm [ ·]Representing the normalization of the vector, the notation × represents the vector outer product,

representing the pixel locations of the two-dimensional plane of the depth map of the indoor scene to be resolved,

the method comprises the following steps of representing coordinates in a three-dimensional space of an indoor scene depth map to be analyzed, wherein the conversion relation between the two-dimensional coordinates and the three-dimensional coordinates is as follows:

is an internal reference matrix of the depth camera;

(1.1.3) preparation of

And

constructing a parallel set N_∣∣And a vertical set N_⊥Wherein, in the step (A),

the normal vector representing the pixel point is then,

which represents the direction of the force of gravity,

the included angle between the normal vector and the gravity direction is defined, and rho represents an angle error allowance;

(1.1.4) matrix N to be solved_⊥N_⊥ ^T-N_∣∣N_∣∣ ^TTaking the characteristic value as an updated gravity vector, continuously executing the step (1.1.3) by adopting the updated gravity vector until the characteristic value is stable and unchanged to obtain a target gravity vector, and calculating an included angle between a normal vector of each pixel in the point cloud and the target gravity direction, wherein the point cloud represents that coordinates (x, y, z) in a three-dimensional space corresponding to all pixel points form a three-dimensional point cloud;

(1.1.5) calculating the projection value of each point along the target gravity vector by taking the target gravity vector as a reference axis, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground.

Preferably, step (1.4) specifically comprises:

predicting the category of each pixel point through the softmax layer, and outputting the probability that each pixel point i belongs to each category

Wherein l {1,2, …, C } represents a category label and the maximum probability value is assigned

1,2, …, the category corresponding to C is used as the initial category label of the pixel,

the output of the last layer of the full convolution network without considering the softmax layer.

Preferably, the step (2) specifically comprises:

(2.1) by conditional probability

Defining a conditional random field distribution, wherein X is defined by X₁,X₂,...,X_NConstituent random vectors，X_i(i-1, 2, …, N) indicates the initial category label to which the ith pixel belongs, and z (i) - ∑_Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;

(2.2) preparation of

Obtaining a total energy function in which the univariate term

Binary term

Wherein p is_iRepresenting the position, p, of a pixel point i_jRepresenting the position of pixel j, over-parameter σ_α,σ_βAnd σ_γRepresenting weights governing the Gaussian kernel for specifying a range of adjacent pixels having an effect on a specified pixel, w₁And w₂Respectively representing the weight occupied by the Gaussian kernel function in two different feature spaces, P (x)_i) Representing the maximum probability value, I, of the target class label corresponding to the pixel point I^E _iRepresents a three-channel diagram I^EThe value of the ith pixel point, I^E _jRepresents a three-channel diagram I^EThe value of the jth pixel point, x_iIndicating the possible label value, x, of pixel i_jRepresenting the possible label value of the pixel point j;

and (2.3) solving so that the value of X corresponding to the maximum conditional probability P (X ═ X | I) is the optimized segmentation result of the indoor scene depth map I to be analyzed, and obtaining target class label vectors of all pixels in the indoor scene depth map I to be analyzed.

Preferably, the method further comprises:

by

Obtaining an error function of the full convolution network, wherein z represents the output of the last layer of the full convolution network, N represents the total number of pixels in the depth map, and y represents_i∈ {1,2, …, C } represents the true category of the manual label corresponding to pixel point i, C represents the total number of categories,

the depth map representing the jth input corresponds to the category y at the last layer of the full convolution network_iIs then outputted from the output of (a),

the depth map representing the ith input corresponds to the category y at the last layer of the full convolution network_iAn output of (d);

the method comprises the steps of training by utilizing a neural network framework Caffe, initializing full convolution network parameters, updating the full convolution network parameters by using a back propagation algorithm, stopping training when an error function value is not changed any more, and obtaining a trained full convolution network, wherein in the training process of the full convolution network, results obtained by shallow neural network layers are fused and output.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects: the invention only adopts the depth map as input, and the depth map is not influenced by illumination conditions and can reflect the space geometric characteristics of a complex indoor environment, so that scene segmentation and understanding are carried out on the basis, the shielding can be effectively overcome, the foreground and the background can be separated, and the space posture of the target object under the three-dimensional coordinate can be given.

Drawings

Fig. 1 is a schematic flowchart of an indoor scene refinement analysis method based on a depth map according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of another depth map-based indoor scene refinement analysis method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of estimating object spatial position information based on 2D segmentation results according to the present invention, in which (1) representsThe projection of the pixel of the depth map marked as an object on the xy plane in space, (2) shows the result after filtering noise points through morphological operation, (3) shows the found 4 corner points, V_iI is 1,2,3,4, and (4) represents a three-dimensional bounding box drawn after estimating the height of the object space;

fig. 4 is a scene analysis diagram of the experiment in the bedroom and the hospital ward, wherein the first action is inputting a depth map, and the second action is corresponding to a refined analysis result.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides refined indoor scene analysis only by using a depth map. By means of strong understanding and generalization capability of the deep convolutional neural network, information such as edges and shapes in the depth map can be automatically learned, and segmentation and identification results of pixel levels of indoor main objects are given. On the basis, the position of the object in the 2D image plane is given, and the position and posture information of the target in the three-dimensional space can be analyzed by combining the traditional optimization method. Due to the characteristics of good robustness, capability of protecting user privacy and the like, the technology can provide powerful help for behavior analysis and intelligent nursing of the old. The daily activities of the elderly are closely related to large furniture such as beds, chairs, tables, sofas and the like in rooms and indoor structures such as floors, walls and the like, for example, the falling bed of the elderly is related to the position detection of the bed surface and the ground. In the future, the research and implementation of the household robot for serving the old people also depend on the detailed cognition of the computer on the indoor scene.

In order to realize the purpose, the invention is mainly divided into three steps: first, scene parsing. Firstly, training a full convolution network for depth image analysis aiming at an indoor scene database, and during testing, segmenting an input depth image of a new scene by using the trained full convolution network to obtain an initial analysis result. Secondly, optimizing the analysis result. And calculating an energy function aiming at the whole graph by using the full-connection conditional random field to obtain an optimized segmentation result. Thirdly, on the basis of the analysis result, the depth map is converted into a three-dimensional point cloud, and the position and the posture of the target in the three-dimensional coordinate are estimated.

Fig. 1 is a schematic flowchart illustrating an indoor scene refinement analysis method based on a depth map according to an embodiment of the present invention; in the method shown in fig. 1, the following steps are included:

(1) extracting a three-channel characteristic diagram of the indoor scene depth diagram to be analyzed, taking the extracted three-channel characteristic diagram as the input of a trained full convolution network, and segmenting a target in the indoor scene depth diagram to be analyzed;

in an optional embodiment, the method further comprises the step of training the full convolutional network:

by

the method comprises the steps of training by utilizing a neural network framework Caffe, initializing parameters of a full convolution network, updating the parameters of the full convolution network by using a back propagation algorithm, stopping training when an error function value is not changed any more, and obtaining the trained full convolution network, wherein in order to obtain a more refined segmentation result, in the training process of the full convolution network, results obtained by a shallow neural network layer are fused and output.

As an optional implementation manner, as shown in fig. 2, which is a flowchart of an indoor scene depth image analysis method for intelligent nursing according to an embodiment of the present invention, when training a full convolution network of an indoor scene segmentation task, in order to train a full convolution network with sufficient generalization capability, an input sample image may be made into a training data set specifically for a ward scene on the basis of an existing NYUD2 indoor scene database, and the training data set includes 100 depth pictures and is mainly labeled for a bed, a ground, a wall, and other large-scale indoor targets.

In the embodiment of the present invention, a VGG16 network model trained on an ImageNet data set may be used, the number of network layers may be increased or decreased according to actual needs, or other network structures, such as AlexNet, ResNet, and the like, may be used to initialize neural network parameters in a target detection method. The embodiment of the present invention is not limited uniquely, which network model is specifically adopted.

In an optional embodiment, step (1) specifically includes:

wherein, the step (1.1) specifically comprises the following steps:

(1.1.1) calculating a disparity value: by

(1.1.2) calculating the size of an included angle between a normal vector and the gravity direction: deducing the pixel position of the two-dimensional plane of the indoor scene depth map to be analyzed

And coordinates in three-dimensional space of indoor scene depth map to be resolved

The following equation is satisfied:

wherein

The conversion relation between the two-dimensional coordinates and the three-dimensional coordinates is as follows:

coordinates (x, y, z) in the three-dimensional space corresponding to all the pixel points form a three-dimensional point cloud, and a normal vector calculation formula corresponding to each pixel point is as follows

Wherein norm [. cndot ] represents the normalization of the vector, and symbol x represents the vector outer product;

(1.1.3) for all pixels in the point cloud

And

the normal vector representing the pixel point is then,

indicating the direction of gravity, the initial value may be

ρ represents the angular error margin, which is the angle between the normal vector and the direction of gravity, preferably ρ is 5 °;

(1.1.5) calculating the height from the ground: and taking the target gravity vector as a reference axis, solving the projection value of each point along the target gravity vector, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground.

(1.2) three-channel diagram I^EExtracting characteristics of a multi-level Convolutional Neural Network (CNN) layer by layer as input of a trained full Convolutional Network, wherein a Convolutional characteristic diagram obtained in the previous layer is sent to the next layer to extract a new Convolutional characteristic diagram after being downsampled;

in specific implementation, the structure of the full convolution network and the parameters of each layer of convolution kernel adopted in the embodiment of the present invention are shown in fig. 2.

specifically, taking fig. 2 as an example, the effect of the deconvolution layer and the convolution layer are exactly opposite, and the two operate in reverse. The characteristic diagram of the pool5 layer is up-sampled to 2 times of the original size through deconvolution, namely, the characteristic diagram has the same size as the pool4 layer, is up-sampled to 2 times of the size through deconvolution after being overlapped, namely, the characteristic diagram has the same size as the pool3 layer, and a final up-sampling result can be obtained after being overlapped and used as the input of the next softmax layer.

Wherein, the step (1.4) specifically comprises the following steps:

and

in an optional embodiment, step (2) specifically includes:

(2.1) by conditional probability

Defining a conditional random field distribution, wherein X is defined by X₁,X₂,...,X_NRandom vector of composition, X_i(i-1, 2, …, N) indicates the initial category label to which the ith pixel belongs, and z (i) - ∑_Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;

(2.2) preparation of

Obtaining a total energy function in which the univariate term

Binary term

in the implementation of the present invention, a cross-validation method can be used to determine the optimal value combination of the above parameters. First default setting w₂And σ_γTo 3, then randomly select 100 samples from the validation data set, find w₁,σ_αAnd σ_βWith the search range set to w₁∈(0,20)，σ_α∈(0,100)，σ_β∈ (0,20) from the above experiment, w was found₁,σ_αAnd σ_βThe optimum value of (c).

By utilizing the current most efficient high-order filtering algorithm to carry out approximate inference on the probability distribution of the model, the speed of optimization solution can be obviously improved.

In the embodiment of the present invention, as shown in fig. 3, taking spatial orientation analysis of a couch top as an example, other types of targets can be simply modified according to practical applications, and the principle is not changed. In fig. 3, (1) represents the projection of the pixel of the depth map marked as an object on the xy plane in space, (2) represents the result of filtering out the noise points by morphological operation, (3) represents the found 4 corner points, V_iAnd i is 1,2,3,4, and (4) represents a three-dimensional bounding box drawn after estimating the height of the object space. The step 3 specifically comprises the following substeps:

and (3.1) projecting the segmentation result on the two-dimensional image plane of the indoor scene depth map to be analyzed to the coordinates of the three-dimensional point cloud. The calculation method of the projection is completely the same as the step (1.1.2);

(3.2) projecting the three-dimensional coordinates of the pixel points with the bed surface labels to an xy plane, performing morphological operation (for example, performing morphological corrosion operation and then performing morphological expansion operation) and filtering noise points;

(3.3) finding the points with the maximum and minimum coordinates in the x-direction and the y-direction, and marking as V_iI is 1,2,3,4, represents 4 corner points of the bed surface, and is sequentially connected with V_iForm a closed geometric figure, and the normal vectors of all points inside the closed figure represent the orientation of the plane in space and can be expressed inAttitude and structural information of the target in three-dimensional space;

(3.4) calculating the distance between the upper plane and the ground plane as the height h of the space occupied by the object, using V_iAnd h drawing a cubic frame of the stereo attitude estimation in a three-dimensional space coordinate system. Fig. 4 is a view showing a scene analysis chart of the experiment in the bedroom and the hospital ward, wherein the first action is inputting a depth map, and the second action is corresponding to a fine analysis result.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A depth map-based indoor scene refinement analysis method is characterized by comprising the following steps:

in the three-channel characteristic diagram, pixels of each channel image correspond to pixels in the indoor scene depth diagram to be analyzed one by one, and the three channels respectively represent parallax values, ground height and the size of an included angle between a normal vector and the gravity direction;

the method for extracting the three-channel characteristic diagram of the indoor scene depth map to be analyzed comprises the following steps:

(1.1.1) preparation of

(1.1.2) preparation of

is an internal reference matrix of the depth camera;

(1.1.3) preparation of

And

constructing a parallel set N_IIAnd a vertical set N_⊥Wherein, in the step (A),

the normal vector representing the pixel point is then,

which represents the direction of the force of gravity,

(1.1.4) matrix N to be solved_⊥N_⊥ ^T-N_IIN_II ^TTaking the characteristic value as an updated gravity vector, continuously executing the step (1.1.3) by adopting the updated gravity vector until the characteristic value is stable and unchanged to obtain a target gravity vector, and calculating an included angle between a normal vector of each pixel in the point cloud and the target gravity direction, wherein the point cloud represents that coordinates (x, y, z) in a three-dimensional space corresponding to all pixel points form a three-dimensional point cloud;

(1.1.5) calculating a projection value of each point along the target gravity vector by taking the target gravity vector as a reference axis, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground;

2. The method according to claim 1, wherein step (1) comprises in particular:

(1.1) coding the indoor scene depth map I to be analyzed into a three-channel map I^E；

3. The method according to claim 2, characterized in that step (1.4) comprises in particular:

by softmaThe x layer predicts the category of each pixel point and outputs the probability that each pixel point i belongs to each category

Wherein l ═ {1, 2.. multidata, C } represents the class label and the maximum probability value is given

The corresponding category is used as the initial category label of the pixel point,

4. The method according to claim 2 or 3, characterized in that step (2) comprises in particular:

(2.1) by conditional probability

Defining a conditional random field distribution, wherein X is defined by X₁，X₂，...，X_NRandom vector of composition, X_iN denotes an initial class label to which the ith pixel belongs, and z (i) ∑_Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;

(2.2) preparation of

Obtaining a total energy function in which the univariate term

Binary term

Wherein p is_iRepresenting the position, p, of a pixel point i_jRepresenting the position of pixel j, over-parameter σ_α，σ_βAnd σ_γRepresenting weights governing the Gaussian kernel for specifying a range of adjacent pixels having an effect on a specified pixel, w₁And w₂Respectively representing the weight occupied by the Gaussian kernel function in two different feature spaces, P (x)_i) Representing the maximum probability value, I, of the target class label corresponding to the pixel point I^E _iRepresents a three-channel diagram I^EThe value of the ith pixel point, I^E _jRepresents a three-channel diagram I^EThe value of the jth pixel point, x_iIndicating the possible label value, x, of pixel i_jRepresenting the possible label value of the pixel point j;

5. The method of claim 1, further comprising:

by

Obtaining an error function of the full convolution network, wherein z represents the output of the last layer of the full convolution network, N represents the total number of pixels in the depth map, and y represents_i∈ {1, 2., C } represents the actual category of the manual label corresponding to pixel point i, C represents the total number of categories,

depth map representing ith input in full convolution netCorresponding category y of the last layer of the network_iAn output of (d);