CN107622244B - Indoor scene fine analysis method based on depth map - Google Patents

Indoor scene fine analysis method based on depth map Download PDF

Info

Publication number
CN107622244B
CN107622244B CN201710874793.3A CN201710874793A CN107622244B CN 107622244 B CN107622244 B CN 107622244B CN 201710874793 A CN201710874793 A CN 201710874793A CN 107622244 B CN107622244 B CN 107622244B
Authority
CN
China
Prior art keywords
depth map
indoor scene
analyzed
value
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710874793.3A
Other languages
Chinese (zh)
Other versions
CN107622244A (en
Inventor
曹治国
杭凌霄
肖阳
赵峰
张博深
王立
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710874793.3A priority Critical patent/CN107622244B/en
Publication of CN107622244A publication Critical patent/CN107622244A/en
Application granted granted Critical
Publication of CN107622244B publication Critical patent/CN107622244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a depth map-based indoor scene fine analysis method, which is applied to the technical field of digital image processing and pattern recognition, and comprises the following steps: extracting three-channel characteristics of the depth map, and segmenting a target in the depth map of the indoor scene to be analyzed by utilizing the trained full convolution network; on the depth feature map, utilizing a fully connected conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed; and converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target. According to the invention, only the depth map is used as input, the semantic segmentation of the indoor scene is realized, the spatial posture of a specific object under a three-dimensional coordinate is given, the shielding can be effectively overcome, the foreground and the background are separated, and the privacy of a user is protected.

Description

Indoor scene fine analysis method based on depth map
Technical Field
The invention belongs to the technical field of digital image processing and pattern recognition, and particularly relates to a depth map-based indoor scene fine analysis method.
Background
The indoor scene analysis is a task integrating target detection and image segmentation technologies, requires a computer to understand an image in multiple levels, and comprises 2D to 3D omnibearing and multi-angle algorithm design from bottom-layer object positioning, identification and segmentation to upper-layer scene identification and indoor object layout analysis.
Traditional scene parsing is mainly based on color images, and relies on limited information sources, mainly color, texture, and the like. The existing algorithm adopts a bottom-up frame to classify the image superpixels, and then optimizes the segmentation result by using a graph model. However, the existing algorithm has two defects: firstly, under the conditions of serious indoor shielding and complex objects, the robustness is poor, and the target and the background are difficult to distinguish; secondly, the planar color image has the defect of insufficient innate information source, and the position information of the target three-dimensional space cannot be provided.
In recent years, the popularization of depth cameras provides a new dimension for solving the problems, so that the level of analysis and understanding of indoor scenes is greatly improved. The depth images provide a visual angle closer to the real world, the difference between the foreground and the background can be reflected through the distance, meanwhile, the surface geometric information is added on the basis of the visual information, and the unique characteristics of the depth images provide great convenience for the 3D analysis of the indoor scene.
The existing indoor scene analysis technology based on the depth map is similar to the traditional color image method in thinking, only uses the depth information as a new feature, and does not fully utilize the unique characteristics of the depth map. It is worth mentioning that, in practical applications, both the traditional method based on color images alone and the method that relies on depth images and also relies on color images inevitably fails in the case of night light turn-off. Furthermore, with color cameras, there is a risk of revealing the privacy of the user.
Disclosure of Invention
Aiming at the defects or improvement requirements in the prior art, the invention provides a depth map-based indoor scene fine analysis method, so that the technical problem that an indoor scene cannot be identified under the condition of no illumination due to a color image in the existing depth map-based indoor scene analysis technology is solved.
In order to achieve the above object, according to an aspect of the present invention, there is provided a depth map-based indoor scene refinement analysis method, including:
(1) extracting a three-channel characteristic diagram of an indoor scene depth map to be analyzed, taking the extracted three-channel characteristic diagram as the input of a trained full convolution network, and segmenting a target in the indoor scene depth map to be analyzed;
(2) according to the extracted three-channel characteristic diagram, utilizing a full-connection conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed;
(3) and converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target.
Preferably, step (1) specifically comprises:
(1.1) coding the indoor scene depth map I to be analyzed into a three-channel map IEThe pixels of each channel image correspond to the pixels in the indoor scene depth map I to be analyzed one by one, and the three channels respectively represent the parallax value, the height from the ground and the size of an included angle between a normal vector and the gravity direction;
(1.2) three-channel diagram IEExtracting multi-level CNN characteristics layer by layer as input of a trained full convolution network, wherein a convolution characteristic graph obtained in the previous layer is sent to the next layer to extract a new convolution characteristic graph after downsampling;
(1.3) respectively passing the convolution characteristic graphs in different layers through deconvolution layers, upsampling to the same size, then mutually fusing the characteristic graphs in different layers, and sending the fused characteristic graphs into a softmax layer;
(1.4) predicting the category of each pixel point through a softmax layer, outputting the probability that each pixel point belongs to each category, wherein the category corresponding to the maximum probability value is the initial category label of the pixel point.
Preferably, step (1.1) specifically comprises:
(1.1.1) preparation of
Figure BDA0001417881300000031
Obtaining the relation between the parallax d and the depth value Z corresponding to the pixel point;
(1.1.2) preparation of
Figure BDA0001417881300000032
Obtaining the normal vector of each pixel point, wherein norm [ ·]Representing the normalization of the vector, the notation × represents the vector outer product,
Figure BDA0001417881300000033
representing the pixel locations of the two-dimensional plane of the depth map of the indoor scene to be resolved,
Figure BDA0001417881300000034
the method comprises the following steps of representing coordinates in a three-dimensional space of an indoor scene depth map to be analyzed, wherein the conversion relation between the two-dimensional coordinates and the three-dimensional coordinates is as follows:
Figure BDA0001417881300000035
is an internal reference matrix of the depth camera;
(1.1.3) preparation of
Figure BDA0001417881300000036
And
Figure BDA0001417881300000037
Figure BDA0001417881300000038
constructing a parallel set N∣∣And a vertical set NWherein, in the step (A),
Figure BDA00014178813000000311
the normal vector representing the pixel point is then,
Figure BDA00014178813000000310
which represents the direction of the force of gravity,
Figure BDA0001417881300000039
the included angle between the normal vector and the gravity direction is defined, and rho represents an angle error allowance;
(1.1.4) matrix N to be solvedN T-N∣∣N∣∣ TTaking the characteristic value as an updated gravity vector, continuously executing the step (1.1.3) by adopting the updated gravity vector until the characteristic value is stable and unchanged to obtain a target gravity vector, and calculating an included angle between a normal vector of each pixel in the point cloud and the target gravity direction, wherein the point cloud represents that coordinates (x, y, z) in a three-dimensional space corresponding to all pixel points form a three-dimensional point cloud;
(1.1.5) calculating the projection value of each point along the target gravity vector by taking the target gravity vector as a reference axis, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground.
Preferably, step (1.4) specifically comprises:
predicting the category of each pixel point through the softmax layer, and outputting the probability that each pixel point i belongs to each category
Figure BDA0001417881300000041
Wherein l {1,2, …, C } represents a category label and the maximum probability value is assigned
Figure BDA0001417881300000042
1,2, …, the category corresponding to C is used as the initial category label of the pixel,
Figure BDA0001417881300000044
the output of the last layer of the full convolution network without considering the softmax layer.
Preferably, the step (2) specifically comprises:
(2.1) by conditional probability
Figure BDA0001417881300000043
Defining a conditional random field distribution, wherein X is defined by X1,X2,...,XNConstituent random vectors,Xi(i-1, 2, …, N) indicates the initial category label to which the ith pixel belongs, and z (i) - ∑Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;
(2.2) preparation of
Figure BDA0001417881300000047
Obtaining a total energy function in which the univariate term
Figure BDA0001417881300000048
Binary term
Figure BDA0001417881300000045
Figure BDA0001417881300000046
Wherein p isiRepresenting the position, p, of a pixel point ijRepresenting the position of pixel j, over-parameter σαβAnd σγRepresenting weights governing the Gaussian kernel for specifying a range of adjacent pixels having an effect on a specified pixel, w1And w2Respectively representing the weight occupied by the Gaussian kernel function in two different feature spaces, P (x)i) Representing the maximum probability value, I, of the target class label corresponding to the pixel point IE iRepresents a three-channel diagram IEThe value of the ith pixel point, IE jRepresents a three-channel diagram IEThe value of the jth pixel point, xiIndicating the possible label value, x, of pixel ijRepresenting the possible label value of the pixel point j;
and (2.3) solving so that the value of X corresponding to the maximum conditional probability P (X ═ X | I) is the optimized segmentation result of the indoor scene depth map I to be analyzed, and obtaining target class label vectors of all pixels in the indoor scene depth map I to be analyzed.
Preferably, the method further comprises:
by
Figure BDA0001417881300000049
Obtaining an error function of the full convolution network, wherein z represents the output of the last layer of the full convolution network, N represents the total number of pixels in the depth map, and y representsi∈ {1,2, …, C } represents the true category of the manual label corresponding to pixel point i, C represents the total number of categories,
Figure BDA0001417881300000051
the depth map representing the jth input corresponds to the category y at the last layer of the full convolution networkiIs then outputted from the output of (a),
Figure BDA0001417881300000052
the depth map representing the ith input corresponds to the category y at the last layer of the full convolution networkiAn output of (d);
the method comprises the steps of training by utilizing a neural network framework Caffe, initializing full convolution network parameters, updating the full convolution network parameters by using a back propagation algorithm, stopping training when an error function value is not changed any more, and obtaining a trained full convolution network, wherein in the training process of the full convolution network, results obtained by shallow neural network layers are fused and output.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects: the invention only adopts the depth map as input, and the depth map is not influenced by illumination conditions and can reflect the space geometric characteristics of a complex indoor environment, so that scene segmentation and understanding are carried out on the basis, the shielding can be effectively overcome, the foreground and the background can be separated, and the space posture of the target object under the three-dimensional coordinate can be given.
Drawings
Fig. 1 is a schematic flowchart of an indoor scene refinement analysis method based on a depth map according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another depth map-based indoor scene refinement analysis method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of estimating object spatial position information based on 2D segmentation results according to the present invention, in which (1) representsThe projection of the pixel of the depth map marked as an object on the xy plane in space, (2) shows the result after filtering noise points through morphological operation, (3) shows the found 4 corner points, ViI is 1,2,3,4, and (4) represents a three-dimensional bounding box drawn after estimating the height of the object space;
fig. 4 is a scene analysis diagram of the experiment in the bedroom and the hospital ward, wherein the first action is inputting a depth map, and the second action is corresponding to a refined analysis result.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides refined indoor scene analysis only by using a depth map. By means of strong understanding and generalization capability of the deep convolutional neural network, information such as edges and shapes in the depth map can be automatically learned, and segmentation and identification results of pixel levels of indoor main objects are given. On the basis, the position of the object in the 2D image plane is given, and the position and posture information of the target in the three-dimensional space can be analyzed by combining the traditional optimization method. Due to the characteristics of good robustness, capability of protecting user privacy and the like, the technology can provide powerful help for behavior analysis and intelligent nursing of the old. The daily activities of the elderly are closely related to large furniture such as beds, chairs, tables, sofas and the like in rooms and indoor structures such as floors, walls and the like, for example, the falling bed of the elderly is related to the position detection of the bed surface and the ground. In the future, the research and implementation of the household robot for serving the old people also depend on the detailed cognition of the computer on the indoor scene.
In order to realize the purpose, the invention is mainly divided into three steps: first, scene parsing. Firstly, training a full convolution network for depth image analysis aiming at an indoor scene database, and during testing, segmenting an input depth image of a new scene by using the trained full convolution network to obtain an initial analysis result. Secondly, optimizing the analysis result. And calculating an energy function aiming at the whole graph by using the full-connection conditional random field to obtain an optimized segmentation result. Thirdly, on the basis of the analysis result, the depth map is converted into a three-dimensional point cloud, and the position and the posture of the target in the three-dimensional coordinate are estimated.
Fig. 1 is a schematic flowchart illustrating an indoor scene refinement analysis method based on a depth map according to an embodiment of the present invention; in the method shown in fig. 1, the following steps are included:
(1) extracting a three-channel characteristic diagram of the indoor scene depth diagram to be analyzed, taking the extracted three-channel characteristic diagram as the input of a trained full convolution network, and segmenting a target in the indoor scene depth diagram to be analyzed;
in an optional embodiment, the method further comprises the step of training the full convolutional network:
by
Figure BDA0001417881300000071
Obtaining an error function of the full convolution network, wherein z represents the output of the last layer of the full convolution network, N represents the total number of pixels in the depth map, and y representsi∈ {1,2, …, C } represents the true category of the manual label corresponding to pixel point i, C represents the total number of categories,
Figure BDA0001417881300000073
the depth map representing the jth input corresponds to the category y at the last layer of the full convolution networkiIs then outputted from the output of (a),
Figure BDA0001417881300000072
the depth map representing the ith input corresponds to the category y at the last layer of the full convolution networkiAn output of (d);
the method comprises the steps of training by utilizing a neural network framework Caffe, initializing parameters of a full convolution network, updating the parameters of the full convolution network by using a back propagation algorithm, stopping training when an error function value is not changed any more, and obtaining the trained full convolution network, wherein in order to obtain a more refined segmentation result, in the training process of the full convolution network, results obtained by a shallow neural network layer are fused and output.
As an optional implementation manner, as shown in fig. 2, which is a flowchart of an indoor scene depth image analysis method for intelligent nursing according to an embodiment of the present invention, when training a full convolution network of an indoor scene segmentation task, in order to train a full convolution network with sufficient generalization capability, an input sample image may be made into a training data set specifically for a ward scene on the basis of an existing NYUD2 indoor scene database, and the training data set includes 100 depth pictures and is mainly labeled for a bed, a ground, a wall, and other large-scale indoor targets.
In the embodiment of the present invention, a VGG16 network model trained on an ImageNet data set may be used, the number of network layers may be increased or decreased according to actual needs, or other network structures, such as AlexNet, ResNet, and the like, may be used to initialize neural network parameters in a target detection method. The embodiment of the present invention is not limited uniquely, which network model is specifically adopted.
In an optional embodiment, step (1) specifically includes:
(1.1) coding the indoor scene depth map I to be analyzed into a three-channel map IEThe pixels of each channel image correspond to the pixels in the indoor scene depth map I to be analyzed one by one, and the three channels respectively represent the parallax value, the height from the ground and the size of an included angle between a normal vector and the gravity direction;
wherein, the step (1.1) specifically comprises the following steps:
(1.1.1) calculating a disparity value: by
Figure BDA0001417881300000089
Obtaining the relation between the parallax d and the depth value Z corresponding to the pixel point;
(1.1.2) calculating the size of an included angle between a normal vector and the gravity direction: deducing the pixel position of the two-dimensional plane of the indoor scene depth map to be analyzed
Figure BDA00014178813000000812
And coordinates in three-dimensional space of indoor scene depth map to be resolved
Figure BDA00014178813000000813
The following equation is satisfied:
Figure BDA00014178813000000810
wherein
Figure BDA00014178813000000811
The conversion relation between the two-dimensional coordinates and the three-dimensional coordinates is as follows:
Figure BDA0001417881300000088
coordinates (x, y, z) in the three-dimensional space corresponding to all the pixel points form a three-dimensional point cloud, and a normal vector calculation formula corresponding to each pixel point is as follows
Figure BDA0001417881300000087
Wherein norm [. cndot ] represents the normalization of the vector, and symbol x represents the vector outer product;
(1.1.3) for all pixels in the point cloud
Figure BDA0001417881300000085
Figure BDA0001417881300000086
And
Figure BDA0001417881300000083
constructing a parallel set N∣∣And a vertical set NWherein, in the step (A),
Figure BDA00014178813000000814
the normal vector representing the pixel point is then,
Figure BDA0001417881300000084
indicating the direction of gravity, the initial value may be
Figure BDA0001417881300000081
Figure BDA0001417881300000082
ρ represents the angular error margin, which is the angle between the normal vector and the direction of gravity, preferably ρ is 5 °;
(1.1.4) matrix N to be solvedN T-N∣∣N∣∣ TTaking the characteristic value as an updated gravity vector, continuously executing the step (1.1.3) by adopting the updated gravity vector until the characteristic value is stable and unchanged to obtain a target gravity vector, and calculating an included angle between a normal vector of each pixel in the point cloud and the target gravity direction, wherein the point cloud represents that coordinates (x, y, z) in a three-dimensional space corresponding to all pixel points form a three-dimensional point cloud;
(1.1.5) calculating the height from the ground: and taking the target gravity vector as a reference axis, solving the projection value of each point along the target gravity vector, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground.
(1.2) three-channel diagram IEExtracting characteristics of a multi-level Convolutional Neural Network (CNN) layer by layer as input of a trained full Convolutional Network, wherein a Convolutional characteristic diagram obtained in the previous layer is sent to the next layer to extract a new Convolutional characteristic diagram after being downsampled;
in specific implementation, the structure of the full convolution network and the parameters of each layer of convolution kernel adopted in the embodiment of the present invention are shown in fig. 2.
(1.3) respectively passing the convolution characteristic graphs in different layers through deconvolution layers, upsampling to the same size, then mutually fusing the characteristic graphs in different layers, and sending the fused characteristic graphs into a softmax layer;
specifically, taking fig. 2 as an example, the effect of the deconvolution layer and the convolution layer are exactly opposite, and the two operate in reverse. The characteristic diagram of the pool5 layer is up-sampled to 2 times of the original size through deconvolution, namely, the characteristic diagram has the same size as the pool4 layer, is up-sampled to 2 times of the size through deconvolution after being overlapped, namely, the characteristic diagram has the same size as the pool3 layer, and a final up-sampling result can be obtained after being overlapped and used as the input of the next softmax layer.
(1.4) predicting the category of each pixel point through a softmax layer, outputting the probability that each pixel point belongs to each category, wherein the category corresponding to the maximum probability value is the initial category label of the pixel point.
Wherein, the step (1.4) specifically comprises the following steps:
predicting the category of each pixel point through the softmax layer, and outputting the probability that each pixel point i belongs to each category
Figure BDA0001417881300000101
Wherein l {1,2, …, C } represents a category label and the maximum probability value is assigned
Figure BDA0001417881300000102
1,2, …, the category corresponding to C is used as the initial category label of the pixel,
Figure BDA0001417881300000103
and
Figure BDA0001417881300000104
the output of the last layer of the full convolution network without considering the softmax layer.
(2) According to the extracted three-channel characteristic diagram, utilizing a full-connection conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed;
in an optional embodiment, step (2) specifically includes:
(2.1) by conditional probability
Figure BDA0001417881300000105
Defining a conditional random field distribution, wherein X is defined by X1,X2,...,XNRandom vector of composition, Xi(i-1, 2, …, N) indicates the initial category label to which the ith pixel belongs, and z (i) - ∑Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;
(2.2) preparation of
Figure BDA0001417881300000108
Obtaining a total energy function in which the univariate term
Figure BDA0001417881300000109
Binary term
Figure BDA0001417881300000106
Figure BDA0001417881300000107
Wherein p isiRepresenting the position, p, of a pixel point ijRepresenting the position of pixel j, over-parameter σαβAnd σγRepresenting weights governing the Gaussian kernel for specifying a range of adjacent pixels having an effect on a specified pixel, w1And w2Respectively representing the weight occupied by the Gaussian kernel function in two different feature spaces, P (x)i) Representing the maximum probability value, I, of the target class label corresponding to the pixel point IE iRepresents a three-channel diagram IEThe value of the ith pixel point, IE jRepresents a three-channel diagram IEThe value of the jth pixel point, xiIndicating the possible label value, x, of pixel ijRepresenting the possible label value of the pixel point j;
in the implementation of the present invention, a cross-validation method can be used to determine the optimal value combination of the above parameters. First default setting w2And σγTo 3, then randomly select 100 samples from the validation data set, find w1αAnd σβWith the search range set to w1∈(0,20),σα∈(0,100),σβ∈ (0,20) from the above experiment, w was found1αAnd σβThe optimum value of (c).
And (2.3) solving so that the value of X corresponding to the maximum conditional probability P (X ═ X | I) is the optimized segmentation result of the indoor scene depth map I to be analyzed, and obtaining target class label vectors of all pixels in the indoor scene depth map I to be analyzed.
By utilizing the current most efficient high-order filtering algorithm to carry out approximate inference on the probability distribution of the model, the speed of optimization solution can be obviously improved.
(3) And converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target.
In the embodiment of the present invention, as shown in fig. 3, taking spatial orientation analysis of a couch top as an example, other types of targets can be simply modified according to practical applications, and the principle is not changed. In fig. 3, (1) represents the projection of the pixel of the depth map marked as an object on the xy plane in space, (2) represents the result of filtering out the noise points by morphological operation, (3) represents the found 4 corner points, ViAnd i is 1,2,3,4, and (4) represents a three-dimensional bounding box drawn after estimating the height of the object space. The step 3 specifically comprises the following substeps:
and (3.1) projecting the segmentation result on the two-dimensional image plane of the indoor scene depth map to be analyzed to the coordinates of the three-dimensional point cloud. The calculation method of the projection is completely the same as the step (1.1.2);
(3.2) projecting the three-dimensional coordinates of the pixel points with the bed surface labels to an xy plane, performing morphological operation (for example, performing morphological corrosion operation and then performing morphological expansion operation) and filtering noise points;
(3.3) finding the points with the maximum and minimum coordinates in the x-direction and the y-direction, and marking as ViI is 1,2,3,4, represents 4 corner points of the bed surface, and is sequentially connected with ViForm a closed geometric figure, and the normal vectors of all points inside the closed figure represent the orientation of the plane in space and can be expressed inAttitude and structural information of the target in three-dimensional space;
(3.4) calculating the distance between the upper plane and the ground plane as the height h of the space occupied by the object, using ViAnd h drawing a cubic frame of the stereo attitude estimation in a three-dimensional space coordinate system. Fig. 4 is a view showing a scene analysis chart of the experiment in the bedroom and the hospital ward, wherein the first action is inputting a depth map, and the second action is corresponding to a fine analysis result.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A depth map-based indoor scene refinement analysis method is characterized by comprising the following steps:
(1) extracting a three-channel characteristic diagram of an indoor scene depth map to be analyzed, taking the extracted three-channel characteristic diagram as the input of a trained full convolution network, and segmenting a target in the indoor scene depth map to be analyzed;
(2) according to the extracted three-channel characteristic diagram, utilizing a full-connection conditional random field to perfect and optimize the boundary of the segmentation result to obtain category label vectors of all pixels in the indoor scene depth map to be analyzed;
in the three-channel characteristic diagram, pixels of each channel image correspond to pixels in the indoor scene depth diagram to be analyzed one by one, and the three channels respectively represent parallax values, ground height and the size of an included angle between a normal vector and the gravity direction;
the method for extracting the three-channel characteristic diagram of the indoor scene depth map to be analyzed comprises the following steps:
(1.1.1) preparation of
Figure FDA0002491962190000011
Obtaining the relation between the parallax d and the depth value Z corresponding to the pixel point;
(1.1.2) preparation of
Figure FDA0002491962190000012
Obtaining the normal vector of each pixel point, wherein norm [ ·]Representing the normalization of the vector, the notation × represents the vector outer product,
Figure FDA0002491962190000013
representing the pixel locations of the two-dimensional plane of the depth map of the indoor scene to be resolved,
Figure FDA0002491962190000014
the method comprises the following steps of representing coordinates in a three-dimensional space of an indoor scene depth map to be analyzed, wherein the conversion relation between the two-dimensional coordinates and the three-dimensional coordinates is as follows:
Figure FDA0002491962190000015
is an internal reference matrix of the depth camera;
(1.1.3) preparation of
Figure FDA0002491962190000016
And
Figure FDA0002491962190000017
Figure FDA0002491962190000018
constructing a parallel set NIIAnd a vertical set NWherein, in the step (A),
Figure FDA0002491962190000019
the normal vector representing the pixel point is then,
Figure FDA00024919621900000110
which represents the direction of the force of gravity,
Figure FDA00024919621900000111
the included angle between the normal vector and the gravity direction is defined, and rho represents an angle error allowance;
(1.1.4) matrix N to be solvedN T-NIINII TTaking the characteristic value as an updated gravity vector, continuously executing the step (1.1.3) by adopting the updated gravity vector until the characteristic value is stable and unchanged to obtain a target gravity vector, and calculating an included angle between a normal vector of each pixel in the point cloud and the target gravity direction, wherein the point cloud represents that coordinates (x, y, z) in a three-dimensional space corresponding to all pixel points form a three-dimensional point cloud;
(1.1.5) calculating a projection value of each point along the target gravity vector by taking the target gravity vector as a reference axis, finding the lowest point, and taking the difference value between the projection value of other points along the target gravity vector and the lowest point as the height from the ground;
(3) and converting the indoor scene depth map to be analyzed into point cloud, analyzing the three-dimensional structure of the target based on the category label vector, and obtaining the space posture of the target.
2. The method according to claim 1, wherein step (1) comprises in particular:
(1.1) coding the indoor scene depth map I to be analyzed into a three-channel map IE
(1.2) three-channel diagram IEExtracting multi-level CNN characteristics layer by layer as input of a trained full convolution network, wherein a convolution characteristic graph obtained in the previous layer is sent to the next layer to extract a new convolution characteristic graph after downsampling;
(1.3) respectively passing the convolution characteristic graphs in different layers through deconvolution layers, upsampling to the same size, then mutually fusing the characteristic graphs in different layers, and sending the fused characteristic graphs into a softmax layer;
(1.4) predicting the category of each pixel point through a softmax layer, outputting the probability that each pixel point belongs to each category, wherein the category corresponding to the maximum probability value is the initial category label of the pixel point.
3. The method according to claim 2, characterized in that step (1.4) comprises in particular:
by softmaThe x layer predicts the category of each pixel point and outputs the probability that each pixel point i belongs to each category
Figure FDA0002491962190000021
Wherein l ═ {1, 2.. multidata, C } represents the class label and the maximum probability value is given
Figure FDA0002491962190000022
The corresponding category is used as the initial category label of the pixel point,
Figure FDA0002491962190000023
the output of the last layer of the full convolution network without considering the softmax layer.
4. The method according to claim 2 or 3, characterized in that step (2) comprises in particular:
(2.1) by conditional probability
Figure FDA0002491962190000031
Defining a conditional random field distribution, wherein X is defined by X1,X2,...,XNRandom vector of composition, XiN denotes an initial class label to which the ith pixel belongs, and z (i) ∑Xexp (-E (X | I)) represents the sum of the exp (·) terms corresponding to X for all possible cases, E (X | I) represents the total energy function of the conditional random field;
(2.2) preparation of
Figure FDA0002491962190000032
Obtaining a total energy function in which the univariate term
Figure FDA0002491962190000033
Binary term
Figure FDA0002491962190000034
Figure FDA0002491962190000035
Wherein p isiRepresenting the position, p, of a pixel point ijRepresenting the position of pixel j, over-parameter σα,σβAnd σγRepresenting weights governing the Gaussian kernel for specifying a range of adjacent pixels having an effect on a specified pixel, w1And w2Respectively representing the weight occupied by the Gaussian kernel function in two different feature spaces, P (x)i) Representing the maximum probability value, I, of the target class label corresponding to the pixel point IE iRepresents a three-channel diagram IEThe value of the ith pixel point, IE jRepresents a three-channel diagram IEThe value of the jth pixel point, xiIndicating the possible label value, x, of pixel ijRepresenting the possible label value of the pixel point j;
and (2.3) solving so that the value of X corresponding to the maximum conditional probability P (X ═ X | I) is the optimized segmentation result of the indoor scene depth map I to be analyzed, and obtaining target class label vectors of all pixels in the indoor scene depth map I to be analyzed.
5. The method of claim 1, further comprising:
by
Figure FDA0002491962190000036
Obtaining an error function of the full convolution network, wherein z represents the output of the last layer of the full convolution network, N represents the total number of pixels in the depth map, and y representsi∈ {1, 2., C } represents the actual category of the manual label corresponding to pixel point i, C represents the total number of categories,
Figure FDA0002491962190000037
the depth map representing the jth input corresponds to the category y at the last layer of the full convolution networkiIs then outputted from the output of (a),
Figure FDA0002491962190000038
depth map representing ith input in full convolution netCorresponding category y of the last layer of the networkiAn output of (d);
the method comprises the steps of training by utilizing a neural network framework Caffe, initializing full convolution network parameters, updating the full convolution network parameters by using a back propagation algorithm, stopping training when an error function value is not changed any more, and obtaining a trained full convolution network, wherein in the training process of the full convolution network, results obtained by shallow neural network layers are fused and output.
CN201710874793.3A 2017-09-25 2017-09-25 Indoor scene fine analysis method based on depth map Active CN107622244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710874793.3A CN107622244B (en) 2017-09-25 2017-09-25 Indoor scene fine analysis method based on depth map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710874793.3A CN107622244B (en) 2017-09-25 2017-09-25 Indoor scene fine analysis method based on depth map

Publications (2)

Publication Number Publication Date
CN107622244A CN107622244A (en) 2018-01-23
CN107622244B true CN107622244B (en) 2020-08-28

Family

ID=61090539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710874793.3A Active CN107622244B (en) 2017-09-25 2017-09-25 Indoor scene fine analysis method based on depth map

Country Status (1)

Country Link
CN (1) CN107622244B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596102B (en) * 2018-04-26 2022-04-05 北京航空航天大学青岛研究院 RGB-D-based indoor scene object segmentation classifier construction method
CN109034183B (en) * 2018-06-07 2021-05-18 苏州飞搜科技有限公司 Target detection method, device and equipment
CN109118490B (en) * 2018-06-28 2021-02-26 厦门美图之家科技有限公司 Image segmentation network generation method and image segmentation method
CN110378359B (en) * 2018-07-06 2021-11-05 北京京东尚科信息技术有限公司 Image identification method and device
CN109064455B (en) * 2018-07-18 2021-06-25 清华大学深圳研究生院 BI-RADS-based classification method for breast ultrasound image multi-scale fusion
CN110827337B (en) * 2018-08-08 2023-01-24 深圳地平线机器人科技有限公司 Method and device for determining posture of vehicle-mounted camera and electronic equipment
CN110160502B (en) 2018-10-12 2022-04-01 腾讯科技(深圳)有限公司 Map element extraction method, device and server
CN109452914A (en) * 2018-11-01 2019-03-12 北京石头世纪科技有限公司 Intelligent cleaning equipment, cleaning mode selection method, computer storage medium
CN109409376B (en) * 2018-11-05 2020-10-30 昆山紫东智能科技有限公司 Image segmentation method for solid waste object, computer terminal and storage medium
CN109635685B (en) * 2018-11-29 2021-02-12 北京市商汤科技开发有限公司 Target object 3D detection method, device, medium and equipment
CN109658449B (en) * 2018-12-03 2020-07-10 华中科技大学 Indoor scene three-dimensional reconstruction method based on RGB-D image
CN110046747B (en) * 2019-03-19 2021-07-27 华中科技大学 Method and system for planning paths among users of social network facing to image flow
CN109917419B (en) * 2019-04-12 2021-04-13 中山大学 Depth filling dense system and method based on laser radar and image
CN110047047B (en) * 2019-04-17 2023-02-10 广东工业大学 Method for interpreting three-dimensional morphology image information device, apparatus and storage medium
CN110222767B (en) * 2019-06-08 2021-04-06 西安电子科技大学 Three-dimensional point cloud classification method based on nested neural network and grid map
CN110569709A (en) * 2019-07-16 2019-12-13 浙江大学 Scene analysis method based on knowledge reorganization
CN111325135B (en) * 2020-02-17 2022-11-29 天津中科智能识别产业技术研究院有限公司 Novel online real-time pedestrian tracking method based on deep learning feature template matching
CN111507266A (en) * 2020-04-17 2020-08-07 四川长虹电器股份有限公司 Human body detection method and device based on depth image
CN112818756A (en) * 2021-01-13 2021-05-18 上海西井信息科技有限公司 Target detection method, system, device and storage medium
CN113052971B (en) * 2021-04-09 2022-06-10 杭州群核信息技术有限公司 Neural network-based automatic layout design method, device and system for indoor lamps and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105979244A (en) * 2016-05-31 2016-09-28 十二维度(北京)科技有限公司 Method and system used for converting 2D image to 3D image based on deep learning
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106600571A (en) * 2016-11-07 2017-04-26 中国科学院自动化研究所 Brain tumor automatic segmentation method through fusion of full convolutional neural network and conditional random field
CN106815563A (en) * 2016-12-27 2017-06-09 浙江大学 A kind of crowd's quantitative forecasting technique based on human body apparent structure
CN106934765A (en) * 2017-03-14 2017-07-07 长沙全度影像科技有限公司 Panoramic picture fusion method based on depth convolutional neural networks Yu depth information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105979244A (en) * 2016-05-31 2016-09-28 十二维度(北京)科技有限公司 Method and system used for converting 2D image to 3D image based on deep learning
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106600571A (en) * 2016-11-07 2017-04-26 中国科学院自动化研究所 Brain tumor automatic segmentation method through fusion of full convolutional neural network and conditional random field
CN106815563A (en) * 2016-12-27 2017-06-09 浙江大学 A kind of crowd's quantitative forecasting technique based on human body apparent structure
CN106934765A (en) * 2017-03-14 2017-07-07 长沙全度影像科技有限公司 Panoramic picture fusion method based on depth convolutional neural networks Yu depth information

Also Published As

Publication number Publication date
CN107622244A (en) 2018-01-23

Similar Documents

Publication Publication Date Title
CN107622244B (en) Indoor scene fine analysis method based on depth map
US11816907B2 (en) Systems and methods for extracting information about objects from scene information
CN109544677B (en) Indoor scene main structure reconstruction method and system based on depth image key frame
He et al. Deep learning based 3D segmentation: A survey
CN108269266B (en) Generating segmented images using Markov random field optimization
Häne et al. Dense semantic 3d reconstruction
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN111798475A (en) Indoor environment 3D semantic map construction method based on point cloud deep learning
TW202034215A (en) Mapping object instances using video data
CN107798725B (en) Android-based two-dimensional house type identification and three-dimensional presentation method
JP7439153B2 (en) Lifted semantic graph embedding for omnidirectional location recognition
Tang et al. BIM generation from 3D point clouds by combining 3D deep learning and improved morphological approach
Liu et al. 3D Point cloud analysis
Qian et al. Learning pairwise inter-plane relations for piecewise planar reconstruction
Pahwa et al. Locating 3D object proposals: A depth-based online approach
US20230334727A1 (en) 2d and 3d floor plan generation
Pintore et al. Automatic modeling of cluttered multi‐room floor plans from panoramic images
Wang et al. Understanding of wheelchair ramp scenes for disabled people with visual impairments
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
Mohan et al. Room layout estimation in indoor environment: a review
Zhang et al. A robust visual odometry based on RGB-D camera in dynamic indoor environments
Pintore et al. Automatic 3D reconstruction of structured indoor environments
CN116030335A (en) Visual positioning method and system based on indoor building framework constraint
Zioulis et al. Monocular spherical depth estimation with explicitly connected weak layout cues
Zhang et al. Geometric and Semantic Modeling from RGB-D Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant