CN113139965A

CN113139965A - Indoor real-time three-dimensional semantic segmentation method based on depth map

Info

Publication number: CN113139965A
Application number: CN202110297418.3A
Authority: CN
Inventors: 颜成钢; 路荣丰; 裘健鋆; 朱尊杰; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-20

Abstract

The invention discloses an indoor real-time three-dimensional semantic segmentation method based on a depth map. Firstly, a three-dimensional dense real-time reconstruction technology is applied through an RGB-D camera to complete real-time reconstruction, and a three-dimensional dense TSDF voxel three-dimensional model is obtained; then, plane detection is carried out by a computer graphics method, and after plane parts are removed by real-time plane detection, the three-dimensional models are voxel models of all independent three-dimensional objects; and finally, putting the obtained voxel models of the three-dimensional objects which are isolated from each other into a three-dimensional convolutional neural network, thereby quickly and accurately realizing the task of real-time three-dimensional semantic segmentation. The segmentation result of the method can better show the texture information of the object, and can be further directly used in an AR or VR scene. The interpretability is better, the calculation power is greatly reduced, and the real-time semantic segmentation can be realized; the method is more suitable for three-dimensional semantic segmentation of large scenes.

Description

Indoor real-time three-dimensional semantic segmentation method based on depth map

Technical Field

The invention belongs to the field of three-dimensional object semantic segmentation in computer vision, and particularly relates to a method for improving the efficiency and accuracy of three-dimensional semantic segmentation by fully utilizing computer graphics, which can be applied to the fields of three-dimensional dense reconstruction, dynamic reconstruction, AR and VR.

Background

At present, semantic segmentation is carried out in a three-dimensional space, most of the semantic segmentation is carried out by utilizing a two-dimensional image and then the two-dimensional image is projected into the three-dimensional space to complete the semantic segmentation, or the three-dimensional semantic segmentation is directly carried out on a point cloud picture obtained by a three-dimensional laser radar, and in some methods, three-dimensional reconstruction is carried out through an RGB-D camera, and the whole dense reconstruction result is directly thrown into an end-to-end three-dimensional convolution neural network to carry out the semantic segmentation.

Although simple three-dimensional semantic segmentation can be realized by two-dimensional segmentation, the two-dimensional images in the method naturally lack the dimension of the distance in the three-dimensional space, and the problem of insufficient three-dimensional segmentation precision is caused in principle. Meanwhile, the method is limited by two-dimensional segmentation precision, and when a two-dimensional color image is in a dark or highlight scene, a two-dimensional segmentation result is often invalid, so that the three-dimensional segmentation effect is further influenced, and the method has the limitations of low robustness and incapability of fully utilizing the depth information of a three-dimensional object.

In the current stage, a popular method in the field of unmanned driving is to directly perform deep learning on three-dimensional point cloud obtained by a laser radar to complete a task of semantic segmentation. Although the method further utilizes the depth information, the method cannot utilize the color information of the object, so that the method has more consistent appearance, but the identification capability of the object with obviously different information is obviously insufficient; meanwhile, the laser radar used for acquiring data is expensive in manufacturing cost, is not suitable for popularization in the family market, and has certain limitation in application occasions; meanwhile, the point cloud data utilized in the method is difficult to express the texture of the surface of the three-dimensional object, and the requirements on the segmentation effect in the three-dimensional dense reconstruction field and the AR and VR fields cannot be met.

In recent years, some methods use the existing three-dimensional reconstruction technology to store the final result of three-dimensional reconstruction into a three-dimensional data format such as a voxel or a triangular patch, and then use the whole three-dimensional model as input, and use a three-dimensional neural network to perform end-to-end training on the whole three-dimensional model, thereby completing the task of three-dimensional semantic segmentation. The method further utilizes the color and depth information of the three-dimensional object and has better three-dimensional object texture. However, under the condition that the three-dimensional scene model is gradually enlarged, the calculation amount is sharply expanded due to the fact that a large number of multi-scale three-dimensional convolutional neural networks are used, meanwhile, the calculation amount of the method is greatly increased due to the fact that a large number of three-dimensional RPNs are used, and therefore the method cannot be directly applied to a real-time three-dimensional semantic segmentation task.

The patent discloses a method for carrying out indoor real-time three-dimensional semantic segmentation by directly utilizing information acquired by RGB-D.

Content of the scheme

Aiming at the defects in the prior art, the invention provides an indoor real-time three-dimensional semantic segmentation method based on a depth map.

The invention can carry out plane detection in real time by using a computer graphics method in the real-time reconstruction process, and carries out space segmentation on different three-dimensional objects in an indoor scene through plane detection, and then respectively carries out object identification and semantic segmentation on the separated three-dimensional objects one by one. The process greatly reduces the use of three-dimensional convolutional neural networks due to the use of a plane detection method, lightens the calculation power and realizes real-time indoor real-time three-dimensional semantic segmentation of the voice.

An indoor real-time three-dimensional semantic segmentation method based on a depth map comprises the following steps:

step (1), a real-time dense reconstruction stage based on an RGB-D camera;

the real-time reconstruction is completed by an RGB-D camera by using a three-dimensional dense real-time reconstruction technology (Fastfusion), a three-dimensional dense TSDF voxel three-dimensional model is obtained, and the real-time three-dimensional model is built for the next real-time three-dimensional semantic segmentation.

Step (2), reconstructing a plane detection stage of the three-dimensional model in real time;

after a three-dimensional real-time reconstruction model (namely a TSDF voxel three-dimensional model) is obtained, plane detection is carried out through a computer graphics method, and after a plane part is removed through the real-time plane detection, the three-dimensional model is the voxel model of each independent three-dimensional object, namely, the mutual isolation of the indoor three-dimensional objects is completed through the plane detection.

Step (3), carrying out three-dimensional semantic detection and segmentation on a voxel model of an independent three-dimensional object in a real-time three-dimensional reconstruction scene;

and (3) putting the voxel models of the three-dimensional objects which are isolated from each other and obtained in the step (2) into a three-dimensional convolution neural network, and further quickly and accurately realizing the task of real-time three-dimensional semantic segmentation.

The invention has the following beneficial effects:

(1) according to the method, the semantic segmentation is carried out on the real-time dense three-dimensional model, and different from the real-time semantic segmentation which is carried out on a point cloud map or a real-time sparse three-dimensional model which is widely used, the segmentation result of the method can better show the texture information of an object, and the method can be further directly used in an AR or VR scene.

(2) Compared with a method only using an end-to-end three-dimensional convolutional neural network, the method using computer graphics has better interpretability, greatly reduces the computational power, can realize real-time semantic segmentation, does not perform three-dimensional semantic segmentation after a three-dimensional model is completely modeled, and can perform three-dimensional semantic segmentation in the camera scanning process.

(3) The method is more suitable for three-dimensional semantic segmentation of large scenes. Since the method uses a plane detection method, each object in the three-dimensional model can be separated by a small calculation force, and then the three-dimensional model of the object with small data volume for the whole scene is put into the three-dimensional convolution neural network. With the enlargement of the scene, the method only needs to put the frame and the object near the frame into the three-dimensional convolution network every time, and the computational power is not increased; in the traditional method for putting the whole three-dimensional space model into the three-dimensional convolution network, along with the increase of scenes, the calculation amount increases at the speed of geometric progression.

Drawings

FIG. 1 is a schematic diagram of indoor real-time three-dimensional semantic segmentation based on a depth map;

FIG. 2 is a schematic diagram of a real-time three-dimensional reconstruction;

FIG. 3 is a schematic diagram of a two-dimensional TSDF model;

FIG. 4 is a schematic diagram of semantic segmentation of a three-dimensional object model.

Detailed Description

The following detailed description of specific embodiments of the invention is provided in connection with the appended drawings.

With reference to fig. 1, a flow chart of indoor three-dimensional semantic segmentation based on depth maps mainly includes the following implementation stages:

step (1) real-time dense reconstruction stage based on RGB-D camera:

in the stage, firstly, the depth map collected by the RGB-D camera is converted and preprocessed, firstly, the gray value of the depth map is converted into a floating point depth value in a meter unit, then, the depth map is subjected to bilateral filtering processing according to different requirements, and then, the three-dimensional point cloud corresponding to the depth map and the normal vector of each point are calculated.

And then introducing vertex and normal vectors currently input by the camera and model vertex and normal vectors obtained from TSDF voxel space projection, and calculating the position and posture change of the camera between two frames by using an ICP (inductively coupled plasma) algorithm so as to obtain the position and posture of the current camera. Only by accurately obtaining the position and the posture of the motion camera, the three-dimensional indoor environment model can be reconstructed in real time more accurately.

Then, the current depth map is fused into the reconstructed TSDF voxel space, and the scene observed by the current depth camera is fused into the TSDF voxel space in a weighted mode through the camera position and the camera posture obtained through tracking, so that the historical observation frame and the new observation frame can be fused into a three-dimensional model.

And finally, projecting the surface in the TSDF voxel space to the current camera position and posture for estimating the camera position and posture at the next frame time, calculating the depth image at the current camera position and posture by interpolation through a ray projection method, and calculating a corresponding vertex and a normal vector to be used as the input of the next frame tracking camera posture.

Initially, the TSDF model is generated directly from the depth map and color picture acquired by the RGB-D camera of the first frame obtained.

In the stage, the depth map is continuously fused into the three-dimensional TSDF model in real time, so that real-time three-dimensional dense reconstruction is realized, and the three-dimensional TSDF model is provided for the subsequent plane detection and object semantic segmentation.

Step (2), reconstructing a plane detection stage of the three-dimensional model in real time:

in a newly reconstructed three-dimensional voxel model in a current frame, an individual voxel point P is randomly determined, and forms a plurality of planes with other surrounding voxels, and then the respective normal vectors of the planes are calculated. And (3) comparing the included angles between the normal vectors corresponding to the planes to obtain the curvatures of the planes, thereby judging whether the small planes form a large plane or not. And when the included angle between the normal vectors of the two planes is less than 2 degrees, judging that the two small planes are in a large plane, and simultaneously setting and updating the corresponding plane ID, so that the whole new three-dimensional reconstruction model is traversed from the voxel point. When the area of the plane with the same ID is larger than a set threshold, the plane is determined to be a plane capable of bearing different objects in the three-dimensional model, such as the ground, a desktop and the like, and is not used as a row or a column of a specific object in the three-dimensional space. Other voxel points than the voxels that make up the plane carrying the different objects are all considered to be the constituent voxels of the object.

After traversing all voxels, deleting all voxels of the plane bearing the object, wherein all objects in the space are deleted because the plane bearing the objects is deleted, each object is independently in a three-dimensional space, traversing all voxels at the time, and regarding the voxels which are connected together at the time as the same object, namely, independently isolating all three-dimensional objects in the three-dimensional space by a computer graphics method (plane detection).

Step (3), detection and semantic segmentation stage of the three-dimensional object:

by implementing a plane detection stage of reconstructing a three-dimensional model, voxel models (TSDF) of each separated three-dimensional object can be obtained, then different three-dimensional object models are respectively put into a three-dimensional convolutional neural network for three-dimensional feature extraction, the three-dimensional convolutional neural network part adopts 8 layers of three-dimensional convolutional networks, wherein the convolutional core of the first layer of three-dimensional convolutional layer is 2 x 2, the step length is 2, the filling width is 0, and ReLU is adopted as an activation function; the second layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; the third layer of convolution kernel is 3 × 3, the step size is 1, the filling width is 1, and ReLU is used as an activation function; the fourth layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; a fifth layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; a sixth layer of convolution kernels is 3 × 3, the step size is 1, the filling width is 1, and ReLU is used as an activation function; the seventh layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; the eighth layer of convolution kernel is 2 × 2, the step size is 2, the filling width is 0, and the ReLU is used as an activation function; obtaining a characteristic diagram (body) of the three-dimensional model, then passing the three-dimensional characteristic through three full-connection layers, and finally outputting the object type of the three-dimensional object; and (3) passing the feature map (volume) of the three-dimensional object through the three-dimensional convolution neural network through a full convolution layer to perform mask semantic segmentation on the three-dimensional object.

Because different objects are roughly divided through plane detection, the method can basically meet the dividing requirement of the indoor environment, and can divide large blocks of different objects, so that the three-dimensional neural network does not need to generate a large number of three-dimensional prediction frames through anchors of different sizes like the traditional neural network, and does not need a bounding box to select and adjust different frames, thereby greatly saving calculation power and further realizing real-time three-dimensional semantic division.

Fig. 3 is a schematic diagram of a two-dimensional TSDF model, and the three-dimensional TSDF model used in the method is added with a dimension based on the two-dimensional representation, and the data storage and expression modes are the same.

FIG. 4 is a semantic segmentation schematic of a three-dimensional object model illustrating the overall flow of a three-dimensional convolutional neural network through which individual objects that have been segmented by plane detection.

Claims

1. An indoor real-time three-dimensional semantic segmentation method based on a depth map is characterized by comprising the following steps:

step (1), a real-time dense reconstruction stage based on an RGB-D camera;

the real-time reconstruction is completed by an RGB-D camera by using a three-dimensional dense real-time reconstruction technology to obtain a three-dimensional dense TSDF voxel three-dimensional model, and the real-time three-dimensional model is constructed for the next real-time three-dimensional semantic segmentation;

after a three-dimensional real-time reconstruction model is obtained, namely a TSDF voxel three-dimensional model, plane detection is carried out by a computer graphics method, and after a plane part is removed by real-time plane detection, the three-dimensional model is the voxel model of each independent three-dimensional object, namely, the mutual isolation of the indoor three-dimensional objects is finished by the plane detection;

2. The method for indoor real-time three-dimensional semantic segmentation based on the depth map as claimed in claim 1, wherein the step (1) specifically operates as follows:

in the stage, firstly, a depth map acquired by an RGB-D camera is converted and preprocessed, firstly, the gray value of the depth map is converted into a floating point depth value with the meter as a unit, then, the depth map is subjected to bilateral filtering processing according to different requirements, and then, three-dimensional point clouds corresponding to the depth map and normal vectors of all points are calculated;

secondly, introducing vertex and normal vector currently input by a camera and model vertex and normal vector obtained from TSDF voxel space projection, and calculating the position and posture change of the camera between two frames by using an ICP (inductively coupled plasma) algorithm so as to obtain the position and posture of the current camera;

then fusing the current depth map into the reconstructed TSDF voxel space, and fusing the scene observed by the current depth camera into the TSDF voxel space in a weighted manner through the camera position and posture obtained by tracking so that the historical observation frame and the new observation frame can be fused into a three-dimensional model;

finally, projecting the surface in the TSDF voxel space to the current camera position and posture for estimating the camera position and posture at the next frame time, calculating the depth image at the current camera position and posture by interpolation through a ray projection method, and calculating the corresponding vertex and normal vector as the input of the next frame tracking camera posture;

under the initial condition, directly generating a TSDF model through a depth map and a color picture acquired by an obtained RGB-D camera of a first frame;

3. The method for indoor real-time three-dimensional semantic segmentation based on the depth map as claimed in claim 2, wherein the step (2) specifically operates as follows:

randomly determining an individual voxel point P in a newly reconstructed three-dimensional voxel model in a current frame, forming a plurality of planes by the individual voxel point P and other surrounding voxels, and calculating respective normal vectors of the planes; the curvature of the planes is obtained by comparing the included angles between the normal vectors corresponding to the planes, so that whether the small planes form a large plane is judged; when the included angle between the normal vectors of the two planes is less than 2 degrees, the two small planes are judged to be in one large plane, and the corresponding plane ID is set and updated simultaneously, so that the whole new three-dimensional reconstruction model is traversed from the voxel point; when the area of the plane with the same ID is larger than a set threshold, the plane is judged to be a plane capable of bearing different objects in the three-dimensional model, such as the ground, a desktop and the like, and is not used as a row or a column of a specific object in a three-dimensional space; all other voxel points except the voxels forming the plane bearing different objects are considered as the forming voxels of the object;

after traversing all voxels, deleting all voxels of the plane bearing the object, wherein all objects in the space are deleted because the plane bearing the objects is deleted, each object is independently in a three-dimensional space, traversing all voxels at the time, and regarding the voxels which are connected together at the time as the same object, namely, independently isolating all three-dimensional objects in the three-dimensional space by a computer graphics method.

4. The method for indoor real-time three-dimensional semantic segmentation based on the depth map as claimed in claim 3, wherein the step (3) is specifically operated as follows:

by implementing a plane detection stage of reconstructing a three-dimensional model, voxel models of each separated three-dimensional object can be obtained, then different three-dimensional object models are respectively put into a three-dimensional convolutional neural network for three-dimensional feature extraction, the three-dimensional convolutional neural network part adopts 8 layers of three-dimensional convolutional networks, the convolution kernel of the first layer of three-dimensional convolutional layer is 2 x 2, the step length is 2, the filling width is 0, and ReLU is adopted as an activation function; the second layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; the third layer of convolution kernel is 3 × 3, the step size is 1, the filling width is 1, and ReLU is used as an activation function; the fourth layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; a fifth layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; a sixth layer of convolution kernels is 3 × 3, the step size is 1, the filling width is 1, and ReLU is used as an activation function; the seventh layer convolution kernel is 1 × 1, the step size is 1, the filling width is 0, and ReLU is used as an activation function; the eighth layer of convolution kernel is 2 × 2, the step size is 2, the filling width is 0, and the ReLU is used as an activation function; obtaining a characteristic diagram (body) of the three-dimensional model, then passing the three-dimensional characteristic through three full-connection layers, and finally outputting the object type of the three-dimensional object; and (3) passing the feature map (volume) of the three-dimensional object through the three-dimensional convolution neural network through a full convolution layer to perform mask semantic segmentation on the three-dimensional object.