CN111340939A

CN111340939A - Indoor three-dimensional semantic map construction method

Info

Publication number: CN111340939A
Application number: CN202010108398.6A
Authority: CN
Inventors: 赵芳; 曾碧
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-26
Anticipated expiration: 2040-02-21
Also published as: CN111340939B

Abstract

The invention belongs to the field of three-dimensional reconstruction and scene understanding, and particularly relates to an indoor three-dimensional semantic map construction method, aiming at solving the technical problems that a family service robot understands semantic information of the surrounding environment, is convenient for man-machine interaction, executes high-level intelligent operation and the like. The method comprises the steps of firstly, carrying out image acquisition on an indoor scene by using an RGB-D sensor, carrying out target detection or semantic segmentation on a two-dimensional color image to obtain corresponding semantic information, simultaneously repairing a depth image to carry out three-dimensional reconstruction, and finally fusing the image semantic information into a three-dimensional map to obtain the indoor three-dimensional semantic map. The technical scheme of the invention can realize accurate and accurate three-dimensional information perception, has important significance for the family service robot, and is also suitable for application such as indoor augmented reality and three-dimensional indoor design.

Description

Indoor three-dimensional semantic map construction method

Technical Field

The invention relates to the field of three-dimensional reconstruction and scene understanding, in particular to an indoor three-dimensional semantic map construction method and system.

Background

The rapid and accurate three-dimensional information perception is a key technology for emerging applications such as family service robots, indoor augmented reality and three-dimensional indoor design. In recent years, with the development of depth sensors (e.g., microsoft Kinect, intel real sense, etc.), three-dimensional scanning technology has been greatly advanced. The depth map and color map collected by these sensors can be conveniently used to generate a dense three-dimensional model of the scanned object. And the research development of the indoor scene three-dimensional semantic map construction is promoted. The semantic map can be widely applied to the fields of robots, navigation, human-computer interaction and the like. An indoor semantic map typically includes spatial attribute information, such as the floor structure of a building, room distribution, etc., as well as semantic attribute information, such as individual room attributes and functions, and object class and location information within a room, etc. The goal of semantic map building is to accurately label semantic information on a map.

Through the literature retrieval of the prior art, the literature 1 (Wuhao. robot map construction research [ D ]. Jinan: Shandong university, 2011.) utilizes the QRCode technology to paste a two-dimensional code as an artificial landmark on a large object in a family semi-unknown environment so as to construct a semantic map capable of describing the object-room affiliation relationship; document 2 (zhao journey. based on visual-voice interactive indoor level map construction and navigation system [ D ]. mansion door: mansion door university, 2014.) realizes a grid-topology-semantic multi-level map from bottom to top by a visual tracking human body and voice labeling technology, but relies on manual human intervention in the process of map construction; document 3(SHENG W, DU J, CHENG Q, et al. robot management mapping and computing adaptive reliability recognition: A wearable sensing and computing adaptive approach [ J ]. Robotics and Autonomous Systems,2015,68(C):47-58.) creatively proposes to use wearable devices to recognize human body actions and establish a Bayesian framework based on the relationship between human body actions and object types to construct semantic maps, but the wearing of wearable devices is somewhat cumbersome for practical applications.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides an indoor three-dimensional semantic map construction method based on an RGB-D sensor, which can construct a map containing room semantic information and room object semantic information so that a robot can execute high-level intelligent operation and better serve human beings.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the invention provides a method for constructing an indoor three-dimensional semantic map, which comprises the following steps:

step S1, data acquisition; collecting color depth RGB-D image information of an indoor environment by using an RGB-D sensor, wherein the color depth RGB-D image information comprises an RGB image and a depth image;

step S2, obtaining semantic information: carrying out target detection or semantic segmentation on the acquired two-dimensional RGB image by using a deep learning algorithm to obtain corresponding semantic information;

step S3, repairing the depth image;

step S4, building a three-dimensional map of the indoor environment: constructing a three-dimensional map by using the repaired indoor environment RGB-D image;

and step S5, forming a three-dimensional semantic map: and fusing the target with semantic information obtained in the step S2 with the indoor three-dimensional map obtained in the step S4 through coordinate position conversion, and carrying out assignment and labeling on the map by using a label to form the indoor environment three-dimensional semantic map.

In a preferred embodiment, the specific steps of step S1 are as follows:

the user can scan the indoor environment by holding the equipment with the RGB-D sensor or by the mobile robot with the RGB-D sensor to obtain continuous RGB-D images.

In a preferred embodiment, the target detection method in step S2 is YOLOv 3.

In a preferred embodiment, the step S3 uses a parallelized real-time depth image restoration algorithm based on the CUDA technique.

In a preferred embodiment, the step S4 employs a modified three-dimensional reconstruction BundleFusion algorithm.

The invention provides an indoor three-dimensional semantic map construction system in a second aspect, which comprises a data acquisition module, a three-dimensional dense reconstruction module and a semantic fusion dense reconstruction module;

the data acquisition module acquires color depth RGB-D image information of an indoor environment and divides the color depth RGB-D image information into an RGB image and a depth image; respectively carrying out RGB image target detection/semantic segmentation and CUDA depth image restoration;

the three-dimensional dense reconstruction module performs corresponding relation matching between frames on the input aligned color and depth data streams, then performs global pose optimization, corrects the overall drift, and keeps the model in a continuously dynamic updating state in the whole reconstruction process;

the semantic fusion dense reconstruction module is used for carrying out target detection or semantic segmentation on the image acquired by the camera, integrating the semantic result of the obtained image into three-dimensional dense point cloud reconstruction through a fusion algorithm based on Bayes updating, and realizing the construction of an indoor scene three-dimensional semantic map facing the service robot.

In a preferred scheme, the CUDA depth image restoration method specifically includes the following steps:

the invalid points on each depth image are filtered using equation (1).

In the formula: i is_destIs a restored image I_srcFor the original image, ω (i, j) is the weight of the filter at point (i, j), Ω_invAs an area of invalid points on the image, omega_nIs a neighborhood of pixels, omega, with invalid points removed_pIs that the standard quantity is calculated by the formula (2);

the weight ω (i, j) is linearly related to the spatial domain and the value domain of the pixel point at the same time, the closer the distance is, the smaller the pixel value change is, the higher the correlation is, and the filter kernel function is defined as follows:

in the formula:

is the standard deviation of a spatial gaussian function,

is the standard deviation of the value domain gaussian function, x, y are the abscissa of the pixel within the filter window, I, j are the pixel coordinates of the invalid point currently being processed, I represents the value of a certain pixel on the depth image.

In a preferred embodiment, said three-dimensional dense reconstruction module,

in the aspect of matching, a coarse-fine parallel global optimization method is used; using sparse SIFT feature points to perform rough registration, and then using dense luminosity and geometric constraint to perform finer registration;

in the aspect of position and attitude optimization, a layered local-to-global optimization method is used, the method is divided into two layers in total, on the lowest layer, each continuous 10 frames form a chunk, the first frame is used as a key frame, and then local position and attitude optimization is carried out on all frames in the chunk; on the second layer, only all the chunk key frames are used for mutual correlation and then global optimization; the method has the advantages that the key frames can be separated, and the storage and the data to be processed are reduced;

in the aspect of dense scene reconstruction, reconstruction errors caused by accumulated drift or calculation in the featureless region are corrected based on the attitude estimation.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the method for constructing the indoor three-dimensional semantic map provided by the invention establishes the three-dimensional scene map by scanning the surrounding environment of the indoor scene by using the RGB-D sensor, and meanwhile, obtains semantic information (walls, doors and windows, the ground, various furniture and the like) which can enable the robot to automatically understand the surrounding environment by using a deep learning algorithm, and finally realizes the construction of the three-dimensional semantic map of the indoor scene; the method has important significance for the home service robot to really understand the surrounding environment and achieve the real purpose of intelligent semantic perception, and has important reference value for acquiring scene three-dimensional information for emerging applications such as indoor augmented reality and three-dimensional indoor design.

Drawings

Fig. 1 is a flowchart of a method for constructing an indoor scene three-dimensional semantic map according to the present invention.

FIG. 2 is a schematic flow chart of an indoor scene three-dimensional semantic map construction system according to the present invention;

FIG. 3 is an original depth image generated by Kinect;

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

step S3, repairing the depth image;

and step S5, forming a three-dimensional semantic map: and fusing the target with semantic information obtained in the step S2 with the indoor three-dimensional map obtained in the step S3 through coordinate position conversion, and carrying out assignment and labeling on the map by using a label to form the indoor environment three-dimensional semantic map.

In a preferred embodiment, the specific steps of step S1 are as follows:

In a preferred embodiment, the target detection method in step S2 is YOLOv 3.

Example 2

the invalid points on each depth image are filtered using equation (1).

in the formula:

is the standard deviation of a spatial gaussian function,

In a preferred embodiment, said three-dimensional dense reconstruction module,

Example 3

The embodiment of the invention provides a detailed flow diagram of an indoor scene three-dimensional semantic map construction method. The method mainly comprises three modules of data acquisition, three-dimensional dense reconstruction, semantic fusion dense reconstruction and the like.

Where the data collection uses RGB-D sensors, embodiments of the invention may scan the indoor environment with a user holding a depth sensor equipped device (e.g., KinectV2) or with a mobile robot equipped with a depth sensor, collecting continuous image data. The RGB-D image data includes an RGB color image and a depth image. The depth image can directly reflect real three-dimensional environment information, as shown in fig. 3. Due to the fact that self equipment, the surface material of an object, the region shielding and the like exist, a large number of invalid regions such as black edges and black holes exist in an original depth image generated by the Kinect, and the use of the depth image is greatly influenced. The embodiment of the invention uses a parallel real-time depth image restoration algorithm based on the CUDA technology to realize real-time and effective restoration of the depth image on the mobile robot.

In the embodiment, in order to parallelize the image restoration program, the image is divided, the size of the depth image of the Kinect v2 is 512 × 424, 12 lines of pixels above and below the image are omitted, 32 × 20 is used as a block, grid of 16 × 20 is formed, the grid is uploaded to a GPU after the image division is completed, and the image restoration program is executed in parallel by the GPU, and invalid points on each image are filtered by using a formula (1).

In the formula: i is_destIs a restored image I_srcFor the original image, ω (i, j) is the weight of the filter at point (i, j), Ω_invAs an area of invalid points on the image, omega_nIs a neighborhood of pixels, omega, with invalid points removed_pIs the standard quantity calculated by the formula (2).

in the formula:

is the standard deviation of a spatial gaussian function,

The three-dimensional dense reconstruction module in fig. 2 is mainly completed based on a BundleFusion algorithm, and according to the embodiment of the invention, invalid point repairing processing is firstly performed on the acquired original depth image so as to solve the problem that the matching error of the key point is accumulated due to the existence of noise in the sensor. And then carrying out corresponding relation matching between frames on the input aligned color and depth data streams, then carrying out global pose optimization, correcting the overall drift, and keeping the model in a continuously dynamic updating state in the whole reconstruction process.

In the aspect of matching, a coarse-to-fine parallel global optimization method is used. First a coarser registration is performed using sparse SIFT feature points, and then a finer registration is performed using dense photometric and geometric constraints.

In terms of pose optimization, a hierarchical local-to-global optimization method is used. The method is divided into two layers in total, on the lowest layer, each continuous 10 frames form a chunk, the first frame is used as a key frame, and then local pose optimization is carried out on all frames in the chunk. On the second level, only all the chunk's key frames are used for inter-correlation and then global optimization. The method has the advantages of being capable of separating out key frames and reducing storage and data to be processed.

In terms of dense scene reconstruction, the key point is the symmetric update of the model: if an updated frame estimate is to be added, the old frame is removed and then re-integrated at the new pose. Based on the method, reconstruction errors caused by accumulated drift or calculation in the featureless area can be corrected as long as better attitude estimation is carried out, so that the model is more and more accurate.

The semantic information in the semantic fusion dense reconstruction module in fig. 2 can be obtained by a target detection or semantic segmentation method. Benefiting from the development of deep learning in recent years, the computer vision field obtains a plurality of remarkable achievements, wherein the achievements comprise target detection and semantic segmentation of images, a better target detection algorithm is a YOLO series, and can meet the requirement of a real-time detection task, wherein the YOLOv3 balances speed and precision by changing the size of a model structure; the average precision of the better semantic segmentation method Deeplabv3 reaches 85.2 percent. The algorithms are used for carrying out target detection or semantic segmentation on the images acquired by the camera, and the semantic results of the obtained images are integrated into three-dimensional dense point cloud reconstruction through a fusion algorithm based on Bayesian update, so that the construction of an indoor scene three-dimensional semantic map facing a service robot is realized.

The method comprises the steps of scanning the surrounding environment of the indoor scene by using an RGB-D sensor to establish a three-dimensional scene map, and acquiring semantic information (walls, doors and windows, the ground, various furniture and the like) which enables a robot to automatically understand the surrounding environment by using a deep learning algorithm, so that the three-dimensional semantic map of the indoor scene is constructed finally; the method has important significance for the home service robot to really understand the surrounding environment and achieve the real purpose of intelligent semantic perception, and has important reference value for acquiring scene three-dimensional information for emerging applications such as indoor augmented reality and three-dimensional indoor design.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An indoor three-dimensional semantic map construction method is characterized by comprising the following steps:

s1, data acquisition; collecting color depth RGB-D image information of an indoor environment by using an RGB-D sensor, wherein the color depth RGB-D image information comprises an RGB image and a depth image;

s2, obtaining semantic information: carrying out target detection or semantic segmentation on the acquired two-dimensional RGB image by using a deep learning algorithm to obtain corresponding semantic information;

s3, repairing the depth image;

s4, constructing an indoor environment three-dimensional map: constructing a three-dimensional map by using the repaired indoor environment RGB-D image;

s5, forming a three-dimensional semantic map: and fusing the target with semantic information obtained in the step S2 with the indoor three-dimensional map obtained in the step S4 through coordinate position conversion, and carrying out assignment and labeling on the map by using a label to form the indoor environment three-dimensional semantic map.

2. The indoor three-dimensional semantic map construction method according to claim 1, wherein the specific steps of the step S1 are as follows:

3. The indoor three-dimensional semantic map construction method according to claim 2, wherein the target detection method in the step S2 is YOLOv 3.

4. The indoor three-dimensional semantic map construction method according to claim 3, wherein the step S3 uses a parallelized real-time depth image restoration algorithm based on CUDA technology.

5. The indoor three-dimensional semantic map construction method according to claim 3, wherein the step S4 adopts a modified three-dimensional reconstruction Bundlefusion algorithm.

6. An indoor three-dimensional semantic map construction system based on the method of claims 1-5, which is characterized by comprising a data acquisition module, a three-dimensional dense reconstruction module and a semantic fusion dense reconstruction module;

7. The indoor three-dimensional semantic map construction system according to claim 6, wherein the CUDA depth image restoration comprises the following specific steps:

the invalid points on each depth image are filtered using equation (1).

in the formula:

is the standard deviation of a spatial gaussian function,

8. The indoor three-dimensional semantic map building system according to claim 6, wherein the three-dimensional dense reconstruction module,

in the aspect of position and attitude optimization, a layered local-to-global optimization method is used, the method is divided into two layers in total, on the lowest layer, each continuous 10 frames form a chunk, the first frame is used as a key frame, and then local position and attitude optimization is carried out on all frames in the chunk; on the second layer, only all the chunk key frames are used for mutual correlation and then global optimization;