CN113313824B

CN113313824B - Three-dimensional semantic map construction method

Info

Publication number: CN113313824B
Application number: CN202110394816.7A
Authority: CN
Inventors: 刘立林; 罗志宇
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2024-03-15
Anticipated expiration: 2041-04-13
Also published as: CN113313824A

Abstract

The invention belongs to the technical field of map construction, and particularly relates to a three-dimensional semantic map construction method, which comprises a registration image thread, a local map and global map thread, a semantic map thread, a fusion thread and a global thread which can be processed in parallel based on a GPU; meanwhile, the gesture solving, semantic segmentation, image fusion, matching and other calculation processes are carried out on the scene images, so that the SLAM system is stronger in instantaneity and faster in map construction speed, semantic information is fused on the three-dimensional images, the expression forms of the maps are enriched, the scene maps can be understood by unmanned mobile platform equipment such as an unmanned plane and a robot through more dimensions, the motion trail can be controlled more accurately, and the performance of the unmanned mobile platform is improved.

Description

Three-dimensional semantic map construction method

Technical Field

The invention belongs to the technical field of map construction, and particularly relates to a three-dimensional semantic map construction method.

Background

SLAM (Simultaneouslocalizationand mapping, synchronous positioning and instant composition) is a technique for acquiring three-dimensional information of a scene by a sensor, which can position itself and distinguish environments according to the scene information. The SLAM comprises a laser SLAM and a visual SLAM, wherein a sensor for acquiring scene data in the laser SLAM is a laser radar, is generally used in the aerospace and automobile industries, has high precision but high cost, acquires scene image data through a camera in the visual SLAM, has lower cost, and is generally used in the field of unmanned aerial vehicle and robot autonomous navigation.

In the fields of unmanned aerial vehicles and robots, the traditional map cannot meet the diversified application requirements, and with the development of depth sensors, semantic maps are widely applied to the fields of unmanned aerial vehicles and robots in the autonomous navigation field. Semantic maps typically include spatial attribute information, such as the planar structure of a building, room distribution, etc., as well as semantic attribute information, such as individual room attributes and functions, and object class and location information within a room, etc. The goal of semantic map construction is to precisely mark semantic information on a map.

As chinese patent CN111080659a discloses an environmental semantic perception method based on visual information, comprising: acquiring environmental image information by using a Kinect V1.0 camera to obtain a registered color image and a registered depth image; based on the registered color map and depth map, through an ORB_SLAM2 process, calculating the three-dimensional pose of the camera according to the ORB characteristic points extracted from each frame to obtain pose information of the camera; carrying out semantic segmentation on each frame of image to generate semantic color information; generating a point cloud synchronously according to the input depth map and an internal reference matrix of the camera; registering semantic color information into the point cloud to obtain a local semantic point cloud result; fusing the camera pose information with the local semantic point cloud result to obtain new global semantic point cloud information; and representing the fused global semantic point cloud information by using the octree map to obtain a final three-dimensional octree semantic map. However, in the implementation process, it is found that the response speed and the control accuracy of the motion trail of the unmanned aerial vehicle or the robot are seriously affected because the ORB feature extraction is adopted, and the map construction speed is not fast enough, so that the use experience is poor.

Disclosure of Invention

The invention provides a three-dimensional semantic map construction method for overcoming at least one defect in the prior art, which is based on GPU multithread processing, can improve map construction speed and realizes real-time map construction.

In order to solve the technical problems, the invention adopts the following technical scheme:

the three-dimensional semantic map construction method comprises the following steps:

registering image threads, local map and global map threads, semantic map threads, fusion threads and global threads which can be processed in parallel based on a GPU (graphics processor);

the registration image thread is used for acquiring a color image and a depth image of a scene, and preprocessing the color image and the depth image to obtain a registration image;

the local map and global map thread is used for solving the pose between the multi-frame images according to the registration image and the depth image, and performing three-dimensional reconstruction by using the pose, the color image and the depth image to obtain a local map and a global map; the semantic map thread is used for carrying out semantic segmentation on the plurality of registration images by using a PSP Net (Pyramid Scene Parsing Network, pyramid scene analysis network) to obtain two-dimensional semantic images;

the fusion thread is used for respectively fusing the two-dimensional semantic image with the local map and the global map to obtain the local semantic map and the global semantic map;

the global thread is used for matching the local semantic map and the global semantic map to obtain a global consistency dense semantic map.

According to the scheme, through multithreading based on the GPU, calculation processing such as pose solving, semantic segmentation, image fusion and matching is performed on the scene image, so that the SLAM system is stronger in instantaneity and faster in map construction speed, meanwhile, semantic information is fused on the three-dimensional image, the expression forms of the map are enriched, the scene map can be understood through more dimensionalities of unmanned mobile platform equipment such as an unmanned plane and a robot, the motion trail can be controlled more accurately, and the performance of the unmanned mobile platform is improved.

Preferably, the registering image thread specifically includes:

calibrating a depth camera comprising an infrared camera and a color camera to obtain an internal parameter and an external parameter of the depth camera;

respectively utilizing an infrared camera and a color camera in the depth camera to acquire a depth image and a color image of a multi-frame scene;

and registering the depth image and the color image according to the external participation and the internal reference to obtain a multi-frame registration image.

Preferably, the local map and global map threads include:

performing block division on the multi-frame registration image to obtain a plurality of image blocks, wherein frame stacking exists between adjacent image blocks;

performing feature extraction on the registration image in each image block by using a Scale Invariant Feature Transform (SIFT) extraction algorithm based on Graphic Processing Unit (GPU) acceleration to obtain feature points, and selecting a coordinate system of a frame of registration image as a world coordinate system;

matching the characteristic points according to a GMS matching algorithm, filtering out mismatching points, and storing the intra-block relevance as local image relevance matching M ₁ Saving poor intra-block relevance as global image relevance matching M ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to M ₁ And M is as follows ₂ Solving the pose between each frame of registration images by using a Gauss Newton method, and carrying out loop detection on the current pose;

and carrying out three-dimensional dense reconstruction on the scene according to the pose and the depth image and the color image obtained in the registration image thread to obtain a local map and a global map.

Preferably, the magnitudes of the feature points in the SIFT extraction algorithm are specifically expressed as follows:

the direction is specifically expressed as:

wherein A (x, y) is the magnitude of the feature point, x and y are the pixel positions of the feature point in the image, I (x+1, y), I (x-1, y), I (x, y+1) and I (x, y-1) are all adjacent pixels of the feature point in the Gaussian differential pyramid, and θ (x, y) is the pointing direction of the feature point.

Preferably, the probability model in the GMS matching algorithm is:

the evaluation score formula of the feature point pair is as follows:

wherein P is the difference between correct matching and incorrect matching, P _true For correct matching, p _false For error matching, mean _true And mean _false Average of match correct and match error, std _true And std _false The variances of the correct matching and the incorrect matching are respectively; i F _1i The I is the number of features in the feature point matching grid; i and j are the matching point areas in the two frames of images respectively, K is the current grid number, K is the total grid number,for pair of units { i } ^k ，j ^k Number of matches between.

Preferably, the registering the depth image and the color image according to the external and internal parameters specifically includes:

converting coordinates of all pixel points in the depth image to an infrared camera coordinate system;

converting coordinates of all points under the infrared camera coordinate system to a world coordinate system;

converting the coordinates of all points in the world coordinate system to a color camera coordinate system;

mapping the coordinates of all points under the coordinate system of the color camera to a color plane of the normalized plane;

and obtaining a transformation matrix between the infrared camera and the color camera.

Preferably, the semantic map thread specifically includes:

extracting features of the registration images to obtain feature layers;

pooling the feature layers to generate pyramid pooling features;

flattening and upsampling pyramid pooling features;

and performing CONCAT (merging) with the feature layer, and obtaining a local semantic map and a global semantic map through a convolutional neural network.

Preferably, the specific formula for fusing the local map and the global map by using the TSDF model in the local map and the global map thread is as follows:

the specific formula of the de-fusion construction is as follows:

wherein D (v) is the sign distance value of the voxel, W (v) is the voxel weight value, D _i (v) And w is equal to _i (v) The projection distance from the voxel to the i-th frame depth image and the integral weight respectively,is the updated voxel symbol distance value.

Preferably, the fusion model adopted in the fusion thread is as follows:

wherein C is _i-1 (o) and W _i-1 (o) respectively fusing the confidence and reliability weights of the category for the voxels corresponding to the ith frame,and->For the image in the ith frame of imageClass confidence and reliability weights for the element p.

Preferably, the specific formula for matching the local semantic map and the global semantic map in the global thread is as follows:

de-fusion:

the accuracy calculation formula is:

wherein W is _local And W is equal to _global Weight values of a local semantic Map and a global semantic Map, map (v, C) _i-1 (o)) _local With Map (v, C) _i-1 (o)) _global The local semantic map and the global semantic map are respectively; s is S ₁ And S is equal to ₂ Respectively, the three-dimensional semantic model surface area measured by using a meshlab tool, S is the three-dimensional reconstruction model surface area measured by using the meshlab tool, and k ₁ And k is equal to ₂ Respectively S ₁ 、S ₂ Weight coefficient of (c) in the above-mentioned formula (c).

Compared with the prior art, the beneficial effects are that:

compared with the traditional ORB feature extraction, the SIFT algorithm feature extraction based on GPU acceleration has the advantages of higher extraction speed and better robustness; in addition, the multi-thread processing based on the GPU can simultaneously perform semantic segmentation, pose calculation and image fusion on the registered images, and the fused images can be released one by one, so that the GPU has enough memory to perform real-time fusion rendering on the images, real-time map construction is realized, three-dimensional images and semantic information are fused, the understanding capacity of unmanned mobile platforms such as unmanned aerial vehicles and robots to the environment is improved, the unmanned mobile platforms are enabled to move more accurately and flexibly, and the performance of products is improved.

Drawings

FIG. 1 is a schematic block diagram of a process of a local map and global map thread of a three-dimensional semantic map construction method according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a block division in a local map and global map thread of a three-dimensional semantic map construction method according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a semantic map thread in a three-dimensional semantic map construction method according to an embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationship described in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are orientations or positional relationships indicated by terms "upper", "lower", "left", "right", "long", "short", etc., based on the orientations or positional relationships shown in the drawings, this is merely for convenience in describing the present invention and simplifying the description, and is not an indication or suggestion that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and are not to be construed as limitations of the present patent, and that it is possible for those of ordinary skill in the art to understand the specific meaning of the terms described above according to specific circumstances.

The technical scheme of the invention is further specifically described by the following specific embodiments with reference to the accompanying drawings:

examples:

as shown in fig. 1, a three-dimensional semantic map construction method includes:

registering image threads, local map and global map threads, semantic map threads, fusion threads and global threads which can be processed in parallel based on the GPU;

the registration image thread is used for acquiring a color image and a depth image of a scene, and preprocessing the color image and the depth image to obtain a registration image; wherein the registered image is colored;

the local map and global map thread is used for solving the pose between the multi-frame images according to the registration image and the depth image, and performing three-dimensional reconstruction by using the pose, the color image and the depth image to obtain a local map and a global map;

the semantic map thread is used for carrying out semantic segmentation on the plurality of registration images by using the PSP Net to obtain a two-dimensional semantic image;

the fusion thread is used for respectively fusing the two-dimensional semantic map with the local map and the global map to obtain the local semantic map and the global semantic map;

The registration image thread in this embodiment specifically includes:

calibrating a depth camera comprising an infrared camera and a color camera to obtain an internal parameter and an external parameter of the depth camera; the depth camera can adopt Kinect V2, specifically, the Kinect V2 is used for shooting a checkerboard, and the camera is calibrated to obtain an internal reference matrix of the cameraAnd the external reference matrix->Wherein R is a rotation matrix of 3x3, t is a translation vector of 3x1, f _x And f _y Normalized focal lengths of the image x-axis and the image y-axis respectively, c _x And c _y The position of the coordinates of the center point of the image;

The local map and global map threads in this embodiment include:

taking fifteen frames of images as units, performing block division on the multi-frame registration images to obtain a plurality of image blocks, wherein three frames of stacks exist between adjacent image blocks; of course, each image block and the number of stacks between image blocks are only one referenced embodiment and should not be construed as limiting the present solution.

matching the characteristic points according to a GMS matching algorithm, filtering out mismatching points, and storing the intra-block relevance as local image relevance matching M ₁ Saving poor intra-block relevance as global image relevance matching M ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to M ₁ And M is as follows ₂ Solving the pose between each frame of registration images by using a Gauss Newton method, and carrying out loop detection on the current pose; wherein the pose comprises a local pose and a global pose;

in addition, solving the pose by the gauss newton method in this embodiment specifically includes:

constructing a nonlinear optimization objective function:

X ^* ＝argminE _align (X)，

the specific calculation process is as follows:

R＝3N _corr +|E|(|D _i |+|I _i |)，

F(X ^k )＝F(X ^k-1 )+J _F (X ^k-1 )ΔX，

J _F (X ^k-1 ) ^T J _F (X ^k-1 )ΔX ^* ＝-J _F (X ^k-1 ) ^T F(X ^k-1 )，

wherein X is the pose of the camera, X ^* Is the optimal solution of the pose X, E _align (X) is the alignment objective function of the coefficient features and the dense luminosity and set constraint, r _i (X) residual term for pose representation, N _corr Is the total corresponding relation quantity in the image block, |D _i I and I _i The i is the size of the i frame depth image and the color image after downsampling, which are 64X53 = 3392, the i E is the number of frame pair sets, and the E is a frame pair set comprising a frame pair (i, j), i frame and j frame, F (X) ^k-1 ) In the form of vector of the residual error term of the pose of the previous frame of image, J _F For the jacobian matrix corresponding to the vector, Δx=x ^k -X ^k-1 Delta X is the difference between the pose of the current frame and the pose of the previous frame ^* Deviation value for pose optimal solution, (X) ^k-1 ) ^T Is a matrix (X) ^k-1 ) Is a transposed matrix of (a);

and then, carrying out three-dimensional dense reconstruction on the scene according to the pose and the depth image and the color image obtained in the registration image thread to obtain a local map and a global map.

The amplitude of the feature point in the SIFT extraction algorithm in this embodiment is specifically expressed as:

the direction is specifically expressed as:

The probability model in the GMS matching algorithm in this embodiment is:

the evaluation score formula of the feature point pair is as follows:

The registering of the depth image and the color image according to the external participation and the internal reference in the embodiment specifically includes: converting coordinates of all pixel points in the depth image into an infrared camera coordinate system, wherein the specific formula is as follows:

wherein Z is _c For depth values, i.e. the distance of an object in space to the depth camera,an inverse matrix of an internal reference matrix of the infrared camera is +.>P is the pixel coordinate of the midpoint of the depth image _{IR_camera} Converting the coordinates of the pixel points into coordinates under an infrared camera coordinate system;

converting coordinates of all points in an infrared camera coordinate system into a world coordinate system, wherein the specific formula is as follows:

wherein,for converting world coordinate system into inverse matrix of transformation matrix under infrared camera coordinate system, P _w World coordinates that are points in the depth image;

converting coordinates of all points in the world coordinate system into a color camera coordinate system, wherein the specific formula is as follows:

P _{Color_camera} ＝T _{wColor_camera} P _w ，

wherein T is _{wColor_camera} The transformation matrix from the world coordinate system to the color camera coordinate system, P _{Color_camera} The coordinates of the color camera corresponding to the midpoint of the depth image;

mapping the coordinates of all points in the color camera coordinate system to a normalized plane Z _c Color plane=1, the specific formula is:

wherein K is _{Color_camera} Is an internal reference matrix of the color camera,is the normalized mapping plane of the image,the pixel points in the final registered image are obtained;

let z=1, then the pixels of the registered image have the following relationship with the pixels of the depth image:

removing the external parameters K of two cameras _{Color_camera} ，Finally, a transformation matrix between the infrared camera and the color camera is obtained:

the following expression is obtained after the above expansion and simplification:

wherein T is _{wColor_camera} For the transformation matrix of the world coordinate system into the color camera coordinate system,for converting world coordinate system into inverse matrix of transformation matrix of color camera coordinate system, T _IR2Color Conversion matrix for converting infrared camera into color camera, R _w2Color For world coordinate conversion into a rotation matrix in color camera coordinates, +.>Converting world coordinates into inverse matrix of rotation matrix under infrared camera coordinates, t _w2Color Translation vector, t, for world coordinate conversion to color camera coordinates _w2Color For translation vector conversion of world coordinates to color camera coordinates, T _IR2Color Representing a transformation matrix of an infrared camera of size 4*4 to a color camera.

The semantic map thread in this embodiment specifically includes:

extracting features of the registration images to obtain feature layers;

pooling the feature layers to generate pyramid pooling features; the sizes of the pooling cores are 1x1,2x2,3x3 and 6x6 respectively;

flattening and upsampling pyramid pooling features;

performing CONCAT with the feature layer, and obtaining a local semantic map and a global semantic map through a convolutional neural network;

the network is trained by adopting a VOC2007 data set containing 21 kinds of information, the PSP Net backbone network is MobileNet V2, the number of training epochs (training generation number) is 140, and the ratio of the training set to the verification set is 9: and 1, performing freezing training on the first 50 epochs, namely freezing a part of training weights to accelerate the training speed. The BacthSize was set to 4, thawing was started when epoch=51, and all weights were trained. It should be noted that, the parameters used in this embodiment are all reference embodiments, and are not to be construed as limiting the present scheme, and in the specific implementation process, the parameters may be changed according to the device performance, training accuracy, and the like.

The specific formula for fusing the local map and the global map by using the TSDF model in the local map and the global map thread in the embodiment is as follows:

the specific formula of the de-fusion construction is as follows:

The fusion model adopted in the fusion thread in this embodiment is:

wherein C is _i-1 (o) and W _i-1 (o) respectively fusing the confidence and reliability weights of the category for the voxels corresponding to the ith frame,and->The category confidence and reliability weight of the pixel p in the ith frame image.

In order to perfect the details of the global semantic map by utilizing the local semantic map, the local semantic map and the global semantic map are matched in the global thread in the embodiment, and the specific formula is as follows:

de-fusion:

the accuracy calculation formula is:

wherein W is _local And W is equal to _global Weight values of a local semantic Map and a global semantic Map, map (v, C) _i-1 (o)) _local With Map (v, C) _i-1 (o)) _global The local semantic map and the global semantic map are respectively fused; s is S ₁ And S is equal to ₂ Respectively, the three-dimensional semantic model surface area measured by using a meshlab tool, S is the three-dimensional reconstruction model surface area measured by using the meshlab tool, and k ₁ And k is equal to ₂ Respectively S ₁ 、S ₂ Weight coefficient of (c) in the above-mentioned formula (c).

The present invention is described with reference to flowchart illustrations or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application, it being understood that each flowchart illustration or block in the flowchart illustrations or block diagrams, and combinations of flowcharts or blocks in the flowchart illustrations or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The three-dimensional semantic map construction method is characterized by comprising the following steps of:

the local map and global map thread is used for solving the pose between the images according to the registration image and the depth image, and performing three-dimensional reconstruction by using the pose, the color image and the depth image to obtain a local map and a global map;

the local map and global map thread comprises:

carrying out feature extraction on the registration images in each image block by using a SIFT extraction algorithm based on GPU acceleration to obtain feature points, and selecting a coordinate system of a frame of registration images as a world coordinate system;

matching the characteristic points according to a GMS matching algorithm, filtering out mismatching points, and storing the intra-block relevance as local image relevance matching M ₁ Saving poor intra-block relevance as global image relevance matching M ₂ ；

According to M ₁ And M is as follows ₂ Solving the pose between each frame of registration images by using a Gauss Newton method, and carrying out loop detection on the current pose;

according to the pose and the depth image and the color image obtained in the registration image thread, carrying out three-dimensional dense reconstruction on the scene to obtain a local map and a global map;

the fusion thread is used for respectively fusing the two-dimensional semantic image with the local map and the global map to obtain a local semantic map and a global semantic map;

2. The three-dimensional semantic map building method according to claim 1, wherein the registering image thread specifically comprises:

respectively utilizing an infrared camera and a color camera in the depth camera to continuously acquire a depth image and a color image of a multi-frame scene;

3. The method of claim 2, wherein the local map and global map threads comprise:

matching the characteristic points according to a GMS matching algorithm, filtering out mismatching points, and storing the intra-block relevance as local image relevance matching M ₁ Closing the blockPoor-connectivity preservation as global image association matching M ₂ ；

According to said M ₁ And M is as follows ₂ Solving the pose between each frame of registration images by using a Gauss Newton method, and carrying out loop detection on the current pose;

4. A three-dimensional semantic map building method according to claim 3, wherein the probability model in the GMS matching algorithm is:

the evaluation score formula of the feature point pair is as follows:

wherein P is the difference between correct matching and incorrect matching, P _true For correct matching, p _false For error matching, mean _true And mean _false Average of match correct and match error, std _true And std _false The variances of the correct matching and the incorrect matching are respectively; i F _1i The I is the number of features in the feature point matching grid; i and j are the matching point areas in the two frames of images respectively, K is the current grid number, K is the total grid number,for pair of units { i } ^k ,j ^k Number of matches between.

5. The three-dimensional semantic map construction method according to claim 2, wherein registering the depth image and the color image according to the external participation and the internal reference specifically comprises:

converting points under the infrared camera coordinate system into a world coordinate system;

converting points in the world coordinate system into a color camera coordinate system;

mapping points under a color camera coordinate system to a color plane of the normalized plane;

6. A three-dimensional semantic map construction method according to claim 3, wherein the semantic map thread specifically comprises:

extracting features of the registration images to obtain feature layers;

pooling the feature layers to generate pyramid pooling features;

flattening and upsampling the pyramid pooling feature;

merging with the feature layer, and obtaining a local semantic map and a global semantic map through a convolutional neural network.