CN112258618A

CN112258618A - Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map

Info

Publication number: CN112258618A
Application number: CN202011213908.2A
Authority: CN
Inventors: 李京; 龚建华
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-22
Anticipated expiration: 2040-11-04
Also published as: CN112258618B

Abstract

The invention discloses a semantic mapping and positioning method based on the fusion of a prior laser point cloud and a depth map, which comprises the following steps: s1, collecting prior laser point cloud data; s2, acquiring a depth image and an RGB image, generating an RGB-D point cloud based on the depth image, and carrying out initialization registration on the prior laser point cloud and the RGB-D point cloud; s3, providing camera pose constraint through the registered prior laser point cloud to correct the camera pose; s4, constructing a three-dimensional geometric point cloud map by adopting a front and back window optimization method; s5, performing geometric incremental segmentation on the three-dimensional geometric point cloud map, performing object identification and semantic segmentation on the RGB image, and fusing geometric incremental segmentation and semantic segmentation results to obtain a 3D geometric segmentation map of semantic enhanced geometric segmentation; and S6, performing semantic association and segmentation probability distribution updating on the object to complete the construction of the semantic map. The invention can effectively eliminate the accumulated error of large-scale indoor map building and positioning, and has high precision and real-time performance.

Description

Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map

Technical Field

The invention relates to the technical field of positioning and map construction, in particular to a semantic map construction and positioning method based on the fusion of a prior laser point cloud and a depth map.

Background

At present, the most main technical difficulties hindering application of indoor and outdoor large scene augmented reality fusion, unmanned driving, robot navigation positioning and the like are dynamic tracking and recording of camera position and posture of a real scene and three-dimensional semantic map construction. The semantic map is characterized in that a traditional three-dimensional point cloud map is endowed with corresponding attribute values (such as the ground, walls or buildings), as the global satellite positioning system cannot meet the requirements on precision and robustness in many application scenes (such as outdoor high-precision augmented reality and unmanned driving), the error of an inertial attitude sensor gradually increases along with the drift of time, and the real-time three-dimensional drawing and camera position attitude calculation by utilizing computer vision and three-dimensional laser radar point cloud are continuously developed and perfected. The conventional method includes identification of markers previously placed outdoors from an image, matching based on natural feature points, and an algorithm based on a three-dimensional model of a known environment. In recent years, camera positioning is a very active research field at present due to the rapid development of mobile computing, computer vision, Augmented Reality (AR), indoor GIS and simultaneous positioning and mapping (SLAM) technologies. The use of mobile terminals and Unmanned Aerial Vehicles (UAVs) is increasing, and the development of the technology is promoted to meet the practical application requirements in the fields of AR, robotics, indoor navigation positioning, automatic driving and the like. In these emerging applications, camera localization and semantic mapping are often a key and viable technology. Recent sensors such as portable LiDAR ranging LiDAR and RGB-D cameras supplement the RGB image with depth information, which provides new opportunities for developing robust and practical applications, but RGB cameras alone are far from adequate.

Camera positioning is a key step in both online SLAM (Simultaneous positioning And composition) And offline SFM (Structure From Motion). SLAM incrementally estimates the pose of a mobile platform and simultaneously constructs a surrounding environment map, and due to the autonomous positioning and environment perception capabilities of SLAM, SLAM becomes a key prerequisite for autonomous operation of unmanned systems such as robots in unknown environments. Visual SLAM is an algorithm that uses a camera as a data input sensor, widely used in indoor environment mobile platforms. Compared with distance measuring instruments such as laser radars, the visual sensor has the advantages of low power consumption and small size, and can provide more abundant environment texture information for the mobile platform. Thus, visual SLAM has attracted increasing attention in the research community, for example, many researchers have attempted to improve the efficiency and robustness of visual SLAM methods. In the aspect of improving the efficiency, some feature extraction algorithms are provided, such as SURF features, BRISK features and ORB features. Furthermore, some algorithms introduce parallel computations to improve efficiency, such as parallel tracking and mapping for small-scene augmented reality AR workspaces. The first SLAM algorithm separates the positioning and mapping functions into two threads, and achieves real-time performance. A recent interesting development in the real-time visual SLAM algorithm is called ORB-SLAM3, which incorporates recent technology and uses real-time ORB functions for tracking, patterning, positioning and loop closure in large indoor and outdoor environments. In addition, with the rapid development of new sensors, new applications and new computing methods represented by deep learning, the positioning and mapping (SLAM) algorithm gradually breaks through the limitations of traditional computing methods based on filters, small scenes, pixel-by-pixel matching and the like, and develops towards a more intelligent direction. At present, the Semantic SLAM can acquire geometric structure information of a scene, can identify independent individual objects in the scene, acquire Semantic information such as positions, postures and functional attributes of the objects, and provide support for completing intelligent service tasks in complex scenes. The semantic SLAM serves as a research frontier and a hotspot in the current computer vision field and provides underlying algorithm support for hotspot applications such as augmented reality, scene understanding, robots, navigation positioning, man-machine interaction, automatic driving and the like, so that the simultaneous positioning and semantic map construction technology with high precision, high robustness and high real-time has very important scientific significance and application value semantics. With the recent development of two-dimensional image semantic segmentation and object detection algorithms, many methods effectively combine these algorithms with the most advanced SLAM method to obtain a highly accurate semantically segmented three-dimensional map. In the current mainstream methods, on one hand, a priori known 3D models are needed, since each instance of semantic class (such as chair) can change significantly even in a single environment, and an accurate 3D model is needed for all the changes, which is limited in practical implementation, and on the other hand, a semantic label is allocated to each 3D point by using a probabilistic model such as a hierarchical conditional random field, which consumes excessive computing resources and memory space, and these methods still compromise the real-time performance of the semantic SLAM algorithm.

Online real-time methods (such as SLAM) inevitably accumulate errors as the camera progresses, and the position and orientation errors of the camera typically accumulate over time, especially in large scenarios, and may subsequently destroy the usability of the application. Currently, application scenarios are generally limited to a small workspace, but the robustness and real-time of unknown camera positioning in large indoor and outdoor scenarios remains challenging and may prevent widespread use of positioning and mapping techniques; in addition, because the illumination conditions and texture features of large indoor and outdoor scenes are complex and dynamically changed, the map is built based on a single RGB video image or laser radar point cloud, feature points are often lost, or matching between point cloud frames fails, so that the single sensor cannot meet the requirements of large scene three-dimensional semantic reconstruction and positioning on robustness and instantaneity in an unknown environment.

Disclosure of Invention

The invention aims to provide a semantic mapping and positioning method based on the fusion of a prior laser point cloud and a depth map, which aims to solve the technical problems in the prior art, effectively eliminate the accumulated error of large-scale indoor mapping and positioning and have high mapping and positioning precision and real-time performance.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a semantic mapping and positioning method based on the fusion of a prior laser point cloud and a depth map, which comprises the following steps:

s1, collecting integral prior laser point cloud data of a scene, and preprocessing the collected data;

s2, acquiring a depth image and an RGB image of a scene in real time through a camera fixed on a mobile platform, generating an RGB-D point cloud based on the depth image, and performing initialization registration on preprocessed prior laser point cloud data and the RGB-D point cloud by adopting a deep learning positioning neural network and a nearest neighbor ICP iterative algorithm;

s3, providing camera pose constraint through the registered prior laser point cloud data, correcting the camera pose and completing camera positioning;

s4, constructing a three-dimensional geometric point cloud map by adopting a multithread map segmentation front-back window optimization method based on camera pose constraints provided by prior laser point cloud data;

s5, performing geometric incremental segmentation on the three-dimensional geometric point cloud map, performing object identification and semantic segmentation on each frame of RGB image by adopting an object detector based on a convolutional network, fusing a geometric incremental segmentation result and a semantic segmentation result, and acquiring an object boundary to obtain a 3D geometric segmentation map of scene semantic enhanced geometric segmentation;

s6, updating the 3D geometric segmentation map subjected to semantic enhancement geometric segmentation through the correction result of the camera pose in the step S3, performing semantic association and segmentation probability distribution updating on the objects in the scene based on the updating result, and completing construction of the semantic map.

Preferably, in step S2, the method for performing initial registration on the preprocessed a priori laser point cloud data and the RGB-D point cloud includes:

firstly, constructing a deep learning network, aligning preprocessed prior laser point cloud data with the RGB-D point cloud to obtain continuous video frames marked by a camera pose, and training the deep learning positioning neural network through continuous video frame images and the camera pose;

secondly, inputting RGB image video frames collected by a camera into a trained deep learning positioning neural network to obtain an initial camera pose;

and updating the initial camera pose by adopting an ICP iterative algorithm to complete the initialization registration of the prior laser point cloud data and the RGB-D point cloud.

Preferably, the step S3 specifically includes the following steps:

s3.1, selecting a key frame to be corrected from a depth image acquired by a camera in real time, and generating RGB-D point cloud data through dynamic object elimination and depth image data denoising;

s3.2, acquiring local prior laser point cloud data according to the initial posture of the camera generated by the visual tracking thread;

and S3.3, based on the local prior laser point cloud data, acquiring the camera pose of the key frame to be corrected by adopting a point-to-surface ICP iterative algorithm, and transmitting the camera pose of the key frame to be corrected to a local map through a key frame visibility map.

Preferably, in the step S3.1, RGB-D point cloud data is generated based on the depth information of the camera and the pose of the tracking thread.

Preferably, the step S4 includes:

s4.1, acquiring key frames and 3-D ORB characteristic points of the depth images, and constructing a sparse map;

s4.2, adopting camera pose constraint provided by prior laser point cloud data to locally optimize the constructed sparse map;

s4.3, performing global optimization on the map subjected to local optimization by adopting a multithreading map segmentation optimization method;

and S4.4, further optimizing the map after global optimization by adopting a double-window optimization method to obtain the three-dimensional geometric point cloud map.

Preferably, in step S4.4, the dual-window optimization method includes a front window and a rear window; the front window constraint is a pose-map point constraint, and the front window is used for controlling the construction of a local map; the rear window constraint is a pose-pose constraint between the rear window and the front window key frame, and a pose-map point constraint between the rear window and the front window map point; the rear window is used to control global consistency of the map.

Preferably, the front window comprises a plurality of key frames in the map segment, a plurality of key frames with overlapped front and back map segments, and a map point set observed by the key frames; the back window includes the optimized set of keyframes in the front window.

Preferably, the step S5 specifically includes:

s5.1, segmenting the three-dimensional geometric point cloud map by adopting a normal vector edge analysis method of a concave punishment mechanism and a three-dimensional vertex distance threshold segmentation method to obtain a geometric segmentation map of the depth image;

s5.2, performing semantic segmentation and object identification on each frame of RGB image by adopting an object detector based on a convolutional network to obtain class information and class probability of an object and an object boundary segmentation map;

s5.3, carrying out superposition statistics on the geometric segmentation map and the object boundary segmentation map of the depth image, detecting whether the same object is divided into different parts or not according to the state that different image labels in the geometric segmentation map of the depth image fall into a first object boundary frame of the object boundary segmentation map, if so, updating the boundary of the object in the object boundary segmentation map, and obtaining the 3D geometric segmentation map of the scene semantic enhanced geometric segmentation.

Preferably, in step S6, the segmentation probability of each object in the 3D geometric segmentation map assigned to the semantically enhanced geometric segmentation is updated in real time by using a confidence-based method.

Preferably, the confidence-based method specifically includes:

based on the camera pose correction result in step S3, back-projecting the 3D geometric segmentation map of the semantic enhanced geometric segmentation to the current camera plane to obtain a label rendering segmentation map; and adopting an object detector based on a convolutional network to carry out object identification on the label rendering segmentation map to obtain a second object boundary box and a corresponding segmentation probability, associating the second object boundary box with an object boundary box in the 3D geometric segmentation map of the semantic enhanced geometric segmentation, and updating the object segmentation probability in the 3D geometric segmentation map of the semantic enhanced geometric segmentation through the segmentation probability corresponding to the second object boundary box to complete the construction of the semantic map.

The invention discloses the following technical effects:

(1) the invention provides a semantic mapping and positioning method based on fusion of a priori laser point cloud and a depth map, which is characterized in that the priori laser point cloud data is used as the knowledge of indoor 3-D scene construction and navigation, the high-precision geometric control information in the laser radar point cloud data of the known scene is utilized to eliminate the accumulated error of an RGB-D SLAM system in large-scale indoor operation, and the mapping and positioning precision is obviously improved;

(2) the invention combines deep learning and geometric SLAM system, realizes automatic initialization and motion recovery, thereby improving the stability of the system in severe environment;

(3) the method adopts a double-window multithreading map segment optimization method, ensures that the map building and positioning system runs in real time and at high precision under the guidance of prior laser point cloud data, constructs a large-scale precise and consistent dense map, and ensures the accuracy and the real-time performance of map construction; compared with centimeter-precision prior laser point cloud data, the standard deviation of 3D map construction in the travel distance of 100m is about 20cm, the relative accumulated error of the camera in a closed loop experiment is about 0.1%, and the relative accumulated error is improved by two orders of magnitude compared with a typical SLAM algorithm (the accumulated error is 3% -5%); in addition, 2D abundant texture image data from a camera and high-precision 3D ground laser point cloud data are automatically fused, so that the defect of insufficient texture information of the laser point cloud data is overcome;

(4) the invention adopts a probability updating mechanism based on the segmentation class labels instead of map elements, greatly reduces the calculation complexity and the consumption of calculation resources, and can ensure the real-time construction of the high-precision semantic map.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a semantic mapping and positioning method based on the fusion of a prior laser point cloud and a depth map according to the present invention;

FIG. 2 is prior to pre-processing and post-processing prior laser point cloud data in an embodiment of the invention; wherein, fig. 2(a) is prior laser point cloud data before preprocessing, and fig. 2(b) is prior laser point cloud data after preprocessing;

FIG. 3 is a schematic diagram of obtaining local prior laser point cloud data according to a camera pose provided by a tracking thread in an embodiment of the present invention;

FIG. 4 shows the registration result of RGB-D depth point cloud and prior lidar point cloud in the embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, the present embodiment provides a semantic mapping and positioning method based on a priori laser point cloud and depth map fusion, which specifically includes the following steps:

in the embodiment, a 3D ground LiDAR scanner is adopted to collect prior laser point cloud data; A3D terrestrial LiDAR scanner can provide a measurement range of 100m to 1000m with a measurement accuracy of centimeters or more, a measurement frequency of 300,000 points/second, and a scanning field of view of 100 DEG x 360 deg. And combining the single-station laser scanning data together to form the integral laser point cloud data of the scene.

In the embodiment, the collected prior laser point cloud data is preprocessed by adopting a downsampling and KD-tree data structure; the method comprises the steps of utilizing cloudcompare point cloud processing software, adopting a voxel with the resolution of 5cm to carry out down-sampling on prior laser point cloud data, reducing the capacity of the laser point cloud data, and keeping the maximum geometric information of the point cloud data with the minimum data size. Because the 3D ground LiDAR scanner has a large data output, the invention can effectively improve the high efficiency and the real-time performance of semantic mapping and positioning through preprocessing. The prior laser point cloud data before and after the preprocessing are respectively shown in fig. 2(a) and fig. 2 (b).

S2, acquiring a depth image and an RGB image of a scene in real time through a camera fixed on a mobile platform, generating RGB-D point cloud based on the depth image, and performing initialization registration on preprocessed prior laser point cloud data and the RGB-D point cloud by adopting a deep learning positioning neural network and an ICP (Iterative Closest neighbor) Iterative algorithm; wherein the camera is an RGB-D camera.

If the prior laser point cloud data can not be accurately matched with the coordinate frame of the RGB-D point cloud, the prior laser point cloud data can not be used as the basis for subsequent optimization. The method adopts a deep learning positioning neural network to align the preprocessed prior laser point cloud data with RGB-D point cloud; the deep learning positioning neural network is converted from GoogLeNet, the GoogLeNet is a 22-layer convolution network and is provided with six initial modules and two additional intermediate classifiers, and an affine regression is used for replacing all classifiers in the GoogLeNet; at the same time, the softmax layer in google lenet is removed and each final fully-connected layer is modified to output a 7-dimensional pose vector representing the camera position and orientation.

The specific registration method comprises the following steps:

firstly, constructing a deep learning network, manually aligning prior laser point cloud data with RGB-D point cloud to obtain continuous video frames marked by camera poses, training the deep learning positioning neural network by using the continuous video frame images and the camera poses obtained by manually aligning the corresponding video frame images, and generating an average position error of 1m and an average angle error of 4 degrees; and the average position error and the average angle error are errors relative to the pose generated by manually matching the prior laser point cloud data and the RGB-D point cloud.

Secondly, inputting an RGB image video frame acquired by a camera into a trained deep learning positioning neural network to obtain a rough initial camera pose, wherein the rough initial camera pose has an average position error of 1m and an average angle error of 4 degrees, namely the prior laser point cloud data is not aligned with the RGB-D point cloud.

And updating the initial camera pose based on the rough initial camera pose by adopting an ICP iterative algorithm to obtain an accurate initial camera pose, finishing the initialization accurate registration of the prior laser point cloud data and the RGB-D point cloud, and finishing the automatic initialization of the RGB-D SLAM algorithm system guided by the laser data in real time. After initialization, the prior laser point cloud data is used as a true value data framework to provide real-time constraint for positioning and mapping.

S3, providing camera pose constraint through the registered prior laser point cloud data, correcting the camera pose and completing camera positioning; the method specifically comprises the following steps:

firstly, selecting a key frame to be corrected:

in a typical visual SLAM approach, errors in camera pose estimation typically accumulate over time. Although it is not necessary to correct the accumulated error for each frame using a priori laser point cloud data, the accumulated error must be kept at a sufficiently low level within the initial range required for the ICP to converge to a global optimum. In order to ensure drift error free and real-time performance, in the present embodiment, the accumulated error of 50 key frames per interval is eliminated by a priori laser point cloud data. The interval of every 50 key frames is chosen because it is experimentally verified that the choice can maintain the best balance between successful convergence of the ICP algorithm and the real-time performance of the system. For example, if the interval of the keyframes is less than 15, the a priori laser point cloud data optimization thread will lag the tracking thread and cause a delay in the camera pose correction of the current frame. Intervals of more than 60 keyframes produce a higher error accumulation and result in failure of ICP to converge to the global optimum because the camera coarse pose accumulation error provided by the tracking thread is large and does not allow the ICP algorithm to converge successfully.

Furthermore, due to the complexity of the test environment, the keyframes to be corrected are further selected according to the actual operating conditions of the proposed algorithm. For example, in the case where the camera is rotated rapidly, since the tracking image becomes blurred and the feature matching failure rate increases, the accumulated error of the camera increases accordingly, and the interval between correction key frames needs to be narrowed. Therefore, the selection criteria of the keyframes to be corrected are: whether the change of the positive direction of the camera exceeds a preset threshold value or not before and after two adjacent key frames, if so, the camera rotates too fast, and key frame correction is required; in this embodiment, the preset threshold is 47 degrees.

Secondly, carrying out dynamic object elimination, cavity filling and denoising treatment on the depth image:

the embodiment adopts the consistency of the optical flow to detect the dynamic object and eliminates the detected dynamic object so as to eliminate the influence of the data of the dynamic object;

additionally, the RGB-D camera employs optical coding techniques to acquire depth images of a scene. Compared with a gray image, the depth image has object three-dimensional characteristic information and is not influenced by the irradiation direction of a light source, the emission characteristic of the surface of an object and the shadow-free characteristic, so that the three-dimensional geometry of the surface of the target object can be more clearly expressed. Compared with the method for extracting the geometric characteristics of the three-dimensional object from the gray level image, the depth image processing has no special limitation on the geometric and physical characteristics of the scenery, and the three-dimensional information is directly utilized, so that the problems of identification and positioning of the three-dimensional object are greatly simplified. However, the illumination change of the scene during measurement may also cause instability of depth data to a certain extent, and due to the infrared reflection characteristics of the surface material of the object, noise may be generated and voids may be generated at the edge and the shielding area of the object, which is not favorable for subsequent mapping and positioning. Therefore, the depth image needs to be subjected to cavity filling and noise removal processing, so that the accuracy of subsequent map construction and positioning is improved.

Hole filling is the estimation of depth values for hole locations using pixel depth values of the neighborhood around the hole, using spatial correlation. Due to the influence of complex environment, the RGB-D camera cannot directly obtain the depth information of the hole, and because surrounding pixels contain continuous depth values of the depth information in a small neighborhood range, the depth values of the surrounding pixels of the depth map can be used for estimating the missing depth information.

Thirdly, based on the denoised depth image, generating RGB-D point cloud data of the key frame to be corrected:

after a keyframe to be corrected is selected and depth image noise elimination is performed, RGB-D point cloud data are generated based on depth information of the RGB-D camera and a rough pose of a tracking thread. Because the data volume of the RGB-D point cloud is huge, the efficiency of the ICP iterative algorithm is influenced, and therefore the RGB-D point cloud voxels are downsampled into point clouds with the resolution of 6cm, and the matching of the RGB-D sparse point cloud data of the key frame to be corrected and the prior laser point cloud data is quickly obtained.

in order to acquire the accurate pose of the camera from the local prior laser point cloud data, the local prior laser point cloud data is required in addition to the RGB-D point cloud generated by the key frame to be corrected according to the depth image information. According to the rough camera pose visually tracked by the key frame to be corrected, filtering the prior laser point cloud data of the whole scene to obtain local point cloud data to be corrected around the rough pose of the key frame to be corrected, as shown in fig. 3, ensuring that the local prior laser point cloud data and the RGB-D point cloud of the key frame to be corrected have higher overlapping degree, and ensuring that a subsequent iterative algorithm converges to a global optimal value; the overlapping degree between the acquired local prior laser point cloud data and the RGB-D point cloud of the key frame to be corrected depends on the accumulated error of the camera pose of the key frame to be corrected. In this embodiment, the filtering method for obtaining the local prior laser point cloud data according to the rough camera pose is as follows:

d_i＝(q_i(x)-t_x)*(q_i(x)-t_x)+(q_i(y)-t_y)*(q_i(y)-t_y)

+(q_i(z)-t_z)*(q_i(z)-t_z)

q＝{q_i|d_i＜7.0&&θ_i＜θ₀+5°}

in the formula, a₁、a₂、a₃、b₁、b₂、b₃、c₁、c₂、c₃For the corresponding component of the rotation matrix of the camera pose, T_wcThe matrix is an inverse matrix of a camera pose matrix of the current key frame to be corrected; q is prior laser point cloud data, Q_i∈Q；t_x，t_y，t_zThe coordinates of the camera center under a prior laser point cloud data coordinate system are obtained; d_iThe distance from the ith prior laser point cloud data to the center of the camera; theta_iAn included angle between a vector formed by the ith prior laser point cloud data coordinate and the camera center coordinate and a forward direction vector of the camera is formed; theta₀Is the maximum field angle of the camera; q is a prior laser point cloud set which is less than 7 meters away from the center of the camera and is in a field range with a field angle more than 5 degrees in front of the camera, and high overlapping degree of local prior laser point cloud data and RGB-D point cloud of a key frame to be corrected can be ensured. Because the prior laser point cloud data is organized and managed by adopting a KD-tree and is subjected to down-sampling treatment, the local prior laser point cloud data can be efficiently obtained by utilizing a filtering algorithm.

S3.3, based on local prior laser point cloud data, acquiring the camera pose of the key frame to be corrected by adopting a point-to-surface ICP iterative algorithm, and transmitting the camera pose to a local map through a key frame visibility map;

RGB-D point cloud data and local accurate prior laser point cloud data around the rough camera pose of the key frame to be corrected are obtained at the same time, therefore, the camera pose correction number is obtained from the two point cloud data in the next step. The ICP iterative algorithm adopts a rough camera pose of a to-be-corrected key frame visual tracking as an initial value, the accumulated error of the rough camera pose is not large because laser correction is performed once every 50 frames of key frames, and simultaneously, because an indoor environment structure meets the Manhattan world assumption and is mainly formed by splicing rectangular blocks in all directions, the ICP iterative algorithm adopts a point-to-surface distance error measurement mechanism in the embodiment. By combining the precedent factors of the two ICP algorithms and voxel downsampling sparse matching RGB-D point cloud data, in the embodiment, the ICP iterative algorithm constructs a cost function model based on a point surface and sets the iteration times to 10 times, so that the global optimum value can be converged robustly and quickly.

And (3) ICP iteration is based on the steps of constructing a cost function model algorithm by a point surface:

firstly, the method comprises the following steps: selecting a certain number of point sets in key frame RGB-D point cloud data to be corrected according to a uniform sampling rule;

secondly, the method comprises the following steps: constructing corresponding point pairs of the source point cloud data and the target point cloud data by adopting a nearest point matching rule; the source point cloud data is RGB-D point cloud data, and the target point cloud data is prior laser point cloud data;

thirdly, the method comprises the following steps: calculating rotation and translation parameters of the two point cloud data according to the corresponding point pairs to obtain a camera pose;

fourthly: transforming the source point cloud data according to the calculated rotation and translation parameters, and constructing a cost function model for the transformed point cloud data and the target point cloud data by adopting the point-to-surface error distance; further optimizing the camera pose obtained in the third step by using a cost function model to obtain a correction number of the camera pose;

fifth, the method comprises the following steps: judging the correction number of the camera pose, if the correction number of the camera pose is smaller than a preset threshold value or the iteration number is larger than the maximum iteration number, stopping iteration, and outputting a final rigid transformation parameter; if the condition is not met, the method goes to the first step, and iterative knowledge is continued until the condition is met.

It can be known from the ICP iteration steps that the initial relative positions of the two point sets have a certain influence on the convergence of the algorithm, and the ICP registration is preferably performed under a "close enough" condition, and fig. 4 is a registration result of RGB-D depth point cloud and prior lidar point cloud.

The accurate camera pose of the key frame to be corrected is obtained through an ICP algorithm, and because the control range of the map by the pose of the camera of a single correction key frame is not small, the map is not sufficiently optimized by taking the pose as unique control information, local map transmission needs to be carried out on the pose, and the control range of the pose in the map is expanded. The visual SLAM algorithm can know that the local map construction mainly comprises the steps of acquiring the pose of a camera and the local map construction in a camera motion tracking and local map tracking mode, so that the relative pose of each pair of co-view key frames is accurate, the accumulated error is small, the acquired accurate pose of the key frame to be corrected can be locally transmitted through the accurate relative pose between the co-view key frames, the local map points observed by the corresponding key frames are corrected, and the control range of accurate pose information is further expanded. Through the accurate pose correction, the accurate camera pose is transmitted to the local map, so that the accuracy of the local map is improved, the range of control information in the map is expanded, and preparation is made for next map segmentation optimization.

while the prior laser point cloud data may be used as a map model of the scene, the prior laser point cloud data contains only geometric spatial information and no texture information. In addition, due to the shielding problem in the laser scanning process, missing scanning of laser point cloud data of some detailed scenes is caused. Therefore, the invention constructs a high-precision map with texture information under the constraint of prior laser point cloud data so as to improve the content and the utilization rate of the prior laser point cloud data map and repair data loopholes.

the map construction method is based on an ORB-SLAM algorithm framework, and the ORB-SLAM algorithm framework has sparse map construction threads. The sparse map is composed of key frames and corresponding 3-D ORB feature points, wherein the key frames and the 3-D ORB feature points are created by the method comprising the following steps:

firstly, establishing 3-D ORB characteristic points through projection according to the pose of a key frame camera and the depth data of the ORB characteristic points;

secondly, a selection strategy of every 50 frames is applied to the candidate objects in the local mapping thread, and the final key frame and the 3-D ORB characteristic points in the local map are preferentially selected.

S4.2, local optimization is carried out on the constructed sparse map by adopting accurate camera pose constraint provided by prior laser point cloud data;

local optimization and global optimization of sparse maps are applied in ORB-SLAM without external constraints. Therefore, after long-distance tracking, errors in the camera pose and the sparse map may accumulate. The invention further optimizes the sparse graph constructed by the ORB-SLAM by using the accurate camera pose constraint provided by the prior laser point cloud data so as to eliminate the accumulated error. Because the optimization operation is performed once every 50 key frames, the optimization frequency is high. Therefore, in order to obtain good real-time performance and realize the accuracy and consistency of sparse map optimization, a multi-thread map segmentation-based front-back window optimization strategy is adopted. After optimization, accurate camera poses of all key frames can be obtained, and a dense point cloud model of a scene is recovered by integrating depth image data of all key frames. Then, a semi-dense map is constructed based on the keyframes, and the map is fused with 2D rich texture information from the image data and high-precision 3D geometric information from ground prior laser point cloud data.

although the camera pose is acquired through ICP (inductively coupled plasma) matching with the laser point cloud and key frame common view transmission, the map does not realize global optimization. Therefore, in order to obtain good real-time performance and a global optimal and consistent map, the invention is based on a multithreading map segmentation double-window optimization strategy, which specifically comprises the following steps: and correcting the first and last key frames of the current map segment by adopting an ICP (inductively coupled plasma) algorithm, taking the accurate pose obtained from the prior laser point cloud data as a true value, and optimizing the map segment by adopting a double-window method.

In multi-thread segment-based map optimization, a visual tracking thread runs on a current map segment, and an a priori laser point cloud data optimization thread runs on a previous map segment; when the visual tracking thread starts tracking from t₁To t₂When the next map segment is obtained, the prior laser point cloud data optimization thread simultaneously completes the optimization of the previous map segment and starts the optimization of the next map segment, so that the real-time laser data optimization and pose tracking performance can be ensured. The pose graph of the visual SLAM consists of a set of keyframe vertices M, a set of three-dimensional points P, and a set of opposing edges ε. Each key frame vertex m_iSaving absolute pose T of camera_iAnd the absolute pose T of the slave camera is recorded_iThree-dimensional feature point x observed under_kE P, and the pixel observation z of the three-dimensional characteristic points on the image_ik. At two position and posture vertexes m_iAnd m_jEdge E between_ijThere is a co-visibility weight w_ijI.e. at m_iAnd m_jThe number of commonly visible feature points in (a). Two keyframe pose vertices relative pose is implicitly defined as

After the camera pose correction, the local map around the key frame to be corrected has accurate pose control information, in the map containing 45 key frames, the key frame at the front end of the map is the tail end of the optimized map at the upper section, and the tail end part of the map is also corrected by checking laser point cloud data, so that the front end and the rear end of the map to be optimized have accurate pose control information. Meanwhile, in order to maintain the consistency of the map, in the embodiment, when the map segment is optimized, the map segment and the upper map segment have an overlapping degree of key frames of 10 frames, so that the optimized maps at the two ends have common constraints, meanwhile, the observed map points in the map segment and the optimized key frames are included in the optimization variables of the segmentation optimization process, the common map points are combined, and finally, the map with the accuracy consistent with the scale is obtained.

S4.4, further optimizing the map after global optimization by adopting a double-window optimization method to obtain a three-dimensional geometric point cloud map;

the invention optimizes the map after the global optimization in the step S4.3 by adopting a double-window optimization strategy, wherein the optimization window is divided into a front window and a rear window, the front window comprises 45 key frames in the map section, 10 key frames overlapped by the front map section and the rear map section and a map point set observed by the key frames, the front window constraint is a pose-map point constraint, and the front window constraint is used for controlling the accurate construction of a local map. The back window comprises an optimized key frame set in the front window, the back window is constrained to pose-pose constraints between key frames of the front window and pose-map point constraints between map points of the front window, and the back window is used for controlling the global consistency of the map segment and the optimized map. Meanwhile, since the map points in the front window may have been optimized for multiple times before, and the pose-fixed key frame of the rear window is far from the map in the front window, the constraint influence is small, and in order to improve the optimization efficiency and reduce the memory consumption, the map points with the optimization times exceeding the preset threshold and the optimized key frames far from the map section (i.e. exceeding 4m) are removed from the current optimization variables by comprehensively considering; in this embodiment, the preset threshold is 6 times.

The elimination strategy of the map points of the front window is as follows: firstly, initially determining optimized key frames including a newly-built key frame and a key frame overlapped with a front section map by a front window; the map points observed by the key frame are included in the optimized variables of the map section, the optimized times of the map points are ensured to be less than or equal to a preset threshold value, if the optimized times of the map points are greater than the preset threshold value, the map points are considered to be accurate enough, and the map points are not added into the map sectionIn the optimization variables, a front window map point optimization variable set V is finally obtained through screening_p。

The elimination strategy of the key frames of the rear window is as follows: and adding the map points left after the elimination of the map points of the front window and the optimized key frame into the fixed key frame optimization variable of the rear window as a true value constraint condition of the map section to ensure the consistency of the front map and the rear map. And meanwhile, considering that the constraint influence on the map section is small for the fixed key frames far away from the map section, deleting the fixed key frames far away from the map section, only keeping the fixed key frames with the minimum distance value less than a preset threshold value away from the map section, and keeping the pose of the fixed key frames fixed in the optimization process to serve as a true value constraint condition. Finally obtaining fixed key frame optimization variable V of rear window through screening_kf-outer。

The method for constructing the error equation of the front window comprises the following steps:

and obtaining the map point optimization variable of the map section through the screening of the map points. Based on the map point-pose constraint, an error equation is constructed for the map point optimization variable and the keyframe pose variable of the front window, as shown in the following formula:

e_ij＝z_ij-h(ξ_i，p_ij)p_ij∈V_p

wherein h () is a three-dimensional feature point p_ijProjection function of the projection image to pixel coordinates via camera perspective, z_ijThe pixel coordinate value, xi, of the jth map point observed in the ith key frame in the front window in the image_iFor lie algebra representation of camera pose, p_ijThe world coordinate of the jth map point observed for the key frame, e_ijIs the error between the observed and calculated values. Summing the square of the back projection errors of all map points in the ith key frame to obtain:

where n is the number of map points observed by the key frame. Summing the e (i) errors of all the optimized keyframes in the front window to obtain the total error of the map:

in the formula, kf_iFor key frames contained within the front window, E_lAnd (5) an integral error function based on map point-pose constraint is applied to the front window, and m is the number of key frames in the front window.

The error equation construction method of the rear window comprises the following steps:

in order to keep the consistency of the map, the key frame of the rear window is incorporated into a total error equation, the constraint of the rear supplementary window on the map of the front window comprises map point-pose constraint and pose-pose constraint, and the error equations are respectively constructed for the two constraints;

the map point-pose constraint error equation is shown as follows:

in the formula, kf_kFor key-frames contained in the rear window, kf_k∈V_kf-outer；p_kj∈V_pα is the number of key frames in the rear window, and k is the index number of the key frames.

The pose-pose constraint error equation is shown as follows:

in the formula (I), the compound is shown in the specification,

is the relative pose error, T, in SE (3) tangent space_kFor the rear window key frame vertex v_kPosition and attitude of (1), T_iFor front window keyframe vertices v_iThe pose of (a);

is a binary constraint T_kjThe accuracy matrix of (a) is determined,

w_kithe number of map points which are observed together by the poses of the two key frames is shown as I, which is an identity matrix.

Uniformly balancing and optimizing error equations obtained by the front window and the rear window, and optimizing the map section; in order to keep consistency of the front map and the rear map, in the process of optimizing the map section, the key frame pose xi of the rear window is set_kAs a fixed value, a constant is set as a true value constraint condition for the segment of map optimization. And meanwhile, the pose of the end key frame in the front window is also set to be a constant, because the pose is the accurate pose obtained by matching the prior laser point cloud data ICP, the pose is used as a true value constraint condition and is not used as an unknown optimization variable. Meanwhile, the equation is nonlinear, and all unknown optimization variables (the pose variable of the front window key frame camera and the three-dimensional coordinate variable of the map point) are derived from the obtained total error equation. Constraint error equation e for each map point-pose relative to the upper front window_ijIn other words, the optimization variables include three position variables of 6 pose variables of the camera (i.e., 6 degrees of freedom lie algebra of the camera pose) and coordinates of the map point (i.e., three-dimensional coordinates (x, y, z) of the map point). Wherein, the Jacobian matrix obtained by derivation is:

in the formula, x is all key frame position and posture variables and map point three-dimensional coordinate variables in the front window, xi_iFor key frame vertices v_iLie algebra, p, corresponding to a matrix of camera poses_jFor key frame vertices v_iObserved jth map point three-dimensional coordinates. Since the error equation is only associated with the key frame vertex v_iAnd key frame vertex v_iThe observed map points are related, and thus co-ordinated and related to other map pointsPartial derivatives of key frame pose are respectively zero 0_2×3，0_2×6. And for keyframe vertices v in the back window_kFor the front window map point V_jGenerated map point-pose constraint error equation e_kjWherein the key frame vertex v_kTaking the pose of the camera as a constant and not as an optimization variable, and performing derivation on the three-dimensional coordinate variable of the map point:

in the formula, p_jFor the rear window key frame vertex v_kThe three-dimensional coordinates of the map point j in the front window observed.

For rear window keyframe vertices v_kFor front window key frame vertex v_iGenerated pose-pose constraint error equation

The error equation has no map point three-dimensional coordinate variable, and only the key frame vertex v in the front window_iAnd (3) derivation of camera pose variables:

in the formula, xi_iFor front window keyframe vertices v_iThe camera pose of (1).

And (3) combining the Jacobian matrixes obtained by derivation of all error equations to construct a Hessian matrix H, and finally simplifying an error cost function model:

L＝HΔx

in the formula, H is a Hessian matrix of a total error equation, a Schur elimination method is adopted to eliminate (marginalize) the sparse Hessian matrix H, and a pose variable delta x, namely the correction numbers of all key frame pose variables and map point three-dimensional coordinate variables, is finally obtained through an LM self-adaptive step length iterative algorithm, so that the camera pose and map point coordinates in front window optimization variables are obtained, the laser data equivalent precision is achieved, and the global consistency of maps of front and rear sections is guaranteed.

S5, performing geometric incremental segmentation on the three-dimensional geometric point cloud map, performing object identification and semantic segmentation on each frame of RGB image by adopting an object detector based on a convolutional network, fusing a geometric incremental segmentation result and a semantic segmentation result, and acquiring an object boundary to obtain a 3D geometric segmentation map of scene semantic enhanced geometric segmentation; the method specifically comprises the following steps:

s5.1, guiding the RGB-D SLAM to construct a high-precision three-dimensional geometric point cloud map of an indoor scene through the prior laser point cloud data in the step 4, performing geometric incremental segmentation on the constructed three-dimensional geometric point cloud map, and performing geometric segmentation on a depth image acquired by the RGB-D camera frame by frame in order to ensure the real-time performance of a subsequently constructed semantic map. Under the condition that most of real world objects are formed by convex shapes, the method adopts a normal vector edge analysis method of a concave penalty mechanism and a three-dimensional vertex distance threshold segmentation method to obtain a geometric segmentation map of the depth image.

to identify object instances in the scene, an object detector based on a convolutional neural network CNN is applied to the current input RGB image. Recent research shows that CNN-based object detectors can achieve real-time performance while maintaining high recognition accuracy, and perform real-time frame-by-frame object detection on input images by Fast-RCNN, and give input images I_t(u)，

X is more than or equal to 0 and less than W, y is more than or equal to 0 and less than H, and Fast-RCNN outputs a group of bounding boxes B_i,

I is more than 0 and less than K, K is the number of output bounding boxes and the classification class c,

probability of classifying the class

Assigned to each bounding box B_i。

S5.3, carrying out superposition statistics on the geometric segmentation map and the object boundary segmentation map of the depth image, detecting whether the same object is divided into different parts or not according to the state that different image labels in the geometric segmentation map of the depth image fall into a first object boundary frame of the object boundary segmentation map, if so, updating the boundary of the object in the object boundary segmentation map, and obtaining a 3D geometric segmentation map of scene semantic enhancement geometric segmentation;

since the geometric segmentation of the depth image is processed only by the geometric information, some whole objects in the target scene may be divided into different segment parts (for example, a chair is divided into a backrest part and a seating surface part), and the object boundary segmentation map has an over-segmentation phenomenon, and the embodiment performs enhanced geometric segmentation by using semantic information provided in a form of a bounding box based on CNN neural network object detection and a semantic segmentation result, and specifically includes:

geometrically segmented map of depth images

Performing superposition statistics with the CNN neural network object detection image of the RGB image if the geometric segmentation map of the depth image

First object bounding Box B falling into the object boundary segmentation graph_iClass label in (1)_n、I_mAnd the divided region satisfies the following condition, the divided region is regarded as I_n、I_mBelong to the same object and_n、I_mas the boundary of the object:

in the formula, Ψ isGeometric partition component I_n、I_mA set of conditions that are satisfied for different parts of the same object,

the representation is labeled as a label class l_iIs selected to be the set of pixels of (a),

is shown in bounding box B_iIs marked as a label class l_iSet of pixels of σ_sameIs a ratio threshold. The formula is obtained by considering the bounding box B_iTarget tag l in (1)_iTo verify whether the tag li is part of the object instance; if the formula condition is satisfied, the merged geometric segmentation is divided into different parts (I) belonging to the same object_n，I_m)。

By adopting the deep learning object detection enhanced geometric segmentation strategy, the segmentation real-time performance and higher segmentation precision can be ensured, and complete and consistent semantic information is provided for the subsequent high-precision semantic map construction.

And S6, updating the 3D geometric segmentation map subjected to semantic enhancement geometric segmentation through the correction result of the camera pose in the step S3, and performing semantic association and segmentation probability distribution updating on the objects in the scene based on the updating result to complete the construction of the semantic map. While traditional methods assign classification class probabilities to each element that makes up a 3D map, the present invention assigns class probabilities to each neural network semantic segmentation instance. In this way, semantic object instances identified by each neural network segmentation will be assigned to a discrete probability distribution P (c | I)_1...t) And probability confidence t. Assuming N is the number of classification classes, the spatial complexity of storing the classification class probabilities is O (N)_l)，N_lTo partition the number of tags, the spatial complexity is (N) relative to conventional methods_s)，N_sIs the number of elements of the three-dimensional map. Due to N_s＞＞N_lTherefore, the invention reduces the memory occupation and the calculation complexity, and accords with human semantics by distributing semantics in a region mode rather than an element modeA more natural way of identifying objects by tags. Based on the above concept, in order to correctly fuse the output of the object detector with the three-dimensional geometric composition, the embodiment adopts a confidence-based method to update the label l assigned to each classification category in real time_iClass probability of (1), camera pose T estimated using SLAM framework_tAnd with each surface s_kAssociated 3D position x (k) rendering a 3D geometric segmentation map updating the semantically enhanced geometric segmentation on the current image plane by a segmentation class label l representing the segmentation area_iAnd generating an associated label rendering segmentation graph. And adopting an object detector based on a convolutional network to carry out object identification on the label rendering segmentation map to obtain a second object boundary box and a corresponding segmentation probability, associating the second object boundary box with an object boundary box in the 3D geometric segmentation map subjected to semantic enhanced geometric segmentation, and updating the object segmentation probability in the 3D geometric segmentation map subjected to semantic enhanced geometric segmentation through the segmentation probability corresponding to the second object boundary box to complete the construction of the high-precision semantic map.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A semantic mapping and positioning method based on the fusion of a prior laser point cloud and a depth map is characterized by comprising the following steps:

2. The semantic mapping and localization method based on the fusion of the prior laser point cloud and the depth map of claim 1, wherein in the step S2, the method for performing the initialization registration on the preprocessed prior laser point cloud data and the RGB-D point cloud comprises:

3. The semantic mapping and localization method based on the fusion of the prior laser point cloud and the depth map according to claim 1, wherein the step S3 specifically comprises the following steps:

4. The method of semantic mapping and localization based on a priori laser point cloud and depth map fusion of claim 3, wherein in step S3.1, RGB-D point cloud data is generated based on the depth information of the camera and the pose of the tracking thread.

5. The semantic mapping and localization method based on the fusion of the prior laser point cloud and the depth map of claim 1, wherein the step S4 comprises:

6. The method for semantic mapping and localization based on a priori laser point cloud and depth map fusion of claim 5, wherein in the step S4.4, the dual-window optimization method comprises a front window and a rear window; the front window constraint is a pose-map point constraint, and the front window is used for controlling the construction of a local map; the rear window constraint is a pose-pose constraint between the rear window and the front window key frame, and a pose-map point constraint between the rear window and the front window map point; the rear window is used to control global consistency of the map.

7. The method of claim 6, wherein the front window comprises a plurality of key frames in a map segment, a plurality of key frames with overlapping front and back map segments, and a set of map points observed by the key frames; the back window includes the optimized set of keyframes in the front window.

8. The semantic mapping and localization method based on the fusion of the prior laser point cloud and the depth map according to claim 1, wherein the step S5 specifically includes:

9. The method for semantic mapping and localization based on a fusion of a priori laser point cloud and a depth map as claimed in claim 8, wherein in step S6, a confidence-based method is used to update the segmentation probability of each object in the 3D geometric segmentation map assigned to the semantically enhanced geometric segmentation in real time.

10. The semantic mapping and localization method based on the fusion of the prior laser point cloud and the depth map according to claim 9, wherein the confidence-based method specifically comprises: