CN115035260A - Indoor mobile robot three-dimensional semantic map construction method - Google Patents
Indoor mobile robot three-dimensional semantic map construction method Download PDFInfo
- Publication number
- CN115035260A CN115035260A CN202210594240.3A CN202210594240A CN115035260A CN 115035260 A CN115035260 A CN 115035260A CN 202210594240 A CN202210594240 A CN 202210594240A CN 115035260 A CN115035260 A CN 115035260A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- key frame
- semantic
- map
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 25
- 230000011218 segmentation Effects 0.000 claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000000638 solvent extraction Methods 0.000 claims abstract description 20
- 238000001914 filtration Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000013078 crystal Substances 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000002425 crystallisation Methods 0.000 description 2
- 230000008025 crystallization Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 101000664407 Neisseria meningitidis serogroup B (strain MC58) Surface lipoprotein assembly modifier 2 Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/05—Geographic models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30244—Camera pose
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Geometry (AREA)
- Remote Sensing (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Graphics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a three-dimensional semantic map construction method for an indoor mobile robot, which comprises the steps of running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and pose of the robot, and forming a space point cloud by back projection of an RGB image and a depth image of a key frame; filtering the spatial point clouds of the key frames and splicing the spatial point clouds according to the poses to form a three-dimensional dense point cloud map; performing target identification on the RGB image of the key frame by using a YOLO V5 network to obtain 2D semantic information, and obtaining a 3D semantic label according to back projection; partitioning the dense point cloud map through a point cloud partitioning algorithm; and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map. According to the method, the RGB-D information acquired by the RGB-D depth camera is processed, and the semantic information and the SLAM result are fused to form the three-dimensional map with more abundant information.
Description
Technical Field
The invention belongs to the technical field of robot semantic map construction, relates to a robot map construction method, and particularly relates to a construction method of an indoor mobile robot three-dimensional semantic map.
Background
With the rapid development of intelligent technology, the mobile robot industry is also in vigorous development, and advanced mobile robots are loaded with a plurality of sensors with specific functions and complete specified tasks in various scenes under the condition of no direct operation of personnel. For an intelligent robot with an application scene in a room, a robot autonomous exploration mode is generally adopted to complete a map construction task.
At present, typical semantic construction methods of mobile robots mainly include methods based on visual images and methods based on point cloud processing, for example, zhang cheng et al utilize monocular image sequences in a document, "semantic SLAM algorithm under dynamic environment", develop semantic segmentation, visual odometry and other research works, and finally construct a global semantic map; for example, an individual having a shining and the like researches and utilizes the point cloud image in the document, "structure-variable robust semantic SLAM algorithm" to develop the research work of semantic SLAM robustness.
The prior art has the following problems in the aspect of building a semantic map of the mobile robot: firstly, the visual image can only provide RGB color information, and three-dimensional space information such as point cloud is difficult to extract; secondly, the point cloud information is difficult to obtain the semantic information of the represented object, and the semantic information is difficult to be fused with the SLAM mapping and positioning result.
Disclosure of Invention
Aiming at the prior art, the invention aims to solve the technical problem of providing a three-dimensional semantic map construction method for an indoor mobile robot, which is characterized in that RGB-D information collected by an RGB-D depth camera is processed, semantic information and SLAM results are fused to form a three-dimensional map with more abundant information.
In order to solve the technical problem, the invention provides a method for constructing a three-dimensional semantic map of an indoor mobile robot, which comprises the following steps:
step 1: running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding poses of the robot, and forming a single-frame space point cloud by back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
and step 3: performing target identification on the RGB image corresponding to the key frame by using a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic label according to back projection;
and 4, step 4: partitioning the dense point cloud map by a point cloud partitioning algorithm;
and 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map.
Further, step1 comprises:
adding processing on a key frame in a Tracking thread in an ORB-SLAM2 system to obtain spatial point cloud data corresponding to the key frame, specifically: performing the following operations on all pixel points in the key frame to obtain point cloud data of the key frame under a camera coordinate system: let the coordinate of a certain pixel point P' in the image be [ u, v ]] T If the internal reference matrix of the camera is K and the depth data corresponding to the pixel point is d, converting the pixel point into a three-dimensional point P under a camera coordinate system c (X ', Y ', Z ') is:
obtaining a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c RGB information ofOperating all pixel points in the key frame as above, and obtaining point cloud data of the key frame under a camera coordinate system through a point cloud construction function of a PCL (personal computer) library;
and then converting point cloud data of key frames under a camera coordinate system into point cloud data under a world coordinate system:
let P c Converted into a point P in the world coordinate system w (X,Y,Z),P w (X, Y, Z) satisfies:
wherein, T cw Is the camera pose.
Further, the step2 of filtering the spatial point cloud corresponding to the key frame includes:
removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud base, calculating the average distance from each point in the spatial point cloud to all adjacent points, wherein the average distance is at an outlier threshold d pt The outliers, are identified as outliers and removed from the data. And then, point cloud data is down-sampled by a voxel grid filtering method.
Further, an outlier distance threshold d pt The method comprises the following specific steps:
the distances of all points in the spatial point cloud are arranged to obey normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Comprises the following steps:
d pt =μ+H out ·σ。
further, the step2 of forming the three-dimensional dense point cloud map by splicing according to the poses corresponding to the key frames comprises:
the Loopcolising thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame with the changed camera pose after the closed loop optimization, deletes the point cloud data corresponding to the key frame in the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
Further, step3 comprises:
step 3.1: establishing data set based on YOLO V5 network model
Shooting a picture of a target object to obtain a data set, labeling the acquired data set by using a labeling program Labelimg, labeling, and respectively selecting a training set and a test set from the shot picture; selecting pictures containing target objects from the COCO data set as a training set and a test set respectively; carrying out data configuration and model training parameter setting on a YOLO V5 network, wherein the data configuration comprises object class number and object class name, training parameters of a YOLO V5 network model by using a training set, and verifying the network model by using a test set;
step 3.2: extracting 2D object semantics and constructing 3D semantic tags
Enabling a YOLO V5 network to call an RGB-D depth camera to collect RGB image information, and acquiring a 2D recognition box in the RGB image in real time to serve as 2D target semantic information;
adding an interface in a Tracking thread of ORB-SLAM2 to read key frames generated in real time, performing target detection on each newly generated key frame by using a trained YOLO V5 network model to obtain the pixel coordinates of the center point of a target object and the length and width of a boundary frame, then changing the RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, wherein the single-frame point cloud data generated according to the key frame has a 3D semantic label, then reversely projecting the obtained key frame point cloud data and a result of pose estimation to a world coordinate system to complete point cloud splicing, and obtaining a global dense map with the 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
Further, the step 4 of partitioning the dense point cloud map by a point cloud partitioning algorithm includes:
step 4.1: partitioning the dense point cloud map by adopting a hyper-voxel clustering partitioning method, converting the point cloud into a surface structure, and adopting a partitioning resultAdjacency graph G ═ { v, epsilon }, where v i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ) Each curved surface block comprises a mass center c i And a normal vector n i ;
Step 4.2: after the surface block is obtained, the surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m Then calculate the centroid c of the surface patch i To the plane pc m Distance d (c) of i ,pc m ) Setting a threshold delta to obtain all arrival planes pc m The surface patch with distance within delta is named pi ═ v i ∈V|d(c i ,pc m ) < delta >, define:
where η represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Minimizing energy P representing weight of point cloud plane, graph partitioning problem * Comprises the following steps:
wherein E (P) is the energy of the fitting, and satisfies:
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
Further, step 5 comprises:
obtaining a bounding box of the point cloud of the target object according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are a set P ═ P 1 ,P 2 ,P 3 ......,P n }, then
b=max(Depth(P i )-Depth(P j ))
Wherein, P i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value is back projected into a world coordinate system by utilizing the 2D pixel point, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
and detecting the result obtained by point cloud segmentation, and replacing the random color attribute labels of the clusters with the 3D semantic labels obtained in the step (3) to obtain a semantic map only when all point clouds in a certain cluster are located in the bounding box.
Further, the RGB-D depth camera uses a Kinect V2 camera.
Further, the YOLO V5 network employs a YOLO V5s network.
The invention has the beneficial effects that: the method obtains RGB-D information of an indoor environment by carrying an RGB-D depth camera (such as a Kinect camera) through a robot, finishes target recognition based on a key frame generated by an ORB-SLAM2 system, obtains semantic information through a trained YOLO v5s network model, and then converts the semantic information into a 3D semantic label. And generating a three-dimensional dense point cloud map according to the RGB map and the depth map corresponding to the key frame and the pose information of the key frame, and fusing the semantic tags and the dense point cloud map to construct a three-dimensional semantic map with rich information after dividing the point cloud map on the basis.
Drawings
FIG. 1 is a block flow diagram of the present invention;
FIG. 2 is a flow chart of dense point cloud construction;
FIG. 3 is a dense point cloud map;
FIG. 4 is a sample of a portion of a data set collected;
FIG. 5 is a calibration of a data sample;
FIG. 6 shows the real-time results of YOLO v5s detection;
FIG. 7 is a flow chart for generating point cloud labels;
FIG. 8 is a point cloud label diagram in a real scene;
FIG. 9 is a super voxel cluster segmentation result;
FIG. 10 is a diagram of a point cloud segmentation result;
fig. 11 is a semantic map after fusion.
Detailed Description
The invention is further described with reference to the drawings and examples.
The method comprises the steps that a Kinect V2 camera is selected as a visual sensor, key frames in the camera movement process are obtained based on an ORB-SLAM2 algorithm, after camera poses are obtained through the processes of feature point matching, rear end optimization and the like in the key frames, an RGB (red, green and blue) image and a depth image corresponding to each key frame are subjected to back projection to form space point clouds, and the space point clouds are spliced by combining pose information of the camera after filtering processing to form a three-dimensional dense point cloud map; meanwhile, carrying out target identification on the RGB image corresponding to the key frame by using a YOLO V5 network model, obtaining semantic information of a target, and projecting the semantic information to a spatial point cloud to realize construction of a 3D semantic label; and then, carrying out segmentation work on the dense point cloud map by using a point cloud segmentation algorithm, and finally fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct a 3D semantic map.
1. Improved ORB-SLAM2 construction of three-dimensional dense point cloud
The method realizes the goal of acquiring the dense point cloud by improving partial threads in the ORB-SLAM2, and the specific flow is shown in FIG. 2, and the main steps are as follows:
step 1: back projection of two-dimensional pixel point to three-dimensional point cloud data
Modifying the Tracking thread in the ORB-SLAM2 system, adding the processing of the key frame, and obtaining the 3D point cloud data corresponding to the key frame. The principle of the process is that two-dimensional pixel coordinates and depth data of the environment are back projected to a world coordinate system through camera internal reference and external reference, and the coordinate of a certain pixel point P' in an image is set as [ u, v [ ]] T The internal reference matrix of the camera is K, and the depth data corresponding to the pixel point is d; then the pixel point is converted into a three-dimensional point P under the camera coordinate system c (X ', Y ', Z ') can be represented by the following formula:
obtain a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c The RGB information of the key frame is operated as above, and point cloud data of the key frame under a camera coordinate system is obtained through a point cloud construction function of a PCL library. After point cloud data corresponding to the key frame is obtained, the point cloud data is converted into point cloud under a world coordinate system, and P is set c Converted into a point P in the world coordinate system w (X, Y, Z), the process of which can be represented by the following formula, wherein T cw For the camera pose:
step 2: dense point cloud data filtering
According to the method, the outlier removing filter in the PCL point cloud base is used for removing abnormal noise points in the point cloud data, and the point cloud obtained after the outlier removing filter is relatively smooth and suitable for subsequent processing point cloud clustering. And then, point cloud data is down-sampled by a voxel grid filtering method. When the outlier filtering is removed, the average distance from each point to all adjacent points is calculated, the obtained result is subject to Gaussian distribution, and the shape of the result is determined by Gaussian distribution parameters (mu, sigma) 2 ) Points with average distances outside the standard range were determined, considered outliers and removed from the data.
Assuming that the distances of all points in the point cloud obey a normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Can be expressed as:
d pt =μ+H out ·σ (3)
i.e. the average distance is in the interval (0, d) pt ) Points outside of this are considered outlier rejection.
The invention carries out experiments, and based on a hardware platform (Intel Core i5-10400F CPU, NVIDIA GTX 1650), the resolution parameter of voxel grid filtering is adjusted to 0.01, namely when the size of the voxel grid is 1cm x 1cm, the optimal operation effect can be obtained.
Step 3: point cloud stitching
In the invention, the global dense point cloud data directly obtained according to step1 depends on the camera pose T of each key frame cw However, in the ORB-SLAM2 system, the LoopClosing thread optimizes the pose of the key frames in the whole closed loop according to the detected closed loop, and the pose estimation results of some key frames change.
In order to avoid double images of the constructed dense point cloud map, the method extracts a key frame with a changed camera pose after closed-loop optimization, deletes the point cloud data corresponding to the key frame in a world coordinate system, regenerates the corresponding point cloud data according to the pose after closed-loop optimization, projects the point cloud data into the world coordinate system, completes the point cloud splicing work after loop detection, and finally constructs a 3D dense point cloud map based on an ORB-SLAM2 system as shown in figure 3.
2. 3D semantic label construction based on YOLO v5s network model
According to the method, the target identification is needed to be carried out on the key frame generated by the Tracking thread in the ORB SLAM2 system, and then the obtained 2D semantic information is mapped to the spatial point cloud data to complete the construction of the 3D semantic label. The specific steps are as follows:
step 1: establishing data set based on YOLO v5s network model
This patent uses camera to shoot common object in the laboratory, mainly contains common article such as computery equipment, books. Part of the acquired image is labeled using the labeling program Labelimg and labeled as shown in FIG. 4, for example, the process of labeling the "display" is labeled "tv" as shown in FIG. 5. 500 pictures are selected from the taken pictures as a training set, and 200 pictures are selected as a testing set. In addition, 1500 pictures containing objects in an indoor scene are selected from the COCO to be used as a training set, 300 pictures are used as a testing set, and the annotation file is stored in a json file format for being used together with the COCO data set. In the present invention, 35 objects are commonly used in an indoor environment.
Before training, the configuration files of the YOLO v5s data and the model need to be modified, the object class number nc of the two files is changed to 35, and the object class names are modified in the class list names. The training set comprises 2000 pictures, the test set comprises 500 pictures, the YOLO v5s network model parameters are trained after the data set is prepared, and the hyper-parameter configuration of the training model is shown in table 1.
TABLE 1 parameter settings for the YOLO v5s network model
Step 2: extracting 2D object semantics and constructing 3D semantic tags
After the YOLO V5s network training is finished, the Kinect V2 camera is used for collecting corresponding RGB image information in real time, so that the real-time detection function can be realized, and as a result, as shown in fig. 6, 2D target semantic information in RGB images can be obtained, and the requirement of the invention for constructing a 3D semantic label in real time can be met.
The process of mapping 2D semantic information to 3D semantic tags is shown in fig. 7, and first, an interface is added to the Tracking thread of ORB-SLAM2, and based on the trained YOLO v5s network, target detection is performed on each newly generated key frame to obtain the pixel coordinates of the center point of the target object and the length and width of the bounding box. And then, when generating the spatial point cloud data according to the RGB image and the depth image corresponding to the key frame, changing the RGB information of all pixel points in the 2D boundary frame of the target object into set color information, wherein the single-frame point cloud data generated according to the key frame has a semantic label. And then, back projecting the obtained key frame point cloud data and a pose estimation result to a world coordinate system to complete point cloud splicing to obtain a global dense point cloud map with a 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
The key frame generated in ORB-SLAM2 is used as input to obtain a key frame image with a 2D bounding box and confidence, and then a corresponding 3D semantic label in the key frame is obtained by back projection, and the result is shown in fig. 8, where the point cloud of the display is set to red, the point cloud of the keyboard is set to green, and the point cloud of the mouse is set to blue.
3. Dense point cloud segmentation
The method carries out point cloud segmentation work on the constructed dense point cloud map so as to obtain a scene understandable by a robot, and obtains accurate target point cloud according to the segmentation result of the dense point cloud map so as to be convenient for subsequent combination with a 3D semantic tag. Firstly, an original dense point cloud map is segmented by using a segmentation method based on hyper-voxel clustering, irregular point clouds are converted into surface structures according to the similarity of points, surface patches formed after the hyper-voxel segmentation all have a centroid and a normal vector, the point cloud segmentation can be defined as a graph segmentation problem, and then the graph segmentation is further segmented by adopting a graph segmentation method, and the detailed steps are as follows:
step 1: supervoxel cluster segmentation
The hyper-voxel segmentation is to utilize a spatial octree structure of point cloud, perform Clustering through a Supervoxel Clustering class in a PCL library, add a label to a voxel centroid (voxel _ centroid _ cluster) of the octree structure according to a segmentation result, and endow random color to be stored in a clustered _ voxel _ cluster parameter.
The segmentation principle is that point clouds are segmented by using the region growth of k-means clustering, the process of super-volume clustering is similar to the crystallization process, and polycrystalline nucleus crystallization is caused after a solution is supersaturated. The segmentation method of the hyper-voxel cluster is to uniformly distribute crystal nuclei for growth in the whole space according to a certain rule, then set the crystal nucleus distance and the minimum crystal grains, and the smaller crystal grains are absorbed by the larger crystal grains. In the process of voxel-wise segmentation, the expansion of the seed points is determined by characteristic distances, the characteristic distances consider characteristic space calculation of space, color and normal vectors, and the distance formula in voxel-wise clustering segmentation is as follows:
in the above formula: d c Representing the degree of difference in color, D n Representing the degree of difference in the normal direction, D s Representing the degree of difference in the point clouds in distance. W in the formula c 、w s 、w n Respectively representing the weight of each variable. By searching around the crystal nucleus, the voxel with the smallest D value is found to be the most similar, so as to represent the next crystal nucleus for growth.
The invention selects a Kinect V2 camera as a sensor, and sets appropriate parameters: voxel _ resolution is 0.008 (voxel size, resolution of spatial octree), and seed _ resolution is 0.5 (seed resolution). The obtained super-voxel cluster segmentation result is shown in fig. 9 by adjusting the color, the spatial distance, and the weight occupied by the normal vector according to the laboratory environment.
Step 2: point cloud segmentation based on geometric information
The super voxel segmentation result obtained in the last step is represented by an adjacency graph G ═ v, epsilon, wherein v is i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ). After the superpixel segmentation processing, each surface block has a centroid c i And a normal vector n i At this time, the scene segmentation may be regarded as a graph segmentation problem.
Suppose there are K actual planes { s ] in a point cloud map 1 ,s 2 ,…,s k And these planes have been divided into curved patches. Here, a variable b is defined i } 1 N ,b i ∈[0,K],b i K denotes the surface block belongs to the plane s k N denotes the actual plane s k Is divided into N surface blocks, such as the desktop divided into two clusters in fig. 9. If the planes of all the objects in the point cloud map can be extracted and the curved surface blocks are distributed to the objects according to the point cloud map, a more accurate point cloud segmentation result can be obtained.
Specific operation after obtaining the curved surface block, the curved surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m H, calculate d (c) i ,pc m ) I.e. centroid c of the surface block i To the plane pc m The distance of (c). By adding a threshold delta, all the to-plane pc can be obtained m The curved surface block with the distance within delta is named as n ═ v i ∈V|d(c i ,pc m ) < delta }. The following formula is defined:
in the above formula: eta represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Representing the weight of the point cloud plane. In the experiment, η ═ 40 and δ ═ 4(cm) were set. Minimizing energy P of graph partitioning problem * Can be represented by the following formula:
in the above formula, e (p) is the fitting energy, which is specifically shown as the following formula:
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
The edges meeting the energy minimization are cut off based on the point cloud minimum segmentation theory, namely adjacent curved surface blocks can be combined according to constraint conditions to obtain the point cloud segmentation result, the segmentation purpose at this time is to combine clusters belonging to the same object in the clustering result, the segmentation result is shown in figure 10, wherein the segmentation effect of target objects such as displays, books, cups, bottles and the like is good, the combination effect of background objects such as tables, storage cabinets, floors and the like is also good, and the requirements of the invention are met.
4. Fusing semantic information to construct semantic map
After dense point cloud segmentation is completed, the segmentation result is combined with the 3D semantic tags, the segmentation result is optimized through semantic information, and targets which cannot be extracted in point cloud segmentation can be segmented; meanwhile, semantic information can be added to each segmented target, and a more accurate semantic map is constructed. After point cloud segmentation, each cluster has an attribute label, and after the attribute labels of the clusters are replaced by the semantic labels of specific targets, specific objects in the point cloud have semantic information, so that a semantic map is constructed. The specific implementation steps are as follows:
step 1: storage of semantic tags
After the dense point cloud map is divided, the point clouds with different colors represent a point cloud set and a spatial sampling area in each hyper-voxel, and the color is an attribute label of the cluster. However, the attribute label at this time is only a random label without any meaning, and the label is also an attribute label of each cluster in the final point cloud segmentation result.
In the second part, the present invention performs object recognition on the keyframe generated in ORB-SLAM2 by YOLO v5s, and adds a color tag with semantic information to the result of point cloud segmentation according to the 3D semantic tag obtained from the recognition result, thereby completing the storage of object semantic tags.
At the moment, all pixel points in the 2D target recognition bounding box are directly mapped to semantic labels formed in a 3D mode by combining with depth data of a Kinect camera. In a complex environment of a laboratory, a bounding box of target identification is not accurate enough, which easily causes the addition of wrong semantic information to a surrounding point cloud of a target object, as shown in fig. 8, a semantic label is erroneously added to the surrounding point cloud of the target object.
Step 2: fusion of semantic tags and point cloud bounding boxes
In order to obtain a more accurate semantic map, the bounding box of the point cloud of the target object is obtained according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame. Assuming that all the pixel points in the 2D identification frame are set P ═ P 1 ,P 2 ,P 3 ......,P n } then
b=max(Depth(P i )-Depth(P j )) (8)
Wherein P is i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value. The length and height of the bounding box are obtained by mapping the length and width of the 2D identification frame, and the principle is that the 2D pixel points are back projected to a world coordinate system, which is not described herein again.
Then, the invention detects the result obtained by partitioning the point clouds, and only when all the point clouds in a certain cluster are positioned in the bounding box, specific semantic information is added, namely specific color attributes are added, so that a more accurate semantic map is obtained. As shown in fig. 11 and fig. 10, the object TV can be segmented more completely by fusion of bounding boxes obtained by semantic labels and point cloud segmentation results, and correct semantic information is successfully added to the clustering of the actual object edge point cloud; compared with the method that the semantic information is added to the point cloud by the bounding box directly obtained by 2D target recognition in the figure 8, the method can also effectively avoid mistakenly adding the semantic information to the point cloud around the target. Meanwhile, the point cloud segmentation effect can be improved by combining semantic information, for example, in fig. 11, the added semantic information is clustered again after changing color attributes, and small objects such as a keyboard and a mouse on a desktop can be segmented. According to the above two steps, a three-dimensional semantic map as shown in fig. 11 can be formed.
Claims (10)
1. A three-dimensional semantic map construction method for an indoor mobile robot is characterized by comprising the following steps:
step 1: running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding poses of the robot, and forming a single-frame space point cloud by back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
and step 3: performing target identification on the RGB image corresponding to the key frame by using a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic label according to back projection;
and 4, step 4: partitioning the dense point cloud map by a point cloud partitioning algorithm;
and 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map.
2. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step1 comprises the following steps:
adding processing on a key frame in a Tracking thread in an ORB-SLAM2 system to obtain spatial point cloud data corresponding to the key frame, specifically: performing the following operations on all pixel points in the key frame to obtain point cloud data of the key frame under a camera coordinate system: let the coordinate of a certain pixel point P' in the image be [ u, v] T If the internal reference matrix of the camera is K and the depth data corresponding to the pixel point is d, converting the pixel point into a three-dimensional point P under a camera coordinate system c (X ', Y ', Z ') is:
obtain a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c The RGB information of the key frame is operated as above, and point cloud data of the key frame under a camera coordinate system is obtained through a point cloud construction function of a PCL library;
and then converting point cloud data of key frames under a camera coordinate system into point cloud data under a world coordinate system:
let P c Converted into a point P in the world coordinate system w (X,Y,Z),P w (X, Y, Z) satisfies:
wherein, T cw Is the camera pose.
3. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the filtering processing of the spatial point cloud corresponding to the key frame comprises:
removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud base, calculating the average distance from each point in the spatial point cloud to all adjacent points, wherein the average distance is within an outlier distance threshold d pt The outliers, are identified as outliers and removed from the data. And then, point cloud data is down-sampled by a voxel grid filtering method.
4. The indoor mobile robot three-dimensional semantic map construction method according to claim 3, characterized in that: the outlier distance threshold d pt The method specifically comprises the following steps:
the distances of all points in the spatial point cloud are arranged to obey normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Comprises the following steps:
d pt =μ+H out ·σ。
5. the indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the step of splicing the poses corresponding to the key frames to form the three-dimensional dense point cloud map comprises the following steps:
the Loopcolising thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame with the changed camera pose after the closed loop optimization, deletes the point cloud data corresponding to the key frame in the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
6. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step3 comprises the following steps:
step 3.1: establishing data set based on YOLO V5 network model
Shooting a picture of a target object to obtain a data set, labeling the collected data set by using a labeling program Labelimg, labeling, and respectively selecting a training set and a test set from the shot picture; selecting pictures containing target objects from the COCO data set as a training set and a test set respectively; carrying out data configuration and model training parameter setting on a YOLO V5 network, wherein the data configuration comprises object class number and object class name, training parameters of a YOLO V5 network model by using a training set, and verifying the network model by using a test set;
step 3.2: extracting 2D object semantics and constructing 3D semantic tags
Enabling a YOLO V5 network to call an RGB-D depth camera to acquire RGB image information, and acquiring a 2D recognition frame in the RGB image in real time to serve as 2D target semantic information;
adding an interface in a Tracking thread of ORB-SLAM2 to read key frames generated in real time, performing target detection on each newly generated key frame by using a trained YOLO V5 network model to obtain the pixel coordinates of the center point of a target object and the length and width of a boundary frame, then changing the RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, wherein the single-frame point cloud data generated according to the key frame has a 3D semantic label, then reversely projecting the obtained key frame point cloud data and a result of pose estimation to a world coordinate system to complete point cloud splicing, and obtaining a global dense map with the 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
7. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: 4, the step of partitioning the dense point cloud map by using a point cloud partitioning algorithm comprises the following steps:
step 4.1: and (3) partitioning the dense point cloud map by adopting a hyper-voxel clustering partitioning method, converting the point cloud into a surface structure, and representing a partitioning result by adopting an adjacency graph G ═ v, epsilon, wherein v is represented by i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ) Each of which isThe curved surface block comprises a center of mass c i And a normal vector n i ;
Step 4.2: after the curved surface block is obtained, the curved surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m Then calculate the centroid c of the surface patch i To the plane pc m Distance d (c) of i ,pc m ) Setting a threshold delta to obtain all arrival planes pc m The surface patch with distance within delta is named pi ═ v i ∈V|d(c i ,pc m ) < delta >, define:
where η represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Minimizing energy P representing weight of point cloud plane, graph partitioning problem * Comprises the following steps:
wherein E (P) is the energy of the fitting, and satisfies:
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
8. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step 5 comprises the following steps:
obtaining a bounding box of the point cloud of the target object according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are a set P ═ P 1 ,P 2 ,P 3 ......,P n },Then
b=max(Depth(P i )-Depth(P j ))
Wherein, P i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value is back projected into a world coordinate system by utilizing the 2D pixel points, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
and detecting the result obtained by point cloud segmentation, and replacing the random color attribute labels of the clusters with the 3D semantic labels obtained in the step (3) to obtain a semantic map only when all point clouds in a certain cluster are located in the bounding box.
9. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the RGB-D depth camera employs a Kinect V2 camera.
10. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the YOLO V5 network adopts a YOLOv5s network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210594240.3A CN115035260A (en) | 2022-05-27 | 2022-05-27 | Indoor mobile robot three-dimensional semantic map construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210594240.3A CN115035260A (en) | 2022-05-27 | 2022-05-27 | Indoor mobile robot three-dimensional semantic map construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115035260A true CN115035260A (en) | 2022-09-09 |
Family
ID=83121169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210594240.3A Pending CN115035260A (en) | 2022-05-27 | 2022-05-27 | Indoor mobile robot three-dimensional semantic map construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115035260A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115908734A (en) * | 2022-11-25 | 2023-04-04 | 贵州电网有限责任公司信息中心 | Power grid map updating method, device, equipment and storage medium |
CN116363217A (en) * | 2023-06-01 | 2023-06-30 | 中国人民解放军国防科技大学 | Method, device, computer equipment and medium for measuring pose of space non-cooperative target |
CN116433493A (en) * | 2023-06-07 | 2023-07-14 | 湖南大学 | Workpiece point cloud set splicing method based on metric learning |
CN117517864A (en) * | 2023-11-08 | 2024-02-06 | 南京航空航天大学 | Laser radar-based power transmission line near electricity early warning method and device |
CN117542008A (en) * | 2023-10-12 | 2024-02-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Semantic point cloud fusion automatic driving scene identification method and storage medium |
CN117611762A (en) * | 2024-01-23 | 2024-02-27 | 常熟理工学院 | Multi-level map construction method, system and electronic equipment |
CN117788730A (en) * | 2023-12-08 | 2024-03-29 | 中交机电工程局有限公司 | Semantic point cloud map construction method |
CN117906595A (en) * | 2024-03-20 | 2024-04-19 | 常熟理工学院 | Scene understanding navigation method and system based on feature point method vision SLAM |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A kind of three-dimensional semantic map constructing method of the indoor environment based on deep learning |
CN112288857A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot semantic map object recognition method based on deep learning |
WO2021249575A1 (en) * | 2020-06-09 | 2021-12-16 | 全球能源互联网研究院有限公司 | Area semantic learning and map point identification method for power transformation operation scene |
WO2022021739A1 (en) * | 2020-07-30 | 2022-02-03 | 国网智能科技股份有限公司 | Humanoid inspection operation method and system for semantic intelligent substation robot |
-
2022
- 2022-05-27 CN CN202210594240.3A patent/CN115035260A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A kind of three-dimensional semantic map constructing method of the indoor environment based on deep learning |
WO2021249575A1 (en) * | 2020-06-09 | 2021-12-16 | 全球能源互联网研究院有限公司 | Area semantic learning and map point identification method for power transformation operation scene |
WO2022021739A1 (en) * | 2020-07-30 | 2022-02-03 | 国网智能科技股份有限公司 | Humanoid inspection operation method and system for semantic intelligent substation robot |
CN112288857A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot semantic map object recognition method based on deep learning |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115908734A (en) * | 2022-11-25 | 2023-04-04 | 贵州电网有限责任公司信息中心 | Power grid map updating method, device, equipment and storage medium |
CN116363217A (en) * | 2023-06-01 | 2023-06-30 | 中国人民解放军国防科技大学 | Method, device, computer equipment and medium for measuring pose of space non-cooperative target |
CN116363217B (en) * | 2023-06-01 | 2023-08-11 | 中国人民解放军国防科技大学 | Method, device, computer equipment and medium for measuring pose of space non-cooperative target |
CN116433493A (en) * | 2023-06-07 | 2023-07-14 | 湖南大学 | Workpiece point cloud set splicing method based on metric learning |
CN116433493B (en) * | 2023-06-07 | 2023-09-22 | 湖南大学 | Workpiece point cloud set splicing method based on metric learning |
CN117542008A (en) * | 2023-10-12 | 2024-02-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Semantic point cloud fusion automatic driving scene identification method and storage medium |
CN117517864A (en) * | 2023-11-08 | 2024-02-06 | 南京航空航天大学 | Laser radar-based power transmission line near electricity early warning method and device |
CN117517864B (en) * | 2023-11-08 | 2024-04-26 | 南京航空航天大学 | Laser radar-based power transmission line near electricity early warning method and device |
CN117788730A (en) * | 2023-12-08 | 2024-03-29 | 中交机电工程局有限公司 | Semantic point cloud map construction method |
CN117611762A (en) * | 2024-01-23 | 2024-02-27 | 常熟理工学院 | Multi-level map construction method, system and electronic equipment |
CN117611762B (en) * | 2024-01-23 | 2024-04-30 | 常熟理工学院 | Multi-level map construction method, system and electronic equipment |
CN117906595A (en) * | 2024-03-20 | 2024-04-19 | 常熟理工学院 | Scene understanding navigation method and system based on feature point method vision SLAM |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115035260A (en) | Indoor mobile robot three-dimensional semantic map construction method | |
CN109544677B (en) | Indoor scene main structure reconstruction method and system based on depth image key frame | |
CN108564616B (en) | Fast robust RGB-D indoor three-dimensional scene reconstruction method | |
CN111665842B (en) | Indoor SLAM mapping method and system based on semantic information fusion | |
CN110111338B (en) | Visual tracking method based on superpixel space-time saliency segmentation | |
CN103646391B (en) | A kind of real-time video camera tracking method for dynamic scene change | |
CN104572804B (en) | A kind of method and its system of video object retrieval | |
Gao et al. | Robust RGB-D simultaneous localization and mapping using planar point features | |
CN103886619B (en) | A kind of method for tracking target merging multiple dimensioned super-pixel | |
CN112927353B (en) | Three-dimensional scene reconstruction method, storage medium and terminal based on two-dimensional target detection and model alignment | |
CN107944428B (en) | Indoor scene semantic annotation method based on super-pixel set | |
CN105809716B (en) | Foreground extraction method integrating superpixel and three-dimensional self-organizing background subtraction method | |
Wang et al. | An overview of 3d object detection | |
CN110070578B (en) | Loop detection method | |
CN115995039A (en) | Enhanced semantic graph embedding for omni-directional location identification | |
CN110751097A (en) | Semi-supervised three-dimensional point cloud gesture key point detection method | |
CN113920254B (en) | Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof | |
CN110751153B (en) | Semantic annotation method for indoor scene RGB-D image | |
Lugo et al. | Semi-supervised learning approach for localization and pose estimation of texture-less objects in cluttered scenes | |
Huang et al. | Road scene segmentation via fusing camera and lidar data | |
Yang et al. | Cascaded superpixel pedestrian object segmentation algorithm | |
CN112070840A (en) | Human body space positioning and tracking method with integration of multiple depth cameras | |
Zhou et al. | UAV based indoor localization and objection detection | |
Zhang et al. | Consistent target tracking via multiple underwater cameras | |
CN111462181B (en) | Video single-target tracking method based on rectangular asymmetric inverse layout model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |