CN115035260A - Indoor mobile robot three-dimensional semantic map construction method - Google Patents

Indoor mobile robot three-dimensional semantic map construction method Download PDF

Info

Publication number
CN115035260A
CN115035260A CN202210594240.3A CN202210594240A CN115035260A CN 115035260 A CN115035260 A CN 115035260A CN 202210594240 A CN202210594240 A CN 202210594240A CN 115035260 A CN115035260 A CN 115035260A
Authority
CN
China
Prior art keywords
point cloud
key frame
semantic
map
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210594240.3A
Other languages
Chinese (zh)
Inventor
王立鹏
张佳鹏
王学武
马文龙
张智
张秋雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202210594240.3A priority Critical patent/CN115035260A/en
Publication of CN115035260A publication Critical patent/CN115035260A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Remote Sensing (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Graphics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional semantic map construction method for an indoor mobile robot, which comprises the steps of running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and pose of the robot, and forming a space point cloud by back projection of an RGB image and a depth image of a key frame; filtering the spatial point clouds of the key frames and splicing the spatial point clouds according to the poses to form a three-dimensional dense point cloud map; performing target identification on the RGB image of the key frame by using a YOLO V5 network to obtain 2D semantic information, and obtaining a 3D semantic label according to back projection; partitioning the dense point cloud map through a point cloud partitioning algorithm; and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map. According to the method, the RGB-D information acquired by the RGB-D depth camera is processed, and the semantic information and the SLAM result are fused to form the three-dimensional map with more abundant information.

Description

Indoor mobile robot three-dimensional semantic map construction method
Technical Field
The invention belongs to the technical field of robot semantic map construction, relates to a robot map construction method, and particularly relates to a construction method of an indoor mobile robot three-dimensional semantic map.
Background
With the rapid development of intelligent technology, the mobile robot industry is also in vigorous development, and advanced mobile robots are loaded with a plurality of sensors with specific functions and complete specified tasks in various scenes under the condition of no direct operation of personnel. For an intelligent robot with an application scene in a room, a robot autonomous exploration mode is generally adopted to complete a map construction task.
At present, typical semantic construction methods of mobile robots mainly include methods based on visual images and methods based on point cloud processing, for example, zhang cheng et al utilize monocular image sequences in a document, "semantic SLAM algorithm under dynamic environment", develop semantic segmentation, visual odometry and other research works, and finally construct a global semantic map; for example, an individual having a shining and the like researches and utilizes the point cloud image in the document, "structure-variable robust semantic SLAM algorithm" to develop the research work of semantic SLAM robustness.
The prior art has the following problems in the aspect of building a semantic map of the mobile robot: firstly, the visual image can only provide RGB color information, and three-dimensional space information such as point cloud is difficult to extract; secondly, the point cloud information is difficult to obtain the semantic information of the represented object, and the semantic information is difficult to be fused with the SLAM mapping and positioning result.
Disclosure of Invention
Aiming at the prior art, the invention aims to solve the technical problem of providing a three-dimensional semantic map construction method for an indoor mobile robot, which is characterized in that RGB-D information collected by an RGB-D depth camera is processed, semantic information and SLAM results are fused to form a three-dimensional map with more abundant information.
In order to solve the technical problem, the invention provides a method for constructing a three-dimensional semantic map of an indoor mobile robot, which comprises the following steps:
step 1: running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding poses of the robot, and forming a single-frame space point cloud by back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
and step 3: performing target identification on the RGB image corresponding to the key frame by using a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic label according to back projection;
and 4, step 4: partitioning the dense point cloud map by a point cloud partitioning algorithm;
and 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map.
Further, step1 comprises:
adding processing on a key frame in a Tracking thread in an ORB-SLAM2 system to obtain spatial point cloud data corresponding to the key frame, specifically: performing the following operations on all pixel points in the key frame to obtain point cloud data of the key frame under a camera coordinate system: let the coordinate of a certain pixel point P' in the image be [ u, v ]] T If the internal reference matrix of the camera is K and the depth data corresponding to the pixel point is d, converting the pixel point into a three-dimensional point P under a camera coordinate system c (X ', Y ', Z ') is:
Figure BDA0003667085520000021
obtaining a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c RGB information ofOperating all pixel points in the key frame as above, and obtaining point cloud data of the key frame under a camera coordinate system through a point cloud construction function of a PCL (personal computer) library;
and then converting point cloud data of key frames under a camera coordinate system into point cloud data under a world coordinate system:
let P c Converted into a point P in the world coordinate system w (X,Y,Z),P w (X, Y, Z) satisfies:
Figure BDA0003667085520000022
wherein, T cw Is the camera pose.
Further, the step2 of filtering the spatial point cloud corresponding to the key frame includes:
removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud base, calculating the average distance from each point in the spatial point cloud to all adjacent points, wherein the average distance is at an outlier threshold d pt The outliers, are identified as outliers and removed from the data. And then, point cloud data is down-sampled by a voxel grid filtering method.
Further, an outlier distance threshold d pt The method comprises the following specific steps:
the distances of all points in the spatial point cloud are arranged to obey normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Comprises the following steps:
d pt =μ+H out ·σ。
further, the step2 of forming the three-dimensional dense point cloud map by splicing according to the poses corresponding to the key frames comprises:
the Loopcolising thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame with the changed camera pose after the closed loop optimization, deletes the point cloud data corresponding to the key frame in the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
Further, step3 comprises:
step 3.1: establishing data set based on YOLO V5 network model
Shooting a picture of a target object to obtain a data set, labeling the acquired data set by using a labeling program Labelimg, labeling, and respectively selecting a training set and a test set from the shot picture; selecting pictures containing target objects from the COCO data set as a training set and a test set respectively; carrying out data configuration and model training parameter setting on a YOLO V5 network, wherein the data configuration comprises object class number and object class name, training parameters of a YOLO V5 network model by using a training set, and verifying the network model by using a test set;
step 3.2: extracting 2D object semantics and constructing 3D semantic tags
Enabling a YOLO V5 network to call an RGB-D depth camera to collect RGB image information, and acquiring a 2D recognition box in the RGB image in real time to serve as 2D target semantic information;
adding an interface in a Tracking thread of ORB-SLAM2 to read key frames generated in real time, performing target detection on each newly generated key frame by using a trained YOLO V5 network model to obtain the pixel coordinates of the center point of a target object and the length and width of a boundary frame, then changing the RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, wherein the single-frame point cloud data generated according to the key frame has a 3D semantic label, then reversely projecting the obtained key frame point cloud data and a result of pose estimation to a world coordinate system to complete point cloud splicing, and obtaining a global dense map with the 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
Further, the step 4 of partitioning the dense point cloud map by a point cloud partitioning algorithm includes:
step 4.1: partitioning the dense point cloud map by adopting a hyper-voxel clustering partitioning method, converting the point cloud into a surface structure, and adopting a partitioning resultAdjacency graph G ═ { v, epsilon }, where v i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ) Each curved surface block comprises a mass center c i And a normal vector n i
Step 4.2: after the surface block is obtained, the surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m Then calculate the centroid c of the surface patch i To the plane pc m Distance d (c) of i ,pc m ) Setting a threshold delta to obtain all arrival planes pc m The surface patch with distance within delta is named pi ═ v i ∈V|d(c i ,pc m ) < delta >, define:
Figure BDA0003667085520000041
where η represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Minimizing energy P representing weight of point cloud plane, graph partitioning problem * Comprises the following steps:
Figure BDA0003667085520000042
wherein E (P) is the energy of the fitting, and satisfies:
Figure BDA0003667085520000043
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
Further, step 5 comprises:
obtaining a bounding box of the point cloud of the target object according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are a set P ═ P 1 ,P 2 ,P 3 ......,P n }, then
b=max(Depth(P i )-Depth(P j ))
Wherein, P i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value is back projected into a world coordinate system by utilizing the 2D pixel point, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
and detecting the result obtained by point cloud segmentation, and replacing the random color attribute labels of the clusters with the 3D semantic labels obtained in the step (3) to obtain a semantic map only when all point clouds in a certain cluster are located in the bounding box.
Further, the RGB-D depth camera uses a Kinect V2 camera.
Further, the YOLO V5 network employs a YOLO V5s network.
The invention has the beneficial effects that: the method obtains RGB-D information of an indoor environment by carrying an RGB-D depth camera (such as a Kinect camera) through a robot, finishes target recognition based on a key frame generated by an ORB-SLAM2 system, obtains semantic information through a trained YOLO v5s network model, and then converts the semantic information into a 3D semantic label. And generating a three-dimensional dense point cloud map according to the RGB map and the depth map corresponding to the key frame and the pose information of the key frame, and fusing the semantic tags and the dense point cloud map to construct a three-dimensional semantic map with rich information after dividing the point cloud map on the basis.
Drawings
FIG. 1 is a block flow diagram of the present invention;
FIG. 2 is a flow chart of dense point cloud construction;
FIG. 3 is a dense point cloud map;
FIG. 4 is a sample of a portion of a data set collected;
FIG. 5 is a calibration of a data sample;
FIG. 6 shows the real-time results of YOLO v5s detection;
FIG. 7 is a flow chart for generating point cloud labels;
FIG. 8 is a point cloud label diagram in a real scene;
FIG. 9 is a super voxel cluster segmentation result;
FIG. 10 is a diagram of a point cloud segmentation result;
fig. 11 is a semantic map after fusion.
Detailed Description
The invention is further described with reference to the drawings and examples.
The method comprises the steps that a Kinect V2 camera is selected as a visual sensor, key frames in the camera movement process are obtained based on an ORB-SLAM2 algorithm, after camera poses are obtained through the processes of feature point matching, rear end optimization and the like in the key frames, an RGB (red, green and blue) image and a depth image corresponding to each key frame are subjected to back projection to form space point clouds, and the space point clouds are spliced by combining pose information of the camera after filtering processing to form a three-dimensional dense point cloud map; meanwhile, carrying out target identification on the RGB image corresponding to the key frame by using a YOLO V5 network model, obtaining semantic information of a target, and projecting the semantic information to a spatial point cloud to realize construction of a 3D semantic label; and then, carrying out segmentation work on the dense point cloud map by using a point cloud segmentation algorithm, and finally fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct a 3D semantic map.
1. Improved ORB-SLAM2 construction of three-dimensional dense point cloud
The method realizes the goal of acquiring the dense point cloud by improving partial threads in the ORB-SLAM2, and the specific flow is shown in FIG. 2, and the main steps are as follows:
step 1: back projection of two-dimensional pixel point to three-dimensional point cloud data
Modifying the Tracking thread in the ORB-SLAM2 system, adding the processing of the key frame, and obtaining the 3D point cloud data corresponding to the key frame. The principle of the process is that two-dimensional pixel coordinates and depth data of the environment are back projected to a world coordinate system through camera internal reference and external reference, and the coordinate of a certain pixel point P' in an image is set as [ u, v [ ]] T The internal reference matrix of the camera is K, and the depth data corresponding to the pixel point is d; then the pixel point is converted into a three-dimensional point P under the camera coordinate system c (X ', Y ', Z ') can be represented by the following formula:
Figure BDA0003667085520000061
obtain a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c The RGB information of the key frame is operated as above, and point cloud data of the key frame under a camera coordinate system is obtained through a point cloud construction function of a PCL library. After point cloud data corresponding to the key frame is obtained, the point cloud data is converted into point cloud under a world coordinate system, and P is set c Converted into a point P in the world coordinate system w (X, Y, Z), the process of which can be represented by the following formula, wherein T cw For the camera pose:
Figure BDA0003667085520000062
step 2: dense point cloud data filtering
According to the method, the outlier removing filter in the PCL point cloud base is used for removing abnormal noise points in the point cloud data, and the point cloud obtained after the outlier removing filter is relatively smooth and suitable for subsequent processing point cloud clustering. And then, point cloud data is down-sampled by a voxel grid filtering method. When the outlier filtering is removed, the average distance from each point to all adjacent points is calculated, the obtained result is subject to Gaussian distribution, and the shape of the result is determined by Gaussian distribution parameters (mu, sigma) 2 ) Points with average distances outside the standard range were determined, considered outliers and removed from the data.
Assuming that the distances of all points in the point cloud obey a normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Can be expressed as:
d pt =μ+H out ·σ (3)
i.e. the average distance is in the interval (0, d) pt ) Points outside of this are considered outlier rejection.
The invention carries out experiments, and based on a hardware platform (Intel Core i5-10400F CPU, NVIDIA GTX 1650), the resolution parameter of voxel grid filtering is adjusted to 0.01, namely when the size of the voxel grid is 1cm x 1cm, the optimal operation effect can be obtained.
Step 3: point cloud stitching
In the invention, the global dense point cloud data directly obtained according to step1 depends on the camera pose T of each key frame cw However, in the ORB-SLAM2 system, the LoopClosing thread optimizes the pose of the key frames in the whole closed loop according to the detected closed loop, and the pose estimation results of some key frames change.
In order to avoid double images of the constructed dense point cloud map, the method extracts a key frame with a changed camera pose after closed-loop optimization, deletes the point cloud data corresponding to the key frame in a world coordinate system, regenerates the corresponding point cloud data according to the pose after closed-loop optimization, projects the point cloud data into the world coordinate system, completes the point cloud splicing work after loop detection, and finally constructs a 3D dense point cloud map based on an ORB-SLAM2 system as shown in figure 3.
2. 3D semantic label construction based on YOLO v5s network model
According to the method, the target identification is needed to be carried out on the key frame generated by the Tracking thread in the ORB SLAM2 system, and then the obtained 2D semantic information is mapped to the spatial point cloud data to complete the construction of the 3D semantic label. The specific steps are as follows:
step 1: establishing data set based on YOLO v5s network model
This patent uses camera to shoot common object in the laboratory, mainly contains common article such as computery equipment, books. Part of the acquired image is labeled using the labeling program Labelimg and labeled as shown in FIG. 4, for example, the process of labeling the "display" is labeled "tv" as shown in FIG. 5. 500 pictures are selected from the taken pictures as a training set, and 200 pictures are selected as a testing set. In addition, 1500 pictures containing objects in an indoor scene are selected from the COCO to be used as a training set, 300 pictures are used as a testing set, and the annotation file is stored in a json file format for being used together with the COCO data set. In the present invention, 35 objects are commonly used in an indoor environment.
Before training, the configuration files of the YOLO v5s data and the model need to be modified, the object class number nc of the two files is changed to 35, and the object class names are modified in the class list names. The training set comprises 2000 pictures, the test set comprises 500 pictures, the YOLO v5s network model parameters are trained after the data set is prepared, and the hyper-parameter configuration of the training model is shown in table 1.
TABLE 1 parameter settings for the YOLO v5s network model
Figure BDA0003667085520000071
Step 2: extracting 2D object semantics and constructing 3D semantic tags
After the YOLO V5s network training is finished, the Kinect V2 camera is used for collecting corresponding RGB image information in real time, so that the real-time detection function can be realized, and as a result, as shown in fig. 6, 2D target semantic information in RGB images can be obtained, and the requirement of the invention for constructing a 3D semantic label in real time can be met.
The process of mapping 2D semantic information to 3D semantic tags is shown in fig. 7, and first, an interface is added to the Tracking thread of ORB-SLAM2, and based on the trained YOLO v5s network, target detection is performed on each newly generated key frame to obtain the pixel coordinates of the center point of the target object and the length and width of the bounding box. And then, when generating the spatial point cloud data according to the RGB image and the depth image corresponding to the key frame, changing the RGB information of all pixel points in the 2D boundary frame of the target object into set color information, wherein the single-frame point cloud data generated according to the key frame has a semantic label. And then, back projecting the obtained key frame point cloud data and a pose estimation result to a world coordinate system to complete point cloud splicing to obtain a global dense point cloud map with a 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
The key frame generated in ORB-SLAM2 is used as input to obtain a key frame image with a 2D bounding box and confidence, and then a corresponding 3D semantic label in the key frame is obtained by back projection, and the result is shown in fig. 8, where the point cloud of the display is set to red, the point cloud of the keyboard is set to green, and the point cloud of the mouse is set to blue.
3. Dense point cloud segmentation
The method carries out point cloud segmentation work on the constructed dense point cloud map so as to obtain a scene understandable by a robot, and obtains accurate target point cloud according to the segmentation result of the dense point cloud map so as to be convenient for subsequent combination with a 3D semantic tag. Firstly, an original dense point cloud map is segmented by using a segmentation method based on hyper-voxel clustering, irregular point clouds are converted into surface structures according to the similarity of points, surface patches formed after the hyper-voxel segmentation all have a centroid and a normal vector, the point cloud segmentation can be defined as a graph segmentation problem, and then the graph segmentation is further segmented by adopting a graph segmentation method, and the detailed steps are as follows:
step 1: supervoxel cluster segmentation
The hyper-voxel segmentation is to utilize a spatial octree structure of point cloud, perform Clustering through a Supervoxel Clustering class in a PCL library, add a label to a voxel centroid (voxel _ centroid _ cluster) of the octree structure according to a segmentation result, and endow random color to be stored in a clustered _ voxel _ cluster parameter.
The segmentation principle is that point clouds are segmented by using the region growth of k-means clustering, the process of super-volume clustering is similar to the crystallization process, and polycrystalline nucleus crystallization is caused after a solution is supersaturated. The segmentation method of the hyper-voxel cluster is to uniformly distribute crystal nuclei for growth in the whole space according to a certain rule, then set the crystal nucleus distance and the minimum crystal grains, and the smaller crystal grains are absorbed by the larger crystal grains. In the process of voxel-wise segmentation, the expansion of the seed points is determined by characteristic distances, the characteristic distances consider characteristic space calculation of space, color and normal vectors, and the distance formula in voxel-wise clustering segmentation is as follows:
Figure BDA0003667085520000081
in the above formula: d c Representing the degree of difference in color, D n Representing the degree of difference in the normal direction, D s Representing the degree of difference in the point clouds in distance. W in the formula c 、w s 、w n Respectively representing the weight of each variable. By searching around the crystal nucleus, the voxel with the smallest D value is found to be the most similar, so as to represent the next crystal nucleus for growth.
The invention selects a Kinect V2 camera as a sensor, and sets appropriate parameters: voxel _ resolution is 0.008 (voxel size, resolution of spatial octree), and seed _ resolution is 0.5 (seed resolution). The obtained super-voxel cluster segmentation result is shown in fig. 9 by adjusting the color, the spatial distance, and the weight occupied by the normal vector according to the laboratory environment.
Step 2: point cloud segmentation based on geometric information
The super voxel segmentation result obtained in the last step is represented by an adjacency graph G ═ v, epsilon, wherein v is i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ). After the superpixel segmentation processing, each surface block has a centroid c i And a normal vector n i At this time, the scene segmentation may be regarded as a graph segmentation problem.
Suppose there are K actual planes { s ] in a point cloud map 1 ,s 2 ,…,s k And these planes have been divided into curved patches. Here, a variable b is defined i } 1 N ,b i ∈[0,K],b i K denotes the surface block belongs to the plane s k N denotes the actual plane s k Is divided into N surface blocks, such as the desktop divided into two clusters in fig. 9. If the planes of all the objects in the point cloud map can be extracted and the curved surface blocks are distributed to the objects according to the point cloud map, a more accurate point cloud segmentation result can be obtained.
Specific operation after obtaining the curved surface block, the curved surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m H, calculate d (c) i ,pc m ) I.e. centroid c of the surface block i To the plane pc m The distance of (c). By adding a threshold delta, all the to-plane pc can be obtained m The curved surface block with the distance within delta is named as n ═ v i ∈V|d(c i ,pc m ) < delta }. The following formula is defined:
Figure BDA0003667085520000091
in the above formula: eta represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Representing the weight of the point cloud plane. In the experiment, η ═ 40 and δ ═ 4(cm) were set. Minimizing energy P of graph partitioning problem * Can be represented by the following formula:
Figure BDA0003667085520000093
in the above formula, e (p) is the fitting energy, which is specifically shown as the following formula:
Figure BDA0003667085520000092
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
The edges meeting the energy minimization are cut off based on the point cloud minimum segmentation theory, namely adjacent curved surface blocks can be combined according to constraint conditions to obtain the point cloud segmentation result, the segmentation purpose at this time is to combine clusters belonging to the same object in the clustering result, the segmentation result is shown in figure 10, wherein the segmentation effect of target objects such as displays, books, cups, bottles and the like is good, the combination effect of background objects such as tables, storage cabinets, floors and the like is also good, and the requirements of the invention are met.
4. Fusing semantic information to construct semantic map
After dense point cloud segmentation is completed, the segmentation result is combined with the 3D semantic tags, the segmentation result is optimized through semantic information, and targets which cannot be extracted in point cloud segmentation can be segmented; meanwhile, semantic information can be added to each segmented target, and a more accurate semantic map is constructed. After point cloud segmentation, each cluster has an attribute label, and after the attribute labels of the clusters are replaced by the semantic labels of specific targets, specific objects in the point cloud have semantic information, so that a semantic map is constructed. The specific implementation steps are as follows:
step 1: storage of semantic tags
After the dense point cloud map is divided, the point clouds with different colors represent a point cloud set and a spatial sampling area in each hyper-voxel, and the color is an attribute label of the cluster. However, the attribute label at this time is only a random label without any meaning, and the label is also an attribute label of each cluster in the final point cloud segmentation result.
In the second part, the present invention performs object recognition on the keyframe generated in ORB-SLAM2 by YOLO v5s, and adds a color tag with semantic information to the result of point cloud segmentation according to the 3D semantic tag obtained from the recognition result, thereby completing the storage of object semantic tags.
At the moment, all pixel points in the 2D target recognition bounding box are directly mapped to semantic labels formed in a 3D mode by combining with depth data of a Kinect camera. In a complex environment of a laboratory, a bounding box of target identification is not accurate enough, which easily causes the addition of wrong semantic information to a surrounding point cloud of a target object, as shown in fig. 8, a semantic label is erroneously added to the surrounding point cloud of the target object.
Step 2: fusion of semantic tags and point cloud bounding boxes
In order to obtain a more accurate semantic map, the bounding box of the point cloud of the target object is obtained according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame. Assuming that all the pixel points in the 2D identification frame are set P ═ P 1 ,P 2 ,P 3 ......,P n } then
b=max(Depth(P i )-Depth(P j )) (8)
Wherein P is i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value. The length and height of the bounding box are obtained by mapping the length and width of the 2D identification frame, and the principle is that the 2D pixel points are back projected to a world coordinate system, which is not described herein again.
Then, the invention detects the result obtained by partitioning the point clouds, and only when all the point clouds in a certain cluster are positioned in the bounding box, specific semantic information is added, namely specific color attributes are added, so that a more accurate semantic map is obtained. As shown in fig. 11 and fig. 10, the object TV can be segmented more completely by fusion of bounding boxes obtained by semantic labels and point cloud segmentation results, and correct semantic information is successfully added to the clustering of the actual object edge point cloud; compared with the method that the semantic information is added to the point cloud by the bounding box directly obtained by 2D target recognition in the figure 8, the method can also effectively avoid mistakenly adding the semantic information to the point cloud around the target. Meanwhile, the point cloud segmentation effect can be improved by combining semantic information, for example, in fig. 11, the added semantic information is clustered again after changing color attributes, and small objects such as a keyboard and a mouse on a desktop can be segmented. According to the above two steps, a three-dimensional semantic map as shown in fig. 11 can be formed.

Claims (10)

1. A three-dimensional semantic map construction method for an indoor mobile robot is characterized by comprising the following steps:
step 1: running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding poses of the robot, and forming a single-frame space point cloud by back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
and step 3: performing target identification on the RGB image corresponding to the key frame by using a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic label according to back projection;
and 4, step 4: partitioning the dense point cloud map by a point cloud partitioning algorithm;
and 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map.
2. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step1 comprises the following steps:
adding processing on a key frame in a Tracking thread in an ORB-SLAM2 system to obtain spatial point cloud data corresponding to the key frame, specifically: performing the following operations on all pixel points in the key frame to obtain point cloud data of the key frame under a camera coordinate system: let the coordinate of a certain pixel point P' in the image be [ u, v] T If the internal reference matrix of the camera is K and the depth data corresponding to the pixel point is d, converting the pixel point into a three-dimensional point P under a camera coordinate system c (X ', Y ', Z ') is:
Figure FDA0003667085510000011
obtain a three-dimensional point P c After the space coordinate, the RGB information of the pixel point P' is acquired as P c The RGB information of the key frame is operated as above, and point cloud data of the key frame under a camera coordinate system is obtained through a point cloud construction function of a PCL library;
and then converting point cloud data of key frames under a camera coordinate system into point cloud data under a world coordinate system:
let P c Converted into a point P in the world coordinate system w (X,Y,Z),P w (X, Y, Z) satisfies:
Figure FDA0003667085510000012
wherein, T cw Is the camera pose.
3. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the filtering processing of the spatial point cloud corresponding to the key frame comprises:
removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud base, calculating the average distance from each point in the spatial point cloud to all adjacent points, wherein the average distance is within an outlier distance threshold d pt The outliers, are identified as outliers and removed from the data. And then, point cloud data is down-sampled by a voxel grid filtering method.
4. The indoor mobile robot three-dimensional semantic map construction method according to claim 3, characterized in that: the outlier distance threshold d pt The method specifically comprises the following steps:
the distances of all points in the spatial point cloud are arranged to obey normal distribution (mu, sigma) 2 ) Judging whether the threshold parameter of the outlier is H out Then the outlier distance threshold d pt Comprises the following steps:
d pt =μ+H out ·σ。
5. the indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the step of splicing the poses corresponding to the key frames to form the three-dimensional dense point cloud map comprises the following steps:
the Loopcolising thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame with the changed camera pose after the closed loop optimization, deletes the point cloud data corresponding to the key frame in the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
6. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step3 comprises the following steps:
step 3.1: establishing data set based on YOLO V5 network model
Shooting a picture of a target object to obtain a data set, labeling the collected data set by using a labeling program Labelimg, labeling, and respectively selecting a training set and a test set from the shot picture; selecting pictures containing target objects from the COCO data set as a training set and a test set respectively; carrying out data configuration and model training parameter setting on a YOLO V5 network, wherein the data configuration comprises object class number and object class name, training parameters of a YOLO V5 network model by using a training set, and verifying the network model by using a test set;
step 3.2: extracting 2D object semantics and constructing 3D semantic tags
Enabling a YOLO V5 network to call an RGB-D depth camera to acquire RGB image information, and acquiring a 2D recognition frame in the RGB image in real time to serve as 2D target semantic information;
adding an interface in a Tracking thread of ORB-SLAM2 to read key frames generated in real time, performing target detection on each newly generated key frame by using a trained YOLO V5 network model to obtain the pixel coordinates of the center point of a target object and the length and width of a boundary frame, then changing the RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, wherein the single-frame point cloud data generated according to the key frame has a 3D semantic label, then reversely projecting the obtained key frame point cloud data and a result of pose estimation to a world coordinate system to complete point cloud splicing, and obtaining a global dense map with the 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.
7. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: 4, the step of partitioning the dense point cloud map by using a point cloud partitioning algorithm comprises the following steps:
step 4.1: and (3) partitioning the dense point cloud map by adopting a hyper-voxel clustering partitioning method, converting the point cloud into a surface structure, and representing a partitioning result by adopting an adjacency graph G ═ v, epsilon, wherein v is represented by i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks i ,v j ) Each of which isThe curved surface block comprises a center of mass c i And a normal vector n i
Step 4.2: after the curved surface block is obtained, the curved surface block is processed by RANSAC to generate a candidate plane PC ═ PC 1 ,pc 2 ,…,pc m Then calculate the centroid c of the surface patch i To the plane pc m Distance d (c) of i ,pc m ) Setting a threshold delta to obtain all arrival planes pc m The surface patch with distance within delta is named pi ═ v i ∈V|d(c i ,pc m ) < delta >, define:
Figure FDA0003667085510000031
where η represents the constraint that distinguishes foreground objects from the background, D (pc) m ) Minimizing energy P representing weight of point cloud plane, graph partitioning problem * Comprises the following steps:
P * =arg min E(P),
Figure FDA0003667085510000032
wherein E (P) is the energy of the fitting, and satisfies:
Figure FDA0003667085510000041
wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.
8. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step 5 comprises the following steps:
obtaining a bounding box of the point cloud of the target object according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are a set P ═ P 1 ,P 2 ,P 3 ......,P n },Then
b=max(Depth(P i )-Depth(P j ))
Wherein, P i ,P j ∈P,Depth(P i ) Is a pixel point P i The corresponding depth value is back projected into a world coordinate system by utilizing the 2D pixel points, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
and detecting the result obtained by point cloud segmentation, and replacing the random color attribute labels of the clusters with the 3D semantic labels obtained in the step (3) to obtain a semantic map only when all point clouds in a certain cluster are located in the bounding box.
9. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the RGB-D depth camera employs a Kinect V2 camera.
10. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the YOLO V5 network adopts a YOLOv5s network.
CN202210594240.3A 2022-05-27 2022-05-27 Indoor mobile robot three-dimensional semantic map construction method Pending CN115035260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210594240.3A CN115035260A (en) 2022-05-27 2022-05-27 Indoor mobile robot three-dimensional semantic map construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210594240.3A CN115035260A (en) 2022-05-27 2022-05-27 Indoor mobile robot three-dimensional semantic map construction method

Publications (1)

Publication Number Publication Date
CN115035260A true CN115035260A (en) 2022-09-09

Family

ID=83121169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210594240.3A Pending CN115035260A (en) 2022-05-27 2022-05-27 Indoor mobile robot three-dimensional semantic map construction method

Country Status (1)

Country Link
CN (1) CN115035260A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908734A (en) * 2022-11-25 2023-04-04 贵州电网有限责任公司信息中心 Power grid map updating method, device, equipment and storage medium
CN116363217A (en) * 2023-06-01 2023-06-30 中国人民解放军国防科技大学 Method, device, computer equipment and medium for measuring pose of space non-cooperative target
CN116433493A (en) * 2023-06-07 2023-07-14 湖南大学 Workpiece point cloud set splicing method based on metric learning
CN117517864A (en) * 2023-11-08 2024-02-06 南京航空航天大学 Laser radar-based power transmission line near electricity early warning method and device
CN117611762A (en) * 2024-01-23 2024-02-27 常熟理工学院 Multi-level map construction method, system and electronic equipment
CN117788730A (en) * 2023-12-08 2024-03-29 中交机电工程局有限公司 Semantic point cloud map construction method
CN117906595A (en) * 2024-03-20 2024-04-19 常熟理工学院 Scene understanding navigation method and system based on feature point method vision SLAM

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908734A (en) * 2022-11-25 2023-04-04 贵州电网有限责任公司信息中心 Power grid map updating method, device, equipment and storage medium
CN116363217A (en) * 2023-06-01 2023-06-30 中国人民解放军国防科技大学 Method, device, computer equipment and medium for measuring pose of space non-cooperative target
CN116363217B (en) * 2023-06-01 2023-08-11 中国人民解放军国防科技大学 Method, device, computer equipment and medium for measuring pose of space non-cooperative target
CN116433493A (en) * 2023-06-07 2023-07-14 湖南大学 Workpiece point cloud set splicing method based on metric learning
CN116433493B (en) * 2023-06-07 2023-09-22 湖南大学 Workpiece point cloud set splicing method based on metric learning
CN117517864A (en) * 2023-11-08 2024-02-06 南京航空航天大学 Laser radar-based power transmission line near electricity early warning method and device
CN117517864B (en) * 2023-11-08 2024-04-26 南京航空航天大学 Laser radar-based power transmission line near electricity early warning method and device
CN117788730A (en) * 2023-12-08 2024-03-29 中交机电工程局有限公司 Semantic point cloud map construction method
CN117611762A (en) * 2024-01-23 2024-02-27 常熟理工学院 Multi-level map construction method, system and electronic equipment
CN117611762B (en) * 2024-01-23 2024-04-30 常熟理工学院 Multi-level map construction method, system and electronic equipment
CN117906595A (en) * 2024-03-20 2024-04-19 常熟理工学院 Scene understanding navigation method and system based on feature point method vision SLAM

Similar Documents

Publication Publication Date Title
CN115035260A (en) Indoor mobile robot three-dimensional semantic map construction method
CN109544677B (en) Indoor scene main structure reconstruction method and system based on depth image key frame
CN108564616B (en) Fast robust RGB-D indoor three-dimensional scene reconstruction method
CN106204572B (en) Road target depth estimation method based on scene depth mapping
CN111665842B (en) Indoor SLAM mapping method and system based on semantic information fusion
CN110111338B (en) Visual tracking method based on superpixel space-time saliency segmentation
CN103646391B (en) A kind of real-time video camera tracking method for dynamic scene change
Gao et al. Robust RGB-D simultaneous localization and mapping using planar point features
CN107944428B (en) Indoor scene semantic annotation method based on super-pixel set
CN112927353B (en) Three-dimensional scene reconstruction method, storage medium and terminal based on two-dimensional target detection and model alignment
CN105809716B (en) Foreground extraction method integrating superpixel and three-dimensional self-organizing background subtraction method
Wang et al. An overview of 3d object detection
CN110070578B (en) Loop detection method
CN110751097A (en) Semi-supervised three-dimensional point cloud gesture key point detection method
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
CN110751153B (en) Semantic annotation method for indoor scene RGB-D image
Lugo et al. Semi-supervised learning approach for localization and pose estimation of texture-less objects in cluttered scenes
Ghafarianzadeh et al. Efficient, dense, object-based segmentation from RGBD video
Huang et al. Road scene segmentation via fusing camera and lidar data
Yang et al. Cascaded superpixel pedestrian object segmentation algorithm
CN112070840A (en) Human body space positioning and tracking method with integration of multiple depth cameras
Zhou et al. UAV based indoor localization and objection detection
Zhang et al. Consistent target tracking via multiple underwater cameras
SrirangamSridharan et al. Object localization and size estimation from RGB-D images
CN111462181B (en) Video single-target tracking method based on rectangular asymmetric inverse layout model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination