CN115035260A

CN115035260A - Indoor mobile robot three-dimensional semantic map construction method

Info

Publication number: CN115035260A
Application number: CN202210594240.3A
Authority: CN
Inventors: 王立鹏; 张佳鹏; 王学武; 马文龙; 张智; 张秋雨
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-09
Anticipated expiration: 2042-05-27
Also published as: CN115035260B

Abstract

The invention discloses a method for constructing a three-dimensional semantic map of an indoor mobile robot. The ORB-SLAM2 algorithm is run through an RGB-D depth camera mounted on the robot to obtain key frame information and pose of the robot, and the RGB map and the depth map of the key frame are reversed through the reverse process. Projection to form a spatial point cloud; filter the spatial point cloud of the key frame and splicing it according to the pose to form a three-dimensional dense point cloud map; use the RGB image of the key frame to use the YOLO V5 network for target recognition to obtain 2D semantic information, and according to The 3D semantic label is obtained by back-projection; the dense point cloud map is segmented by the point cloud segmentation algorithm; the obtained semantic information is fused with the segmentation result of the dense point cloud map to construct a 3D semantic map. By processing the RGB-D information collected by the RGB-D depth camera, the present invention fuses the semantic information with the SLAM result to form a more informative three-dimensional map.

Description

A method for constructing 3D semantic map of indoor mobile robot

技术领域technical field

本发明属于机器人语义地图构建技术领域，涉及一种机器人地图构建方法，特别是一种室内移动机器人三维语义地图的构建方法。The invention belongs to the technical field of robot semantic map construction, and relates to a robot map construction method, in particular to a construction method of a three-dimensional semantic map of an indoor mobile robot.

背景技术Background technique

随着智能化技术的快速发展，移动机器人产业也迎来了蓬勃发展，先进的移动机器人装载了许多具有特定功能的传感器，在没有人员直接操作的情况下完成各种场景下的指定任务。其中对于应用场景在室内的智能机器人来说，一般会采用机器人自主探索的方式，完成地图构建任务。With the rapid development of intelligent technology, the mobile robot industry has also ushered in vigorous development. Advanced mobile robots are loaded with many sensors with specific functions to complete designated tasks in various scenarios without direct operation by personnel. Among them, for intelligent robots whose application scenarios are indoors, the robot's autonomous exploration method is generally used to complete the map construction task.

目前典型的移动机器人语义构建方法主要有基于视觉图像方法和基于点云处理的方法，例如张诚等人在文献《动态环境下的语义SLAM算法》中，利用单目图像序列，开展语义分割、视觉里程计等研究工作，最终构建全局语义地图；再比如张铮等人在文献《变结构的鲁棒语义SLAM算法》中，研究利用点云图像，开展语义SLAM鲁棒性的研究工作。At present, the typical semantic construction methods of mobile robots mainly include methods based on visual images and methods based on point cloud processing. For example, in the document "Semantic SLAM Algorithm in Dynamic Environment" by Cheng Zhang et al., they use monocular image sequences to carry out semantic segmentation, Visual odometry and other research work, and finally build a global semantic map; another example is Zhang Zheng et al. In the document "Robust Semantic SLAM Algorithm with Variable Structure", they studied the use of point cloud images to carry out research on the robustness of semantic SLAM.

以上现有技术在构建移动机器人语义地图方面存在如下问题：一是视觉图像仅能够提供RGB彩色信息，难以提取诸如点云的三维空间信息；二是点云信息不容易获得所代表物体的语义信息，难以将语义与SLAM建图和定位结果相融合。The above existing technologies have the following problems in constructing a semantic map of mobile robots: first, visual images can only provide RGB color information, and it is difficult to extract three-dimensional spatial information such as point clouds; second, it is not easy to obtain semantic information of the objects represented by point cloud information. , it is difficult to fuse semantics with SLAM mapping and localization results.

发明内容SUMMARY OF THE INVENTION

针对上述现有技术，本发明要解决的技术问题是提供一种室内移动机器人三维语义地图构建方法，通过对使用RGB-D深度相机采集的RGB-D信息进行处理，将语义信息与SLAM结果融合，形成信息更为丰富的三维地图。Aiming at the above-mentioned prior art, the technical problem to be solved by the present invention is to provide a method for constructing a three-dimensional semantic map of an indoor mobile robot. By processing the RGB-D information collected by the RGB-D depth camera, the semantic information and the SLAM result are fused. , to form a more informative 3D map.

为解决上述技术问题，本发明的的一种室内移动机器人三维语义地图构建方法，包括以下步骤：In order to solve the above technical problems, a method for constructing a three-dimensional semantic map of an indoor mobile robot of the present invention includes the following steps:

步骤1：通过机器人搭载的RGB-D深度相机运行ORB-SLAM2算法，获得机器人的关键帧信息与对应的位姿，将每一个关键帧对应的RGB图与深度图通过反投影形成单帧空间点云；Step 1: Run the ORB-SLAM2 algorithm through the RGB-D depth camera carried by the robot to obtain the key frame information and corresponding pose of the robot, and back-project the RGB image and depth image corresponding to each key frame to form a single-frame spatial point cloud;

步骤2：对关键帧对应的空间点云进行滤波处理，并根据关键帧对应的位姿进行拼接形成三维稠密点云地图；Step 2: Filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;

步骤3：将关键帧对应的RGB图利用YOLO V5网络进行目标识别，获得关键帧中的2D语义信息，并根据反投影获得3D语义标签；Step 3: Use the YOLO V5 network to identify the RGB image corresponding to the key frame, obtain the 2D semantic information in the key frame, and obtain the 3D semantic label according to the back projection;

步骤4：通过点云分割算法对稠密点云地图进行分割；Step 4: Segment the dense point cloud map through the point cloud segmentation algorithm;

步骤5：将获得的语义信息与稠密点云地图的分割结果相融合，构建得到3D语义地图。Step 5: Integrate the obtained semantic information with the segmentation result of the dense point cloud map to construct a 3D semantic map.

进一步的，步骤1包括：Further, step 1 includes:

在ORB-SLAM2系统中的Tracking线程中增加对关键帧的处理，获得关键帧对应的空间点云数据，具体为：对关键帧中的所有像素点进行如下操作，获得相机坐标系下关键帧的点云数据：设图像中某一像素点P′坐标为[u,v]^T，相机的内参矩阵为K，像素点对应的深度数据为d，则将该像素点转换为相机坐标系下的三维点P_c(X′,Y′,Z′)为：The processing of key frames is added to the Tracking thread in the ORB-SLAM2 system to obtain the spatial point cloud data corresponding to the key frames. Specifically, the following operations are performed on all the pixels in the key frames to obtain the key frames in the camera coordinate system. Point cloud data: Let the coordinate of a pixel point P' in the image be [u, v] ^T , the camera's internal parameter matrix is K, and the depth data corresponding to the pixel point is d, then convert the pixel point to the camera coordinate system. The three-dimensional point P _c (X', Y', Z') is:

得到三维点P_c的空间坐标后，再获取像素点P′的RGB信息作为P_c的RGB信息，对关键帧中的所有像素点进行如上操作，通过PCL库的点云构造函数，得到所述关键帧在相机坐标系下的点云数据；After obtaining the spatial coordinates of the three-dimensional point P _c , then obtain the RGB information of the pixel point P' as the RGB information of P _c , and perform the above operations on all the pixels in the key frame. Through the point cloud constructor of the PCL library, the described The point cloud data of the key frame in the camera coordinate system;

然后将相机坐标系下关键帧的点云数据转换为世界坐标系下的点云数据：Then convert the point cloud data of the key frame in the camera coordinate system to the point cloud data in the world coordinate system:

设P_c转换为世界坐标系下的点为P_w(X,Y,Z)，P_w(X,Y,Z)满足：Let P _c be converted to a point in the world coordinate system as P _w (X, Y, Z), and P _w (X, Y, Z) satisfies:

其中，T_cw为相机位姿。where T _cw is the camera pose.

进一步的，步骤2中对关键帧对应的空间点云进行滤波处理包括：Further, filtering the spatial point cloud corresponding to the key frame in step 2 includes:

使用PCL点云库中的去离群点滤波器去除点云数据中的离群点，对空间点云中每一个点计算其到所有相邻点的平均距离，平均距离在离群距离阈值d_pt之外的点，认定为离群点并从数据中去除。然后通过体素网格滤波的方法，对点云数据进行降采样。Use the de-outlier filter in the PCL point cloud library to remove outliers in the point cloud data, and calculate the average distance from each point in the spatial point cloud to all adjacent points. The average distance is at the outlier distance threshold d. Points outside _pt are considered outliers and removed from the data. Then, the point cloud data is down-sampled by the method of voxel grid filtering.

进一步的，离群距离阈值d_pt具体为：Further, the outlier distance threshold d _pt is specifically:

设空间点云中所有点的距离服从正态分布(μ,σ²)，判断是否为离群点的阀值参数为H_out，则离群距离阈值d_pt为：Assuming that the distances of all points in the spatial point cloud obey a normal distribution (μ,σ ² ), and the threshold parameter for judging whether it is an outlier is H _out , the outlier distance threshold d _pt is:

d_pt＝μ+H_out·σ。d _pt = μ + H _out ·σ.

进一步的，步骤2中根据关键帧对应的位姿进行拼接形成三维稠密点云地图包括：Further, in step 2, splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map includes:

ORB-SLAM2系统中LoopClosing线程对检测到的闭环中的关键帧位姿进行优化，将闭环优化后的相机位姿发生变化的关键帧提取出来，然后删除世界坐标系下该关键帧对应的点云数据并根据闭环优化后的相机位姿重新生成对应的点云数据，再投影到世界坐标系中，构建出三维稠密点云地图。The LoopClosing thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame where the camera pose changes after the closed loop optimization, and then deletes the point cloud corresponding to the key frame in the world coordinate system The corresponding point cloud data is regenerated according to the closed-loop optimized camera pose, and then projected into the world coordinate system to construct a three-dimensional dense point cloud map.

进一步的，步骤3包括：Further, step 3 includes:

步骤3.1：建立基于YOLO V5网络模型的数据集Step 3.1: Build a dataset based on the YOLO V5 network model

拍摄目标物体图片得到数据集，使用标注程序Labelimg对采集的数据集进行标注，并打上标签，从拍摄图片中分别选取训练集和测试集；从COCO数据集中选取包含目标物体的图片分别作为训练集和测试集；对YOLO V5网络进行数据配置和模型训练参数设置，数据配置包括物体类别数和物体类别名称，利用训练集对YOLO V5网络模型参数进行训练，利用测试集对网络模型进行验证；Take a picture of the target object to get a data set, use the labeling program Labelimg to label the collected data set, and label it, and select the training set and test set from the captured pictures; select the pictures containing the target object from the COCO data set as the training set. and test set; perform data configuration and model training parameter settings for the YOLO V5 network, the data configuration includes the number of object categories and object category names, use the training set to train the YOLO V5 network model parameters, and use the test set to verify the network model;

步骤3.2：提取2D目标语义和构建3D语义标签Step 3.2: Extract 2D object semantics and build 3D semantic labels

使得YOLO V5网络调用RGB-D深度相机采集RGB图像信息，实时获得RGB图像中的2D识别框作为2D目标语义信息；Make the YOLO V5 network call the RGB-D depth camera to collect RGB image information, and obtain the 2D recognition frame in the RGB image in real time as the 2D target semantic information;

在ORB-SLAM2的Tracking线程中添加接口读取实时产生的关键帧，利用训练好的YOLO V5网络模型对每一个新产生的关键帧进行目标检测得到目标物体中心点的像素坐标与边界框的长和宽，然后根据该关键帧对应的RGB图与深度图生成空间点云数据时，将目标物体2D边界框中所有像素点的RGB信息更改为设定的颜色信息，此时根据关键帧生成的单帧点云数据便具有3D语义标签，再将得到的关键帧点云数据结合位姿估计的结果反投影到世界坐标系中完成点云拼接，得到带3D语义标签的全局稠密点云地图，即给点云地图中目标物体的点云添加颜色信息。Add an interface to the Tracking thread of ORB-SLAM2 to read the key frames generated in real time, and use the trained YOLO V5 network model to perform target detection on each newly generated key frame to obtain the pixel coordinates of the center point of the target object and the length of the bounding box. and width, and then generate spatial point cloud data according to the RGB map and depth map corresponding to the key frame, change the RGB information of all pixels in the 2D bounding box of the target object to the set color information. A single frame of point cloud data has 3D semantic labels, and then the obtained key frame point cloud data combined with the result of pose estimation are back-projected into the world coordinate system to complete point cloud splicing, and a global dense point cloud map with 3D semantic labels is obtained. That is, add color information to the point cloud of the target object in the point cloud map.

进一步的，步骤4中通过点云分割算法对稠密点云地图进行分割包括：Further, in step 4, the dense point cloud map is segmented by the point cloud segmentation algorithm, including:

步骤4.1：采用超体素聚类分割方法对稠密点云地图进行分割，将点云转换为表面结构，分割结果采用邻接图G＝{v,ε}表示，其中v_i∈v为分割得到的曲面块，ε表示连接着相邻的曲面块(v_i,v_j)，每一个曲面块包含一个质心c_i和一个法向量n_i；Step 4.1: Use the hypervoxel clustering segmentation method to segment the dense point cloud map, and convert the point cloud into a surface structure. The segmentation result is represented by an adjacency graph G={v,ε}, where v _i ∈ v is obtained by segmentation Surface block, ε represents connecting adjacent surface blocks (v _i , v _j ), each surface block contains a centroid c _i and a normal vector _ni ;

步骤4.2：在得到曲面块后，使用RANSAC处理曲面块生成候选平面PC＝{pc₁,pc₂,…,pc_m}，然后计算曲面块质心c_i到平面pc_m的距离d(c_i,pc_m)，设定阈值δ，得到所有到平面pc_m距离在δ内的曲面块，命名为Π＝{v_i∈V|d(c_i,pc_m)＜δ}，定义：Step 4.2: After obtaining the surface block, use RANSAC to process the surface block to generate a candidate plane PC={pc ₁ ,pc ₂ ,...,pc _m }, and then calculate the distance _d (c _i _, pc _m ), set the threshold δ, and obtain all the surface blocks whose distance to the plane pc _m is within δ, named Π={vi _∈V |d(ci ,pc _m )<δ _} , defined as:

其中，η表示区分前景物体与背景的约束条件，D(pc_m)表示点云平面的权重，图分割问题的最小化能量P^*为：Among them, η represents the constraints to distinguish the foreground object from the background, D(p _m ) represents the weight of the point cloud plane, and the minimized energy P ^* of the graph segmentation problem is:

其中，E(P)为拟合的能量，满足：Among them, E(P) is the fitted energy, which satisfies:

其中V与E分别为图中顶点与边的集合，对应聚类分割的邻接图G＝{v,ε}。Among them, V and E are the set of vertices and edges in the graph, respectively, corresponding to the adjacency graph G={v,ε} of cluster segmentation.

进一步的，步骤5包括：Further, step 5 includes:

根据2D识别框得出目标物体点云的包围盒，其中包围盒的宽度值b为2D识别框中全部像素点所对应的最大深度差，假设2D识别框中的全部像素点为集合P＝{P₁，P₂，P₃......，P_n}，则According to the 2D recognition frame, the bounding box of the point cloud of the target object is obtained, wherein the width value b of the bounding box is the maximum depth difference corresponding to all the pixels in the 2D recognition frame. It is assumed that all the pixels in the 2D recognition frame are the set P={ P ₁ , P ₂ , P ₃ ......, P _n }, then

b＝max(Depth(P_i)-Depth(P_j))b=max(Depth(P _i )-Depth(P _j ))

其中，P_i,P_j∈P,Depth(P_i)为像素点P_i对应的深度值，利用2D像素点反投影到世界坐标系中，将2D识别框的长与宽映射到世界坐标系中得到包围盒的长与高；Among them, P _i , P _j ∈ P, Depth(P _i ) is the depth value corresponding to the pixel point P _i , and the 2D pixel point is back projected into the world coordinate system, and the length and width of the 2D recognition frame are mapped to the world coordinate system. to get the length and height of the bounding box;

然后对点云分割得到的结果进行检测，只有当某个聚类中的所有点云都位于该包围盒中，将聚类的随机颜色属性标签替换为步骤3得到的3D语义标签，获得语义地图。Then, the results obtained by point cloud segmentation are detected. Only when all point clouds in a cluster are located in the bounding box, replace the random color attribute label of the cluster with the 3D semantic label obtained in step 3 to obtain a semantic map. .

进一步的，RGB-D深度相机采用Kinect V2相机。Further, the RGB-D depth camera adopts the Kinect V2 camera.

进一步的，YOLO V5网络采用YOLOv5s网络。Further, the YOLO V5 network adopts the YOLOv5s network.

本发明的有益效果：本发明通过机器人搭载RGB-D深度相机(例如Kinect相机)，获得室内环境的RGB-D信息，基于ORB-SLAM2系统产生的关键帧，通过训练好的YOLO v5s网络模型，完成目标识别获得语义信息后转化为3D语义标签。并根据关键帧对应的RGB图与深度图结合关键帧的位姿信息生成三维稠密点云地图，在此基础上对点云地图进行分割后，将语义标签与稠密点云地图融合，构建信息丰富的三维语义地图。Beneficial effects of the present invention: the present invention obtains the RGB-D information of the indoor environment through a robot equipped with an RGB-D depth camera (such as a Kinect camera), and based on the key frames generated by the ORB-SLAM2 system, through the trained YOLO v5s network model, After completing the target recognition, the semantic information is converted into 3D semantic labels. And according to the RGB map corresponding to the key frame and the depth map combined with the pose information of the key frame, a three-dimensional dense point cloud map is generated. On this basis, the point cloud map is segmented, and the semantic labels are fused with the dense point cloud map to construct rich information. 3D semantic map.

附图说明Description of drawings

图1为本发明流程框图；Fig. 1 is the flow chart of the present invention;

图2为稠密点云构建流程图；Fig. 2 is the flow chart of dense point cloud construction;

图3为稠密点云地图；Figure 3 is a dense point cloud map;

图4为采集的部分数据集样本；Fig. 4 is a part of the data set samples collected;

图5为数据样本的标定；Figure 5 is the calibration of the data sample;

图6为YOLO v5s实时检测结果；Figure 6 shows the real-time detection results of YOLO v5s;

图7为生成点云标签流程图；Fig. 7 is the flow chart of generating point cloud label;

图8为真实场景下点云标签图；Figure 8 is a point cloud label map in a real scene;

图9为超体素聚类分割结果；Fig. 9 is the supervoxel clustering segmentation result;

图10为点云分割结果图；Figure 10 is the result of point cloud segmentation;

图11为融合后的语义地图。Figure 11 shows the fused semantic map.

具体实施方式Detailed ways

下面结合说明书附图和实施例对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments of the description.

本发明的流程框图如图1所示，本发明选用Kinect V2相机作为视觉传感器，基于ORB－SLAM 2算法获得相机运动过程中的关键帧，利用关键帧中特征点匹配、后端优化等过程获取相机位姿后，将每一个关键帧的对应的RGB图与深度图通过反投影形成空间点云，经过滤波处理后结合相机的位姿信息进行拼接，形成三维稠密点云地图；同时利用YOLO V5网络模型对关键帧对应的RGB图开展目标识别，获得目标的语义信息，并投影到空间点云实现3D语义标签的构建；然后运用点云分割算法，对稠密点云地图开展分割工作，最后将获得的语义信息与稠密点云地图的分割结果相融合，构建3D语义地图。The flow chart of the present invention is shown in Figure 1. The present invention selects the Kinect V2 camera as the visual sensor, obtains the key frames in the camera movement process based on the ORB-SLAM 2 algorithm, and uses the process of feature point matching and back-end optimization in the key frames to obtain After the camera pose, the corresponding RGB image and depth map of each key frame are back-projected to form a spatial point cloud. After filtering, the camera pose information is combined to form a three-dimensional dense point cloud map. At the same time, YOLO V5 is used. The network model performs target recognition on the RGB map corresponding to the key frame, obtains the semantic information of the target, and projects it to the spatial point cloud to realize the construction of 3D semantic label; then uses the point cloud segmentation algorithm to segment the dense point cloud map, and finally divides The obtained semantic information is fused with the segmentation results of the dense point cloud map to construct a 3D semantic map.

1、改进ORB-SLAM2构建三维稠密点云1. Improve ORB-SLAM2 to build 3D dense point cloud

本发明通过改进ORB-SLAM2中的部分线程，实现获取稠密点云的目标，具体流程如图2所示，主要步骤如下所示：The present invention achieves the goal of obtaining dense point clouds by improving some threads in ORB-SLAM2. The specific process is shown in Figure 2, and the main steps are as follows:

Step 1：二维像素点到三维点云数据反投影Step 1: Back-projection of 2D pixel points to 3D point cloud data

对ORB-SLAM2系统中的Tracking线程进行修改，增加了对关键帧的处理，获得关键帧对应的3D点云数据。此过程原理是将环境的二维像素坐标及深度数据通过相机内参和外参反投影到世界坐标系，设图像中某一像素点P′坐标为[u,v]^T，相机的内参矩阵为K，该像素点对应的深度数据为d；则将该像素点转换为相机坐标系下的三维点P_c(X′,Y′,Z′)可用下式所示：Modified the Tracking thread in the ORB-SLAM2 system, increased the processing of key frames, and obtained the 3D point cloud data corresponding to the key frames. The principle of this process is to back-project the two-dimensional pixel coordinates and depth data of the environment to the world coordinate system through the camera's internal and external parameters. Let the coordinates of a pixel point P' in the image be [u, v] ^T , and the camera's internal parameter matrix is K, the depth data corresponding to the pixel point is d; then the pixel point is converted into a three-dimensional point P _c (X', Y', Z') in the camera coordinate system, which can be expressed as follows:

得到三维点P_c的空间坐标后，再获取像素点P′的RGB信息作为P_c的RGB信息，对关键帧中的所有像素点进行如上操作，通过PCL库的点云构造函数，得到所述关键帧在相机坐标系下的点云数据。得到关键帧对应的点云数据后，将其转换为世界坐标系下的点云，设P_c转换为世界坐标系下的点为P_w(X,Y,Z)，其过程可以用下式表示，其中T_cw为相机位姿：After obtaining the spatial coordinates of the three-dimensional point P _c , then obtain the RGB information of the pixel point P' as the RGB information of P _c , and perform the above operations on all the pixels in the key frame. Through the point cloud constructor of the PCL library, the described The point cloud data of the key frame in the camera coordinate system. After obtaining the point cloud data corresponding to the key frame, convert it to a point cloud in the world coordinate system. Let P _c be converted to a point in the world coordinate system as P _w (X, Y, Z), and the process can use the following formula represents, where T _cw is the camera pose:

Step2：稠密点云数据滤波Step2: Dense point cloud data filtering

本发明使用PCL点云库中的去离群点滤波器去除点云数据中异常的噪声点，去离群点滤波后得到的点云会比较平滑，适合后续处理点云聚类。然后通过体素网格滤波的方法，对点云数据进行降采样。本发明在去离群点滤波时，对每一个点，计算它到所有相邻点的平均距离，得到的结果应服从高斯分布，其形状是由高斯分布参数(μ,σ²)决定，平均距离在标准范围之外的点，认为其为离群点并从数据中去除。The invention uses the outlier filter in the PCL point cloud library to remove abnormal noise points in the point cloud data, and the point cloud obtained after the outlier filter is relatively smooth, which is suitable for subsequent processing of point cloud clustering. Then, the point cloud data is down-sampled by the method of voxel grid filtering. When the present invention filters out outliers, for each point, the average distance from it to all adjacent points is calculated, and the obtained result should obey the Gaussian distribution, and its shape is determined by the Gaussian distribution parameters (μ, σ ² ). Points with distances outside the standard range are considered outliers and removed from the data.

设点云中所有点的距离服从正态分布(μ,σ²)，判断是否为离群点的阀值参数为H_out，则离群距离阈值d_pt可表示为：Assuming that the distances of all points in the point cloud obey a normal distribution (μ,σ ² ), and the threshold parameter for judging whether it is an outlier is H _out , the outlier distance threshold d _pt can be expressed as:

d_pt＝μ+H_out·σ (3)d _pt = μ+H _out ·σ (3)

即平均距离在区间(0,d_pt)以外的点会被认为是离群点滤除。That is, points whose average distance is outside the interval (0,d _pt ) will be considered as outliers and filtered out.

本发明开展实验，基于硬件平台(Intel Core i5-10400F CPU,NVIDIA GTX1650)，将体素网格滤波的分辨率参数调整为0.01，即体素栅格大小为1cm*1cm*1cm时，可以获得最佳的运算效果。The present invention conducts experiments. Based on the hardware platform (Intel Core i5-10400F CPU, NVIDIA GTX1650), the resolution parameter of the voxel grid filtering is adjusted to 0.01, that is, when the size of the voxel grid is 1cm*1cm*1cm, it can be obtained the best computing effect.

Step3：点云拼接Step3: point cloud stitching

在本发明中，直接根据step1获得的全局稠密点云数据，依赖于各个关键帧的相机位姿T_cw，但ORB-SLAM2系统中LoopClosing线程会根据检测到的闭环对整个闭环中的关键帧位姿进行优化，此时一些关键帧的位姿估计结果将发生改变。In the present invention, the global dense point cloud data obtained directly according to step1 depends on the camera pose T _cw of each key frame, but the LoopClosing thread in the ORB-SLAM2 system will adjust the position of the key frame in the entire closed loop according to the detected closed loop. At this time, the pose estimation results of some key frames will change.

为了避免构建的稠密点云地图出现重影，本发明将闭环优化后相机位姿发生变化的关键帧提取出来，然后删除世界坐标系下该关键帧对应的点云数据并根据闭环优化后的位姿重新生成对应的点云数据，投影到世界坐标系中，完成回环检测后的点云拼接工作，最后基于ORB-SLAM2系统构建出的3D稠密点云地图如图3所示。In order to avoid ghosting in the constructed dense point cloud map, the present invention extracts the key frame where the camera pose changes after the closed-loop optimization, and then deletes the point cloud data corresponding to the key frame in the world coordinate system. The corresponding point cloud data is regenerated from the pose, projected into the world coordinate system, and the point cloud stitching after loop closure detection is completed. Finally, the 3D dense point cloud map constructed based on the ORB-SLAM2 system is shown in Figure 3.

2、基于YOLO v5s网络模型的3D语义标签构建2. 3D semantic label construction based on YOLO v5s network model

本发明需要对ORB SLAM2系统中Tracking线程生成的关键帧进行目标识别，然后将得到的2D语义信息映射到空间点云数据中，完成3D语义标签的构建。具体步骤如下所示：The present invention needs to perform target recognition on the key frames generated by the Tracking thread in the ORB SLAM2 system, and then map the obtained 2D semantic information into the spatial point cloud data to complete the construction of the 3D semantic label. The specific steps are as follows:

Step1：建立基于YOLO v5s网络模型的数据集Step1: Establish a dataset based on the YOLO v5s network model

本专利使用相机拍摄实验室内常见物体，主要包含电脑设备、书等常见物品。部分采集的图像如图4所示，使用标注程序Labelimg对采集的数据集进行标注，并打上标签，如图5所示，例如标注“显示器”的过程则打上“tv”标签。从拍摄的图片中选取500张作为训练集，200张作为测试集。此外，从COCO中选取包含室内场景下物体的图片1500张作为训练集，300张作为测试集，为了与COCO数据集一起使用，将标注文件使用json文件格式存储。在本发明中，室内环境中常用物体为35种。This patent uses a camera to photograph common objects in the laboratory, mainly including common items such as computer equipment and books. Part of the collected images are shown in Figure 4. Use the labeling program Labelimg to mark and label the collected data set, as shown in Figure 5. For example, the process of labeling "display" is marked with "tv" label. 500 images were selected as the training set and 200 as the test set. In addition, 1500 pictures containing objects in indoor scenes are selected from COCO as the training set and 300 pictures are used as the test set. In order to use with the COCO dataset, the annotation files are stored in json file format. In the present invention, there are 35 commonly used objects in the indoor environment.

在训练前需要修改YOLO v5s数据和模型的配置文件，将两个文件的物体类别数nc改为35，并在类别列表names里修改为包含的物体类别名称。这里训练集共2000张图片，测试集共500张图片，准备好数据集后对YOLO v5s网络模型参数进行训练，其训练模型的超参数配置如表1所示。Before training, you need to modify the YOLO v5s data and the configuration file of the model, change the number of object categories nc of the two files to 35, and modify the name of the object category included in the category list names. There are 2000 images in the training set and 500 images in the test set. After the data set is prepared, the parameters of the YOLO v5s network model are trained. The hyperparameter configuration of the training model is shown in Table 1.

表1 YOLO v5s网络模型的参数设置Table 1 Parameter settings of the YOLO v5s network model

Step2：提取2D目标语义和构建3D语义标签Step2: Extract 2D object semantics and construct 3D semantic labels

YOLO v5s网络训练结束后，使用Kinect V2相机实时采集相应的RGB图像信息，可实现实时检测功能，结果如图6所示，可以获得RGB图像中的2D目标语义信息，能够满足本发明实时构建3D语义标签的要求。After the YOLO v5s network training is completed, the Kinect V2 camera is used to collect the corresponding RGB image information in real time, and the real-time detection function can be realized. Semantic label requirements.

将2D语义信息映射到3D语义标签的流程如图7所示，首先是在ORB-SLAM2的Tracking线程中添加接口，基于训练好的YOLO v5s网络，对每一个新产生的关键帧进行目标检测得到目标物体中心点的像素坐标与边界框的长和宽。然后根据该关键帧对应的RGB图与深度图生成空间点云数据时，需将目标物体2D边界框中所有像素点的RGB信息更改为设定的颜色信息，此时根据关键帧生成的单帧点云数据便具有了语义标签。再将得到的关键帧点云数据结合位姿估计的结果反投影到世界坐标系中完成点云拼接，得到带3D语义标签的全局稠密点云地图，即给点云地图中目标物体的点云添加颜色信息。The process of mapping 2D semantic information to 3D semantic labels is shown in Figure 7. The first step is to add an interface to the Tracking thread of ORB-SLAM2. Based on the trained YOLO v5s network, target detection is performed on each newly generated key frame. The pixel coordinates of the center point of the target object and the length and width of the bounding box. Then, when generating spatial point cloud data according to the RGB map and depth map corresponding to the key frame, it is necessary to change the RGB information of all pixels in the 2D bounding box of the target object to the set color information. At this time, the single frame generated according to the key frame The point cloud data has semantic labels. Then, the obtained key frame point cloud data and the result of pose estimation are back-projected into the world coordinate system to complete the point cloud splicing, and the global dense point cloud map with 3D semantic labels is obtained, that is, the point cloud for the target object in the point cloud map. Add color information.

将ORB-SLAM2中生成的关键帧作为输入，获得带有2D边界框和置信度的关键帧图像，再通过反投影得到关键帧中对应的3D语义标签，其结果如图8所示，其中显示器的点云设置为红色，键盘的点云设置为绿色，鼠标的点云设置为蓝色。The keyframe generated in ORB-SLAM2 is used as input, and the keyframe image with 2D bounding box and confidence is obtained, and then the corresponding 3D semantic label in the keyframe is obtained through back projection. The result is shown in Figure 8, where the display The point cloud of the keyboard is set to red, the point cloud of the keyboard is set to green, and the point cloud of the mouse is set to blue.

3、稠密点云分割3. Dense point cloud segmentation

本发明对构建的稠密点云地图开展点云分割工作，以便获得机器人可理解的场景，同时根据稠密点云地图的分割结果，获得精确的目标点云，便于后续与3D语义标签结合。首先使用基于超体素聚类的分割方法对原始的稠密点云地图进行分割，根据点的相似性，将无规则的点云转换为表面结构，经过超体素分割之后形成的面片都有一个质心与一个法向量，点云分割可以被定义为图分割问题，然后采用图割的方法进行进一步分割，详细步骤如下所示：The present invention performs point cloud segmentation on the constructed dense point cloud map, so as to obtain a scene understandable by the robot, and at the same time obtains an accurate target point cloud according to the segmentation result of the dense point cloud map, which is convenient for subsequent combination with 3D semantic labels. First, the original dense point cloud map is segmented using a segmentation method based on super-voxel clustering, and the irregular point cloud is converted into a surface structure according to the similarity of points. The patches formed after super-voxel segmentation have A centroid and a normal vector, point cloud segmentation can be defined as a graph segmentation problem, and then the graph cut method is used for further segmentation. The detailed steps are as follows:

Step1：超体素聚类分割Step1: Supervoxel cluster segmentation

超体素分割是利用点云的空间八叉树结构，通过PCL库中的类SupervoxelClustering进行聚类，根据分割结果为八叉树结构的体素质心(voxel_centroid_cloud)添加标签，并赋予随机颜色存储在colored_voxel_cloud参数中。Supervoxel segmentation is to use the spatial octree structure of the point cloud, clustering through the class SupervoxelClustering in the PCL library, and add labels to the voxel_centroid_cloud of the octree structure according to the segmentation results, and assign random colors to store in colored_voxel_cloud parameter.

其分割原理使用k-均值聚类的区域增长来对点云进行分割，超体聚类的过程和结晶过程相似，就像溶液过饱和后导致多晶核结晶。超体素聚类的分割方法会在整个空间按一定规则均匀的分布用来生长的晶核，然后设定好晶核距离和最小晶粒，较小的晶粒被较大的晶粒吸收。在超体素分割过程中，种子点的扩展由特征距离决定，该特征距离考虑了空间、颜色和法向量的特征空间计算，超体素聚类分割中的距离公式如下所示：The segmentation principle uses the region growth of k-means clustering to segment the point cloud. The process of superbody clustering is similar to the crystallization process, just like the crystallization of polycrystalline nuclei after the solution is supersaturated. The segmentation method of supervoxel clustering will uniformly distribute the nuclei used for growth in the entire space according to certain rules, and then set the distance between the nuclei and the minimum grains, and the smaller grains will be absorbed by the larger grains. In the process of supervoxel segmentation, the expansion of seed points is determined by the feature distance, which takes into account the feature space calculation of space, color and normal vector. The distance formula in the supervoxel cluster segmentation is as follows:

上式中：D_c代表颜色的差异程度，D_n代表法线方向的差异程度，D_s表示点云在距离上的差异程度。公式中w_c、w_s、w_n分别表示各个变量的权重。通过对晶核周围的搜索，寻找D值最小的体素即为最相似的，以此来代表下一个进行生长的晶核。In the above formula: D _c represents the degree of difference in color, D _n represents the degree of difference in the normal direction, and D _s represents the degree of difference in the distance of the point cloud. In the formula, w _c , _ws , and _wn represent the weight of each variable, respectively. By searching around the crystal nucleus, the voxel with the smallest D value is the most similar, which represents the next crystal nucleus for growth.

本发明选择Kinect V2相机作为传感器，设置合适的参数：voxel_resolution＝0.008(体素大小，空间八叉树的分辨率)，seed_resolution＝0.5(种子分辨率)。根据实验室环境调整颜色、空间距离和法向量所占的权重，获得的超体素聚类分割结果如图9所示。The present invention selects the Kinect V2 camera as the sensor, and sets appropriate parameters: voxel_resolution=0.008 (voxel size, resolution of spatial octree), seed_resolution=0.5 (seed resolution). The weights of color, spatial distance and normal vector are adjusted according to the laboratory environment, and the obtained supervoxel clustering segmentation results are shown in Figure 9.

Step2：基于几何信息的点云分割Step2: Point cloud segmentation based on geometric information

上一步获得的超体素分割结果采用邻接图G＝{v,ε}表示，其中v_i∈v为分割得到的曲面块，ε表示连接着相邻的曲面块(v_i,v_j)。经过超体素分割处理之后，每一个曲面块都有一个质心c_i，和一个法向量n_i，此时场景分割可视为图分割问题。The _supervoxel segmentation result obtained in the previous step is represented by an adjacency graph G={v,ε}, where vi ∈ v is the surface block obtained by segmentation, and ε represents the connection between adjacent surface blocks ( _{vi , v j} ₎ . After supervoxel segmentation, each surface block has a centroid c _i and a normal vector _ni . At this time, scene segmentation can be regarded as a graph segmentation problem.

假设一个点云地图中有K个实际平面{s₁,s₂,…,s_k}，并且这些平面已经被分割成曲面块。这里定义一个变量{b_i}₁ ^N，b_i∈[0,K]，b_i＝K表示曲面块属于平面s_k，N代表了实际平面s_k被分割成了N个曲面块，如图9中桌面分成了两个聚类。若能根据点云地图提取其中所有对象的平面并将曲面块分配给它们，便可得到较为精确的点云分割结果。Suppose there are K actual planes {s ₁ ,s ₂ ,...,s _k } in a point cloud map, and these planes have been divided into surface patches. A variable {b _i } ₁ ^N is defined here, b _i ∈[0,K], b _i =K indicates that the surface block belongs to the plane _sk , and N represents that the actual plane _sk is divided into N surface blocks, as shown in the figure 9 The desktop is divided into two clusters. If the planes of all objects in the point cloud map can be extracted and the surface blocks can be assigned to them, a more accurate point cloud segmentation result can be obtained.

具体操作在得到曲面块后，使用RANSAC处理曲面块以生成候选平面PC＝{pc₁,pc₂,…,pc_m}，计算d(c_i,pc_m)，即曲面块质心c_i到平面pc_m的距离。增加一个阈值δ，可以得到所有到平面pc_m距离在δ内的曲面块，命名为∏＝{v_i∈V|d(c_i,pc_m)＜δ}。定义下式：The specific operation After obtaining the surface block, use RANSAC to process the surface block to generate a candidate plane PC={pc ₁ ,pc ₂ ,...,pc _m }, and calculate d(ci ,pc _m ), that is _{, the centroid of the surface block c i} _to the plane pc _m distance. By adding a threshold δ, all the surface blocks whose distance to the plane pc _m is within δ can be obtained, named ∏={vi _∈V |d(ci ,pc _m )<δ _} . Define the following formula:

上式中：η表示区分前景物体与背景的约束条件，D(pc_m)表示点云平面的权重。在实验中，设置η＝40和δ＝4(cm)。图分割问题的最小化能量P^*可用下式表示：In the above formula: η represents the constraint to distinguish the foreground object from the background, and D(p _m ) represents the weight of the point cloud plane. In the experiment, η=40 and δ=4 (cm) were set. The minimization energy P ^* of the graph segmentation problem can be expressed as:

上式中E(P)为拟合的能量，具体如下式所示：In the above formula, E(P) is the fitted energy, as shown in the following formula:

基于点云最小割理论切断满足能量最小化时的边，也即能够根据约束条件合并相邻的曲面块便可得到点云的分割结果，此时的分割目的是为了将聚类结果中属于同一物体的聚类合并，分割结果如图10所示，其中，目标物体如显示器、书本、杯子、瓶子等分割效果很好，背景物体如桌子、置物柜、地面等平面的合并效果也很好，满足本发明的要求。Based on the minimum cut theory of point cloud, the edge when the energy is minimized is cut off, that is, the segmentation result of the point cloud can be obtained by merging the adjacent surface blocks according to the constraints. The clustering and merging of objects, the segmentation result is shown in Figure 10. Among them, the segmentation effect of target objects such as monitors, books, cups, bottles, etc. is very good, and the merger effect of background objects such as tables, lockers, ground and other planes is also very good. meet the requirements of the present invention.

4、融合语义信息，构建语义地图4. Integrate semantic information to build a semantic map

本发明在完成稠密点云分割后，将分割结果与3D语义标签相结合，通过语义信息来优化分割结果，可将点云分割中未能提取出的目标分割出来；同时，可为分割出的各个目标添加语义信息，构建更为精确的语义地图。点云分割之后，每一个聚类均具有一个属性标签，用特定目标的语义标签替换掉聚类的属性标签之后，点云中的特定对象便具有了语义信息，从而构建出语义地图。具体实施步骤如下所示：After completing the segmentation of the dense point cloud, the present invention combines the segmentation result with the 3D semantic label, optimizes the segmentation result through the semantic information, and can segment the target that cannot be extracted in the point cloud segmentation; Each target adds semantic information to build a more accurate semantic map. After point cloud segmentation, each cluster has an attribute label. After replacing the attribute label of the cluster with the semantic label of the specific target, the specific object in the point cloud has semantic information, thereby constructing a semantic map. The specific implementation steps are as follows:

Step1：语义标签的存储Step1: Storage of Semantic Labels

本发明在对稠密点云地图进行分割后，不同颜色的点云代表着每个超体素中的点云集合和空间采样区域，此颜色便为该聚类的属性标签。但此时的属性标签仅仅是区分出各个聚类结果，其本身是一个没有任何意义的随机标签，此标签亦为最终点云分割结果中各个聚类的属性标签。In the present invention, after the dense point cloud map is segmented, the point clouds of different colors represent the point cloud set and the spatial sampling area in each supervoxel, and the color is the attribute label of the cluster. However, the attribute label at this time only distinguishes each clustering result, which itself is a random label without any meaning, and this label is also the attribute label of each cluster in the final point cloud segmentation result.

在第二部分中，本发明通过YOLO v5s对ORB-SLAM2中产生的关键帧进行目标识别，并根据识别结果获得的3D语义标签，为点云分割的结果添加带语义信息的颜色标签，便完成了对物体语义标签的存储。In the second part, the present invention uses YOLO v5s to perform target recognition on the key frames generated in ORB-SLAM2, and according to the 3D semantic labels obtained from the recognition results, add color labels with semantic information to the results of point cloud segmentation, and complete Storage of semantic labels for objects.

不过此时是将2D目标识别边界框中的所有像素点结合Kinect相机的深度数据直接映射到3D形成的语义标签。在实验室的复杂环境中，目标识别的边界框不够精准，很容易导致目标物体的周围点云添加错误的语义信息，如图8所示，存在将语义标签错误的添加到了目标物体的周围点云中的情况。However, at this time, all the pixels in the 2D target recognition bounding box are directly mapped to the 3D semantic label formed by combining the depth data of the Kinect camera. In the complex environment of the laboratory, the bounding box of target recognition is not accurate enough, and it is easy to add wrong semantic information to the surrounding point cloud of the target object. As shown in Figure 8, there is a wrong semantic label added to the surrounding points of the target object. Situation in the cloud.

Step2：语义标签与点云包围盒的融合Step2: Fusion of semantic labels and point cloud bounding boxes

为了获得更为精确的语义地图，本发明根据2D识别框得出目标物体点云的包围盒，其中包围盒的宽度值b为2D识别框中全部像素点所对应的最大深度差。假设2D识别框中的全部像素点为集合P＝{P₁，P₂，P₃......，P_n}，则In order to obtain a more accurate semantic map, the present invention obtains the bounding box of the point cloud of the target object according to the 2D recognition frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixels in the 2D recognition frame. Assuming that all the pixels in the 2D recognition frame are set P={P ₁ , P ₂ , P ₃ ......, P _n }, then

b＝max(Depth(P_i)-Depth(P_j)) (8)b=max(Depth(P _i )-Depth(P _j )) (8)

其中P_i,P_j∈P,Depth(P_i)为像素点P_i对应的深度值。包围盒的长与高由2D识别框的长与宽映射得到，原理为2D像素点反投影到世界坐标系中，在此不再赘述。Among them, P _i , P _j ∈P, Depth(P _i ) is the depth value corresponding to the pixel point P _i . The length and height of the bounding box are obtained by mapping the length and width of the 2D recognition frame. The principle is that the 2D pixels are back-projected into the world coordinate system, which will not be repeated here.

接下来，本发明将对点云分割得到的结果进行检测，只有当某个聚类中的所有点云都位于该包围盒中，才会被添加特定的语义信息，即为其添加特定的颜色属性，由此获得更为精确的语义地图。如图11与图10所示，通过语义标签获得的包围盒与点云分割结果的的融合，物体TV能够分割的更为完整，实际物体边缘点云的聚类也成功添加了正确的语义信息；与图8直接通过2D目标识别获得的边界框为点云添加语义信息相比，本方法也能够有效避免将语义信息错误的添加到目标周围的点云上。同时，结合语义信息也能够改善点云分割效果，如图11中，添加的语义信息改变颜色属性后的再聚类，能够分割出桌面上的键盘、鼠标等小物体。按照如上两个步骤，便可以形成如图11所示的三维语义地图。Next, the present invention will detect the result obtained by point cloud segmentation, and only when all point clouds in a certain cluster are located in the bounding box, will specific semantic information be added, that is, a specific color will be added to it attributes to obtain a more precise semantic map. As shown in Figure 11 and Figure 10, through the fusion of the bounding box obtained by the semantic label and the point cloud segmentation result, the object TV can be segmented more completely, and the clustering of the actual object edge point cloud has also successfully added correct semantic information ; Compared with adding semantic information to the point cloud directly through the bounding box obtained by 2D target recognition in Figure 8, this method can also effectively avoid adding semantic information to the point cloud around the target by mistake. At the same time, combining with semantic information can also improve the effect of point cloud segmentation. As shown in Figure 11, the added semantic information changes the color attribute and then re-clusters, which can segment small objects such as keyboard and mouse on the desktop. According to the above two steps, the three-dimensional semantic map as shown in FIG. 11 can be formed.

Claims

1. A three-dimensional semantic map construction method for an indoor mobile robot is characterized by comprising the following steps:

step 1: running an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding poses of the robot, and forming a single-frame space point cloud by back projection of an RGB image and a depth image corresponding to each key frame;

step 2: filtering the spatial point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;

and step 3: performing target identification on the RGB image corresponding to the key frame by using a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic label according to back projection;

and 4, step 4: partitioning the dense point cloud map by a point cloud partitioning algorithm;

and 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map to construct and obtain the 3D semantic map.

2. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step1 comprises the following steps:

adding processing on a key frame in a Tracking thread in an ORB-SLAM2 system to obtain spatial point cloud data corresponding to the key frame, specifically: performing the following operations on all pixel points in the key frame to obtain point cloud data of the key frame under a camera coordinate system: let the coordinate of a certain pixel point P' in the image be [ u, v] ^T If the internal reference matrix of the camera is K and the depth data corresponding to the pixel point is d, converting the pixel point into a three-dimensional point P under a camera coordinate system _c (X ', Y ', Z ') is:

obtain a three-dimensional point P _c After the space coordinate, the RGB information of the pixel point P' is acquired as P _c The RGB information of the key frame is operated as above, and point cloud data of the key frame under a camera coordinate system is obtained through a point cloud construction function of a PCL library;

and then converting point cloud data of key frames under a camera coordinate system into point cloud data under a world coordinate system:

let P _c Converted into a point P in the world coordinate system _w (X,Y,Z)，P _w (X, Y, Z) satisfies:

wherein, T _cw Is the camera pose.

3. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the filtering processing of the spatial point cloud corresponding to the key frame comprises:

removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud base, calculating the average distance from each point in the spatial point cloud to all adjacent points, wherein the average distance is within an outlier distance threshold d _pt The outliers, are identified as outliers and removed from the data. And then, point cloud data is down-sampled by a voxel grid filtering method.

4. The indoor mobile robot three-dimensional semantic map construction method according to claim 3, characterized in that: the outlier distance threshold d _pt The method specifically comprises the following steps:

the distances of all points in the spatial point cloud are arranged to obey normal distribution (mu, sigma) ² ) Judging whether the threshold parameter of the outlier is H _out Then the outlier distance threshold d _pt Comprises the following steps:

d _pt ＝μ+H _out ·σ。

5. the indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: step2, the step of splicing the poses corresponding to the key frames to form the three-dimensional dense point cloud map comprises the following steps:

the Loopcolising thread in the ORB-SLAM2 system optimizes the detected key frame pose in the closed loop, extracts the key frame with the changed camera pose after the closed loop optimization, deletes the point cloud data corresponding to the key frame in the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.

6. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step3 comprises the following steps:

step 3.1: establishing data set based on YOLO V5 network model

Shooting a picture of a target object to obtain a data set, labeling the collected data set by using a labeling program Labelimg, labeling, and respectively selecting a training set and a test set from the shot picture; selecting pictures containing target objects from the COCO data set as a training set and a test set respectively; carrying out data configuration and model training parameter setting on a YOLO V5 network, wherein the data configuration comprises object class number and object class name, training parameters of a YOLO V5 network model by using a training set, and verifying the network model by using a test set;

step 3.2: extracting 2D object semantics and constructing 3D semantic tags

Enabling a YOLO V5 network to call an RGB-D depth camera to acquire RGB image information, and acquiring a 2D recognition frame in the RGB image in real time to serve as 2D target semantic information;

adding an interface in a Tracking thread of ORB-SLAM2 to read key frames generated in real time, performing target detection on each newly generated key frame by using a trained YOLO V5 network model to obtain the pixel coordinates of the center point of a target object and the length and width of a boundary frame, then changing the RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, wherein the single-frame point cloud data generated according to the key frame has a 3D semantic label, then reversely projecting the obtained key frame point cloud data and a result of pose estimation to a world coordinate system to complete point cloud splicing, and obtaining a global dense map with the 3D semantic label, namely adding color information to the point cloud of the target object in the point cloud map.

7. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: 4, the step of partitioning the dense point cloud map by using a point cloud partitioning algorithm comprises the following steps:

step 4.1: and (3) partitioning the dense point cloud map by adopting a hyper-voxel clustering partitioning method, converting the point cloud into a surface structure, and representing a partitioning result by adopting an adjacency graph G ═ v, epsilon, wherein v is represented by _i Epsilon.v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v) connecting adjacent curved surface blocks _i ,v _j ) Each of which isThe curved surface block comprises a center of mass c _i And a normal vector n _i ；

Step 4.2: after the curved surface block is obtained, the curved surface block is processed by RANSAC to generate a candidate plane PC ═ PC ₁ ,pc ₂ ,…,pc _m Then calculate the centroid c of the surface patch _i To the plane pc _m Distance d (c) of _i ,pc _m ) Setting a threshold delta to obtain all arrival planes pc _m The surface patch with distance within delta is named pi ═ v _i ∈V|d(c _i ,pc _m ) < delta >, define:

where η represents the constraint that distinguishes foreground objects from the background, D (pc) _m ) Minimizing energy P representing weight of point cloud plane, graph partitioning problem ^* Comprises the following steps:

P ^* ＝arg min E(P),

wherein E (P) is the energy of the fitting, and satisfies:

wherein V and E are the set of the top point and the edge in the graph respectively, and the adjacency graph G corresponding to the cluster segmentation is { V, epsilon }.

8. The indoor mobile robot three-dimensional semantic map construction method according to claim 1, characterized in that: the step 5 comprises the following steps:

obtaining a bounding box of the point cloud of the target object according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are a set P ═ P ₁ ，P ₂ ，P ₃ ......，P _n }，Then

b＝max(Depth(P _i )-Depth(P _j ))

Wherein, P _i ,P _j ∈P,Depth(P _i ) Is a pixel point P _i The corresponding depth value is back projected into a world coordinate system by utilizing the 2D pixel points, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;

and detecting the result obtained by point cloud segmentation, and replacing the random color attribute labels of the clusters with the 3D semantic labels obtained in the step (3) to obtain a semantic map only when all point clouds in a certain cluster are located in the bounding box.

9. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the RGB-D depth camera employs a Kinect V2 camera.

10. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to any one of claims 1 to 8, wherein: the YOLO V5 network adopts a YOLOv5s network.