CN117906595A

CN117906595A - Scene understanding navigation method and system based on feature point method vision SLAM

Info

Publication number: CN117906595A
Application number: CN202410317058.2A
Authority: CN
Inventors: 鲁明丽; 周理想; 徐本连; 从金亮; 施健; 周旭; 顾苏杭; 朱培逸
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-04-19

Abstract

The invention discloses a scene understanding navigation method and a system based on feature point method vision SLAM, comprising the following steps: acquiring real-time pose information of a camera and issuing the real-time pose information through ROS; synthesizing the acquired RGB image and depth image into a 3D point cloud; acquiring the relative position relation between the robot and the object according to the semantic and depth information; splicing each frame of point cloud in the 3D point cloud into a global point cloud map according to pose information, and releasing the global point cloud map into the ROS; acquiring map point information, converting the map point information into point clouds to obtain a point cloud map, releasing the point cloud map data into ROS, and converting the point cloud map into a three-dimensional octree map and a two-dimensional grid map; and (3) the ORB-SLAM3 module and the result of the example segmentation network are distributed and subscribed in the form of topics through the ROS system for communication transmission, and the ROS navigation package is used for implementing navigation operation. The ORB-SLAM3 is combined with a real-time instance segmentation method to establish a dense semantic map, so that scene understanding can be performed.

Description

Scene understanding navigation method and system based on feature point method vision SLAM

Technical Field

The invention belongs to the technical field of machine vision and robots, and relates to a scene understanding navigation method and system based on a feature point method vision SLAM.

Background

Meanwhile, the positioning and map construction technology (Simultaneous Localization AND MAPPING, SLAM) is a popular problem in the field of robots, is also a key technology for realizing autonomous movement of robots in complex scenes, and has wide application in the fields of mobile robots, automatic driving, unmanned aerial vehicles, augmented reality (Augmented reality, AR), virtual Reality (VR) and the like.

However, the conventional visual SLAM (such as ORB-SLAM 3) based on the feature point method can only establish a sparse point cloud map for positioning, and cannot establish a two-dimensional occupied grid map and a three-dimensional octree map like a laser SLAM, so that navigation, obstacle avoidance and other operations cannot be performed by using the established map. On the other hand, in the conventional vision SLAM, the vision sensor required for the vision SLAM can acquire a large amount of environmental information such as color, texture, shape, etc., some special vision sensors (e.g., depth cameras) can also acquire the relative distance between the camera and objects in the environment, etc., and the sparse point cloud map wastes such abundant and useful information.

In recent years, the deep learning technology has made great progress, and a lot of breakthroughs are made in the field of computer vision, so that semantic information including identification and labeling of scene elements such as objects can be obtained by combining a deep learning method and fused into an SLAM system, and the robot can better understand environmental structures, semantic meanings and measurement information of the robot and surrounding objects from the perspective of human beings. In addition, the robotic operating system (Robot Operating System, ROS) provides rich functionality and interfaces. At present, although a deep learning method is fused into an SLAM system, communication between the deep learning method and the SLAM system is not efficient, and real-time performance is poor.

The invention discloses a visual SLAM method and a visual SLAM system for an indoor dynamic scene based on deep learning, wherein based on an ORB-SLAM2 framework, a GCNv feature extraction algorithm is utilized to replace a corresponding module of the ORB-SLAM2, a dynamic point removal module and a point cloud mapping thread are added, and interference of dynamic feature points on a visual odometer is effectively reduced. The GCNv feature extraction algorithm performs feature extraction on the image information, cannot establish a semantic map, and cannot help a robot to understand the environment from the human perspective.

Disclosure of Invention

The invention aims to provide a scene understanding navigation method and system based on feature point method vision SLAM, which expands sparse point cloud established by ORB-SLAM3 into a map capable of navigation and a semantic measurement map capable of scene understanding, and the results among the modules are distributed and subscribed in a topic form through an ROS system for communication transmission.

The technical solution for realizing the purpose of the invention is as follows:

A scene understanding navigation method based on feature point method vision SLAM comprises the following steps:

s01: acquiring real-time pose information of a camera from the ORB-SLAM3 and issuing the real-time pose information through the ROS;

s02: synthesizing the RGB image and the depth image obtained by the depth camera into a 3D point cloud;

s03: semantic and depth information acquired through an example segmentation network is used for acquiring the relative position relation between the robot and the object;

s04: splicing each frame of point cloud in the 3D point cloud into a global point cloud map according to the real-time pose information, and publishing the global point cloud map to the ROS;

S05: map point information of ORB-SLAM3 is obtained and converted into point cloud, a point cloud map is obtained, point cloud map data are issued to ROS, and the point cloud map is converted into a three-dimensional octree map and a two-dimensional grid map;

S06: and (3) the ORB-SLAM3 module and the result of the example segmentation network are distributed and subscribed in the form of topics through the ROS system for communication transmission, and the ROS navigation package is used for implementing navigation operation.

In a preferred embodiment, the method for acquiring real-time pose information of the camera from the ORB-SLAM3 and issuing the pose information through ROS in step S01 includes:

S11: acquiring Sophus of the camera pose in SE3f format from TrackRGBD interface of ORB-SLAM 3;

S12: the pose of Sophus:SE 3f format is converted into cv:Mat format Tcw, ，/>For rotation matrix,/>T is a translation vector,/>；As a function of the rotation matrix of the camera pose,/>A translation matrix function for the camera pose;

S13: converting the reverse pose Tcw of the camera into a positive pose Twc, and extracting a rotation matrix RWC and a translation vector tWC from the positive pose Twc;

S14: constructing a rotation matrix RWC as a quaternion q;

S15: the translation vector tWC and the rotation quaternion q of the camera are respectively copied to a variable of the ROS message type and used for issuing camera pose information.

In a preferred technical solution, the method for synthesizing the 3D point cloud in step S02 includes:

S21: reading pixel points of the depth image to obtain a depth value d;

S22: calculating the three-dimensional coordinates of the current pixel point according to the camera internal parameters and the depth value d ；

The calculation method comprises the following steps:

Wherein: is the pixel point coordinate in the pixel coordinate system,/> Is a three-dimensional coordinate value of the pixel point p,，/>，/>，/>Is an internal reference of the camera,/>Is a scaling factor;

S23: acquiring RGB color values of pixel points corresponding to the color image, and storing the RGB color values in b, g and r member variables of a point p;

s24: performing the above processing on each frame of image to generate a 3D point cloud, performing rotation and stitching operation on the generated point cloud, and converting the generated point cloud into a specified coordinate system;

S25: voxel filtering is performed on the point cloud.

In a preferred embodiment, the method for acquiring the relative positional relationship between the robot and the object in step S03 includes:

s31: deploying YOLOv-seg instance partition network by using a C++ version OpenCV-DNN module;

S32: writing the deployed YOLOv-seg into ROS nodes, and issuing detection and segmentation results;

S33: sending the RGB image to YOLOv-seg example segmentation network to obtain target detection results of each category;

s34: reading the coordinates of the rectangular frames of each category in the pixel coordinate system from the result, and solving the coordinates of the central points of the rectangular frames ；

S35: indexing the point cloud coordinate system according to the center point coordinates of each category in the pixel coordinate system to obtain the center point coordinates in the image pixel coordinate systemCorresponding point cloud coordinates/>；

S36: calculating the coordinate transformation relation between the pose of the camera and the position of the center point of each type of object in the point cloud through tf coordinate transformation under the ROS;

s37: the coordinate transformation relationships are published through ROS and then subscribed to in rviz.

In a preferred technical solution, the method for stitching the global point cloud map in step S04 includes:

S41: converting the acquired real-time pose of the camera into a transformation matrix from the camera to a world coordinate system;

S42: calculating the relative transformation between two adjacent frames of point clouds according to the transformation matrix;

S43: registering the point cloud in the current frame with the corresponding pose, and adding the registered point cloud into the global point cloud map.

In a preferred technical solution, the method for obtaining the point cloud map in the step S05 includes:

S51: reading map point information in ORB-SLAM 3;

s52: creating sensor_ msgs, namely PointCloud type point cloud data and initializing the point cloud data;

S53: traversing map points, converting the map points into Eigen:Vector 3d type, storing the converted coordinates in an array, copying data in the array to a specific position of point cloud data of sensor_ msgs: pointCloud2 type by using memcpy function, and publishing the point cloud data to ROS.

In a preferred embodiment, the method for converting the point cloud map into the three-dimensional octree map and the two-dimensional grid map in the step S05 includes:

s61: creating and initializing an empty OccupancyGrid data structure according to the range, resolution and grid map size of the point cloud;

S62: for each point cloud data, converting the coordinates of the point cloud data into grid units on OccupancyGrid, and marking the corresponding grid units as corresponding states, wherein the states comprise idle states, free states or unknown states;

s63: publishing the generated OccupancyGrid data to corresponding topics;

s64: creating and initializing a OccupancyMap data structure and setting basic parameters thereof;

s65: traversing each cell in OccupancyGrid, mapping cell information into OccupancyMap;

S66: and publishing the generated OccupancyMap data to corresponding topics.

In a preferred embodiment, the method for performing navigation operation using the ROS navigation pack in step S06 includes:

S71: a mobile robot or a camera incrementally builds a grid map of the desired environment;

S72: after the establishment is completed, storing the grid map;

S73: using the move_base node of ROS, a navigation operation is performed at rviz.

The invention also discloses a scene understanding navigation system based on the feature point method vision SLAM, which comprises the following steps:

The camera real-time pose acquisition and release module acquires real-time pose information of the camera from the ORB-SLAM3 and releases the real-time pose information through the ROS;

The 3D point cloud synthesis module synthesizes the RGB image and the depth image obtained by the depth camera into a 3D point cloud;

the scene understanding module is used for acquiring the relative position relation between the robot and the object through semantic and depth information acquired by the instance segmentation network;

the global point cloud map generation and release module is used for splicing each frame of point cloud in the 3D point cloud into a global point cloud map according to the real-time pose information and releasing the global point cloud map into the ROS;

the ORB-SLAM3 visual navigation map module is used for acquiring map point information of ORB-SLAM3, converting the map point information into point cloud to obtain a point cloud map, releasing the point cloud map data into ROS, and converting the point cloud map into a three-dimensional octree map and a two-dimensional grid map;

And the navigation module is used for publishing and subscribing the results of the ORB-SLAM3 module and the example division network in the form of topics through the ROS system for communication transmission, and performing navigation operation by using the ROS navigation package.

The invention also discloses a computer storage medium, on which a computer program is stored, which when executed realizes the scene understanding navigation method based on the feature point method vision SLAM.

Compared with the prior art, the invention has the remarkable advantages that:

1. the invention improves the defect that ORB-SLAM3 can only establish a sparse map, combines the YOLOv-seg real-time example segmentation method with the highest current speed and highest accuracy to establish a dense semantic map, and can help a robot understand the environment from the angle of human beings; in addition, the relative position and distance relation between the robot and the priori object in the environment is released in real time, so that the robot can conveniently realize actions such as grabbing.

2. The invention converts the sparse map into an occupied grid map similar to that established by a laser radar. By analyzing and processing the grid map, the functions of path planning, obstacle avoidance, navigation and the like can be realized, the working range of the visual SLAM is greatly expanded, and the visual SLAM system is more comprehensive and flexible.

3. According to the invention, each module operates under the ROS framework, the ORB-SLAM3 module and YOLOv-seg instance segmentation module are rewritten into ROS nodes, and the results between the modules are distributed and subscribed in the form of topics through the ROS system for communication transmission. This architecture makes the system run more efficient, real-time, and facilitates the extension of other functional modules. Meanwhile, as the ROS framework is widely applied to the field of robots, the invention also enables the whole system to be easier to integrate and cooperatively work with other ROS nodes, and further expands the application potential of the visual SLAM in the aspects of robot navigation, environment perception and the like.

Drawings

Fig. 1 is a flowchart of a scene understanding navigation method based on feature point method visual SLAM of the present embodiment;

FIG. 2 is a flowchart of the scene understanding navigation system based on the feature point method visual SLAM of the present embodiment;

FIG. 3 is a purely visually created three-dimensional octree map;

FIG. 4 is a purely visually created two-dimensional occupancy grid map;

fig. 5 is a schematic view of a scene understanding.

Detailed Description

The principle of the invention is as follows: the sparse point cloud established by the ORB-SLAM3 is expanded into a map capable of navigation and a semantic measurement map capable of scene understanding, and the results among the modules are distributed and subscribed in the form of topics through the ROS system for communication transmission.

Example 1:

as shown in fig. 1, a scene understanding navigation method based on feature point method vision SLAM includes the following steps:

The specific implementation of each step herein refers to the workflow of the navigation system based on the scene understanding of the feature point method vision SLAM, and will not be described herein.

In another embodiment, a computer storage medium has a computer program stored thereon, the computer program when executed implementing the scene understanding navigation method based on the feature point method visual SLAM described above.

The scene understanding navigation method based on the feature point method visual SLAM can adopt any one of the navigation methods of the visual SLAM based on the feature point method, and detailed implementation is not repeated here.

In yet another embodiment, a scene understanding navigation system based on feature point method visual SLAM includes:

Specifically, the following description of the workflow of the scene understanding navigation system based on the feature point method visual SLAM is given by taking a preferred embodiment as an example as shown in fig. 2:

s1: real-time pose information of the camera is acquired from the ORB-SLAM3 and published by ROS.

The specific implementation method comprises the following steps:

S11: installing and running ORB-SLAM3;

S12: acquiring Sophus of the camera pose in SE3f format from TrackRGBD interface of ORB-SLAM 3;

s13: converting Sophus:: the pose in SE3f format into cv:: mat format Tcw;

Specifically, trackRGBD is the interface in ORB-SLAM3 for processing RGB-D camera data, and Sophus is returned to the camera pose in SE3f format; sophus SE3f is a data type in the Sophus C ++ library representing rigid body transformations in three dimensions, including translation and rotation, The function returns a rotating part,/>The function returns a translated portion.

Rotation matrix; Translation vectorThen/>。

S14: converting the reverse pose Tcw of the camera into a positive pose Twc, and extracting a rotation matrix RWC and a translation vector tWC from the positive pose Twc;

S15: constructing a rotation matrix RWC as a quaternion q;

s16: the translation vector tWC and the rotation quaternion q of the camera are respectively copied to a variable of the ROS message type and used for issuing camera pose information.

S2: and synthesizing the 3D point cloud by the RGB image and the depth image acquired by the depth camera.

The specific implementation method comprises the following steps:

S21: reading pixel points of the depth image to obtain a depth value d;

Specifically, the calculation method is as follows:

Wherein: is the pixel point coordinate in the pixel coordinate system,/> Is the value of the three-dimensional coordinate p-，/>，，/>Is an internal reference of the camera,/>Is a scaling factor.

S24: repeating the steps, and generating a 3D point cloud for each frame of image;

S25: performing rotation and stitching operation on the generated point cloud, and converting the point cloud into a designated coordinate system;

s26: voxel filtering is performed on the point cloud to reduce the amount of point cloud data and noise.

S3: and combining the semantic and depth information to acquire the relative position relation between the robot and the object. As shown in fig. 5. The a priori objects in the environment have been marked in different colors while the relative position and distance relationship of the robot to the a priori objects is indicated in real time by the RGB-D camera.

The specific implementation method comprises the following steps:

S35: indexing the point cloud coordinate system according to the center point coordinates of each category in the pixel coordinate system to obtain the center point coordinates in the pixel coordinate system of the pictureCorresponding point cloud coordinates/>;

Specific:

Wherein: For index value,/> Is the lateral resolution of the picture or camera. The index value is input into the generated single-frame point cloud to obtain the coordinate/>, of the point cloudWherein/>Is the distance between the robot and the object.

S36: calculating a coordinate transformation relation between the pose of the camera and the position of the center point of each type of object in the point cloud through tf coordinate transformation under the ROS, wherein the coordinate transformation relation is the relative position relation between the robot and the object; the calculated amount is small.

S4: and (3) combining the pose information acquired in the step (S1), and splicing each frame of point cloud in the step (S2) into a 3D point cloud with the same total.

RViz (Robot Visualization) is an open source tool for visualizing the robotic system for displaying and debugging sensor data, status information, motion planning, etc. of the robot.

The specific implementation method comprises the following steps:

s41: converting the pose in the S1 into a transformation matrix from the camera to a world coordinate system;

S43: registering the point cloud in the current frame with the corresponding pose, and adding the registered point cloud into a global point cloud map;

S44: the global point cloud map is published into ROS.

S5: map point information of the ORB-SLAM3 is acquired and converted into a point cloud.

The specific implementation method comprises the following steps:

S51: reading map point information in ORB-SLAM 3;

s52: a sensor_ msgs: pointCloud2 type variable is created and initialized;

S53: traversing map points, converting the map points into Eigen, namely Vector3d type, storing the converted coordinates in an array, and copying data in the array to specific positions of point cloud data by using memcpy function;

S54: and releasing the point cloud data to the ROS to simulate the point cloud generated by the multi-line laser radar.

S6: the point cloud map is converted into a three-dimensional octree map and a two-dimensional grid map as shown in fig. 3 and 4. The square in fig. 3 represents an obstacle present in the environment. The other areas in fig. 4 represent the portions that the robot can pass through, and the black areas represent the presence of obstacles at that location, which the robot is required to avoid when planning and navigating the path.

The memcpy function is a standard library function in the C/C++ language for copying data between memories.

The specific implementation method comprises the following steps:

S61: creating and initializing an empty OccupancyGrid data structure according to the range, resolution and grid map size of the point cloud; occupancyGrid is a message type defined in ROS (robot operating system) for representing a three-dimensional probability map, i.e. octomap map.

S62: for each point cloud data, converting the coordinates of the point cloud data into grid units on OccupancyGrid, and marking the corresponding grid units as corresponding states, such as idle, free or unknown;

s63: publishing the generated OccupancyGrid data to corresponding topics;

S64: creating and initializing a OccupancyMap data structure and setting basic parameters thereof; occupancyMap are used primarily to convert a three-dimensional OccupancyGrid probability map to a two-dimensional occupancy grid map.

S66: and publishing the generated OccupancyMap data to corresponding topics.

S7: and storing the established grid map, and using the ROS navigation package to implement navigation operation.

The specific implementation method comprises the following steps:

S71: slowly moving the robot or the camera, and incrementally establishing a grid map of the required environment;

S72: after the establishment is completed, storing the grid map;

In ROS, move_base is an important software package for implementing the robot navigation function. The move_base comprises a series of nodes and algorithms for realizing the path planning, obstacle avoidance and navigation functions of the robot in an unknown environment.

The foregoing examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the foregoing examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the present invention should be made therein and are intended to be equivalent substitutes within the scope of the present invention.

Claims

1. A scene understanding navigation method based on feature point method vision SLAM is characterized by comprising the following steps:

2. The scene understanding navigation method based on feature point method vision SLAM of claim 1, wherein the method of acquiring real-time pose information of a camera from ORB-SLAM3 and issuing by ROS in step S01 comprises:

S14: constructing a rotation matrix RWC as a quaternion q;

3. The scene understanding navigation method based on feature point method vision SLAM of claim 1, wherein the method for synthesizing 3D point cloud in step S02 comprises:

S21: reading pixel points of the depth image to obtain a depth value d;

The calculation method comprises the following steps:

，

Wherein: is the pixel point coordinate in the pixel coordinate system,/> Is the three-dimensional coordinate value of the pixel point p,/>，/>，，/>Is an internal reference of the camera,/>Is a scaling factor;

S25: voxel filtering is performed on the point cloud.

4. The scene understanding navigation method based on feature point method vision SLAM of claim 1, wherein the method for acquiring the relative positional relationship between the robot and the object in step S03 comprises:

5. The scene understanding navigation method based on the feature point method visual SLAM according to claim 1, wherein the method for stitching into the global point cloud map in step S04 comprises:

6. The scene understanding navigation method based on feature point method vision SLAM of claim 1, wherein the method for obtaining the point cloud map in step S05 comprises:

S51: reading map point information in ORB-SLAM 3;

7. The scene understanding navigation method based on the feature point method vision SLAM according to claim 1, wherein the method of converting the point cloud map into a three-dimensional octree map and a two-dimensional grid map in step S05 comprises:

s63: publishing the generated OccupancyGrid data to corresponding topics;

S66: and publishing the generated OccupancyMap data to corresponding topics.

8. The scene understanding navigation method based on feature point method vision SLAM of claim 1, wherein the method of performing navigation operation using ROS navigation pack in step S06 comprises:

S72: after the establishment is completed, storing the grid map;

9. A scene understanding navigation system based on feature point method vision SLAM, comprising:

10. A computer storage medium having stored thereon a computer program, wherein the computer program when executed implements the scene understanding navigation method based on feature point method vision SLAM of any of claims 1-8.