CN110738667A

CN110738667A - RGB-D SLAM method and system based on dynamic scene

Info

Publication number: CN110738667A
Application number: CN201910913318.1A
Authority: CN
Inventors: 吉长江
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-01-31

Abstract

The application discloses RGB-D SLAM methods and systems based on dynamic scenes, and the method comprises the steps of adopting a semantic segmentation network model based on deep learning to determine a potential dynamic region of an image, adopting a motion consistency method to identify the potential dynamic region as the dynamic region, extracting ORB feature points in the potential dynamic region and a background region, adopting an ICP algorithm to match the ORB feature points to obtain pose information of a robot so as to initially optimize the pose of the robot.

Description

RGB-D SLAM method and system based on dynamic scene

Technical Field

The application relates to the technical field of robot positioning and mapping, in particular to RGB-DSLAM methods and systems based on dynamic scenes.

Background

The current SLAM (simultaneous localization and mapping) technology is expected to start from an unknown place of an unknown environment, locate the position and the posture of a robot through repeatedly observed map features (such as corners, columns and the like) in the motion process, and build a map incrementally according to the position of the robot, so that the purpose of simultaneously locating and building the map is achieved.

The conventional SLAM system usually assumes that the surrounding environment of the robot is static, however, in practical application scenarios, the surrounding environment of the robot is always in dynamic change, which reduces the accuracy of the robot in locating the position of the robot itself, because objects in the dynamic environment may destroy the mapping of the environment, thereby causing the robot to generate an erroneous position estimation

The existing positioning and mapping methods include a deep circular convolution neural network (VO) end-to-end Visual odometer method, which combines the circular convolution neural network with the VO (Visual odometer), through inputting video clips or monocular image sequences, in each time interval, RGB image frames subtract the average RGB value of a training set in preprocessing and optionally adjust the image size to be a multiple of 64, two continuous images are stacked on to form tensors, which are used for deep RCNN learning how to extract motion information and estimate poses, the image tensors are put into CNN to obtain the effective features of the monocular VO and then are transmitted to RNN for sequence learning, in each time interval, each image pair through the network generates pose estimation, VO system progresses and estimates new poses along with image capture.

After CNN, RNN is adopted to run the learning of key frame sequence, namely, the motion model and the data association model are implicitly modeled in the CNN feature sequence, LSTM is used to determine which previous hidden states to discard or reserve to update the current state, the motion can be learned during the estimation of the pose, the association between long-distance images can be found and applied, two LSTM layers are stacked at to construct depth RNN, and based on the visual features obtained from CNN, pose estimation is output at each time interval, and the specific process of the positioning mapping method in the prior art is carried out along with the motion of the camera and the image acquisition.

However, most of the existing VO (Visual odometer) or SLAM systems assume that the environment is static, and denier SLAM systems operate in a complex dynamic environment, the performance of the SLAM systems may be degraded, so that the robot is not accurate enough to locate its own position, and therefore methods and systems capable of accurately locating in a dynamic scene are urgently needed.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to aspects of the present application, there are provided dynamic scene-based RGB-D SLAM methods, the method comprising the steps of:

segmenting the network model based on the deep-learning semantic to determine potential dynamic regions of the image;

determining whether the characteristic points of two continuous frame images in the images correspond to each other by adopting a motion consistency method, judging whether the potential dynamic region in the images and the object to be identified in the background region are consistency, and if consistency is met, identifying the potential dynamic region as a dynamic region;

respectively extracting ORB feature points in the potential dynamic region and the background region, if the region where the ORB feature points are located is the dynamic region, deleting the ORB feature points in the current image frame and the reference frame, and otherwise, keeping the ORB feature points in the potential dynamic region and the background region;

and matching the ORB characteristic points by adopting an ICP (inductively coupled plasma) algorithm to obtain pose information of the robot.

Optionally, the step of adopting the semantic segmentation network model based on deep learning to determine the potential dynamic region of the image includes the following sub-steps:

extracting the outline edge of the object to be recognized, and inputting the RGB image containing the outline edge into a Mask-RCNN model for semantic segmentation to obtain a Mask image of the object to be recognized;

carrying out contour restoration on a mask image of an object to be identified: constructing a contour feature of the object to be identified, wherein the contour feature comprises an edge centroid; obtaining coordinate values of the edge centroid, and calculating the distance between the contour edge point of the object to be recognized and the edge centroid; and if the distance is greater than a preset distance threshold value, removing the contour edge point to determine a potential dynamic area in the image.

Optionally, a canny edge algorithm is adopted to repair the mask image of the object to be identified.

Optionally, the identifying the potential dynamic region as a dynamic region includes the following sub-steps:

calculating the optical flow value of each pixels in the image of the potential dynamic area, and obtaining the optical flow field of each point according to the optical flow value;

tracking sparse points inside and outside an object to be identified in a potential dynamic area by adopting a Lucas-Kanade optical flow method, and dividing a background area of an image according to an optical flow field of each point;

constructing a standardized histogram of potential dynamic and background regions, determining the range of each interval of the standardized histogram, and distributing all optical flow vectors to different clusters to form a plurality of boxes; constructing motion vectors of potential dynamic areas and background areas according to the optical flow vectors in each box;

and calculating cosine similarity of the motion vectors of the potential dynamic area and the background area, if the cosine similarity is greater than the measured motion state tolerance, determining that the potential dynamic area moves, and identifying the potential dynamic area as the dynamic area.

Optionally, the method further comprises the steps of: and performing global optimization on the pose information based on a closed loop detection mode and constraints among all frames of the image.

According to another aspects of the present application, there are provided dynamic scene-based RGB-D SLAM systems, the systems comprising a determination module, a recognition module, an extraction module, and an initial optimization module, wherein,

the determination module segments the network model based on the deep learned semantics to determine potential dynamic regions of the image;

the identification module determines whether points of two continuous frame images in the images correspond to each other by adopting a motion consistency method, judges whether objects to be identified in a potential dynamic region and a background region in the images are consistency, and identifies the potential dynamic region as a dynamic region if consistency is achieved;

the extraction module performs the following operations: respectively extracting ORB feature points in the potential dynamic region and the background region, if the region where the ORB feature points are located is the dynamic region, deleting the ORB feature points in the current image frame and the reference frame, and otherwise, keeping the ORB feature points in the potential dynamic region and the background region;

and the initial optimization module adopts an ICP algorithm to match the ORB characteristic points so as to obtain the pose information of the robot.

Optionally, the determining module includes a semantic segmentation unit and a repair unit:

the semantic segmentation unit performs the following operations: extracting the outline edge of the object to be recognized, and inputting the RGB image containing the outline edge into a Mask-RCNN model for semantic segmentation to obtain a Mask image of the object to be recognized;

the repair unit performs the following operations: constructing a contour feature of the object to be identified, wherein the contour feature comprises an edge centroid; obtaining coordinate values of the edge centroid, and calculating the distance between the contour edge point of the object to be recognized and the edge centroid; and if the distance is greater than a preset distance threshold value, removing the contour edge point to determine a potential dynamic area in the image.

Optionally, the canny edge algorithm is adopted to repair the mask image of the object to be identified.

Optionally, the identification module includes an optical flow field obtaining unit, a background region obtaining unit, a construction unit, and a cosine similarity obtaining unit;

the optical flow field acquisition unit is used for calculating the optical flow value of each pixels in the image of the potential dynamic area and obtaining the optical flow field of each point according to the optical flow value;

the background area acquisition unit tracks sparse points inside and outside an object to be identified in a potential dynamic area by adopting a Lucas-Kanade optical flow method, and divides a background area of an image according to an optical flow field of each point;

the construction unit is used for constructing a standardized histogram of a potential dynamic area and a background area, determining the range of each interval of the standardized histogram, and distributing all optical flow vectors to different clusters to form a plurality of boxes; constructing motion vectors of potential dynamic areas and background areas according to the optical flow vectors in each box;

the cosine similarity obtaining unit is used for calculating the cosine similarity of the motion vectors of the potential dynamic area and the background area, if the cosine similarity is larger than the measured motion state tolerance, the potential dynamic area is determined to move, and the potential dynamic area is identified as the dynamic area.

Optionally, the system further includes a global optimization module:

the global optimization module is used for carrying out global optimization on the pose information based on a closed loop detection mode and constraints among all frames of the image.

According to yet another aspects of the application, there is provided computer electronic devices comprising a memory, a processor and a computer program stored in said memory and executable by said processor, the computer program being stored in a space in the memory for program code, the computer program, when executed by the processor, implementing the method steps according to the invention for carrying out any of the .

According to yet another aspects of the application, computer-readable storage media are provided, the computer-readable storage media comprising a storage unit for program code, the storage unit being provided with a program for performing the steps of the method according to the invention, the program being executed by a processor.

According to yet another aspects of the application, there is provided computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method according to the invention.

According to the method, the potential dynamic region of the image is determined through the semantic segmentation network model of deep learning, the image can be accurately segmented, the dynamic region is identified from the potential dynamic region by adopting a motion consistency method, even if the SLAM system is in a dynamic environment, the 3D motion track of each input image in the camera can be accurately estimated according to ORB feature points in the potential dynamic region and a background region, and meanwhile, the pose information of the robot can be accurately acquired.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

The detailed description of the specific embodiments of the present application will be presented by way of example and not limitation with reference to the accompanying figures in which like references indicate similar or analogous elements or parts.

FIG. 1 is a schematic flow chart of dynamic scene-based RGB-D SLAM methods according to embodiments of the present application;

FIG. 2 is a schematic structural diagram of dynamic scene-based RGB-D SLAM systems according to embodiments of the present application;

FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.

Detailed Description

definition of terms

To facilitate the description of the details of the present embodiment, the following terms are first defined.

1, ORB: the original FASTand Rotated BRIEF algorithm is the fastest and stable feature point detection and extraction algorithm at present, and a plurality of image splicing and target tracking technologies are realized by utilizing ORB features.

2.RGB-D＝RGB+Depth Map；

RGB color model is an industry color standard, which is obtained by changing three color channels of red (R), green (G) and blue (B) and superimposing them on each other to obtain various colors, wherein RGB represents the colors of the three channels of red, green and blue, and the standard almost includes all colors that can be perceived by human vision, and is that currently uses the most color system.

The COCO datasets are large image datasets for object detection, segmentation, human keypoint detection, semantic segmentation, and subtitle generation, among others.

Mask-RCNN model: the Mask RCNN is a network framework based on the fast RCNN, a full convolution Mask segmentation sub-network is added behind a basic feature network, and the original classification and regression detection tasks are changed into the classification, regression and segmentation detection tasks.

The Canny edge algorithm is multi-stage edge detection algorithms, the contour edge of the image can be obtained by adopting the Canny edge algorithm, and the gradient limit of edge detection can be changed by setting a threshold value.

7. The optical flow is the instantaneous speed of the pixel motion of a spatially moving object on an observation imaging plane, and is methods that use the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the correspondence existing between the upper frames and the current frame, thereby calculating the motion information of the object between the adjacent frames, generally speaking, the optical flow is generated by the movement of a foreground object in a scene, the motion of a camera, or the common motion of the two.

The Lucas-Kanade optical flow algorithm is an two-frame differential optical flow estimation algorithm and is based on the following three assumptions that the gray level of pixel points in an area is kept unchanged when an external light source is stable and the time interval delta t is small, the continuous change of time does not influence the drastic change of the motion position of an object, and the pixel points adjacent to the surface in scenes have similar motion changes.

The ICP (Iterative Closest Points) algorithm consists of two parts: searching corresponding points and solving the pose, wherein the aim of the method is to search the matching relation between point sets, and the solved result is the translation and rotation amount between the two point sets.

RGB-D SLAM method based on dynamic scene

The experimental data set used in this example was from the TUM (university of Munich industries, Germany) data set, a large data set containing RGB-D data and ground truth data, with the aim of establishing new benchmarks for the evaluation of visual ranging and visual SLAM systems.

The TUM data set contains color and depth images of a Microsoft Kinect sensor along the sensor ground truth track, recording data at full frame rate (30Hz) and sensor resolution (640X 480). Where the ground truth track is obtained from a high precision motion capture system with eight high speed tracking cameras (100Hz), the workflow of the RGB-D SLAM method based on dynamic scenes will be described below.

FIG. 1 is a flow chart of dynamic scene-based RGB-D SLAM methods according to embodiments of the present application, as shown in FIG. 1, the method comprising the steps of:

step 100, segmenting a network model based on deep learning semantics to determine potential dynamic regions of an image;

specifically, for example, in an indoor environment where a person is a main dynamic object, an image containing the activity of the person in the TUM data set may be first adopted, then the person is defined as an object to be identified, and every frames of the image are input into a deep-learning semantic segmentation network model to determine a dynamic or potential dynamic region in the image;

preferably, the training samples of the deep-learning semantic segmentation network are from a COCO dataset, which is used to detect and classify images to determine dynamic or potentially dynamic regions in the images.

, step 100 includes the following substeps 110 and 120;

step 110, extracting the outline edge of the object to be recognized, inputting the RGB image containing the outline edge into a Mask-RCNN model for semantic segmentation to obtain a Mask image of the object to be recognized;

since useless points may still be detected around the object to be recognized after the RGB image including the contour edge is input to the Mask-RCNN model for semantic segmentation, contour refinement needs to be performed on the Mask image of the object to be recognized by a further step.

And 120, performing contour restoration on the mask image of the object to be identified: constructing a contour feature of the object to be identified, wherein the contour feature comprises an edge centroid; obtaining coordinate values of the edge centroid, and calculating the distance between the contour edge point of the object to be recognized and the edge centroid; if the distance is larger than a preset distance threshold value, removing the edge points of the outline to determine a potential dynamic area in the image;

preferably, a canny edge algorithm is adopted to detect the contour edge of the image and repair the mask image of the object to be identified.

And 200, determining whether the characteristic points of two continuous frame images in the images correspond to each other by adopting a motion consistency method to judge whether the potential dynamic region in the images and the object to be identified in the background region are consistency, and if consistency is reached, identifying the potential dynamic region as a dynamic region.

Preferably, after the potential dynamic region is determined in step 100, the motion -induced method, such as an optical flow method, adopted in step can be further used to determine whether the potential dynamic region in the image and the object to be identified in the background region are -induced, that is, whether the spatial-temporal -induced object to be identified is assumed first, and then whether the points of two continuous frame images correspond to each other is determined to determine whether the potential dynamic region and the background region are -induced.

Specifically, step 200 includes the following substeps:

step 210, calculating the light flow value of each pixels in the image of the potential dynamic area, and obtaining the light flow field of each point according to the light flow value;

step 220: tracking sparse points inside and outside an object to be identified in a potential dynamic area by adopting a Lucas-Kanade optical flow method, and dividing a background area of an image according to an optical flow field of each point;

step 230: constructing a standardized histogram of potential dynamic and background regions, determining the range of each interval of the standardized histogram, and distributing all optical flow vectors to different clusters to form a plurality of boxes; constructing motion vectors of potential dynamic areas and background areas according to the optical flow vectors in each box;

wherein the construction of motion vectors for potential dynamic and background regions from the optical flow vectors in each bin is obtained by the following sub-steps 231-:

calculating motion vectors of every boxes in the potential dynamic region;

is the magnitude of the optical flow vector in the R-th bin,

is the magnitude of the optical flow vector of the R-th bin in the potential dynamic region;

step 232, calculating motion vectors of every boxes in the background area;

is the magnitude of the optical flow vector in the R' th bin,

is the magnitude of the optical flow vector for the R' th bin in the background region;

constructing a motion vector of the potential dynamic region according to the motion vectors of boxes in the potential dynamic region, constructing a motion vector of the background region according to the motion vectors of boxes in the background region;

V_D＝(H1，H2，H3，...，H[R])；

V_B＝(H1′，H2′，H3′，...，H[R′])；

wherein, V_DAnd V_BMotion vectors for the potential dynamic region and the background region, respectively; r is the number of the box, HR]Is the motion vector of the R-th bin in the potential dynamic region, H [ R']Is a backgroundMotion vectors of the R' th bin in the region.

Step 240: and calculating cosine similarity of the motion vectors of the potential dynamic area and the background area, if the cosine similarity is greater than the measured motion state tolerance gamma, determining that the potential dynamic area moves, and identifying the potential dynamic area as the dynamic area.

Wherein cos Δ is a cosine similarity between motion vectors of the potential dynamic region and the background region; d is a potential dynamic area, and B is a background area; v_DMotion vectors, V, for potential dynamic regions_BWhich is the motion vector of the background area, in this embodiment the measured motion state tolerance y is preset thresholds.

And S300, respectively extracting ORB feature points in the potential dynamic region and the background region, if the region where the ORB feature points are located is the dynamic region, ignoring the ORB feature points in the current image frame and the reference frame, and otherwise, keeping the ORB feature points in the potential dynamic region and the background region.

Specifically, in this embodiment, a Kinect sensor may be used to obtain ORB feature points of the potential dynamic region and the background region for feature extraction, so as to determine whether the ORB feature points are damaged;

moreover, the 3D motion track of each input image in the camera can be accurately estimated according to the ORB characteristic points in the potential dynamic area and the background area; wherein, the reference frame refers to the adjacent image of the current frame.

S400, matching the ORB characteristic points by adopting an ICP (inductively coupled plasma) algorithm to obtain pose information of the robot;

in the embodiment, the world coordinates are minimized, the ORB feature points are subjected to key point matching, and the pose of the robot is initially optimized by using the obtained errors among the 3D points;

specifically, firstly, determining positions of feature points to be located in a potential dynamic area and a background area;

then, carrying out fusion matching on the positions of the feature points, namely carrying out feature matching on the common feature points, reserving different feature points and removing noise points;

and measuring and minimizing the errors of the positions of the feature points so as to initially optimize the pose of the robot.

In another embodiment, after step 400 is completed, the dynamic scene based RGB-D SLAM method further further includes step 500:

performing global optimization on the pose information based on a closed loop detection mode and constraints among all frames of the image;

i.e. the global optimization of the pose information results by steps using the closed-loop detection mode and the constraints between all frames of the image input to the back-end of the SLAM system.

The method of the embodiment identifies potential dynamic regions of an image by utilizing semantic information of the image, then verifies consistency of optical flows in the potential dynamic regions and a background region by adopting a motion consistency method to identify the dynamic regions from the potential dynamic regions, and applies the method to the front end of an RGB-D SLAM system based on the identified dynamic regions, so that the track and the map of a robot can be tracked simultaneously.

RGB-D SLAM system based on dynamic scene

Fig. 2 is a schematic structural diagram of dynamic scene-based RGB-D SLAM systems according to embodiments of the present application, and referring to fig. 2, the systems include a determination module, a recognition module, an extraction module, and an initial optimization module, wherein,

In this embodiment, optionally, the determining module includes a semantic segmentation unit and a repair unit:

In this embodiment, optionally, the mask image of the object to be identified is repaired by using a canny edge algorithm.

In this embodiment, optionally, the identification module includes an optical flow field obtaining unit, a background region obtaining unit, a construction unit, and a cosine similarity obtaining unit;

In this embodiment, optionally, the system further includes a global optimization module: the global optimization module is used for carrying out global optimization on the pose information based on a closed loop detection mode and constraints among all frames of the image.

The system provided by this embodiment may execute the method provided by any one of the RGB-D SLAM methods based on a dynamic scene, and the detailed process is described in the method embodiment and is not described herein again.

An embodiment of the present application further provides computing devices, referring to fig. 3, comprising a memory 620, a processor 610 and a computer program stored in said memory 620 and executable by said processor 610, the computer program being stored in a space 630 for program code in the memory 620, the computer program, when executed by the processor 610, implementing the method step 631 according to the invention for performing any of the items.

computer-readable storage media are also provided in embodiments of the application referring to fig. 4, the computer-readable storage media comprises a storage unit for program code provided with a program 631' for performing the steps of the method according to the invention, the program being executed by a processor.

computer program product containing instructions for causing a computer to perform the steps of the method according to the invention when the computer program product is run on a computer are also provided.

The computer instructions may be stored in a computer readable storage medium, or transmitted from website sites, computers, servers, or data centers via wired (e.g., coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) to website sites, computers, servers, or data centers via a wired (e.g., optical fiber, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) manner, the computer storage medium may be any available Solid State storage medium, such as a Solid State storage medium, a magnetic Disk, or a Solid State storage medium, such as a Solid State storage medium, a magnetic Disk, a Solid State storage medium, a computer 82, a computer network, a network, or other programmable apparatus.

should also further be appreciated that the exemplary elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the exemplary components and steps have been described in the foregoing description generally in terms of functionality for clarity of illustrating interchangeability of hardware and software.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1, RGB-D SLAM method based on dynamic scene, the method includes the following steps:

determining whether the characteristic points in two continuous frame images in the images correspond to each other by adopting a motion consistency method, judging whether the potential dynamic region in the images and the object to be identified in the background region are consistency, and if consistency is met, identifying the potential dynamic region as a dynamic region;

respectively extracting ORB feature points in a potential dynamic region and a background region, if the region where the ORB feature points are located is the dynamic region, deleting the ORB feature points in the current image frame and the reference frame, and otherwise, keeping the ORB feature points in the potential dynamic region and the background region;

2. The method of claim 1, wherein: the deep learning based semantic segmentation network model to determine potential dynamic regions of an image comprises the sub-steps of:

3. The method of claim 2, wherein: and repairing the mask image of the object to be identified by adopting a canny edge algorithm.

4. The method of claim 1, wherein:

said identifying the potential dynamic region as a dynamic region comprises the sub-steps of:

5. The method according to any one of claims 1-4 to , further comprising the steps of:

and performing global optimization on the pose information based on a closed loop detection mode and constraints among all frames of the image.

6, RGB-D SLAM system based on dynamic scene, the system includes a determination module, a recognition module, an extraction module and an initial optimization module, wherein,

7. The system of claim 6, wherein the determination module comprises a semantic segmentation unit and a repair unit:

8. The system according to claim 7, wherein the repairing unit adopts canny edge algorithm to repair the mask image of the object to be identified.

9. The system according to claim 6, wherein the identification module comprises an optical flow field acquisition unit, a background region acquisition unit, a construction unit and a cosine similarity acquisition unit;

10. The system according to any one of claims 6-9 and , wherein the system further comprises a global optimization module: