CN116524026B

CN116524026B - Dynamic vision SLAM method based on frequency domain and semantics

Info

Publication number: CN116524026B
Application number: CN202310505675.0A
Authority: CN
Inventors: 栾添添; 吕奉坤; 班喜程; 孙明晓; 吕重阳; 张晓霜; 吴宝奇
Original assignee: Harbin University of Science and Technology
Current assignee: Jilin Yuntou Laisengou Digital Technology Co ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-10-27
Anticipated expiration: 2043-05-08
Also published as: CN116524026A

Abstract

The invention discloses a dynamic SLAM method based on frequency domain and semantics, which is used for completing positioning and mapping tasks in a high-dynamic and complex illumination environment. First, to accurately obtain a motion region of an object, images are registered in a frequency domain using a fourier melin algorithm to compensate for camera motion, and then an inter-frame difference algorithm is applied to obtain a motion mask of the images. And simultaneously, carrying out semantic segmentation on the image through a short-time dense connection (STDC) network to obtain a potential moving object mask. Combining the motion mask and the object mask to obtain a final object motion area, and eliminating the characteristic points falling in the area. And finally, tracking and optimizing according to the stable static characteristic points, and improving the pose accuracy. Test results in a public data set and a real environment show that the method has good positioning accuracy and robustness in a complex dynamic scene, and can effectively reduce the influence of motion blur and illumination change on motion detection.

Description

Dynamic vision SLAM method based on frequency domain and semantics

Field of the art

The invention belongs to the field of computer vision, and particularly relates to a simultaneous localization and mapping technology, in particular to a dynamic vision SLAM method based on frequency domain and semantics.

(II) background art

Synchronous localization and mapping techniques (simultaneous location and mapping, SLAM) refer to building a map of the surrounding environment from sensor data in real time without any prior knowledge, while inferring its own localization from this map. SLAM technology based on visual sensors is known as visual synchrony positioning and map creation (VSLAM) technology. After having an RGB-D camera with fast acquisition speed, rich acquisition information, and relatively low price, VSLAMs have been widely used in a variety of fields.

Over the past 30 years, many scholars have studied SLAM and achieved outstanding effects such as ORB-SLAM2, RGBD-SLAM-V2, and the like. However, the conventional SLAM operation is mostly based on the assumption of a static environment, but dynamic objects inevitably exist in the real operation environment of the SLAM, and feature points of the objects are unstable, thus causing interference to the SLAM and causing performance degradation. In a SLAM system based on feature points, when unstable feature points are tracked, pose estimation is seriously affected, resulting in a large track error or even system breakdown. Therefore, performance degradation and lack of robustness in dynamic scenarios have become major obstacles in their practical application.

In the paper 'visual synchronous positioning and map construction based on semantic and optical flow constraint under dynamic scene', semantic and optical flow information is used for eliminating dynamic object feature points in the scene so as to reduce interference of dynamic objects on SLAM, thereby improving the accuracy and robustness of SLAM. However, the optical flow method is based on the assumption of invariance of illumination and cannot be applied to scenes with changed illumination. In the paper 'indoor mobile robot mapping technical research based on dynamic target detection', epipolar constraint is used for screening dynamic feature points, and semantic information and the dynamic feature points in a dynamic scene are utilized for filtering dynamic parts, so that the accuracy of gesture estimation is improved. Epipolar constraint is based on the assumption that static regions occupy an absolute majority of scenes, which is not true in most dynamic scenes, especially those where motion blur occurs. The invention also uses a deep learning method to acquire semantic information, but mainly improves a motion detection algorithm, and uses a Fourier-Merlin transformation registration image to perform motion detection, so that the method still has robustness in the environments of severe illumination change and motion blurring.

Aiming at the problem that the prior art is not robust in the environment with severe illumination change and motion blur, the invention provides a dynamic vision SLAM method based on frequency domain and semantics, which can effectively improve the precision and the robustness of SLAM in the environment with severe illumination change and motion blur.

(III) summary of the invention

The invention utilizes the unique advantages of Fourier-Merlin transformation in image registration to combine with an interframe difference (Temporal Difference, TD) algorithm, realizes a high-robustness motion detection algorithm, combines with a visual ORB-SLAM2 and STDC semantic segmentation network, and provides a visual SLAM algorithm based on Fourier-Merlin transformation in a dynamic scene. First, to accurately obtain the motion region of the object, a fourier melin algorithm is used for registration to compensate for camera motion, and then an inter-frame difference algorithm is used to obtain a motion mask. Meanwhile, the image passes through the STDC semantic segmentation network to obtain a potential moving object mask. Combining the motion mask and the object mask to obtain a final object motion area, and eliminating the characteristic points falling in the area. And finally, tracking, optimizing and improving the pose accuracy through stable static feature points.

In order to achieve the above purpose, the invention adopts the following technical scheme:

s1, acquiring an input image sequence, wherein the input image sequence comprises RGB images and corresponding depth images;

s2, extracting ORB characteristic points of an RGB image of an input frame, and specifically comprising the following substeps:

s21, converting the input RGB image into a gray scale image;

s22, initializing pyramid parameters of the image, wherein the pyramid parameters comprise the number of extracted feature points, pyramid scaling factors, pyramid layers, the number of feature points pre-allocated for each layer, the extraction parameters of initial FAST feature points and the like;

s23, constructing an image pyramid, scaling each layer of pyramid image in the construction process, and filling the periphery;

s24, traversing the images of all pyramid layers, gridding each image, and calling opencv functions in the grids to extract FAST corner points;

s25, eliminating the characteristic points by using an octree method according to the characteristic points pre-distributed in each layer, and calculating the direction of each characteristic point by using a gray centroid method;

s26, waiting for the completion of the motion detection to acquire a moving object region;

s3, taking the input current frame as an image to be registered, taking the previous frame image as a registration image, and registering the registration image and the image to be registered in a frequency domain by utilizing Fourier Merlin transformation, wherein the method specifically comprises the following substeps:

s31, converting RGB images of an input registration image and an image to be registered into a gray scale image;

s32, performing discrete Fourier transform on the registered gray level images, and performing high-pass filtering on the frequency domain images subjected to the discrete Fourier transform;

s33, carrying out logarithmic polar coordinate transformation on the frequency domain diagram after high-pass filtering, and inputting the image subjected to the logarithmic polar coordinate transformation into a phase correlation step to obtain response coordinates (x, y);

s34, carrying out coordinate transformation on response coordinates (x, y) obtained in the phase correlation step to obtain a rotation angle theta and a scale factor S, and carrying out rotation and scaling on the image to be registered according to the rotation angle theta and the scale factor S;

s35, inputting the rotated and scaled image to be registered and the registration image into the phase correlation step again to obtain response coordinates (x, y), and translating the image to be registered according to the response coordinates (x, y) to obtain a final registration image;

s4, performing motion detection on the registration image and the previous frame image through an inter-frame difference method, removing noise through thresholding, edge detection, contour clustering and other operations, and specifically comprising the following sub-steps:

s41, inputting the registered image and the previous frame image into an inter-frame difference module together to obtain a difference image, wherein an inter-frame difference formula is as follows:

D _i (x，y)＝|f _i (x，y)-f _i+1 (x，y)|

wherein D is _i (x, y) is the i-th frame differential image, f _i (x, y) is the i-th frame gray scale image, f _i+1 (x, y) is the i+1st frame gray image after registration;

s42, thresholding is carried out on the differential graph, and a thresholding formula is as follows:

wherein R is _i (x, y) is the i-th frame threshold map, and t=40 is the binarization threshold. Namely, setting the pixel value of a point with the pixel value larger than 40 in the difference map as 255, and setting the pixel value of a point with the pixel value smaller than 40 as 0 to obtain a threshold map;

s43, applying a Canny edge detection operator to the threshold map to perform edge detection to obtain an edge mask;

s44, applying directional rectangular frame fitting to the edge mask to obtain a contour, calculating an aspect ratio for each rectangular frame, classifying the rectangular frame as an afterimage if the aspect ratio is smaller than 0.1, and setting 0 for pixel points of an afterimage area to realize elimination to obtain a final motion mask;

s5, inputting RGB images of an input frame into a short-time dense connection (STDC) network for semantic segmentation to obtain an object mask containing object semantic information;

s6, according to an object mask obtained by the STDC network, combining the motion mask to judge the motion of the object, and specifically comprising the following substeps:

s61, calculating the motion probability rho of the object for the object mask and the motion mask through the following formula _i ：

Wherein M is _i M is the total pixel number of the ith object in the object mask _i The total pixel number of the corresponding region of the motion mask;

s62, a threshold epsilon=0.1 is set, if the motion probability ρ is _i If the pixel point is larger than the threshold epsilon, the object is regarded as a moving object, otherwise, the object is regarded as a static object, and a priori dynamic object mask is obtained by setting 0 to the pixel point of the static object area;

s63, fusing the prior dynamic object mask and the motion mask to obtain a final dynamic object mask;

s64, inputting the dynamic object mask into the step S26, and eliminating the characteristic points falling in the dynamic object area according to the dynamic object mask.

The invention has the following beneficial effects:

(1) The invention registers images through improved Fourier Merlin transformation to realize motion compensation, obtains a motion mask by using inter-frame difference, and reduces the influence of motion blur and illumination change on motion detection;

(2) The invention combines motion detection and semantic segmentation to provide a dynamic feature point filtering method which can effectively eliminate the interference of pose estimation and mapping of a dynamic object;

(3) Compared with the traditional dynamic SLAM, the invention can obtain better effect in a high dynamic environment. In the high dynamic sequence, the absolute track error of the invention is reduced by more than 95 percent compared with ORB-SLAM2 average, and is reduced by more than 30 percent compared with DS-SLAM average, which shows that the invention has higher accuracy and robustness under dynamic environment.

(IV) description of the drawings

FIG. 1 is a general flow diagram of a SLAM system;

FIG. 2 is a flowchart of Fourier Merlin transform image registration;

FIG. 3 is an example of image registration;

FIG. 4 is a flow chart of the motion detection of the drawing;

FIG. 5 is an exemplary diagram of motion detection;

FIG. 6 is a graph of mask extraction effect under motion blur;

fig. 7 is a graph showing the mask extraction effect under illumination change.

(fifth) detailed description of the invention

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and test examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. An overall flow chart of the system of the present invention is shown in fig. 1.

s21, converting the input RGB image into a gray scale image;

s3, taking the input current frame as an image to be registered, taking the previous frame of image as a registration image, registering the registration image and the image to be registered in a frequency domain by utilizing Fourier Merlin transformation, wherein an image registration flow chart is shown in FIG. 2, and specifically comprises the following substeps:

s35, inputting the rotated and scaled image to be registered and the registration image into the phase correlation step again to obtain response coordinates (x, y), and translating the image to be registered according to the response coordinates (x, y) to obtain a final registration image, wherein an image registration example diagram is shown in FIG. 3;

s4, performing motion detection on the registration image and the previous frame image through an inter-frame difference method, removing noise through thresholding, edge detection, contour clustering and other operations, wherein a motion detection flow chart is shown in FIG. 4, and specifically comprises the following sub-steps:

D _i (x，y)＝|f _i (x，y)-f _i+1 (x，y)|

wherein R is _i (x, y) is the i-th frame threshold map, and t=40 is the binarization threshold. That is, the pixel value of the point with the pixel value larger than 40 in the difference image is set as 255, likeSetting the pixel value of the point with the pixel value smaller than 40 to be 0, and obtaining a threshold value diagram;

s44, applying directional rectangular frame fitting to the edge mask to obtain a contour, calculating an aspect ratio for each rectangular frame, classifying the rectangular frame as an afterimage if the aspect ratio is smaller than 0.1, and setting 0 to the pixel point of the afterimage area to realize elimination, so as to obtain a final motion mask, wherein an example diagram of motion detection is shown in FIG. 5;

The present invention uses absolute track error (Absolute Trajectory Error, ATE) and relative pose error (Relative Pose Error, RPE) to evaluate method performance, using root mean square error (Root Mean Square Error, RMSE) and standard deviation (Standard Deviation, SD) as evaluation indicators. The performance under the TUM dataset is shown in tables 1 and 2.

TABLE 1

TABLE 2

As can be seen from tables 1 and 2, in the high dynamic sequence, the root mean square error and standard deviation of the absolute track error of the present invention are significantly better than the ORB-SLAM2 and DS-SLAM, which indicates that the present invention has higher accuracy and more compact error distribution in the high dynamic sequence. In particular, the invention provides a significant improvement over both fr3/w/xyz and fr3/w/half sequences. In the fr3/w/xyz sequence, the root mean square error and standard deviation of the present invention were reduced by about 98.39% and 97.79% relative to ORB-SLAM2, and about 35.33% and 42.94% relative to DS-SLAM. In the fr3/w/half sequence, the root mean square error of the present invention was reduced by about 97.71% and 97.13% relative to ORB-SLAM2 and by about 39.91% and 53.37% relative to DS-SLAM. This shows that the invention has better robustness compared with ORB-SLAM2 and DS-SLAM methods in dynamic scenarios.

However, in low dynamic sequences, the root mean square error and standard deviation of the absolute track error of the invention are reduced by only 17.86% and 15.82% relative to ORB-SLAM2, and are improved by 7.80% and 3.98% relative to DS-SLAM, because in low dynamic sequences the moving object is not moving at the moment, which results in its feature points being used for localization when the moving object is stationary, and its feature points being removed when moving, which has an effect on the accuracy of the system in subsequent global optimization. In addition, although the method uses high-pass filtering, constructing an edge mask and the like to eliminate the influence of environmental noise caused by registration errors, in some cases, especially in the case of severe camera movement, the environmental noise is difficult to thoroughly eliminate, which is one of the reasons that the method has slightly poor precision in low dynamic sequences compared with the DS-SLAM algorithm.

Compared with ORB-SLAM2 and classical DS-SLAM, the method can remarkably improve the positioning accuracy of dynamic scenes. In particular, for low dynamic scenes, the accuracy of the present invention is improved by about 15% over ORB-SLAM 2. For a high dynamic scene, the improvement effect is more obvious, the precision of the invention can be stably improved by more than 95% compared with ORB-SLAM2, and the precision of the invention is improved by more than 30% compared with DS-SLAM. Research results show that the method can accurately eliminate the interference of the dynamic target, thereby reducing the pose error in the optimization process. Because the invention optimizes the extraction strategy of the dynamic mask, compared with the DS-SLAM which uses the RANSAC method to calculate the outlier so as to obtain the dynamic characteristic point, the invention does not need the assumption that the static area occupies the main area or the interested area. As shown in fig. 6 and 7, the present invention can accurately extract a dynamic region in the case where the dynamic blur region occupies a large part of an image or a scene undergoes a severe illumination change.

The above embodiments further illustrate the objects, technical solutions and advantageous effects of the present invention, and the above examples are only for illustrating the technical solutions of the present invention, but not for limiting the scope of protection of the present invention, and it should be understood by those skilled in the art that modifications, equivalents and alternatives to the technical solutions of the present invention are included in the scope of protection of the present invention.

Claims

1. The dynamic vision SLAM method based on the frequency domain and the semantics is characterized by comprising the following steps:

s21, converting the input RGB image into a gray scale image;

s22, initializing image pyramid parameters, wherein the parameters comprise the number of extracted feature points, pyramid scaling factors, pyramid layers, the number of feature points pre-allocated for each layer and the extraction parameters of initial FAST feature points;

s4, performing motion detection on the registration image and the previous frame image through an inter-frame difference method, and removing noise through thresholding, edge detection and contour fitting operation, wherein the method specifically comprises the following sub-steps:

D _i (x,y)＝|f _i (x,y)-f _i+1 (x,y)|

wherein R is _i (x, y) is an i-th frame threshold map, t=40 is a binarization threshold, that is, a pixel value of a point with a pixel value greater than 40 in the difference map is set to 255, and a pixel value of a point with a pixel value less than 40 is set to 0, to obtain a threshold map;

s5, inputting the RGB image of the input frame into a short-time dense connection STDC network for semantic segmentation to obtain an object mask containing object semantic information;

Wherein F is _i M is the total pixel number of the ith object in the object mask _i The total pixel number of the corresponding region of the motion mask;