CN110490900B

CN110490900B - Binocular vision positioning method and system under dynamic environment

Info

Publication number: CN110490900B
Application number: CN201910634021.1A
Authority: CN
Inventors: 秦家虎; 余雷; 陈宇航
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2022-04-19
Anticipated expiration: 2039-07-12
Also published as: CN110490900A

Abstract

A binocular vision positioning method and system under dynamic environment, apply to the technical field of robot positioning, withdraw the characteristic point of the picture, distinguish the characteristic point on the moving object point and characteristic point on the non-moving object through clustering the way to the characteristic point, this kind of method can reject the characteristic point on the moving object effectively, avoid rejecting the characteristic point on the static dynamic attribute object, and then keep more available characteristic points, position and posture transformation matrix of interframe obtained through IMU pre-integration algorithm screens and close to IMU data on the basis of different characteristic point sets estimated position and posture, thus reject some exterior points such as the characteristic point belonging to moving object effectively; the pose and depth estimation method integrating monocular motion vision and binocular stereo vision can effectively improve the estimation precision.

Description

Binocular vision positioning method and system under dynamic environment

Technical Field

The invention relates to the technical field of robot positioning, in particular to a binocular vision positioning method and system in a dynamic environment.

Background

The traditional GPS positioning technology has strong dependence on satellite signals, greatly reduces the precision in the environment with more buildings and serious shielding, and is especially basically unusable in the indoor environment. The inertial navigation technology based on the inertial navigation module (gyroscope and accelerometer) does not depend on external information, the current pose information of the carrier can be obtained through the integral of the inertial navigation measurement information, but the inertial navigation positioning error is increased along with the time, and the long-term accuracy is poor due to the existence of accumulated error. The vision-based positioning method determines the pose of the robot by analyzing the image sequence, can be used in indoor and other sheltered environments, and can also be used for carrying out closed-loop detection by using abundant visual information to eliminate accumulated errors so as to realize long-term high-precision positioning. In the field of robots, a vision-based positioning method is mainly based on vision and Simultaneous positioning and Mapping (SLAM), which is called vision SLAM for short, and a vision SLAM without map building is also called vision odometer, and is called vision SLAM collectively herein. The current vision SLAM is divided into a monocular SLAM, a binocular SLAM and an RGB-D SLAM according to different sensors, and the monocular SLAM has the scale uncertainty problem, but the RGB-D SLAM is too high in cost and low in precision, so that the binocular SLAM is adopted, and the monocular scale uncertainty problem does not exist, and the cost is relatively low. At present, most SLAM positioning technologies are based on static environment assumptions, however, most practical environments are complex dynamic scenes, in the dynamic scenes, the precision of SLAM algorithms based on the static assumptions is greatly reduced, in recent years, some semantic-based dynamic scene SLAM technologies are proposed and used, semantic SLAM is called semantic SLAM for short, the semantic SLAM detects object types with dynamic attributes in scenes through deep learning or other computer vision methods, and the objects of the dynamic attribute types are removed before pose estimation is carried out by utilizing image information, so that the influence of dynamic objects on the pose estimation is reduced.

However, this method has the following disadvantages: (1) the image detection technology has high calculation complexity, the positioning technology has high requirements on real-time performance, and the added frames of deep learning and the like in the SLAM system can greatly influence the real-time performance. (2) In the SLAM system, particularly a visual odometer without an environment map, when the pose estimation is performed by using the image information, the object with dynamic property is not the object which affects the pose estimation accuracy, but the moving object, such as a car parked in a parking lot, belongs to the object with dynamic property, but the car may stay in the parking lot for a long time, at least the pose estimation of the current image information is not affected, and more characteristic information is provided for the pose estimation, particularly in a scene with rare characteristics, the temporarily static dynamic object can provide more characteristics, and the pose estimation accuracy is increased.

Disclosure of Invention

The invention mainly aims to provide a binocular vision positioning method and system in a dynamic environment.

In order to achieve the above object, a first aspect of the embodiments of the present invention provides a binocular vision positioning method in a dynamic environment, including:

acquiring a left eye image and a right eye image under binocular vision of a camera, and extracting feature points corresponding to map points in a current frame to which the left eye image and the right eye image belong and a previous frame to which the left eye image and the right eye image belong, wherein the map points are feature points added into a map;

according to the characteristic points, obtaining the left-eye and right-eye movement visual matching characteristic point pairs and stereoscopic vision matching characteristic point pairs;

screening the shared matching characteristic point pairs of the stereoscopic vision matching characteristic point pair and the motion vision matching characteristic point pair;

respectively carrying out stereo depth estimation on the characteristic point pairs in the common matching characteristic point pair set to obtain a stereo depth estimation value;

clustering the left-target and right-target motion visual matching feature point pairs according to the optical flow velocity and the rotation angle between the motion visual matching feature point pairs by using a K-means + + clustering algorithm to obtain clustering results, and classifying the motion visual matching feature point pairs into a plurality of categories according to the clustering results;

calculating pose estimation results of the IMU of the left eye and the right eye in a time period between a current frame and a previous frame by utilizing an IMU pre-integration algorithm, wherein the pose estimation results comprise inter-frame poses between the current frame and the previous frame;

respectively carrying out EPNP (extended post-processing) pose estimation and map point estimation on the motion visual matching feature point pairs under each category to obtain a pose estimation result and a map point estimation result of the EPNP;

selecting the pose estimation result of the EPNP closest to the pose estimation result of the IMU as an optimal EPNP pose estimation result and a map point estimation result, and recording effective matching feature point pairs corresponding to the optimal EPNP pose estimation result and the map point estimation result;

and selecting a characteristic point pair corresponding to the effective matching characteristic point pair from the common matching characteristic point pair, and performing Kalman filtering fusion on a stereo depth estimation value corresponding to the characteristic point pair and an EPNP estimated depth value to obtain an estimated pose and an estimated map point.

Further, the IMU pre-integration algorithm:

wherein,

quaternion, a, representing displacement increment, velocity increment, and rotation angle change in a period from the k-th frame to the k + 1-th frame in the k-th frame camera coordinate system, respectively_tRepresenting accelerometer measurements, w_tRepresenting measured values of angular velocity, b_atAnd b_wtRepresenting the bias of the accelerometer and gyroscope, respectively, n_aAnd n_wRespectively, represent the noise, and are,

a rotation transformation matrix representing the time t relative to the k frame camera coordinate system

The correlation equation yields, Ω is the matrix related to the quaternion calculation, Δ t_kRepresents t_kTo t_k+1Time interval of (g)^wRepresenting the gravity vector in the world coordinate system.

Further, after the classifying the moving visual matching feature point pairs into a plurality of categories according to the clustering result, the method includes:

performing histogram statistics on the number of the feature points in each category;

and deleting the categories and the corresponding feature points under the categories, wherein the number of the corresponding feature points under each category is less than a first preset number, and the difference between the number of the corresponding feature points under each category and the number of the corresponding feature points under other categories is greater than a second preset number.

Further, the performing EPNP pose estimation and map point estimation on the motion visual matching feature point pairs under each category, respectively, to obtain a pose estimation result and a map point estimation result of the EPNP includes:

respectively carrying out EPNP position and map point estimation on the feature points under each category, randomly adopting a consistency algorithm through RANSAC, and carrying out outlier elimination on the position and orientation estimation result of each category to obtain the position and orientation estimation result of the EPNP corresponding to each category and the map point estimation result.

Further, after obtaining the estimated pose and the estimated map point, the method includes:

optimizing the estimated pose and the estimated map points by utilizing a map-based optimization algorithm according to the inter-frame pose of the IMU to obtain optimized map points;

the optimization algorithm of the graph is as follows:

which is composed of

Represents an estimated pose transformation matrix obtained by EPNP pose estimation,

representing a pose transformation matrix, z, obtained by an IMU pre-integration algorithm^k+1Representing camera imagesPixel coordinate of (2), P^kRepresenting the 3D coordinates of a map point in the kth frame camera coordinate system, pi_c(.) is a projective transformation based on a model of the camera for transforming 3D coordinates in the camera coordinate system into 2D pixel coordinates in the camera image.

Further, after the optimizing the estimated pose and the estimated map point to obtain the optimized map point, the method includes:

replacing the map points before optimization by the optimized map points, updating the local map, and eliminating map points which do not belong to the local map;

comparing whether the common-view feature point between the current frame and the next new key frame is higher than a preset threshold value or not;

if the current key frame is higher than the preset threshold, replacing the latest key frame with the current key frame;

and if the current frame is lower than the preset threshold, adding the current frame into a key frame set.

Further, the obtaining the left-eye and right-eye motion visual matching feature point pairs according to the feature points includes:

carrying out ORB descriptor matching on the feature points corresponding to the map points in the current frame to which the left target image belongs and the map points in the previous frame to which the left target image belongs to obtain a left target matching feature point pair;

carrying out ORB descriptor matching on the feature points corresponding to the map points in the current frame to which the right-eye image belongs and the map points in the previous frame to which the right-eye image belongs to obtain right-eye matched feature point pairs; wherein the left-eye and right-eye moving visual matching feature point pairs include the left-eye matching feature point pair and the right-eye matching feature point pair.

Further, the obtaining the stereoscopic vision matching feature point pair of the left eye and the right eye according to the feature point comprises:

and respectively carrying out ORB descriptor matching and SAD stereo matching on the feature points in the current frame to which the left eye image belongs and the feature points in the current frame to which the right eye image belongs to obtain the stereo matching feature point pairs of the left eye and the right eye.

A second aspect of the embodiments of the present invention provides a binocular vision positioning system in a dynamic environment, including:

the system comprises a main thread, a pose estimation thread, a loop thread and a back-end optimization thread;

the main thread is used for acquiring a left eye image and a right eye image under binocular vision of the camera, initializing the left eye image and the right eye image, receiving a restart signal sent by the pose estimation thread or the loop thread after the initialization is successful, and initializing the pose estimation thread or the loop thread again according to the restart signal;

the pose estimation thread is used for acquiring a left eye image and a right eye image under binocular vision of the camera, extracting feature points in a current frame to which the left eye image and the right eye image belong and feature points corresponding to a map point in a previous frame to which the left eye image and the right eye image belong, and calculating an estimated pose and an estimated map point of the current frame according to the feature points;

the loop thread is used for detecting a loop frame, estimating a pose transformation matrix between a current frame and the loop frame, and updating the pose of the current frame by using the pose of the loop frame as a reference;

and the back-end optimization thread is used for optimizing the poses of all the frames according to the pose estimation result of the IMU pre-integration algorithm and the pose transformation matrix between the current frame and the loop frame.

Further, the calculating the estimated pose and the estimated map point of the current frame according to the feature points includes:

The method extracts the image characteristic points, and distinguishes the characteristic points on the moving object points and the characteristic points on the non-moving object points in a characteristic point clustering mode, so that the method can effectively eliminate the characteristic points on the moving object, avoid eliminating the characteristic points on the static dynamic attribute object and further reserve more available characteristic points; screening poses which are closest to IMU data and estimated based on different feature point sets through an interframe pose transformation matrix obtained by an IMU pre-integration algorithm, and effectively removing external points such as feature points belonging to moving objects; the pose and depth estimation method integrating monocular motion vision and binocular stereo vision can effectively improve the estimation precision.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 2 is a general frame schematic diagram of a binocular vision positioning method under a dynamic environment according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a main thread of a binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a pose estimation thread of a binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 5 is a partially enlarged view of a schematic flow chart of a pose estimation thread of the binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 6 is a partially enlarged view of a schematic flow chart of a pose estimation thread of the binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 7 is a partially enlarged view of a schematic flow chart of a pose estimation thread of the binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a feature point pair of a binocular vision positioning method in a dynamic environment according to an embodiment of the present invention illustrating relative speed and angle;

fig. 9 is a schematic flowchart of a looping thread of a binocular vision positioning method in a dynamic environment according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of loop detection according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a back-end optimization process according to an embodiment of the present invention;

fig. 12 is a schematic diagram of a backend optimization effect according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the application, the method for screening the moving object is not only suitable for a binocular camera, but also can be used for monocular, monocular and binocular RGB-D cameras, so that the monocular, monocular and binocular RGB-D cameras can also be used for achieving the purpose of the invention. In the embodiment of the application, a visual SLAM algorithm is taken as an example, but for a laser sensor, a mechanism for screening a moving object based on the method provided by the invention instead of a dynamic object can be completed in the laser SLAM, so that the pose estimation of the moving object can be completed in a dynamic environment by adopting a method for clustering similar features in the laser SLAM.

Referring to fig. 1, fig. 1 is a schematic flow chart of a binocular vision positioning method under a dynamic environment according to an embodiment of the present invention, the method mainly includes:

s101, acquiring a left eye image and a right eye image under binocular vision of a camera;

s102, extracting feature points corresponding to the feature points in the current frame to which the left eye image and the right eye image belong and the map point in the previous frame to which the left eye image and the right eye image belong, wherein the map points are feature points added into a map;

s103, obtaining the left-eye and right-eye movement visual matching characteristic point pairs and stereoscopic vision matching characteristic point pairs according to the characteristic points;

s104, screening a common matching characteristic point pair of the stereoscopic vision matching characteristic point pair and the motion vision matching characteristic point pair;

s105, respectively carrying out stereo depth estimation on the characteristic point pairs in the common matching characteristic point pair set to obtain a stereo depth estimation value;

s106, clustering the left-eye and right-eye motion visual matching feature point pairs according to the optical flow velocity and the rotation angle between the motion visual matching feature point pairs by using a K-means + + clustering algorithm to obtain a clustering result, and classifying the motion visual matching feature point pairs into a plurality of categories according to the clustering result;

s107, calculating pose estimation results of the IMU of the left eye and the right eye in a time period between the current frame and the previous frame by utilizing an IMU pre-integration algorithm;

s108, respectively carrying out EPNP (extended post-processing) pose estimation and map point estimation on the motion vision matching feature point pairs under each category to obtain an EPNP pose estimation result and a map point estimation result;

s109, selecting the pose estimation result of the EPNP closest to the pose estimation result of the IMU as an optimal EPNP pose estimation result and a map point estimation result, and recording effective matching feature point pairs corresponding to the optimal EPNP pose estimation result and the map point estimation result;

s110, selecting a characteristic point pair corresponding to the effective matching characteristic point pair from the common matching characteristic point pair, and performing Kalman filtering fusion on a stereo depth estimation value corresponding to the characteristic point pair and an EPNP estimated depth value to obtain an estimated pose and an estimated map point.

Further, the IMU pre-integration algorithm:

wherein,

respectively representing displacement increment, speed increment and rotation in the time period from the k frame to the k +1 frame under the k frame camera coordinate systemQuaternion of angular variation, a_tRepresenting accelerometer measurements, w_tRepresenting measured values of angular velocity, b_atAnd b_wtRepresenting the bias of the accelerometer and gyroscope, respectively, n_aAnd n_wRespectively, represent the noise, and are,

and deleting the category and the corresponding feature points under the category, wherein the number of the corresponding feature points under each category is less than a first preset number, and the difference value between the number of the corresponding feature points under each category and the number of the corresponding feature points under other categories is greater than a second preset number.

Further, the step of respectively performing EPNP pose estimation and map point estimation on the motion visual matching feature point pairs under each category to obtain an EPNP pose estimation result and a map point estimation result includes:

optimizing the estimated pose and the estimated map points by utilizing a map-based optimization algorithm according to the pose estimation result of the IMU to obtain optimized map points;

the optimization algorithm of the graph is as follows:

which is composed of

representing a pose transformation matrix, z, obtained by an IMU pre-integration algorithm^k+1Pixel coordinates representing the camera image, Pk represents the 3D coordinates of the map point in the k-th frame camera coordinate system, pi_c(.) is a projective transformation based on a model of the camera for transforming 3D coordinates in the camera coordinate system into 2D pixel coordinates in the camera image.

Further, the optimizing the estimated pose and the estimated map point to obtain an optimized map point includes:

replacing the map points before optimization by the optimized map points, updating the local map, and eliminating points which do not belong to the local map;

if the current frame is lower than the preset threshold, adding the current frame into the key frame set.

Further, the obtaining the pair of left-eye and right-eye motion visual matching feature points according to the feature points comprises:

carrying out ORB descriptor matching on the feature point in the current frame to which the right-eye image belongs and the feature point corresponding to the map point in the previous frame to which the right-eye image belongs to obtain a right-eye matching feature point pair; wherein the left-eye and right-eye moving visual matching feature point pairs include the left-eye matching feature point pair and the right-eye matching feature point pair.

Further, the obtaining the pair of stereoscopic vision matching feature points of the left eye and the right eye according to the feature points comprises:

In the embodiment of the invention, the moving object is removed instead of the dynamic object, the moving object and the non-moving object are distinguished in a clustering mode, and the speed and the angle of the inter-frame feature point are clustered, so that the matching feature point set of the moving object is distinguished from the non-moving object feature point set; screening poses which are closest to IMU data and estimated based on different feature point sets through an interframe pose transformation matrix obtained by an IMU pre-integration algorithm, and effectively removing external points such as feature points belonging to moving objects; the pose and depth estimation method integrating monocular motion vision and binocular stereo vision can effectively improve the estimation precision.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a binocular vision positioning system under a dynamic environment according to another embodiment of the present invention, the system mainly includes:

the main thread is used for acquiring a left eye image and a right eye image under binocular vision of the camera, initializing the images, receiving a restart signal sent by the pose estimation thread or the loop thread after the initialization is successful, and initializing the images again according to the restart signal;

specifically, as shown in fig. 3, the main thread of the system mainly functions to start other threads and initialize a map, and provide initial trusted data for the subsequent other threads, and the main flow and the working principle are as follows:

the method comprises the steps of respectively obtaining images from the left eye and the right eye of a binocular camera, extracting feature points of the left eye image and the right eye image by using FAST corner detection, dividing the images into 10 sub-image areas, and selecting N corner points with the maximum response values from large to small in each sub-image area according to Harris response values to serve as a final feature point set. And calculating the rotation direction of each feature point, wherein the rotation direction of the feature point is defined as a direction vector from the geometric center to the centroid of the image block with the feature point as the center. Then, a descriptor of the feature point is calculated, which uses the same binary BRIEF descriptor as orb (organized FAST and Rotated BRIEF).

ORB description sub-matching is firstly carried out on the feature points of the left eye image and the feature points of the right eye image, then SAD (sum of absolute differences) sub-pixel level matching is carried out, and effective feature point screening is carried out by utilizing the rotation angle consistency of the feature points, so that accurate feature matching feature point pairs of the left camera image and the right camera image are obtained.

Using binocular camera models

Calculating the real depth of the matched characteristic point, wherein z represents the real coordinate of the characteristic point along the z-axis direction under the reference system of the camera, f represents the focal length of the camera, b is the base line of the camera, u is the base line of the camera_L、u_RRespectively, the abscissa of the pixel of the feature point in the left eye image and the right eye image. And detecting whether z is positive, if the depth estimation is regular, restoring the 3D coordinates of the feature point, wherein the 3D coordinates are called map points, if the depth estimation is not regular, the depth estimation of the feature point fails, and rejecting the feature point pair.

If the successfully matched characteristic point pair is larger than a certain threshold value, the initialization is considered to be successful, otherwise, the initialization is carried out again.

Once the initialization validation is successful, map points are added to the local map and global map classes and the current frame is added as a key frame and marked as the previous frame for subsequent processing. The local map is a map point set composed of map points corresponding to all matched feature point pairs on the current image frame and the past frame having a common view relationship with the current image frame, and the global map is a map point set composed of all past map points. The key frame set is used for maintaining the data set size, and only the image frames meeting the conditions are stored locally as key frames for subsequent closed-loop detection and back-end optimization.

After the whole system is initialized successfully, the system enters a state of polling signals of other threads, and if restart signals sent by other threads are received, the system enters the state of acquiring images from the left eye and the right eye of the binocular camera again to start to initialize the system again.

specifically, referring to fig. 4, 5, 6, and 7, the working process of the pose estimation thread is as follows:

images are acquired from the left and right eyes of a binocular camera respectively,

the method comprises the steps of respectively obtaining images from the left eye and the right eye of a binocular camera, extracting feature points of the left eye image and the right eye image by using FAST corner detection, dividing the images into 10 sub-image areas, and selecting N corner points with the maximum response values from large to small in each sub-image area according to Harris response values to serve as a final feature point set. And calculating the rotation direction of each feature point, wherein the rotation direction of the feature point is defined as a direction vector from the geometric center to the centroid of the image block with the feature point as the center. Then, a descriptor of the feature point is calculated, which uses the same binary BRIEF descriptor as orb (organized FAST and Rotated BRIEF). Reading acceleration a from a preset inertial navigation module (IMU module)_tAnd angular velocity w_tAnd filtering the acceleration and the angular velocity.

The IMU inertial navigation module is used as an auxiliary means for screening a proper visual positioning result in a dynamic environment, and sensors capable of obtaining displacement, speed, angular speed and angle of the robot by using a code disc, an encoder, an electronic compass and the like are used as auxiliary means to achieve the processing effect of matching with the visual sensors in the dynamic environment.

Carrying out ORB descriptor matching on the feature points corresponding to the map points in the previous frame to which the feature points belong and the current frame to which the left eye and the right eye belong respectively, accelerating the feature point matching by adopting a DBOW3 bag-of-vocabulary technology, then screening effective feature points by utilizing the feature point rotation angle consistency to respectively obtain a left eye matching feature point pair and a right eye feature point pair, which are called as a left eye and right eye motion vision matching feature point pair. And performing ORB descriptor matching and SAD stereo matching optimization on the current left eye frame and the current right eye frame to obtain more accurate stereo matching feature point pairs of the left eye and the right eye, namely the stereo matching feature point pairs, then screening out a common matching feature point pair set of the motion visual matching feature point pairs and the extracted visual matching feature point pairs, performing stereo depth estimation on corresponding feature points, and restoring the 3D coordinates of the feature points.

Calculating the optical flow velocity v and the rotation angle theta between the left-eye and right-eye motion visual matching feature point pairs, wherein the principle of the optical flow velocity and the rotation angle is schematically shown in fig. 8, and the calculation formula is as follows:

and (4) utilizing a K-means + + clustering algorithm to cluster the left-target and right-target motion visual matching feature point pairs into M categories according to the optical flow velocity and the rotation angle. In an actual environment, the speed and the direction of a moving object are greatly different from those of a background element, the moving objects with different speeds can be isolated from the characteristic points of a static object through clustering, after a clustering result is obtained, histogram statistics is carried out on the characteristic points of each category, and the categories with too small statistical quantity and large difference with other data and corresponding characteristic point pairs are screened out.

The IMU pre-integration technology is utilized to calculate the interframe displacement of the left eye camera and the right eye camera in the time period of the previous frame and the current frame, and the core formula of the IMU pre-integration principle is as follows

Wherein,

The correlation formula is obtained, and omega is the phase of quaternion calculationMatrix of off, Δ t_kRepresents t_kTo t_k+1Time interval of (g)^wRepresenting the gravity vector in the world coordinate system.

And respectively carrying out EPNP (extended range performance) pose estimation and map point estimation according to different types of matching feature point sets by utilizing different clusters of the obtained motion vision matching feature point sets, and carrying out outlier rejection on the pose estimation result of each type by randomly adopting a consistency algorithm through RANSAC (random sample consensus). After the pose estimation result corresponding to each category is obtained, the pose estimation result is compared with the IMU pose result, generally, the IMU short-term pose estimation after filtering optimization is accurate, so that real EPNP can be effectively screened by using IMU inter-frame pose estimation, the pose estimation result of a moving object feature point set is eliminated, the image pose estimation result is more real and accurate, and the influence of the moving object on the estimation result is reduced. By combining with a screening mechanism of IMU data, accurate EPNP pose and map point estimation results can be obtained, the EPNP pose and map point estimation results are called as optimal EPNP pose and map point estimation results, and a corresponding matching feature point set is called as a motion vision effective matching feature point pair set.

And (3) screening out feature points belonging to an effective matching feature point pair set from the common matching feature point set, namely, selecting out feature point pairs belonging to feature point classes corresponding to the optimal EPNP pose estimation result, and then performing Kalman filtering fusion on the stereo depth estimation value corresponding to the feature point pairs and the depth value estimated by the EPNP, thereby obtaining more accurate map points.

The pose and map points obtained by using a map-based optimization algorithm are optimized, the result obtained by IMU pre-integration is blended in the optimization process, and the IMU pre-integration result is also used as one of optimization constraint conditions, so that the effect of blending optimization is achieved, and the core formula of the optimization is as follows:

wherein

After pose estimation is completed, adding correct map points into a local map and a global map, updating and replacing the local map once, removing points which do not belong to the local map, simultaneously comparing whether a common-view feature point between a current frame and a next new key frame is higher than a threshold value, if the common-view feature point is higher than the threshold value, indicating that the coincidence information between the current frame and the latest key frame is excessive, adding a new key frame is not needed, replacing the latest key frame by using the current key frame, and partially replacing and deleting the local map points and the global map points. If the information difference between the current frame and the latest key frame is larger than the threshold value, the current frame is added into the key frame set to increase the information capacity of the environment.

And performing BundleAdjustment optimization and outlier elimination on the poses of the key frames with enough common-view feature points with the latest key frame and the map points related to the key frames. And adding the previous frame and the key frame having the common-view relation with the current frame into the common view of the current frame, and updating the data map structure. And then enters the next round of circulation.

In the process, the method is carried out on the premise that the map point of the previous frame and the current frame are enough matched with the feature point, if the matched feature point is insufficient, the feature point needs to be supplemented, two methods are mainly used for supplementing the feature point, one is to search a local map and increase the map point which can be matched, the other is to carry out feature matching on the previous frame and the current frame again, the range of the matched feature point is not limited to the feature point of the previous frame corresponding to the map point, but all the feature points in the previous frame are involved in matching, so that the vacancy of the feature point is effectively compensated, once the feature point meets the requirement, the pose estimation is continuously carried out, if the pose does not meet the requirement, the IMU is used for recording short-term pose transformation to keep the continuity of the track, the loop is carried out in the next round until enough feature point is found, if the enough feature point is not found continuously for a long time, the positioning is invalid, the IMU is also inaccurate at this time due to accumulated errors, requiring re-initialization of the system.

specifically, referring to fig. 9 and 10, the loop detection means detecting whether the robot returns to a place where the robot has arrived, and if the robot returns to the place where the robot has arrived, a geometric constraint between the current frame and the loop frame may be established, and an accumulated error between the loop frame and the current frame is eliminated through optimization. Loop detection is similar to a feature matching process, DBOW3 bag acceleration is mainly adopted, frames with a very high matching rate with the latest key frame are detected in all the past key frames, once the frames are detected, the frames are likely to be loop frames, whether the frames are the loop frames or not is determined according to a series of screening mechanisms such as geometric consistency, group matching and time consistency performance methods, once the loop frames are found in a key frame set, a pose transformation matrix between the current frame and the loop frames is estimated, the pose transformation matrix between the loop frames and the current frame and map points of the current frame are subjected to BundleAdjustment optimization and outlier rejection, finally the pose of the current frame is updated through the pose transformation matrix between the loop frames and the current frame by taking the pose of the loop frames as a reference, the pose of the common view frame with high degree of common view with the current frame is updated, and the map points related to related frames are updated, the whole loop optimization function eliminates part of accumulated errors, sets up set constraints and provides constraint conditions for global consistency optimization.

The back-end optimization thread is used for optimizing the poses of all the frames according to the pose estimation result of the IMU pre-integration algorithm and the pose transformation matrix between the current frame and the loop frame.

Specifically, please refer to fig. 11 and 12, which are the keys for improving the positioning accuracy and reducing the global accumulated error, and the back-end optimization module optimizes all the key frames and the common-view frame and the global map point of the current frame. The whole optimization process is divided into two major steps, wherein the pose of a key frame is optimized in the first step, and all poses are optimized by introducing inter-frame pose constraints of IMU pre-integration and constraints between a current frame and a loop frame.

The invention changes the existing mode of processing dynamic environment, can keep a large number of effective characteristic points by rejecting the moving object but not the mode of the dynamic object, can keep the static characteristic points of the dynamic object to improve the accuracy of pose estimation, and compared with the mode of rejecting the dynamic object, the invention can adapt to extreme conditions, such as only static dynamic object in a scene, if the mode of rejecting the dynamic object is adopted, enough characteristic points can not be obtained, and the method of rejecting the moving object can keep the static characteristic points of the dynamic object for pose estimation.

The invention changes the prior SLAM method for dealing with a dynamic environment by a semantic tag screening mode based on deep learning, provides a method for clustering the speed and the rotation direction of an inter-frame matching feature point pair to distinguish moving objects from non-moving objects (static objects and static dynamic objects), adopts a mode of taking IMU data as auxiliary constraint information, screens the pose estimation closest to the IMU data in the pose estimation results of different clustering data sets as the optimal pose estimation to eliminate the interference of the moving objects on the pose estimation under a dynamic scene, and greatly reduces the operation complexity and improves the real-time performance of a positioning algorithm compared with the semantic mode based on deep learning.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above description is provided for the binocular vision positioning method and system in a dynamic environment, and for those skilled in the art, there may be variations in the specific implementation and application scope according to the ideas of the embodiments of the present invention, and in summary, the content of the present description should not be construed as limiting the present invention.

Claims

1. A binocular vision positioning method under a dynamic environment is characterized by comprising the following steps:

2. The binocular vision positioning method under the dynamic environment of claim 1, wherein the IMU pre-integration algorithm:

wherein

3. The binocular vision positioning method under the dynamic environment according to claim 1, wherein after the classifying the moving vision matching feature point pairs into a plurality of categories according to the clustering result, the method comprises:

4. The binocular vision positioning method under the dynamic environment of claim 3, wherein the performing EPNP pose estimation and map point estimation on the pairs of the moving vision matching feature points under each category respectively to obtain EPNP pose estimation results and map point estimation results comprises:

5. The binocular vision positioning method under the dynamic environment according to claim 4, wherein after obtaining the estimated pose and the estimated map point, the method comprises:

the optimization algorithm of the graph is as follows:

wherein,

representing a pose transformation matrix, z, obtained by an IMU pre-integration algorithm^k+1Pixel coordinates, P, representing a camera image^kRepresenting the 3D coordinates of a map point in the kth frame camera coordinate system, pi_c(.) is a projective transformation based on a model of the camera for transforming 3D coordinates in the camera coordinate system into 2D pixel coordinates in the camera image.

6. The binocular vision positioning method under the dynamic environment according to claim 5, wherein the optimizing the estimated pose and the estimated map points to obtain the optimized map points comprises:

replacing the map points before optimization by the optimized map points, updating a local map, and eliminating map points which do not belong to the local map;

7. The binocular vision positioning method under the dynamic environment according to claim 1, wherein the obtaining the left-eye and right-eye moving vision matching feature point pairs according to the feature points comprises:

8. The binocular vision positioning method under the dynamic environment according to claim 1, wherein the obtaining the stereoscopic vision matching feature point pairs of the left and right eyes according to the feature points comprises:

9. A binocular vision positioning system in a dynamic environment, comprising:

the back-end optimization thread is used for calculating pose estimation results of the IMU of the left eye and the right eye in a time period between a current frame and a previous frame by utilizing an IMU pre-integration algorithm, wherein the pose estimation results comprise inter-frame poses between the current frame and the previous frame, and the poses of all the frames are optimized according to the pose estimation results of the IMU pre-integration algorithm and a pose transformation matrix between the current frame and a loop frame.

10. The binocular vision positioning system under dynamic environment of claim 9, wherein the calculating the estimated pose and the estimated map point of the current frame according to the feature points comprises: