CN116151320A

CN116151320A - Visual odometer method and device for resisting dynamic target interference

Info

Publication number: CN116151320A
Application number: CN202211171786.4A
Authority: CN
Inventors: 刘成; 熊帅; 程松; 李平力; 胡芳; 李芳�; 张�杰; 宋志鹏
Original assignee: Beijing Muxing Technology Co ltd
Current assignee: Beijing Muxing Technology Co ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-05-23

Abstract

The invention relates to a visual odometer method and a visual odometer device for resisting dynamic target interference, wherein the method comprises the following steps: constructing a vehicle target detection network for detecting a vehicle; acquiring road images of adjacent frames output by a vision camera, respectively detecting target vehicles on the road images based on a vehicle target detection network, and respectively marking the identified target vehicles by adopting anchor frames; removing pixel contents in the anchor frame range in the road image respectively to serve as an input image processed by the visual odometer; the visual odometer acquires the input images of the adjacent frames, and adopts a characteristic extraction algorithm to extract characteristic points in the input images respectively; adopting a feature matching algorithm and carrying out feature point matching on the input images of the adjacent frames based on the feature points; and acquiring a rotation matrix and a translation vector between the input images of the adjacent frames based on the characteristic point matching result of the input images of the adjacent frames, and performing motion estimation on the vision camera.

Description

Visual odometer method and device for resisting dynamic target interference

Technical Field

The invention relates to the field of automatic driving, in particular to a visual odometer method and a visual odometer device for resisting dynamic target interference.

Background

Visual Odometer (VO) is a process of estimating self-motion by a single camera or multiple cameras as input, and application fields encompass autopilot, robotics, unmanned aerial vehicles, augmented reality, etc. The VO concept was created by Nister in the 2004 landmark paper. This term is very similar to wheel ranging, and incrementally estimates the motion of the vehicle by integrating the number of images taken by the wheel. Also, the VO uses an on-board camera to detect image motion changes to enhance the estimation of carrier pose. In order for the VO to be more efficient, there must be enough ambient images that still images with enough texture can extract the motion features. In addition, sequential images need to be captured to overlap the scene.

VO has the advantage over vehicle wheel ranging that it is not affected by wheel slip in uneven ground or other adverse conditions. VO also provides a more accurate trajectory estimation with a relative position error in the range of 0.1% to 2%. This makes VO a beneficial complement to wheel ranging, as well as other navigation systems such as global satellite navigation system (Global Positioning System, GPS), inertial measurement unit (Inertia Measurement Unit, IMU) and radar ranging systems. VO is very important in environments where GPS fails, such as in urban tall buildings, tunnels or under water, in space.

The main modes of VO are classified into a characteristic point method and a direct method. The feature point method currently occupies the main stream, can work when the noise is larger and the camera moves faster, but the map is a sparse feature point; the direct method can build a dense map without extracting features, but has the defects of large calculation amount and poor robustness.

Under the conditions of rich textures, good illumination and no dynamic target interference, the VO system can achieve ideal precision and performance at present. However, in environments such as urban roads, due to the existence of a large number of dynamic targets such as vehicles and pedestrians, the extraction of VO feature points and the calculation of optical flow are obviously affected, so that errors are caused to the estimation of the carrier pose, and even the VO system is scattered and cannot work normally when serious. For example，Chinese patent CN 109813334A discloses a real-time high-precision vehicle mileage calculation method based on binocular vision, in which feature extraction and matching are performed by using directly acquired adjacent frame images, and in which the scheme is easily affected by objects such as dynamic vehicles.

Disclosure of Invention

The invention aims to provide a visual odometer method and a visual odometer device for resisting dynamic target interference.

In order to achieve the above object, the present invention provides a visual odometer method for resisting dynamic target interference, comprising:

s1, constructing a vehicle target detection network for detecting a vehicle;

s2, acquiring road images of adjacent frames output by a vision camera, respectively detecting target vehicles on the road images based on the vehicle target detection network, and respectively marking the identified target vehicles by adopting anchor frames;

s3, eliminating pixel contents in the anchor frame range in the road image respectively to serve as an input image processed by a visual odometer;

s4, the visual odometer acquires the input images of the adjacent frames, and adopts a feature extraction algorithm to extract feature points in the input images respectively;

s5, adopting a feature matching algorithm and carrying out feature point matching on the input images of the adjacent frames based on the feature points;

s6, based on the characteristic point matching result of the input images of the adjacent frames, acquiring a rotation matrix and a translation vector between the input images of the adjacent frames, and performing motion estimation on the vision camera.

According to one aspect of the present invention, in step S6, based on a feature point matching result of the input images of adjacent frames, a rotation matrix and a translation vector between the input images of adjacent frames are acquired, and the step of performing motion estimation on the visual camera includes:

constructing a relative pose relationship when the visual camera captures the road image of the adjacent frame based on the rotation matrix and the translation vector;

acquiring absolute pose tracks of the visual camera when capturing an initial frame road image based on the relative pose relation;

and performing motion estimation on the visual camera based on the absolute pose track.

According to one aspect of the invention, the step of acquiring absolute pose tracks of the visual camera with respect to capturing an initial frame road image based on the relative pose relationship comprises:

acquiring the relative pose change of the visual camera corresponding to the road image of the adjacent frame based on the relative pose relation;

acquiring an absolute pose of the visual camera when capturing the road image of each frame based on the pose of the visual camera when capturing the road image of an initial frame and based on the relative pose change;

and based on the absolute pose of the vision camera, sequentially connecting according to a time sequence, and performing incremental pose track reconstruction to acquire the absolute pose track.

According to one aspect of the present invention, in the step of acquiring absolute pose tracks of the visual camera with respect to capturing an initial frame road image based on the relative pose relationship, the method further comprises:

extracting a local track in the absolute pose track to perform local track optimization; wherein, include:

extracting track segments containing m absolute pose of the vision camera from the absolute pose track as the local track;

and iteratively calculating the minimum value of the sum of the 3D point cloud re-projection error squares of the road image corresponding to the m absolute poses in the local track, and completing the optimization of the local track.

According to one aspect of the invention, in the step of constructing a relative pose relationship when the visual camera captures the road image of adjacent frames based on the rotation matrix and the translation vector, the relative pose relationship is expressed as:

wherein R is _k,k-1 Is a rotation matrix from the image coordinate at time k-1 to the image coordinate at time k, t _k,k-1 Is a translation vector.

According to one aspect of the invention, the vision camera employs a monocular, binocular or depth camera.

According to one aspect of the invention, if the vision camera adopts a monocular camera, the rotation matrix and the translation vector are obtained in a 2D-to-2D mode, or the rotation matrix and the translation vector are obtained in a 3D-to-2D mode;

if the rotation matrix and the translation vector are obtained in a 2D-to-2D mode, in step S5, feature point matching is performed by adopting the input images of two adjacent frames;

in step S6, the step of obtaining the rotation matrix and the translation vector between the input images of the adjacent frames based on the feature point matching result of the input images of the adjacent frames includes:

acquiring an intrinsic matrix of the input image of the adjacent frame based on the characteristic point matching result;

decomposing the rotation matrix and the translation vector based on the eigenvector;

if the rotation matrix and the translation vector are obtained in a 3D-to-2D manner, in step S5, feature point matching is performed by using the input images of three adjacent frames, which includes:

triangularizing the input images by adopting the first two frames to obtain a 3D image;

performing feature point matching based on the 3D image and the input image of a third frame;

in step S6, in the step of obtaining the rotation matrix and the translation vector between the input images of the adjacent frames based on the feature point matching result of the input images of the adjacent frames, the rotation matrix and the translation vector are obtained by adopting PnP algorithm based on the feature point matching result.

According to one aspect of the invention, if the vision camera adopts a binocular camera or a depth camera, the rotation matrix and the translation vector are acquired in a 3D-3D manner, or the rotation matrix and the translation vector are acquired in a 3D-2D manner;

if the rotation matrix and the translation vector are obtained in a 3D-3D mode, in step S5, feature point matching is respectively carried out on the input images of two adjacent frames with left eyes and the input images of two adjacent frames with right eyes; or, matching the input images of the left eye and the right eye at the same moment to generate 3D matching images, and matching the characteristic points based on the 3D matching images of the adjacent frames;

triangularizing the matched characteristic points based on the characteristic point matching result;

acquiring 3D features based on the triangulated feature points, and calculating the rotation matrix and the translation vector based on the 3D features;

if the rotation matrix and the translation vector are acquired in a 3D-to-2D manner, step S5 includes:

triangularizing the input images of the left eye and the right eye at the previous moment to obtain a 3D image;

performing feature point matching based on the 3D image and the input image of the left eye or the right eye at the later moment;

In order to achieve the above object, the present invention provides a visual odometer device adopting the above visual odometer method, comprising:

the visual camera is used for acquiring an external road image;

the vehicle target detection module is used for receiving the road image output by the visual camera, detecting a target vehicle and marking an anchor frame;

the image processing module is used for eliminating pixels in the anchor frame range in the road image and used as an input image processed by the visual odometer;

and the visual odometer module is used for matching the input images of the adjacent frames and outputting characteristic point matching results, and acquiring a rotation matrix and a translation vector between the input images of the adjacent frames based on the characteristic point matching results so as to perform motion estimation on the visual camera.

According to one aspect of the invention, the vision camera is a monocular camera, a binocular camera, or a depth camera.

According to the scheme of the invention, by detecting and removing objects such as vehicles, pedestrians and the like in the image, the possible dynamic target interference influence is reduced, and the accuracy and the robustness of the visual odometer in application scenes such as urban roads and the like are improved。

According to one aspect of the invention, the invention uses algorithms for functional implementation independent ofExternal parts such as laser radar The sensor has lower application cost. Compared with other dynamic target inhibition algorithms, the method can more thoroughly eliminate potential dynamic states The target interference and the effect are more obvious.

According to one scheme of the invention, the invention is suitable for the application of different types of vision cameras and has extremely high adaptability Usability and application prospect.

Drawings

FIG. 1 schematically illustrates a block diagram of steps of a visual odometry method according to an embodiment of the invention;

FIG. 2 schematically illustrates a flow chart of a visual odometry method according to an embodiment of the invention;

FIG. 3 schematically illustrates an original image taken by a vision camera in a vision odometry method in accordance with one embodiment of the present invention;

FIG. 4 schematically illustrates a vehicle target identified and detected using a vehicle target detection network in a visual odometry method according to an embodiment of the invention;

FIG. 5 schematically illustrates an input image for use in a visual odometer calculation after rejection of a vehicle target in an anchor frame in a visual odometer method in accordance with an embodiment of the invention;

FIG. 6 schematically illustrates a diagram of tracking an environmental image and its features by camera movement in a visual odometry method according to an embodiment of the invention;

FIG. 7 schematically illustrates a diagram of epipolar constraint geometry in a visual odometry method according to one embodiment of the invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

In describing embodiments of the present invention, the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer" and the like are used in terms of orientation or positional relationship based on that shown in the drawings, which are merely for convenience of description and to simplify the description, rather than to indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operate in a specific orientation, and thus the above terms should not be construed as limiting the present invention.

Referring to fig. 1 and 2, according to an embodiment of the present invention, a visual odometer method for resisting dynamic target interference includes:

s1, constructing a vehicle target detection network for detecting a vehicle;

s2, acquiring road images of adjacent frames output by the vision camera, respectively detecting target vehicles on the road images based on a vehicle target detection network, and respectively marking the identified target vehicles by adopting anchor frames;

s3, eliminating pixel contents in an anchor frame range in the road image respectively to serve as an input image processed by the visual odometer;

s4, acquiring input images of adjacent frames by a visual odometer, and respectively extracting characteristic points in the input images by a characteristic extraction algorithm;

s6, based on the characteristic point matching result of the input images of the adjacent frames, a rotation matrix and a translation vector between the input images of the adjacent frames are obtained and used for carrying out motion estimation on the vision camera.

According to an embodiment of the present invention, in step S1, the step of constructing a vehicle object detection network for detecting a vehicle includes:

s11, acquiring and loading a vehicle target detection data set; in the present embodiment, a large open source data set BBD 100K is used. It is a large diverse dataset commonly used for autopilot applications, annotated with more than 100000 images, category containing buses, pedestrians, bicycles, trucks, cars, trains, and riders, etc. Of course, in this embodiment, instead of using the existing public data set, a custom data set may be used for training, but the pictures in the custom data set need to be manually marked and converted into a required input format.

In this embodiment, the acquired vehicle target detection data set is divided into two parts, and one part of the data set for training the deep learning target detection network may be referred to as a "training set"; one is a data set for evaluating the deep learning object detection network, which may be referred to as a "test set".

S12, building and training a vehicle target detection network based on deep learning; in this embodiment, a deep learning target detection algorithm, such as a regional convolutional neural network (Region Convolutional Neural Networks, RCNN), single-stage multi-frame detection (Single Shot MultiBox Detector, SSD), YOLO (You Only Look Once), a spatial pyramid pooling layer (Spatial Pyramid Pooling, SPP), a feature pyramid (Feature Pyramid Networks, FPN), retinaNet, and the like, is used to build a deep learning target detection network.

Taking the YOLOv5 algorithm as an example, the target detection network consists of a feature extraction network and a detection network. The feature extraction network is typically a pre-trained convolutional neural network (Convolutional Neural Networks, CNN), although other pre-trained networks may be used. In contrast to the feature extraction network, the detection network is a small CNN, which consists of several convolution layers and layers specific to YOLO v 5. When the YOLO v5 target detection network is created, parameters such as input size, anchor frame number, feature extraction network and the like of the target detection network are correspondingly designed according to actual needs.

S13, data enhancement is carried out on the training set in the process of the vehicle target detection network trained in the step S12 so as to further enhance the training process; in this embodiment, the data enhancement is to improve accuracy of the network training by randomly converting the original data in the training set during the training process. The size of the training set is further extended by using data enhancement without increasing the number of training samples actually marked. In this embodiment, the training set for training may be enhanced by randomly flipping the image and associated frame tags horizontally.

S14, evaluating a trained vehicle target detection network; in this embodiment, the trained vehicle target detection network is evaluated using a test set to test its performance. Specifically, some general metrics such as average accuracy (Mean Average Precision, mAP) and log-average miss rate (Log Average Miss Rate, LAMR) may be calculated. For example, when mAP is used to evaluate performance, the average accuracy includes the ability of the vehicle object detection network to make the correct classification (accuracy) and the ability of the detector to find all relevant objects (recall). The precision/recall (PR) curve highlights the precision of the detector at different recall levels, ideally 1 at each point. Therefore, to improve the average accuracy, more training set data may be used to improve the training effect.

As shown in fig. 1,2, 3 and 4, in step S2, the steps of acquiring road images of adjacent frames output by the vision camera, respectively detecting target vehicles on the road images based on the vehicle target detection network, and respectively marking the identified target vehicles by using anchor frames include:

s21, continuously shooting road images in the travelling direction based on a vision camera fixed on a carrier (or intercepting the road images based on video streams), and outputting the acquired road images to a vehicle target detection network running on the upper line according to time sequence;

s22, sequentially acquiring corresponding road images by a vehicle target detection network and detecting the vehicle target;

s23, respectively performing anchor frame marking on the vehicle targets in the road image based on the vehicle target detection result output by the vehicle target detection network.

As shown in fig. 5, in step S3, in the step of removing the pixel content within the anchor frame range in the road image as the input image of the visual odometer processing, the anchor frame of the vehicle target in the road image is removed based on the foregoing steps, and the pixels including the vehicle target in the anchor frame may be deleted or masked by using the anchor frame as a boundary, and used as the input image of the visual odometer.

With the above arrangement, since the vehicle target is no longer present in the road picture, the image feature points recognized and extracted by the visual odometer are no longer present on the vehicle target, thereby eliminating the potential disturbing influence of the moving vehicle.

As shown in fig. 5, in step S4, in the step of acquiring the input images of the adjacent frames by the visual odometer and extracting the feature points in the input images by the feature extraction algorithm, first, a representative point, that is, a feature point, is selected from the input images (i.e., the images obtained by extracting the vehicle target in the anchor frame in the foregoing step). The feature points are kept unchanged on the premise that the camera visual angle changes slightly, and then the same feature points can be extracted from all the input images. Then, on the basis of the feature points, the pose estimation of the visual camera and the positioning problem of the feature points can be carried out through corresponding feature point matching. In the embodiment, one of a SIFT algorithm, a SURF algorithm and an ORB algorithm is adopted to extract the feature points, so that the obtained feature points have better repeatability and distinguishability, and higher efficiency and locality.

According to one embodiment of the present invention, after the feature points in the adjacent frame images are calculated and extracted, the feature points need to be matched. Specifically, the problem of data association in the visual inertial odometer is solved through feature point matching, namely, the corresponding relation between the currently seen feature point and the previously seen feature point is determined. By accurately matching the descriptors between the images or between the images and the map, a great deal of burden can be reduced for subsequent operations such as gesture estimation, optimization and the like. In step S5, the feature matching algorithm is adopted to carry out special processing on the input images of the adjacent frames based on the feature pointsIn the step of symptom matching, the input images at two adjacent moments k, k+1 are considered, if in image I _k Extracting the characteristic points

m=1, 2, …, M, in image I _k+1 Extracting the characteristic point x _n(k+1) N=1, 2, …, N, matching can be performed using Brute-Force matching (Brute-Force match). I.e. for each feature point +.>

And all x _n(k+1) The distance of the descriptors (descriptors) is measured, and then the sequence is performed, and the nearest feature point is taken as the matching point. The descriptor distance represents the similarity between two features, and different distance measurement norms can be adopted in practical application. In the present embodiment, when the number of feature points is large, the calculation amount of the violent matching method is large, particularly when matching an input image of a certain frame with one map. This may lead to computational delays, which make it difficult to meet real-time requirements. In this case, the fast approximate nearest neighbor (Fast Library for Approximate Nearest Neighbors, FLANN) algorithm may be adopted to be more suitable for the case that the number of matching points is larger.

As shown in fig. 6, in step S6, based on the feature point matching result of the input images of the adjacent frames, a rotation matrix and a translation vector between the input images of the adjacent frames are obtained, so as to perform motion estimation on the vision camera, which includes:

constructing a relative pose relationship when the visual camera captures road images of adjacent frames based on the rotation matrix and the translation vector; in the present embodiment, the coordinates of the image frames of the vision camera are assumed to be the coordinates of the carrier. If the vision camera is a binocular camera or a depth camera, the left-purpose coordinate is set as the origin without losing generality. In the present embodiment, the relative pose relationship T of adjacent camera positions _k,k-1 (or camera system position) by visual features, each frame of input image is then corresponding based on the relative pose relationshipTo obtain an absolute pose C when k=0 with respect to the initial coordinate frame _k 。

In the present embodiment, the relative relationship of the vision camera positions at successive times k-1 and k is changed by the transformation T _k,k-1 ∈R ^4×4 The relative pose relationship is then obtained as:

Further, absolute pose tracks of the visual camera with respect to capturing the initial frame road image are acquired based on the relative pose relationship; wherein, include:

acquiring the relative pose change of the visual camera corresponding to the road image of the adjacent frame based on the relative pose relation; in the present embodiment, the vector group T of the relative pose change of the continuous frame input image can be obtained based on the aforementioned relative pose relationship _1:n ＝{T _1:0 ,…,T _n:n-1 The vector set includes the motion of successive vision cameras. For simplicity, T may be _k,k-1 Denoted as T _k 。

Acquiring the absolute pose of the visual camera when capturing each frame of road image based on the pose of the visual camera when capturing the initial frame of road image and based on the relative pose change; in the present embodiment, the visual camera pose is set to C _0:n ＝{C ₀ ,…,C _n And the pose of the vision camera at the initial coordinate k=0 is included. Further, the current pose C _n By simultaneous transformation T _k (k=1, …, n) and therefore cn=c _n-1 T _n Wherein C ₀ Is the pose of the camera k=0, and can be arbitrarily specified by the user.

The absolute pose of the vision camera is sequentially connected according to a time sequence, and incremental pose track reconstruction is carried out to obtain absolute pose tracks. In the present embodiment, the pose of the camera based on the acquired vision is C _0:n The motion trail (i.e. absolute pose trail) of the visual camera can be reconstructed by an incremental mode of sequentially connecting adjacent poses.

According to one embodiment of the present invention, the step of acquiring absolute pose tracks of the visual camera with respect to capturing the initial frame road image based on the relative pose relationship further comprises:

extracting track segments containing the absolute pose of m visual cameras from the absolute pose track as local tracks;

and iteratively calculating the minimum value of the sum of the 3D point cloud re-projection errors of the road images corresponding to the m absolute poses in the local track (called windowed bundle set adjustment (Boundle Adjustment, BA) because the window bundle set adjustment is performed on the sub-windows on the m frame images), and completing the optimization of the local track.

According to one embodiment of the invention, the vision camera employs a monocular, binocular or depth camera.

According to one embodiment of the invention, if the vision camera adopts a monocular camera, the rotation matrix and the translation vector are obtained in a 2D-to-2D mode, or the rotation matrix and the translation vector are obtained in a 3D-to-2D mode;

if the rotation matrix and the translation vector are acquired in a 2D-to-2D manner, wherein two sets of corresponding features f at the time points k-1 and k _k-1 And f _k Both the coordinates are 2D images (2-dimensional input images), in step S5, feature point matching is performed by adopting the input images of two adjacent frames;

based on the feature point matching result, acquiring an intrinsic matrix of an input image of an adjacent frame;

decomposing a rotation matrix and a translation vector based on the eigenvector;

if a 3D-2D mode is adopted to acquire a rotation matrix and a translation vector, wherein the moment k-1 corresponds to the characteristic f _k-1 For 3D (3-dimensional input image), time k corresponds to feature f _k Is a 2D image (2-dimensional input image) coordinate, in step S5, feature point matching is performed using input images of three adjacent frames, which includes:

triangularizing the first two frames of input images to obtain a 3D image;

performing feature point matching based on the 3D image and the third frame input image;

in step S6, the rotation matrix and the translation vector between the input images of the adjacent frames are obtained based on the feature point matching result and the PnP algorithm is adopted to obtain the rotation matrix and the translation vector based on the feature point matching result.

According to one embodiment of the invention, if the vision camera adopts a binocular camera or a depth camera, acquiring a rotation matrix and a translation vector in a 3D-to-3D manner, or acquiring the rotation matrix and the translation vector in a 3D-to-2D manner;

if the rotation matrix and the translation vector are acquired in a 3D-to-3D manner, two sets of corresponding features f at the time points k-1 and k _k-1 And f _k Are 3D image coordinates. In step S5, feature point matching is performed by using the input images of the two adjacent frames of the left eye and the input images of the two adjacent frames of the right eye respectively; or, matching is carried out on the basis of the left-eye input image and the right-eye input image at the same moment to generate a 3D matching image, and characteristic point matching is carried out on the basis of the 3D matching images of the adjacent frames;

acquiring 3D features based on the triangulated feature points, and calculating a rotation matrix and a translation vector based on the 3D features;

triangularizing the left-eye and right-eye input images at the previous moment to obtain a 3D image;

performing feature point matching based on the 3D image and the input image of the left or right destination at the later moment;

To further illustrate the present invention, the workflow of the present invention is further exemplified.

Example 1

The vision camera adopts a monocular camera and adopts a 2D to 2D mode to acquire a rotation matrix and a translation vector，The method specifically comprises the following steps:

1) Acquiring a new image frame I _k ；

2) Extracting adjacent frame I _k-1 And I _k Inputting characteristic points of an image;

3) For adjacent frame I _k-1 And I _k Performing feature matching on the input image;

4) Calculating adjacent frame I _k-1 And I _k Inputting an eigenvalue matrix of the image; in this embodiment, the geometric relationship between input images by feature matching of adjacent frames is represented by an eigen matrix E. Wherein the vision camera motion parameters contained in the eigenmatrix E have an unknown translational transformation factortAnd can be further expressed as:

/>

wherein the symbols are

To the right of the expression equation is scalar multiplication, (t) _x ,t _y ,t _z ) For translation vector t _k Components in three coordinate directions.

In the present embodiment, the most important feature for implementing the subsequent motion estimation based on the 2D-to-2D manner is epipolar constraint, which forms a straight line connecting two corresponding feature points in two images

And->

As shown in fig. 7. The epipolar constraint can be determined by the equation->

Deriving, wherein->

Is one of the images I _k Is characterized by (a)>

Is another image I _k-1 The location of the corresponding feature. For simplicity, we mark their normalized sitting as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

feature points +.>

And->

Is defined in the image data.

In the present embodiment, 2D is calculated using epipolar constraintsTo the eigen matrix of the 2D feature matching. Wherein the method comprises the steps of，The minimization scheme is to use 5 2D to 2D correspondences. Here, non-coplanar points with n being greater than or equal to 8 are selected, and direct calculation is performed by adopting an 8-point algorithm of Longuet-Higgins. Each pair of feature matches gives a constraint equation:

wherein E= [ E ₁ e ₂ e ₃ e ₄ e ₅ e ₆ e ₇ e ₈ e ₉ ] ^T ；

These constraint equations based on the 8-point method can form the following linear equation set:

AE＝0

representing the pixel coordinates of the i-th pair of matching features.

The eigenvalue matrix E can be obtained by solving the homogeneous linear system of equations above by singular value decomposition (Singular Value Decomposition, SVD). If more than 8 points can generate an overdetermined equation set (equation set with the number of equations being greater than the number of unknowns), if the condition (limitation) given by the overdetermined equation set is too strict, and the solution is not existed, fitting is carried out by a least square method, and a least square solution is obtained. The SVD form of matrix a is a=usv ^T Extracting the value diag (S) on the main diagonal = { S,0}, the first and second singular values being equal, the third being 0. In order to obtain an effective eigenvalue E that satisfies the constraint, it needs to be projected into the space with the effective eigenvalue. Projection eigen matrix

The method comprises the following steps:

when the points in 3D space are coplanar, the 8-point algorithm scheme degrades. Accordingly, a 5-point algorithm may be applied to calculate the coplanar points. It should be noted that the 8-point algorithm is applicable to both calibrated (perspective or panoramic) cameras and non-calibrated cameras, and the 5-point algorithm is applicable only to calibrated (perspective or panoramic) cameras.

5) Decomposing the eigenvalue matrix into a rotation matrix R _k And translation vector t _k And based on a rotation matrix R _k And translation vector t _k Forming a relative pose relationship T of a vision camera position _k The method comprises the steps of carrying out a first treatment on the surface of the In the present embodiment, the eigenvalue matrix estimated from the previous step

In this step, the rotation matrix R and the translation matrix t can be extracted. Typically, there are 4 different methods to solve R, t for the same eigenvector; in fact, a pair of R and +.Can be found by triangulating a point>

6) Calculating the correlation scale and readjusting the translation vector t accordingly _k The method comprises the steps of carrying out a first treatment on the surface of the In this embodiment, with the estimated R, t as the initial value, the rotation and translation parameters can be optimized using a nonlinear optimization method. On this basis, in order to reconstruct the trajectory of the image sequence, a further different transformation T is required _0:n And (5) performing connection. Due to the flattening of the two imagesThe absolute dimensions of the shifts cannot be calculated, and for this purpose, their relative dimensions need to be calculated in order to achieve the above objective. The relative scale of the image subset transformations can be calculated. One approach is to triangulate a pair of image sets X of two image subsets _k-1 And X _k Is a 3-dimensional point of (c). The distance between two 3-dimensional points can be calculated from the corresponding 3-dimensional point coordinates. Computing image pairs X _k-1 And X _k The ratio r of the distances between them can be used to derive the corresponding scale. With the estimated R, t as initial value, the rotation and translation parameters can be optimized using a nonlinear optimization method. On this basis, in order to reconstruct the trajectory of the image sequence, a further different transformation T is required _0:n And (5) performing connection. Since the absolute scale of translation of the two images cannot be calculated, to achieve this, it is necessary to calculate their relative scale. The relative scale of the image subset transformations can be calculated. One approach is to triangulate a pair of image sets X of two image subsets _k-1 And X _k Is a 3-dimensional point of (c). The distance between two 3-dimensional points can be calculated from the corresponding 3-dimensional point coordinates. Computing image pairs X _k-1 And X _k The distance ratio r between the two can obtain the corresponding scale; wherein, the distance ratio is expressed as:

in view of robustness, redundant scale factors can be calculated and an average value employed; if outliers occur, the median is taken. Translation vector t _k It can also be calculated from this distance ratio r. Calculation of the relative scale requires that features on a plurality of image frames have been matched (or tracked), at least three image frames.

7) By calculating C _k ＝C _k-1 T _k The method comprises the steps of connecting pose transformation between input image frames to obtain an absolute pose track of a visual camera;

8) Repeating from step 1).

Example 2

The vision camera adopts binocular camera or depthA camera, and acquiring a rotation matrix and a translation vector in a 3D-3D mode; for 3D to 3D feature correspondence, the camera motion T may be calculated by a consistent transformation of the two sets of 3D features _k . The feature correspondence of 3D to 3D is only applicable in stereoscopic vision. The method specifically comprises the following steps:

1) Acquiring image pairs I of two adjacent frames _l,k-1 、I _r,k-1 And I _l,k 、I _r,k ；

2) Extracting adjacent frame I _l,k-1 、I _r,k-1 And I _l,k 、I _r,k Inputting characteristic points of an image;

3) For adjacent frame I _l,k-1 、I _r,k-1 And I _l,k 、I _r,k Performing feature matching on the input image;

4) Triangulating the matched features for each image pair;

5) From 3D feature X _k-1 And X _k Acquiring rotation matrix R _k And translation vector t _k And based on a rotation matrix R _k And translation vector t _k Forming the relative pose relation of the position of the vision camera, and calculating T _k The method comprises the steps of carrying out a first treatment on the surface of the In the present embodiment, T is calculated _k The general approach of (1) is to calculate the minimum of the L2 distance between two sets of 3D features:

wherein i represents the i-th feature,

representing 3D feature points->

Is a homogeneous coordinate of (c).

Using 3D features X of more than 3 pairs _k-1 、X _k Calculate T _k . Specifically, the translation vector t _k Calculated by the following formula:

wherein the superscript "-" denotes an arithmetic mean.

Rotation matrix R _k Can be calculated by Singular Value Decomposition (SVD):

R _k ＝VU ^T

since the transformation calculation of the 3D to 3D correspondence has an absolute scale, the trajectory of the image sequence can be obtained directly by connecting the individual transformation processes.

6) By calculating C _k ＝C _k-1 T _k Connecting pose transformation between image frames;

7) Repeating from step 1).

Example 3

The vision camera adopts a monocular camera, a binocular camera or a depth camera, and adopts a 3D to 2D mode to acquire a rotation matrix and a translation vector; in this embodiment, a PnP (periodic-n-Point) method is used to solve for 3D to 2D Point-to-Point motion. For example, P3P with 3-point estimated pose, direct linear transformation (Direct Linear Transformation, DLT), EPnP (Efficient PnP), UPnP, etc. In addition, a nonlinear optimization mode can be used for constructing a least square problem and solving the least square problem iteratively, namely BA.

According to an embodiment of the present invention, a visual odometer device for the aforementioned visual odometer method of the present invention includes:

the visual camera is used for acquiring an external road image;

the image processing module is used for eliminating pixels in the anchor frame range in the road image and is used as an input image processed by the visual odometer;

the visual odometer module is used for matching input images of adjacent frames and outputting characteristic point matching results, and acquiring a rotation matrix and a translation vector between the input images of the adjacent frames based on the characteristic point matching results, so as to perform motion estimation on the visual camera.

As shown in fig. 1, according to one embodiment of the present invention, the vision camera is a monocular camera, a binocular camera, or a depth camera.

The foregoing is merely exemplary of embodiments of the invention and, as regards devices and arrangements not explicitly described in this disclosure, it should be understood that this can be done by general purpose devices and methods known in the art.

The above description is only one embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A visual odometer method of combating dynamic target disturbances, comprising:

s1, constructing a vehicle target detection network for detecting a vehicle;

2. The method according to claim 1, wherein in step S6, based on the feature point matching result of the input images of the adjacent frames, the step of obtaining a rotation matrix and a translation vector between the input images of the adjacent frames for motion estimation of the vision camera includes:

3. The visual odometry method of claim 2, wherein the step of obtaining absolute pose trajectories of the visual camera with respect to capturing an initial frame of road image based on the relative pose relationship comprises:

4. A visual odometry method according to claim 3, wherein the step of obtaining absolute pose trajectories of the visual camera with respect to capturing an initial frame of road image based on the relative pose relationship further comprises:

5. The visual odometer method of claim 4, wherein in the step of constructing a relative pose relationship for the visual camera capturing the road image of adjacent frames based on the rotation matrix and the translation vector, the relative pose relationship is expressed as:

/>

6. The visual odometry method of claim 5, wherein the visual camera is a monocular, binocular or depth camera.

7. The method of claim 6, wherein if the vision camera is a monocular camera, the rotation matrix and translation vector are obtained in a 2D-to-2D manner, or the rotation matrix and translation vector are obtained in a 3D-to-2D manner;

8. The visual odometer method according to claim 6, wherein if the visual camera is a binocular camera or a depth camera, the rotation matrix and translation vector are obtained in a 3D to 3D manner or in a 3D to 2D manner;

9. A visual odometer device employing the visual odometer method of any of claims 1 to 8, comprising:

the visual camera is used for acquiring an external road image;

10. The visual odometer device of claim 9, wherein the visual camera is a monocular, binocular or depth camera.