CN117218210A

CN117218210A - Binocular active vision semi-dense depth estimation method based on bionic eyes

Info

Publication number: CN117218210A
Application number: CN202311362069.4A
Authority: CN
Inventors: 李恒宇; 刘峻玮; 刘靖逸; 刘军; 王曰英; 谢少荣; 罗均
Original assignee: Jining University; University of Shanghai for Science and Technology
Current assignee: Jining University; University of Shanghai for Science and Technology
Priority date: 2023-08-29
Filing date: 2023-10-20
Publication date: 2023-12-12

Abstract

The binocular active vision semi-dense depth estimation method based on the bionic eye is characterized by comprising the following steps of: s100, acquiring images by using a binocular bionic eye camera; s200, inputting the image acquired in the S100 into an SLAM system, tracking a key frame by using the SLAM system and estimating the pose of a camera and the coordinates of road points; s300, utilizing the key frame pose obtained in S200 and the corresponding key frame image data, combining polar line searching and block matching with triangulation, and utilizing a depth filter to realize semi-dense depth estimation. In the block matching process, NCC calculation of block matching with some image block gradients not meeting the requirements is filtered, so that compared with a dense direct method, the method can better meet the requirements of instantaneity.

Description

Binocular active vision semi-dense depth estimation method based on bionic eyes

Technical Field

The invention belongs to the technical field of computer vision, relates to bionic eye motion control, camera pose and landmark point coordinate estimation optimization and semi-dense depth estimation, and particularly relates to a binocular active vision semi-dense depth estimation method based on a bionic eye.

Background

In the field of robotics, the autonomous navigational ability of mobile robots in unknown environments has been an attractive and vital research direction. SLAM technology provides a key solution for mobile robots to locate their own position in an unknown environment and build maps. Binocular passive vision and monocular passive vision SLAM are currently under much research in the SLAM field. However, in outdoor and large scene situations with dynamic objects and uneven texture distribution, the conventional passive vision SLAM still has a plurality of limitations, which cause the accuracy and the robustness of the algorithm to be affected and even cause the algorithm to fail. The reason for this is that the camera cannot actively select a photographed scene, and thus the negative influence of a low-texture-distribution region and a dynamic object on the SLAM algorithm cannot be avoided.

Currently, the mainstream feature point method visual SLAM only estimates the three-dimensional positions of sparse feature points, and the finally generated map only contains the sparse road mark points. While such sparse maps are adequate for the need for robots to reposition in the map, their performance is limited for some higher needs, such as navigation, obstacle avoidance, and three-dimensional reconstruction of scenes. Thus, the ability to have semi-dense depth estimation is critical to improving map availability, pushing the practical application of SLAM techniques. Today, visual SLAM performs semi-dense or dense depth estimation mainly through camera characteristics, deep learning or direct method, but currently most applied on related robots or directly acquires depth values of all pixels in an image through TOF (Time ofFlight) method or structured light method using RGB-D camera, etc., because RGB-D camera is easy to use and can directly measure depth through hardware in a sensor without consuming a lot of computing resources to estimate. However, dense mapping of RGB-D cameras is only applicable in some specific scenes, RGB-D cameras are not well applied in outdoor or large scene situations, and their power consumption and cost are high.

Disclosure of Invention

The invention aims to provide a binocular active vision semi-dense depth estimation method based on bionic eyes.

Based on the above purpose, the invention adopts the following technical scheme:

a binocular active vision semi-dense depth estimation method based on bionic eyes comprises the following steps:

s100, acquiring images by using a binocular bionic eye camera;

s200, inputting the image acquired in the S100 into an SLAM system, tracking a key frame by using the SLAM system and estimating the pose of a camera and the coordinates of road points;

s300, utilizing the key frame pose obtained in S200 and the corresponding key frame image data thereof, combining polar line searching and block matching with triangulation, and utilizing a depth filter to realize semi-dense depth estimation, specifically comprising the following steps:

s301, selecting a frame entering the step S300 as a reference key frame, and supposing inverse depths corresponding to all pixel depths in image data corresponding to the reference key frame to meet an initial Gaussian distribution to provide an initial value for subsequent depth information fusion;

s302, inserting a new key frame, and determining the position of a projection point through polar line searching and block matching;

s303, calculating the depth of the triangulated three-dimensional point and uncertainty thereof according to the geometric relationship between the reference key frame and the current key frame;

S304, fusing Gaussian distribution of inverse depth corresponding to the current observation depth into the last estimation to establish the distribution of inverse depth after fusion; if the uncertainty is smaller than a certain threshold, the depth data is considered to be converged, the depth of the three-dimensional point is determined, and otherwise, the step S302 is returned to continue information fusion.

The step S302 specifically includes:

let a pixel in the image corresponding to the reference key frame be u ₁ The projection of the depth range corresponding to the pixel in the image corresponding to the current key frame is a polar line l ₂ The method comprises the steps of carrying out a first treatment on the surface of the The image corresponding to the reference key frame is marked as I _k The image corresponding to the current key frame is marked as I _t The rotation matrix of the world coordinate system relative to the reference key frame and the current key frame is R respectively _kW And R is _tW The translation vectors of the world coordinate system relative to the reference key frame and the current key frame are t respectively _kW And t _tW Obtaining the internal reference K of the left camera through calibration _L The correlation between the reference key frame and the image corresponding to the current key frame in the pixel coordinate system is as follows:

wherein u is _k For referencing a pixel in a key frame, d _k The depth of the space point corresponding to the pixel; when knowing d _k 、u _k When it is time, u can be calculated _t Is a position of (2); when giving u _k When the two components of (a) are increased by an increment du and dv, u can be calculated _t Is increased by du _t 、dv _t Thereby obtaining affine transformation formed by a linear relation of the coordinate transformation of the key frame and the current frame image in the local range:

next, the rotation of the camera is considered into block matching according to the affine transformation matrix:

u is taken ₁ The surrounding pixel blocks areAt the same time handle l ₂ The pixel block above is marked as +.>Comparison of u using normalized cross-correlation with de-averaging ₁ Correlation of surrounding pixel blocks and pixel blocks on the epipolar line; when the included angle between the pixel block gradient and the polar line is larger, the reliability of block matching is lower, so that only the pixel blocks with the pixel gradient in the polar line direction or within a certain threshold value in the polar line direction are matched, according to the definition of NCC, the method comprises the following steps:

after determining the step length, by traversing NCC values of the specified pixel block and the source pixel block on the corresponding polar line, a non-convex NCC distribution function can be obtained, and the point with the highest NCC is searched for as a matched pixel block of the source pixel block.

In step S302, if the highest NCC value matched by the block is still lower than the threshold, it is considered that the polar search and block matching for the source pixel block in the image corresponding to the current key frame is invalid, and after entering a new key frame and the camera pose corresponding to the new key frame and optimized by the SLAM system of S200, matching and information fusion convergence are performed.

In the step S303, the three-dimensional point depth is calculated by triangulation, and the calculation method is as follows:

two frames of images are respectively I ₁ And I ₂ In I ₁ For reference, I ₂ Is T; the optical center of the camera is O ₁ With O ₂ The method comprises the steps of carrying out a first treatment on the surface of the At I ₁ Has a characteristic point p ₁ Correspond to I ₂ With characteristic point p ₂ Is connected with the optical center O ₁ And feature point p ₁ Optical center O ₂ And feature point p ₂ Obtaining ray O ₁ p ₁ With O ₂ p ₂ The method comprises the steps of carrying out a first treatment on the surface of the Straight line O when the camera does not do pure translational motion ₁ p ₁ With O ₂ p ₂ Intersecting at a point P in the scene, the point being two characteristic points P ₁ And p is as follows ₂ The position of the corresponding map point in the three-dimensional scene;

let x be ₁ And x ₂ Is the characteristic point p ₁ And p is as follows ₂ The corresponding normalized coordinates satisfy the relationship:

s ₂ x ₂ ＝s ₁ Rx ₁ +t, where s ₁ And s ₂ Respectively two characteristic points p ₁ And p is as follows ₂ The rotation matrix R and translation vector t are known;

from a geometric perspective, it can be seen that the ray O ₁ p ₁ Searching three-dimensional point to make its projection position approach p ₂ In the same way, the ray O ₂ p ₂ Searching;

to calculate s ₁ For example, the above equation is transformed to give a value for s only ₁ Is defined by the equation:solving equation directly to obtain s ₁ The method comprises the steps of carrying out a first treatment on the surface of the S is obtained by the same method ₂ The method comprises the steps of carrying out a first treatment on the surface of the At this time, the depths of two characteristic points are preliminarily determined;

due to noise influence O ₁ p ₁ With O ₂ p ₂ And when the two parts cannot be intersected, solving by a least square method.

In step S303, when new current key frames continuously appear, using polar search and block matching to find corresponding pixel blocks and using triangulation to estimate depth, the observed depth of the new current key frames in the reference key frames is calculated and converted into inverse depth, and the inverse depth is also a gaussian distribution:

wherein (1)>Obtained by combining the error of the pixels after epipolar search with the depth deviation calculated by triangulation geometry.

In the step S100, a panoramic camera is adopted to perform fixation guidance on the binocular bionic eye camera, and the steps include:

s101, acquiring a panoramic image, and then detecting key scene information, including feature points and dynamic objects, in the panoramic image, which influences the execution of SLAM tasks;

s102, assigning values to pixel points in the panoramic image based on key scene information, and constructing a panoramic value image;

s103, respectively projecting the view field areas of the left bionic eye camera and the right bionic eye camera into the panoramic value image to obtain a left view field projection area and a right view field projection area; simultaneously controlling the left view field projection area and the right view field projection area to overlap; combining the left view field projection area and the right view field projection area to obtain a current binocular view field projection area;

S104, obtaining a value average value of a current binocular visual field projection area in the panoramic value image; judging the value average value and the value threshold value of the current binocular vision field projection area: if the value average value of the current binocular field projection area is larger than the value threshold value, the current binocular field projection area is not moved, and the high-value images currently collected by the left bionic eye camera and the right bionic eye camera are used as input of the SLAM system; if the value average value of the current binocular field projection area is smaller than or equal to the value threshold value, searching a target binocular field projection area with higher value threshold value in the panoramic value image, and then entering step S105;

s105, respectively calculating the displacement needed by the left bionic eye camera and the right bionic eye camera according to the current binocular field projection area and the target binocular field projection area; and respectively moving the left bionic eye camera and the right bionic eye camera according to the displacement, and finally taking high-value images acquired by the left bionic eye camera and the right bionic eye camera in the target binocular visual field projection area as input of the SLAM system.

In the step S104 of the above-mentioned process,

by adopting the projection size of the gazing area of the current bionic eye camera in the gazing area of the panoramic imageThe same rectangle is used as a sliding window, the current gazing area is searched from near to far with a certain step length s, and the value average value V in each window is obtained by combining the value of the panoramic image _m ；

Defining a value threshold V _th If the value average value of the current gazing area is higher than the value threshold value, selecting not to move the camera; otherwise, if the value average value of the current gazing area is smaller than or equal to the value threshold value, selecting to transfer the sight line to the value average value higher than the value threshold value V _th And the area closest to the current gaze area.

In the step S105, the rotational motion of the left bionic eye camera is represented as a rotation matrix R _LCLT The rotational motion of the right camera is represented as a rotation matrix R _RCRT Solving for R by _LCLT ：

Center pixel coordinate u of target fixation area in panoramic value map _PT From projection modelsBack projecting to the projection sphere of the panoramic camera, and back projecting to the projection point P of the sphere _PT Conversion to left camera initial pose coordinate System O _LO Obtain the target optical axis direction P of the left camera _LOT ：/>The center point of the pixel plane of the current gazing area of the left bionic eye camera is +.>Projection model by camera->Back projecting to the normalization plane, and converting the normalization plane into an initial pose coordinate system O of the left bionic eye camera _LO Obtain the left camera in the initial pose coordinate system O _LO Current pose P in (3) _LOC ：Thereby making it possible toAvailable, left bionic eye camera needs to be taken from current pose P _LOC Move to target pose P _LOT ；

Calculating the current attitude P _LOC With the target posture P _LOT And (3) rotating shafts:

according to the definition of dot product, for rotation angle θ _LCLT ：

The rotation matrix P is obtained by the Rodrigues formula according to the rotation angle and the rotation axis _LCLT ：

R is solved by the same way _RCRT ；

By solving the rotation matrix of the left and right cameras, the rotation instruction which is required to be issued by the bionic eye control module to the left and right bionic eye control motors is obtained, so that the bionic eye cameras transfer the vision to a high-value area in the scene, and fixation guidance is realized.

The SLAM system adopts an ORB-SLAM2 framework improved SLAM algorithm and comprises four modules, namely a tracking thread, a local mapping thread, a loop closing thread and a global BA thread:

(1) The tracking of each frame is realized by matching the characteristic points of the current frame and the local map, the pose of the camera is initialized and the local map is tracked by utilizing a motion mode, a PnP repositioning mode or a reference key frame mode, the pose of the camera is estimated by adopting PnP minimized reprojection error, and meanwhile, the key frame is judged to be operated as a tracking thread;

(2) Managing the local map and executing local BA optimization through a local map creation process;

(3) Detecting a large loop by using a DBoW2 library, and correcting a camera track accumulated error by executing pose diagram optimization so as to form a loop closed-loop thread;

(4) After the pose graph optimization is completed, a global BA thread is started to acquire the globally consistent camera pose and map structure.

In the SLAM system, if the current frame satisfies any one of the following conditions, the current frame is set as a key frame:

(1) The current frame is a first frame after the bionic eye sight control reaches the target gazing area;

(2) Global repositioning more than 13 frames from the last time;

(3) The local mapping thread is in an idle state;

(4) More than 15 frames from the last set key frame;

(5) The translation distance between the current frame and the last set key frame exceeds a threshold t _th ；

(6) The number of feature points successfully tracked in the current frame reaches more than 70;

(7) The number of feature points in the current frame that are successfully tracked is less than 85% of the reference key frame.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention is based on the visual mechanism of human eyes, when the characteristic points in the scene are unevenly distributed, the region with relatively dense characteristic points can be endowed with high value by constructing the panoramic value graph, so that the bionic eyes are guided to move to the region. The dense feature points are beneficial to pose estimation optimization of the bionic eye camera and construction of a global map, and stability of the SLAM system is improved. Meanwhile, a texture foundation is provided for the follow-up bionic eye camera to perform semi-dense depth estimation by utilizing block matching.

(2) According to the invention, interference of a dynamic object on camera state estimation optimization and semi-dense depth estimation is eliminated in the process of constructing the panoramic value map, if the dynamic object appears in the semi-dense depth estimation process, the block matching process is influenced, after the dynamic object moves, the pixel points which are originally matched on the corresponding polar line disappear, and then the wrong pixel blocks are matched, so that the matching failure or the depth estimation error is caused. Thus, the step of excluding dynamic objects provides better quality image data for subsequent semi-dense depth estimation.

(3) The invention provides an improved optimization algorithm for pose estimation and optimization in the process of relative pose transformation of the bionic eye camera, can well optimize the pose of the camera, the relative pose of the left bionic eye camera and the right bionic eye camera and the coordinates of road sign points through four threads, and provides a basis for the follow-up semi-dense depth estimation step.

(4) The present invention takes advantage of the above described advantages to perform semi-dense depth estimation, eliminates most of the interference terms to the depth estimation process and provides good data. Meanwhile, in order to more accurately perform depth estimation, the method makes a limiting condition on the pixel block gradient to eliminate interference items, and optimizes a depth estimation mode.

(5) The invention can better meet real-time performance requirements relative to dense direct method depth estimation. Dense direct methods also primarily utilize a key frame-based minimization of the global spatial regularization energy function in a non-convex optimized framework for dense depth estimation, which is computationally intensive for each pixel in the image, and thus difficult to achieve in real-time performance. In the block matching process, NCC calculation of block matching with some image block gradients not meeting the requirements is filtered, so that compared with a dense direct method, the method can better meet the requirements of instantaneity.

(6) Compared with the method for estimating the depth by using an RGB-D camera, the method can be well applied to the outdoor and large scene occasions with uneven texture distribution, and meanwhile, the power consumption and the cost are reduced. In outdoor large scenes with uneven texture distribution, the RGB-D camera may be disturbed by the range, illumination or other infrared light emitted by the sensor. Second, without proper modulation, the simultaneous use of multiple RGB-D cameras may present a problem of mutual interference, which may result in difficulty in accurately acquiring depth information. In addition, the simultaneous use of multiple RGB-D cameras is typically accompanied by higher cost and power consumption. For the traditional binocular passive vision, although the range is not limited, the influence of uneven texture distribution and dynamic objects on depth estimation cannot be avoided. Therefore, the invention can be well applied to the outdoor and large scene occasions with uneven texture distribution by utilizing binocular active vision, and simultaneously reduces the power consumption and the cost.

Drawings

FIG. 1 is an overall flowchart of a binocular active visual semi-dense depth estimation method based on bionic eyes in an embodiment;

FIG. 2 is a flow chart of visual guidance of a binocular bionic eye camera using a panoramic camera in an embodiment;

FIG. 3 is a flow chart of tracking key frames and estimating camera pose and landmark coordinates using the SLAM system in an embodiment;

FIG. 4 is a flow chart of semi-dense depth estimation in an embodiment;

fig. 5 is a schematic diagram of triangulation.

Detailed Description

The present invention will be described in detail with reference to specific examples.

As shown in fig. 1, the embodiment provides a more specific binocular active vision semi-dense depth estimation method based on bionic eyes, which includes the following steps:

s100, performing visual guidance on the binocular bionic eye camera by using the panoramic camera, and acquiring high-value images by the binocular bionic eye camera, wherein the process is as follows, as shown in fig. 2:

s101, capturing surrounding environment information by adopting a panoramic camera, and detecting key information affecting SLAM operation precision and robustness in a panoramic image acquired by the panoramic camera, wherein the key information comprises characteristic points and dynamic objects:

in the panoramic image, in order to achieve matching with a subsequent binocular active SLAM algorithm, the subsequent binocular active SLAM can also extract the same characteristic points, the effectiveness of a panoramic value graph formed by the characteristic points is guaranteed, meanwhile, the real-time performance of the algorithm is achieved, and the characteristic points are extracted by adopting a FAST key point detection algorithm in the panoramic image. Because the panoramic value map only provides information for the feature points to control the movement of binocular bionic eyes, BRIRF descriptor calculation in ORB feature points extracted by SLAM algorithm is omitted.

For a dynamic object that seriously damages the performance of the SLAM algorithm, it should be avoided as much as possible to appear in the input image of the SLAM algorithm, so that the dynamic object in the image is detected first. By using the existing dynamic object detection algorithm, potential dynamic objects such as pedestrians, animals and the like in the panoramic image or potential dynamic objects with negative effects on the SLAM algorithm can be detected, and the interference of the dynamic objects can be eliminated as much as possible.

S102, evaluating each region in the panoramic image based on the key scene information to determine the value of each region to the SLAM algorithm, assigning values to pixel points in the panoramic image, and constructing a panoramic value image:

according to the feature points extracted from the panoramic image, a high value is given to each feature point and pixels in the peripheral field of the feature point. The value gradually decreases as the radius of the field increases. If the same pixel point is assigned a plurality of times in the region in which the feature points are concentrated, the value of the pixel takes the maximum value to which it is assigned.

Further, all pixels in the dynamic object detection target frame are endowed with low value, and even the value of the pixels is lower than that of the area where no characteristic point is detected, so that the influence of the dynamic object on the performance of the subsequent SLAM algorithm is eliminated.

S103, respectively projecting the view field areas of the left bionic eye camera and the right bionic eye camera into the panoramic value image to obtain a left view field projection area and a right view field projection area; simultaneously controlling the left view field projection area and the right view field projection area to overlap; combining the left view field projection area and the right view field projection area to obtain a current binocular view field projection area:

here, the initial pose coordinate system of the panoramic camera with respect to the left camera is denoted as T _LOP The initial pose coordinate system of the panoramic camera relative to the right camera is marked as T _ROP And the two transformation matrices can be determined by calibration. According to feedback of the bionic eye control module, the left bionic eye and the right bionic eye are opposite to the initial pose T _LOL And T _ROR Are available in real time. Defining normalized plane projection equations of two bionic eye cameras asThe projection equation from the spherical surface of the panoramic camera to the pixel plane is +.>Taking a left camera as an example, four corner pixel points u of the left bionic eye camera _L(0,0) 、u _L(m，0) 、u _L(m，n) And u _L(0，n) (where m and n are the maximum pixel coordinates in the directions of the bionic eye camera pixel plane width W and height H, respectively) in its normalized plane P _L Projection onto:

thereby, the four-corner pixel points of the left bionic eye camera can be obtained on the normalization plane P _L Projection P onto _L(0,0) 、P _L(m，0) 、P _L(m，n) And P _L(0，n) The right bionic eye camera is the same.

Further, four-corner projection points on the normalization plane of the left camera are transformed into a coordinate system of the panoramic camera:

because the bionic eye camera has only two degrees of freedom, namely left-right rotary motion and up-down pitching motion, the current pose coordinate system is coincident with the origin of the initial pose coordinate system, the relative positions of the panoramic camera and the bionic eye camera origin are always kept unchanged, and the T can be obtained from calibration _LOP Obtaining translation vector t of left bionic eye camera relative to panoramic camera _PL0 . Let t _PL0 And P _L(0，0) 、P _L(m，0) 、P _L(m，n) And P _L(0，n) The four points are respectively connected to construct four linear equations L _P(0，0) 、L _P(m，0) 、L _P(m，n) 、L _P(0，n) The expression form of the four straight-line equations is as follows:

wherein,x _PL0 transforming the origin coordinate of the left bionic eye camera into the x-direction coordinate and y-direction coordinate in the panoramic camera coordinate system _PL0 Transforming the origin coordinate of the left bionic eye camera into the y-direction coordinate and z-direction coordinate in the panoramic camera coordinate system _PL0 And transforming the origin coordinates of the left bionic eye camera into z-direction coordinates in a panoramic camera coordinate system.

The normalized spherical equation of the panoramic camera spherical projection is:

x ² +y ² +z ² ＝1 (4)，

and (3) and (4) simultaneously, the intersection point of four straight lines and the projection spherical surface of the panoramic camera can be obtained, and each straight line and the spherical surface form two intersection points, but because the bionic eye camera is limited in the movement range, a scene with x < 0 in the coordinate system of the panoramic camera cannot be seen, the result with x < 0 is omitted, and the intersection point with x > 0 is reserved. Therefore, the intersection point of the four straight lines and the projection spherical surface of the panoramic camera is P _P1 、P _P2 、P _P3 And P _P4 。

Further, projection equation for projecting spherical surface to pixel plane by panoramic cameraProjecting the four intersections onto a pixel plane of the panoramic camera:

the four projection points are connected, and due to the distortion problem in the projection process of the panoramic camera, the connection is followed by a trapezoid, and the trapezoid is the projection of the left bionic eye camera in the panoramic image. And the projection of the right bionic eye camera in the panoramic image can be obtained in the same way. According to the mechanical structure limitation of the bionic eye camera, the movement range which can be covered by the bionic eye camera in the panoramic image can be obtained. Meanwhile, in order to enable the photographed images to be located in the movement ranges of the left and right bionic eye cameras, the movable ranges of the left and right bionic eye cameras need to be further limited. It should be noted that, in order to facilitate calculation of the panoramic value algorithm, the integrated gaze area projection is approximated to its maximum bounding rectangle at this stage; the purpose of this is to simplify the calculation process, approximating the complex irregular shape as a rectangle, so that it is more convenient to handle in the subsequent algorithm.

S104, obtaining a value average value of a current binocular visual field projection area in the panoramic value image; judging the value average value and the value threshold value of the current binocular vision field projection area: if the value average value of the current binocular field projection area is larger than the value threshold value, the current binocular field projection area is not moved, and the high-value images currently collected by the left bionic eye camera and the right bionic eye camera are used as input of the SLAM system; if the value average value of the current binocular field projection area is less than or equal to the value threshold, searching a target binocular field projection area with a higher value than the value threshold in the panoramic value image, and then proceeding to step S105.

By adopting a rectangle with the same projection size as the current bionic eye camera gazing area as a sliding window in the panoramic image gazing area, searching from the current gazing area from near to far with a certain step length s, and combining the panoramic image value, the value average value V in each window can be obtained _m . Wherein the value threshold V is defined herein in order to reduce the effect of jitter generated in camera rotation on SLAM algorithms _th If the value average of the current gaze region is above the value threshold (i.e., V _m ＞V _th ) Meaning that the current gaze area has met the good operational requirements of the SLAM algorithm, at which time no camera movement is selected. Conversely, if V _m ≤V _th Meaning that the number of the extractable characteristic points in the current scene is insufficient or dynamic objects are contained in the scene, in order to improve the running quality of the SLAM algorithm, the sight line is selected to be transferred to the value average value higher than the value threshold value V _th And the area closest to the current gaze area.

As described above, the bionic eye has two degrees of freedom, namely a left-right rotation motion and an up-down pitching motion, and only performs pure rotation relative to the robot platform, so that the rotation motion of the left bionic eye camera can be expressed as a rotation matrix R _LCLT The rotational motion of the right camera is represented as a rotation matrix R _RCRT 。

Taking a left camera as an example, the central pixel coordinate u of the target gazing area in the panoramic value map _PT From projection modelsBack projecting to the projection sphere of the panoramic camera, and back projecting to the projection point P of the sphere _PT Conversion to left camera initial pose coordinate System O _LO Obtain the target optical axis direction P of the left camera _LOT ：

Further, the center point of the pixel plane of the current gazing area of the left bionic eye camera is setProjection model by camera->Back projecting to the normalization plane, and converting the normalization plane into an initial pose coordinate system O of the left bionic eye camera _LO Obtain the left camera in the initial pose coordinate system O _LO Current pose P in (3) _LOC ：

Thus, the left bionic eye camera needs to be from the current posture P _LOC Move to target pose P _LOT 。

Further, calculating the current posture of the left bionic eye cameraP _LOC Move to target pose P _LOT A rotation matrix R therebetween _LCLT . A rotation vector between the two vectors representing the pose is first calculated from the two vectors.

Determining the rotation axis between the two vectors:

since the rotation axis in the rotation vector is a unit vector, the vector after the cross multiplication is normalized in the above equation. According to the definition of dot product, for rotation angle θ _LCLT ：

The rotation matrix R is obtained by the Rodrigues formula according to the rotation angle and the rotation axis _LCLT ：

By solving a rotation matrix R of the left camera _LCLT The rotating instruction which needs to be issued to the left bionic eye control motor by the bionic eye control module can be obtained, so that the bionic eye camera transfers the vision to a high-value area in the scene, and the purpose of active vision is achieved. Similarly, the rotation matrix of the right camera can be obtained to control the rotation of the right camera so as to improve the operation quality of the SLAM algorithm.

S200, inputting the high-value image acquired in the S100 into a SLAM algorithm, and estimating the pose of the camera and the coordinates of the landmark points by using the SLAM algorithm.

As shown in FIG. 3, the embodiment adopts an improved SLAM algorithm based on an ORB-SLAM2 frame, so that the left and right bionic eye cameras can still keep stable running when relative movement occurs, and the method has higher accuracy in the aspects of pose and landmark point coordinate estimation of the cameras.

The SLAM algorithm comprises four modules, namely a tracking thread, a local mapping thread, a loop closing thread and a global BA thread, and specifically comprises the following steps:

S201, tracking threads, estimating the pose of a camera and judging key frames:

first an initial local map is constructed: after images acquired by the left and right bionic eye cameras are input into an SLAM algorithm frame, ORB characteristic points of the two images are extracted in a preprocessing module, and the operation of all subsequent algorithm modules is based on the extracted ORB characteristic points, so that the expression of the images is converted from a set of pixel points to a set of ORB characteristic points, and the data volume cached in operation is reduced. According to the pose T between the left and right bionic eyes obtained by calibration _L0R0 Further, an initial partial map is constructed by binocular parallax.

Secondly, setting a left camera as a main camera, extracting preprocessed images input by the left camera through ORB feature points, and initializing the pose of the left camera by using a motion mode, a PnP repositioning mode or a reference key frame mode, wherein the precision of the pose of the initial estimation is low, and the pose estimation of the local map needs to be further tracked and further optimized. And searching ORB characteristic points matched with the current left camera frame in the local map, and then constructing a PnP problem to solve by using a minimized reprojection error so as to optimize the pose of the left camera.

For the error term defined in the nonlinear optimization:

wherein u is _i Representing the pixel coordinates of the observed point P, P _i Representing three-dimensional coordinates of a midpoint P of the local map; e represents the re-projection error; taking the pose of the camera as an example, when constructing the least square problem, the optimal pose of the camera needs to be obtained:

the least square problem can construct an unconstrained optimization problem by using the lie algebra, and the least square problem is conveniently solved by an optimization algorithm; in iterative optimization, the derivative of the reprojection error with respect to the pose of the camera, namely the jacobian matrix required for linearizing the objective function:

the updating amount of the pose of the left camera can be solved.

Further, the pose estimation task of the tracking thread also needs to estimate the relative poses of the left camera and the right camera. The pose T of the left and right bionic eyes relative to the initial pose coordinate system thereof is obtained in the motion control process of the bionic eyes _LOL And T _ROR Thereby calculating the initial value T of the relative pose between the left and right bionic eyes _LR ：

After the pose estimation of the left camera is completed, successfully establishing matching between the pose estimation of the left camera and road marking points in the local map; at this time, based on the ORB feature point matching between the left and right cameras and the matching between the ORB feature point and the landmark point in the left camera, a PnP problem is constructed to minimize the feature point reprojection error between the left and right cameras to obtain an optimized relative pose T of the left and right cameras _LR The method comprises the steps of carrying out a first treatment on the surface of the Thus, the current pose of the left camera and the relative poses of the left camera and the right camera are primarily optimized; at this time, the optimized left and right camera relative pose transformation matrix T is utilized _LR And the obtained good matching characteristic points in the image can be subjected to triangulation to obtain a large number of high-quality binocular matching target points. The high-quality road mark points are used as the supplement of the road mark points in the local map, so that the accuracy and the robustness of the tracking thread are improved.

Finally, selecting a key frame to serve a subsequent thread; the key frame selected here refers to all feature points in the left camera image in a frame meeting the requirements, and the image data corresponding to the key frame does not participate in the SLAM algorithm, but enters the semi-dense depth estimation module. Setting the current frame as a key frame if the current frame satisfies any one of the following conditions:

(2) Global repositioning more than 13 frames from the last time;

(3) The local mapping thread is in an idle state;

(4) More than 15 frames from the last set key frame;

And providing a precondition for the subsequent threads through screening of the key frames.

S202, a local map building thread manages a local map and performs local BA optimization, wherein the purpose is to update and reject key frames and landmark points in the local map and optimize the camera pose where the key frames are located and coordinates of the landmark points.

Firstly, updating key frames; in the tracking thread, once judging that the current frame meets the condition of being set as a key frame, associating the new key frame with the previous key frame; this association process is based on the co-view relationship between the new key frame and the previous key frame to the landmark points and calculates the bag of words expression of the new key frame based on the DBoW bag of words library.

Meanwhile, updating and maintaining the road mark points; if a waypoint is to remain in the local map, it must satisfy two conditions in the first three keyframes after it is created:

(1) Must be successfully tracked in more than 25% of the frames visible to the waypoint according to pose predictions;

(2) If more than one key frame is newly added after the landmark point is created, the landmark point must be observed by not less than three key frames.

In the local mapping thread, the local BA optimization essentially adds the coordinates of the road mark points as parameters into the optimization; the method is similar to the PnP minimized reprojection error optimization camera pose in the tracking thread, and the derivatives of the reprojection error with respect to coordinates of the road mark points are as follows according to a chain rule:

the updating quantity of the coordinates of the landmark points can be obtained, and when iteration is carried out until the error of the objective function converges, the camera pose where the optimal key frame is positioned in the local map and the coordinates of the landmark points can be obtained.

Further, in order to prevent the data scale in the local BA from gradually increasing to influence the real-time performance of the local mapping process, key frames are removed; when 90% of the waypoints in one key frame can be observed in at least three other key frames, the key frame is rejected, but its image data and its pose still enter the semi-dense depth estimation module.

S203, loop closing thread: the large loop is detected by using the DBoW2 library, and the camera track accumulated error is corrected by executing Pose Graph (Pose Graph) optimization, so that a loop closing thread is formed. The main task of this thread is to eliminate the accumulated errors generated by the tracking thread and the local mapping thread considering only the neighboring time key frames.

Firstly, in a loop detection stage, a system detects possible loop conditions and finds candidate loop key frames; then, calculating the relative pose between the current key frame and the candidate loop key frames; closing the loop by means of the co-view relationship between them; correcting the loop by using the calculated relative pose of the current key frame and the loop key frame, and performing pose graph (PoseGraph) optimization, wherein the pose graph optimization does not consider optimization of a road mark point but only considers optimization of a track, namely constructing a graph optimization only having the track; that is, suppose there is K _i 、K _j Two key frames, the camera pose before loop correction is T respectively _Wi And T _Wj After correcting the camera pose in loop detection, the relative pose T between two frames can be obtained _ij Thereby constructing an error e _ij ：

Based on the error, constructing nonlinear optimization and solving error terms about T respectively _Wi And T _Wj Is used for linearization, the optimization of the pose map can be achieved.

S204, global BA threads. After the pose graph optimization is completed, a global BA thread is started. The main task of the thread is to optimize the final accurate track and map by using BA after no new data is input, and obtain the globally consistent camera pose and map structure.

After the loop closing thread completes the optimization of the pose graph, global optimization is performed to obtain a globally optimal solution in the whole process; merging the updated key frame camera pose with the non-updated key frame camera pose through a spanning tree after the optimization is finished, and correcting coordinates of the landmark points according to the updating of the reference key frame camera pose; if a new loop is detected in the global BA optimization process, the thread is terminated, and the process is started again after the pose graph optimization is finished.

S300, inputting the optimized key frame pose obtained in the S200 and the corresponding key frame image data into a semi-dense depth estimation module, wherein the semi-dense depth estimation module realizes semi-dense depth estimation by combining epipolar search and block matching with triangulation and then utilizing a depth filter, as shown in fig. 4, and further realizes the construction of a semi-dense map, and the steps comprise:

s301, selecting one frame entering the semi-dense depth estimation module as a reference key frame, and assuming inverse depths corresponding to all pixel depths in image data corresponding to the reference key frame to meet an initial Gaussian distribution. The purpose is to provide an initial value for subsequent depth information fusion.

Taking the image data corresponding to one frame of reference key frame as an example, the inverse depth value of a certain pixel is set as d ^-1 And obeys a gaussian distribution:

P(d ^-1 )＝N(μ，σ ² ) (17)，

here the initial mean and variance of the inverse depth profile has some effect on the convergence speed.

S302, when the optimized new key frame pose and the corresponding image data enter a semi-dense depth estimation module, polar line searching is carried out by utilizing the optimized key frame pose, and block matching is further carried out to determine the position of the projection point.

Let a pixel in the image corresponding to the reference key frame be u ₁ The projection of the depth range corresponding to the pixel in the image corresponding to the current key frame is a polar line l ₂ . The image corresponding to the reference key frame is marked as I _k The image corresponding to the current key frame is marked as I _t . The rotation matrix of the world coordinate system relative to the reference key frame and the current key frame is R respectively _kW And R is _tW The translation vectors of the world coordinate system relative to the reference key frame and the current key frame are t respectively _kW And t _tW Can be obtained by the SLAM algorithm described above. Meanwhile, the internal reference K of the left camera can be obtained through the calibration _L The correlation between the reference key frame and the image corresponding to the current key frame in the pixel coordinate system is as follows:

wherein u is _k For referencing a pixel in a key frame, d _k The depth of the spatial point corresponding to the pixel. When knowing d _k 、u _k When it is time, u can be calculated _t Is a position of (2); when giving u _k When the two components of (a) are increased by an increment du and dv, u can be calculated _t Is increased by du _t 、dv _t Thus, affine transformation formed by a linear relation of the coordinate transformation of the key frame and the current frame image in the local range can be obtained:

then, rotation of the camera is considered into block matching according to the affine transformation matrix. In the feature matching method, it is necessary to calculate matching between features from descriptors, but at this time, each pixel cannot be regarded as a feature point calculation descriptor. Therefore, a means of comparing luminance values is required. Since the brightness of individual pixels is hardly distinguishable, it is necessary to compare the brightness of blocks of pixels, i.e. in a block matching manner.

U is taken ₁ The surrounding pixel blocks areAt the same time handle l ₂ The pixel block above is marked as +.>Comparison of u using normalized cross-correlation (Normalized Cross Correlation) with de-averaging ₁ The normalized cross-correlation mode of removing the mean value of the correlation of surrounding pixel blocks and pixel blocks on the polar line allows the brightness of the whole pixel block to change somewhat, and is more reliable than the common normalized cross-correlation mode.

When the included angle between the pixel block gradient and the polar line is larger, the reliability of block matching is lower, so that only the pixel blocks with the pixel gradient in the polar line direction or within a certain threshold value in the polar line direction are matched, according to the definition of NCC, the method comprises the following steps:

after determining the step size, a non-convex NCC distribution function can be obtained by traversing NCC values of the specified pixel block and the source pixel block on the corresponding polar line. Searching a point with the highest NCC as a matched pixel block of the source pixel block; if the highest matching value is also lower than the threshold value, the polar line search and block matching of the source pixel block in the image corresponding to the current key frame are considered to be invalid, and the matching and information fusion convergence is carried out after new image and pose data are needed to be obtained.

The limitation of the gradient of the pixel block can effectively reduce the calculation of the mean value removal normalization cross correlation in the matching process and the mismatching of the pixel block, so that the more accurate depth after triangulation is obtained. If the limitation is not added, the operation amount is increased, meanwhile, the uncertainty of block matching is increased, and effective matching is not obtained, so that the error depth is possibly caused, and the subsequent information fusion process is influenced.

Further, the estimated depth in the reference key frame obtained by block matching and triangulation of the images corresponding to the different current key frames and the reference key frame may be different, however, the coordinates corresponding to a point in three-dimensional space in reality are determined. It is therefore necessary to introduce a depth filter, describing the depth values by probability distribution.

S303, calculating the depth of the three-dimensional point after triangulation and uncertainty thereof according to the geometric relation between the reference key frame and the current key frame, wherein the main task is to provide Gaussian distribution of inverse depth corresponding to the new observation depth so as to facilitate subsequent information fusion to enable depth data to be converged.

The three-dimensional point depth is calculated by using a triangulation (triangulation) method, the schematic diagram of which is shown in fig. 5, and the calculation method is as follows:

two frames of images are respectively I ₁ And I ₂ In I ₁ For reference, I ₂ Is T. The optical center of the camera is O ₁ With O ₂ . At I ₁ Has a characteristic point p ₁ Correspond to I ₂ With characteristic point p ₂ Is connected with the optical center O ₁ And feature point p ₁ Optical center O ₂ And feature point p ₂ Obtaining ray O ₁ p ₁ With O ₂ p ₂ . Theoretically, the straight line O when the camera does not do pure translational motion ₁ p ₁ With O ₂ p ₂ Will intersect at a point P in the scene, which is two feature points P ₁ And p is as follows ₂ The position of the corresponding map point in the three-dimensional scene. Due to noise, the two lines may not intersect and may be solved by the least squares method.

s ₂ x ₂ ＝s ₁ Rx ₁ +t (21)，

wherein s is ₁ And s ₂ Respectively two characteristic points p ₁ And p is as follows ₂ The rotation matrix R and translation vector t are known. From a geometric perspective, it can be seen that the ray O ₁ p ₁ Searching three-dimensional point to make its projection position approach p ₂ . Similarly, the ray O can be ₂ p ₂ Searching above. To calculate s ₁ For example, the above equation is transformed to give a value for s only ₁ Is defined by the equation:

solving the above equation directly finds s ₁ . Similarly, s ₂ Or can be obtained. The depths of the two feature points can be initially determined at this time. Due to the presence of noise, a solution least squares solution is required.

When new current key frames continue to appear, the depth of observation in the reference key frame is calculated and converted to the inverse depth as the corresponding pixel block is found using polar search and block matching and depth is estimated using triangulation. The inverse depth at this time is also a gaussian distribution:

wherein,can be obtained by combining the error of the pixels after epipolar search with the depth deviation calculated by triangulation geometry.

S304, fusing Gaussian distribution of inverse depth corresponding to the current observation depth into the last estimation to establish the distribution of inverse depth after fusion; if the uncertainty is smaller than a certain threshold, the depth data is considered to be converged, otherwise, information fusion is needed to be continued. Step S303 and S304 are combined to form a depth filter.

First, by fusing gaussian distributions of new and old inverse depths, products according to the gaussian distributions are:

By iterating the above-mentioned inverse depth observation calculation and fusion process, when the uncertainty obtained by the latest information fusion is smaller than a certain threshold value, the depth data can be considered to have converged.

Further, a semi-dense depth map can be finally obtained through drawing, data reading and the like.

According to the visual mechanism of human eyes, under the condition of uneven distribution of characteristic points in a scene, a panoramic value graph is constructed, and a region with dense characteristic points is endowed with higher value. This strategy can direct the biomimetic eye camera towards these areas. The dense feature points are beneficial to pose estimation optimization and global map construction, so that the stability of the SLAM system is enhanced. Meanwhile, the feature point rich region provides a texture foundation for the follow-up use of block matching to perform semi-dense depth estimation, and mismatching is not easy to cause in the region with obvious texture. Secondly, the interference of dynamic objects on semi-dense depth estimation is eliminated when the panoramic value map is constructed. If a dynamic object appears in the semi-dense depth estimation process, the block matching process is affected, after the dynamic object moves, the pixel points which are originally matched on the corresponding polar line disappear, and then the wrong pixel blocks are matched, so that the matching failure or the depth estimation error is caused. Thus, the step of excluding dynamic objects provides better quality image data for subsequent semi-dense depth estimation.

Meanwhile, aiming at binocular active vision, the SLAM algorithm is improved when the relative pose of the bionic eye camera is changed. Providing a basis for subsequent semi-dense depth estimation.

In the semi-dense depth estimation module, since the baseline between the conventional passive vision binocular cameras is fixed as compared to the conventional passive vision binocular semi-dense depth estimation, the depth of the pixel can be calculated by the parallax of the left and right cameras. In the invention, the relative pose between binocular bionic eye cameras is changed, so that the traditional parallax mode is ineffective. The invention can obtain the filtered key frame pose through the improved SLAM algorithm, so that the corresponding pixel block can be found through polar line search and block matching through the key frame pose and the corresponding image data thereof, and the depth can be estimated through triangulation, wherein certain limitation is made on the direction of the gradient of the pixel block in the block matching process, and more accurate triangulated depth can be obtained. Meanwhile, binocular active vision can provide more texture information than traditional monocular passive semi-dense depth estimation, which can also improve the accuracy of depth estimation. The invention has important application value in actual scenes, and provides an effective way for autonomous navigation and positioning capability of the robot in an unknown environment.

Claims

1. The binocular active vision semi-dense depth estimation method based on the bionic eye is characterized by comprising the following steps of:

s100, acquiring images by using a binocular bionic eye camera;

2. The binocular active vision semi-dense depth estimation method based on the bionic eye according to claim 1, wherein the step S302 is specifically:

3. The binocular active vision semi-dense depth estimation method based on bionic eyes as claimed in claim 2, wherein in step S302, if the highest NCC value obtained by block matching is still lower than a threshold, the polar search and block matching of the source pixel block in the image corresponding to the current key frame is considered to be invalid, and after entering a new key frame and the camera pose corresponding to the new key frame and optimized by the SLAM system of S200, matching and information fusion convergence is performed.

4. The binocular active vision semi-dense depth estimation method based on the bionic eye according to claim 1, wherein in the step S303, the three-dimensional point depth is calculated by triangulation in the following manner:

5. The binocular active visual semi-dense depth estimation method of claim 4, wherein when new current key frames are continuously generated to find corresponding pixel blocks by using polar line search and block matching and depth is estimated by using triangulation, the observed depth in the reference key frames is calculated and converted into inverse depth, and the inverse depth is also a gaussian distribution:

6. The binocular active vision semi-dense depth estimation method based on bionic eyes according to claim 1, wherein in the step S100, the binocular bionic eye camera is adopted for gaze guidance, and the steps include:

7. The binocular active vision semi-dense depth estimation method based on the bionic eye according to claim 6, wherein in the step S104,

by adopting a rectangle with the same projection size as the current bionic eye camera gazing area as a sliding window in the panoramic image gazing area, searching from the current gazing area from near to far with a certain step S, and combining the panoramic image value, the value average value V in each window is obtained _m ；

8. The binocular active vision semi-dense depth estimation method of claim 6, wherein the rotational motion of the left bionic eye camera is represented as a rotation matrix R in step S105 _LCLT The rotational motion of the right camera is represented as a rotation matrix R _RCRT Solving for R by _LCLT ：

Center pixel coordinate u of target fixation area in panoramic value map _PT From projection modelsBack projecting into the projection sphere of the panoramic camera and combiningProjection point P of back projection to sphere _PT Conversion to left camera initial pose coordinate System O _LO Obtain the target optical axis direction P of the left camera _LOT ：/>The center point of the pixel plane of the current gazing area of the left bionic eye camera is +.>Projection model by camera->Back projecting to the normalization plane, and converting the normalization plane into an initial pose coordinate system O of the left bionic eye camera _LO Obtain the left camera in the initial pose coordinate system O _LO Current pose P in (3) _LOC ：Thus, the left bionic eye camera needs to be from the current posture P _LOC Move to target pose P _LOT ；

according to the definition of dot product, for rotation angle θ _LCLT ：

R is solved by the same way _RCRT ；

9. The binocular active vision semi-dense depth estimation method based on bionic eyes according to any one of claims 1 to 8, wherein the SLAM system adopts an ORB-SLAM2 framework modified SLAM algorithm, comprising four modules, namely a tracking thread, a local mapping thread, a loop closing thread and a global BA thread:

10. The binocular active vision semi-dense depth estimation method based on the bionic eye according to claim 9, wherein the SLAM system finally sets the current frame as a key frame if the current frame satisfies any one of the following conditions:

(2) Global repositioning more than 13 frames from the last time;

(3) The local mapping thread is in an idle state;

(4) More than 15 frames from the last set key frame;