CN111899276A

CN111899276A - SLAM method and system based on binocular event camera

Info

Publication number: CN111899276A
Application number: CN202010647021.8A
Authority: CN
Inventors: 余磊; 周游龙; 杨公宇; 杨文�
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-11-06

Abstract

The invention provides a binocular event camera-based SLAM method and system, which comprises the steps of performing motion compensation on input left and right event camera data by utilizing IMU assistance to obtain corresponding reconstructed images; the IMU is used for assisting to carry out motion compensation, the coordinates of the event points are projected into a reference coordinate system through the relative pose obtained by IMU integration, and the depths of the event points are replaced by the median values of the depths of the adjacent three-dimensional space points; respectively carrying out feature point detection and tracking on corresponding reconstructed images input by the left event camera and the right event camera; triangularization calculation is carried out on the detected and tracked feature points to obtain three-dimensional coordinate points corresponding to the target and pose changes among images, and the camera pose is calculated by utilizing a PnP method; and performing back-end BA optimization by combining IMU pre-integration to obtain a camera motion track and scene mapping information. The technical scheme of the invention can be used for dealing with scenes with large illumination change and high-speed motion, and can be used for solving the problem that the robot in the conventional SLAM system is easy to lose efficacy when the motion or the environment is too complex.

Description

SLAM method and system based on binocular event camera

Technical Field

The invention belongs to the field of image processing, and particularly relates to a technical scheme for realizing SLAM by using a binocular event camera.

Background

In the past decades, there has been an increasing interest in robotic perception due to the research and development of computer vision methods. Such conventional cameras are capable of capturing high information content of the camera surroundings and have become most popular in various applications due to their low cost, widespread nature.

Simultaneous Localization and mapping (SLAM) is one of the most important milestones in the field of robot perception and has gained significant success over the last 30 years. The monocular event camera SLAM cannot restore the scale truly, and the monocular camera needs a certain time to initialize the camera, otherwise, an incorrect track and a mapping result are obtained. Existing SLAMs are typically implemented based on conventional optical cameras that exhibit some limitations in design, on the one hand, outputting images at a fixed frame rate, regardless of the amount of new information present in each image, and therefore the incoming information is typically heavily redundant, with the redundant data wasting valuable computational resources. On the other hand, highly dynamic scenes or camera motion may introduce motion blur to conventional image frames and may lack sufficient overlap information between subsequent frames, so that corner detection and tracking effects based on conventional cameras become poor, which also limits the further development of SLAM. However, due to the special data format of the event camera, many existing mature SLAM methods cannot be directly applied to the time camera, and thus the application of the event camera is limited.

An event camera or Dynamic Vision Sensor (DVS) simulates the retina through a chip and responds to pulses generated by pixel-level changes in illumination due to motion. Such asFIG. 1 shows an event camera and a normal camera shooting a rotating disk with dots, where standcameraoutput indicates the output of the normal camera, which is the luminance image of the camera at a specific time point, and DVSoutput is the output of the event camera, which is a stream of event stream data, more specifically, at t_jTime of day u_j＝(x_j,y_j) The brightness increment at the pixel position reaches a threshold value + -c (c > 0), then an event e_j＝(x_j,y_j,t_j,p_j) To be triggered, p_jE { +1, -1} represents the polarity of the event, with a positive sign indicating an increase in brightness and a negative sign indicating a decrease in brightness, so the event camera outputs an asynchronous event stream, as shown in fig. 1, and the absolute brightness value of the scene is no longer directly visible since the event only records incremental changes. In contrast to conventional frame-based cameras, event cameras can capture brightness changes at an almost infinite frame rate and record events at specific points in time and image locations. Especially for mobile scenes, the event camera has great advantages in terms of data rate, speed and dynamic range, and is expected to solve the problem of failure caused by too fast movement in the conventional SLAM system. The current newer event camera, such as a DAVIS (Dynamic and active-pixel Vision Sensor), has an IMU (Inertial measurement unit) module, and the IMU can measure the linear acceleration and angular velocity of three axes, is often used to acquire three-dimensional motion information of the camera, is used for self-positioning in SLAM (Simultaneous positioning and Mapping), navigation and other applications, and can achieve time synchronization with event points and brightness images. However, technical difficulties such as reconstruction of images and time alignment of binocular images still exist in the aspects of positioning and map construction.

Disclosure of Invention

The invention provides a SLAM scheme based on a binocular event camera, aiming at the problem that the real depth information of a scene cannot be directly recovered in the conventional SLAM method based on a monocular event camera.

The technical scheme of the invention provides a SLAM method based on a binocular event camera, which comprises the following steps:

step 1, performing motion compensation on input left and right event camera data by utilizing IMU assistance to obtain corresponding reconstructed images; the motion compensation using IMU assistance is implemented as follows,

setting the event frame start time as

When motion compensation is performed, the start of an event frame is taken as a reference system, and a certain event point e in an accumulation window is aimed at_jNoting the corresponding time stamp as t_jObtained by IMU integration

To t_jRelative position and attitude of

E is to be_jCoordinate x of_jProjected coordinate x 'in reference coordinate system'_jComprises the following steps:

wherein K is the camera internal reference matrix, K^-1Is its inverse matrix, Z (x)_j) The depth of the event point is replaced by a median value of the depths of the adjacent three-dimensional space points;

step 2, respectively carrying out feature point detection and tracking on corresponding reconstructed images input by the left event camera and the right event camera;

step 3, according to the result obtained in the step 2, triangularization calculation is carried out on the detected and tracked feature points to obtain three-dimensional coordinate points corresponding to the target and pose changes among images, and the camera pose is calculated by using a PnP method;

and 4, performing back-end BA optimization by combining IMU pre-integration to obtain a camera motion track and scene mapping information.

Furthermore, the time alignment is performed when the image is reconstructed in step 1, which is realized as follows,

1) accumulating 30ms time event points by a left event camera to form a binary image frame, performing motion compensation, and taking the time value of the first event point within 30ms as a reconstructed image frame timestamp of the left event camera;

2) and in the event stream data of the right event camera, searching an event point with the time closest to the time stamp of the reconstructed image frame of the left event camera, taking the corresponding time of the searched event point as the starting time, accumulating the event point with the time of 30ms by the right event camera to form a binary image frame, and performing motion compensation.

In step 2, the feature point detection is realized by adopting a Shi-Tomasi method.

And in the step 2, tracking is realized by adopting a Kanade-Lucas-Tomasi method.

The invention also provides a SLAM system based on the binocular event camera, which is used for the SLAM method based on the binocular event camera.

The method mainly utilizes the characteristic that the event camera has high time resolution and high dynamic range and utilizes the characteristic that the binocular camera can acquire the real depth of the scene, thereby avoiding the problem that the traditional camera cannot acquire enough image characteristics in the scene with high dynamic range and calculating the real depth information of the scene through the binocular camera. The method based on the binocular event camera can obtain higher track estimation precision in SLAM. The technical scheme of the invention can be used for dealing with scenes with large illumination change and high-speed motion, and can be used for solving the problem that the robot in the conventional SLAM system is easy to lose efficacy when the motion or the environment is too complex.

Drawings

Fig. 1 is a schematic diagram comparing data of a conventional camera and a DVS camera.

Fig. 2 is a diagram of image reconstruction results according to an embodiment of the present invention, in which fig. 2(a) is an event image generated at a fixed time interval of a slow moving scene, fig. 2(b) is an event image generated at a fixed time interval of a fast moving scene, fig. 2(c) is an event image generated by a fixed number of event points of a scene with a simple environment, and fig. 2(d) is an event image generated by a fixed number of event points of a scene with a complex environment.

Fig. 3 is a schematic diagram of motion compensation as used in the present invention.

Fig. 4 shows the result of image reconstruction after motion compensation according to an embodiment of the present invention, in which fig. 4(a) is an image subjected to motion compensation, and fig. 4(b) is an image after motion compensation.

FIG. 5 is a schematic view of the embodiment of the present invention.

Fig. 6 is a schematic diagram of back-end BA optimization according to an embodiment of the present invention.

FIG. 7 is a block diagram of a flow chart according to an embodiment of the present invention.

Detailed Description

In order to more clearly understand the present invention, the technical solutions of the present invention are specifically described below with reference to the accompanying drawings and examples.

The method comprises the steps of taking a left event camera and a right event camera as two independent threads to simultaneously perform motion compensation, time alignment, feature point detection and tracking, then solving the three-dimensional coordinates of the feature points by utilizing triangulation, solving the camera pose by combining the two-dimensional image coordinates of the feature points and the corresponding three-dimensional space coordinates and then optimizing and outputting the motion track of the cameras and the construction of a surrounding map by combining IMU pre-integration through the back end.

Referring to fig. 7, an embodiment of the present invention provides a method for SLAM based on a binocular event camera, including the following steps:

step 1, performing motion compensation on input left and right event camera data by using IMU assistance to obtain a reconstructed gray value image.

Left and right event camera data and an event camera IMU are first input. The data format of the event camera is different from that of a general light camera, the event camera is required to reconstruct images, and the images can be generated by accumulating event points according to the number of the event points and the time intervals.

The event stream data recorded when the camera takes a picture is shown in formula 1, set { e }_i(x, t) } represents data of all event points generated by the camera in the shooting process, x represents pixel coordinates of the event point, t represents generation time, and the ith event point e in the set is set_iThe information of (x, t) includes its generation time t_iPixel coordinate x of an event point_iAnd its polarity sigma_iWhere it is the dirac function. The data output by the event camera is asynchronousEvent streaming, whereas conventional image processing methods are directed to framed images. In order to process images or visualize event streams, event image frames need to be generated from the event streams, and generally event images are formed by directly accumulating events.

e_i(x,t)＝σ_i(x-x_i)(t-t_i),i∈1,2,3,...(1)

Accumulating event data directly over a period of time and then drawing it on an image in a fixed color is the simplest way to generate an image frame. In other words, in the blank image, values of coordinate pixel points of events received in a certain period of time are all recorded as 255, and a place where the event is not received is set as 0, so that a binary event image can be obtained. The selection of the time window may use a fixed length of time or may use a fixed number of events. The event images generated at fixed time intervals are shown in fig. 2(a) (b), which are two event images generated using 30ms time intervals, where fig. 2(a) is a slow moving scene and fig. 2(b) is a fast moving scene. It can be seen that if the time accumulation is performed at a fixed time interval, when the camera moving speed is too fast, the event image edge is thick, and an obvious smear phenomenon occurs. As shown in fig. 2(c) and (d), the event images generated by fixing the number of event points are time images generated by using 10000 points, fig. 2(c) is a scene with a simple environment, and fig. 2(d) is a scene with a complex environment, and the number of events activated is larger when the visible environment is more complex, and the time interval between adjacent frames is increased in the simple scene by using the imaging method of fixing the number of events. Mapping of cumulative events requires selection of appropriate means and parameters based on different scenarios and motion patterns. The patent preferably employs event images generated at fixed time intervals, with the embodiment using cumulative 30ms time intervals to generate the event images.

Since the image formed by accumulating the events for a certain period of time causes the problem of coarse edge of the event image, which is not beneficial to the subsequent feature point detection, the event point needs to be motion compensated.

As shown in FIG. 2, on the axis t of the time axis, the dots represent event points, the squares represent IMU data, and the upper I₁,I₂,I₃,I₄Indicating the order of the image frames corresponding theretoColumn, will time interval

Inner IMU data integration to obtain I₂,I₃Transformation matrix between two frames

Namely:

wherein, linear acceleration quadratic integral obtains translation variable quantity

Integration of angular velocity to obtain rotation variation

A timestamp representing the kth event point at any time f,

and

respectively represent arbitrary event points e_jRelative to a reference timestamp

And

the transformation matrix of (2). Then pass through

Linear interpolation is carried out, and any event point e can be obtained_jIs transformed by

(i.e. the

To t_jRelative pose) of t_jA timestamp representing the point of the event.

Setting the event frame start time as

Cumulative window size of

When motion compensation is performed, starting with an event frame

For a reference system, for a certain event point e in the accumulation window_jWith a time stamp of t_jCalculated by IMU integration

To t_jRelative position and attitude of

Can be combined with e_jCoordinate x of_jProjected into a reference coordinate system. Post-projection coordinate x'_jComprises the following steps:

wherein K is a camera internal reference matrix known after calibration by a camera, K^-1Is the inverse matrix thereof, wherein Z (x)_j) The depth of the event point, which is generally derived from the projection depth of the optimized three-dimensional space point in the area, is given by BA optimization in step 4, and in order to avoid the influence of the error of the calculation of the depth of the space point, the median value of the depths of the adjacent three-dimensional space points is used instead. The effect of motion compensation is shown in fig. 4, where fig. 4(a) is an image subjected to motion compensation, and fig. 4(b) is an image after motion compensation, it can be seen that the edge of the image after motion compensation is thinned, which is convenient for subsequent feature point detection and tracking.

For time alignment, the specific reconstruction method in step 1 in the embodiment is as follows:

1) accumulating event points at 30ms time by a left event camera to form a binary image frame, wherein the binary image frame is included in a blank image, values of coordinate pixel points of an event received at a certain 30ms time are all recorded as 255, a place where the event is not received is set as 0, motion compensation is carried out, and a time value of a first event point within 30ms is used as a time stamp of a reconstructed image frame of the left event camera;

2) searching an event point with the time closest to the time stamp of a reconstructed image frame of the left event camera in the event stream data of the right event camera, taking the event point time as the starting time, accumulating the event points with the time of 30ms by the right event camera in the same method to form a binary image frame, namely, in a blank image, recording the values of coordinate pixel points of an event received at a certain time of 30ms as 255, and setting the positions where the event is not received as 0, and performing motion compensation. Due to the fact that the time resolution of the event cameras is extremely high, the similarity of the reconstructed image results of the left event camera and the right event camera can be achieved, and the purpose of time alignment is achieved.

Step 2, carrying out Shi-Tomasi method characteristic point detection and Kanade-Lucas-Tomasi method tracking on corresponding reconstructed images input by the left and right event cameras:

the left and right camera images are tracked separately in step 2.

The embodiment adopts a Shi-Tomasi method to detect the feature points of the reconstructed image, and comprises the steps of moving a fixed window in the image along any direction, judging whether the image is a corner point or not through the gray scale change of the image in the window at the corner point, and further realizing the feature point detection. The Kanade-Lucas-Tomasi method is abbreviated as KLT method, and the embodiment uses the KLT method to track optical flows, calculates the optical flows of characteristic points according to the assumption that the brightness value of the same object is constant in a short time, and tracks the characteristic points by using the optical flows.

Step 3, triangulating the detected and tracked feature points to calculate the corresponding three-dimensional coordinate points of the target and the pose change between images, and calculating the pose of the camera by using a PnP method:

more accurate scene depth values are obtained more efficiently by binocular camera initialization at step 3. In an embodiment, step 3 comprises the following sub-steps:

step 3.1, calculating three-dimensional coordinates of the feature points by using triangulation:

since the binocular camera knows the baseline distance between the two cameras, the absolute scale information can be obtained by triangulating the three-dimensional coordinates of the feature points detected in the reconstructed images of the left and right event cameras, and therefore, more accurate scene depth values can be calculated when the three-dimensional coordinates of the triangulated target from the feature point coordinates of the left and right cameras are calculated. As shown in FIG. 5, the coordinate of a point P in space in the world coordinate system is X, which is represented by O₁The coordinate in the imaging plane of the camera 1 being the optical center is X₁In the presence of O₂The coordinate in the imaging plane of the camera 2 being the optical center is X₂，R₁、T₁Is a rotation matrix and a translation matrix of the camera 1 relative to the initial pose, and has the same principle of R₂、T₂For the rotation matrix and translation matrix of camera 2 with respect to the initial pose, R, T is the rotation and translation matrix between cameras 1 and 2, and equation 3 can be derived from the camera imaging model.

Wherein K is a camera internal reference matrix s₁And s₂The distances from the optical centers of the camera 1 and the camera 2 to the target point P are approximately equal, the formula 3 is transformed to obtain a formula 4, and the three-dimensional point coordinates of the target point P can be obtained by utilizing SVD (singular value decomposition) decomposition solution. Wherein, the simplification mark K^-1X₂＝X′₂

Step 3.2, calculating the pose of the camera by using a PnP method:

the PnP (Passive-n-Point) method is used for solving the problem of estimation of the camera pose when three-dimensional space Point coordinates under a known partial world coordinate system and two-dimensional camera coordinate systems of the three-dimensional space Point coordinates and the two-dimensional camera coordinate system coordinates are known. In the invention, the two-dimensional image coordinates and the three-dimensional space coordinates of the known feature points in the step 2 and the step 3.1 are used for solving the pose change of a left camera continuously inputting two frames of images by triangulation, when the camera moves to a new position to obtain a new third frame of event frame, because the translation vector T in the relative transformation obtained by triangulation does not have a real scale, if the pose of the new camera is continuously solved by the triangulation, only a certain relative pose can be obtained, and thus the scales between the 3 camera poses (the camera poses corresponding to the first frame, the second frame and the third frame) are inconsistent. Therefore, for the subsequent camera pose, the relation between the three-dimensional coordinates and the two-dimensional pixel coordinates of the feature points can be utilized for solving, namely the PnP method. In the invention, a PnP method is utilized, three-dimensional coordinates of feature points are calculated by triangularization of left and right camera images of a first frame, two-dimensional image coordinates corresponding to the feature points of two continuous frames are detected by the left event camera feature in step 2, at least 6 groups of three-dimensional and two-dimensional matched points are selected for PnP pose resolving, and then camera pose transformation can be obtained.

And 4, optimizing a back end BA (Bundle adjustment) by combining IMU pre-integration to obtain a camera motion track and scene mapping information:

step 4.1, IMU pre-integration, wherein the IMU can output the three-axis acceleration of the sensor at a higher frequency

And angular velocity

However, due to the bias and noise of the IMU itself, there is a certain difference between the output measured value and the true value. The relationship between the IMU measurement value and the true value can be represented by equation 5.

Wherein,

and

measured values of acceleration and angular velocity, a_tAnd ω_tIs the corresponding true value.

And

and (4) bias of acceleration and angular velocity, and a random walk model is obeyed.

And

noise of acceleration and angular velocity, respectively, where n_aNoise representing three-axis acceleration, subject to mean 0 and variance of

Gaussian normal distribution of (1), n_wIs the noise at angular velocity, which follows a mean of 0 and a variance of

Gaussian normal distribution of (a). While

Is a rotation matrix of the world coordinate system to the camera coordinate system at time t, g^wIs the gravitational acceleration under the world coordinate system.

For t_kTime of day event frame b_kTo t_k+1Time of day event frame b_k+1The IMU observed value in between is integrated to obtain b_k+1Corresponding translation of frames in world coordinate system

Speed of rotation

And rotation

The pose change of the camera can be obtained.

And 4.2, optimizing the back end BA, and when more new camera pose appears continuously, generating a certain error due to the accumulated error of the relative pose obtained by solving. At this time, the overall pose and the three-dimensional point cloud are usually adjusted by a BA optimization method. The essence of BA is an optimization model, which optimizes the pose obtained by PnP solution and the three-dimensional world coordinates of the feature points obtained by triangularization of the feature points by minimizing the reprojection error.

Assuming that there are n camera poses and m feature points for solving the three-dimensional world coordinate, the BA optimization problem can be represented by equation 6.

Wherein, when the camera c_iCoefficient when the feature point j is observed

Is 1, otherwise is 0.

Is a camera c_iObserving the two-dimensional pixel coordinate of the feature point j, K being the camera reference matrix, P_iFor the ith camera c_iThe position and the attitude of the robot are shown,

is the three-dimensional world coordinate of the characteristic point j. FIG. 6 shows the case of 3 camera poses, 2 three-dimensional feature points, where C₁、C₂、C₃As camera observation point, X₁、X₂、X₃As three-dimensional coordinates of feature points, x₁₁Is at C₁Two-dimensional image coordinates of three-dimensional feature points, and two othersDimensional image coordinate x₁₂、x₁₃And the same is true.

For the solution of the BA optimization problem, an LM (Levenbrg-Marquardt) method is usually used for optimizing the pose P and the three-dimensional coordinate X of the feature point of the camera, all the steps of the patent are finished, and the map construction of the self-motion pose and the feature point of the camera can be obtained.

Fig. 7 is a flowchart of the present patent, that is, an image form similar to that of a conventional camera is reconstructed by performing motion compensation and time alignment on camera images of left and right camera input events, and then detection and tracking of corresponding feature points based on the reconstructed image, PnP pose calculation, triangularization depth calculation, IMU pre-integration, and back-end optimization can be completed by using the existing method, and finally a map composed of the self pose of the camera and the coordinates of the feature points after optimization is output.

In specific implementation, the method can adopt a computer software technology to realize an automatic operation process, and a corresponding system device for implementing the method process is also in the protection scope of the invention.

It should be understood that the above-mentioned embodiments are described in some detail, and not intended to limit the scope of the invention, and those skilled in the art will be able to make alterations and modifications without departing from the scope of the invention as defined by the appended claims.

Claims

1. A SLAM method based on a binocular event camera is characterized by comprising the following steps:

setting the event frame start time as

When performing motion compensation, the event frame start is taken as the reference frame forAccumulating a certain event point e in the window_jNoting the corresponding time stamp as t_jObtained by IMU integration

To t_jRelative position and attitude of

2. The binocular event camera based SLAM method of claim 1, wherein: the time alignment is performed when the image is reconstructed in step 1, which is realized as follows,

3. The binocular event camera based SLAM method of claim 1 or 2, wherein: in step 2, a Shi-Tomasi method is adopted to realize feature point detection.

4. The binocular event camera based SLAM method of claim 1 or 2, wherein: in the step 2, tracking is realized by adopting a Kanade-Lucas-Tomasi method.

5. A SLAM system based on binocular event cameras, characterized in that: SLAM method for the binocular event based camera of claims 1 to 4.