CN111354043A

CN111354043A - Three-dimensional attitude estimation method and device based on multi-sensor fusion

Info

Publication number: CN111354043A
Application number: CN202010108437.2A
Authority: CN
Inventors: 蔡国榕; 吴则彪; 陈桂锟; 沈启金; 王宗跃; 吴云东; 徐焕
Original assignee: Jimei University
Current assignee: Jimei University
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-30

Abstract

The invention provides a three-dimensional attitude estimation method and a device based on multi-sensor fusion, and the method comprises the following steps: acquiring an RGB image of a measured object, extracting FAST characteristic points and tracking the characteristic points; collecting IMU measured values of the measured object, and integrating to obtain the position, the speed and the rotation of the measured object at the current moment; acquiring a depth image of a measured object, and processing the depth image to obtain an iteration closest point; constructing an objective function based on the prior information, the iteration closest point, the IMU measurement value and the visual reprojection; and solving the minimum value of the objective function by using an iterative method to obtain the optimal pose of the object. The method carries out attitude estimation based on various sensors, constructs an objective function for attitude estimation, solves the objective function by using a least square method, inputs the input value of the function through a sliding window, sets the specific number of the sliding window, reduces the solving time and improves the pose estimation precision.

Description

Three-dimensional attitude estimation method and device based on multi-sensor fusion

Technical Field

The invention relates to the technical field of three-dimensional reconstruction, in particular to a three-dimensional attitude estimation method and device based on multi-sensor fusion.

Background

The attitude estimation problem is the problem of determining the azimuth and the direction of a certain three-dimensional target object. Pose estimation has applications in many areas such as robot vision, motion tracking, and single camera calibration. The sensors used for attitude estimation are not the same in different fields.

Model-based methods are typically estimated using geometric relationships of objects or feature points of objects. The basic idea is to use a certain geometric model or structure to represent the structure and shape of an object, and to establish a corresponding relationship between the model and an image by extracting certain object features, and then to realize estimation of the spatial attitude of the object by geometric or other methods. The model used here may be either a simple geometric form, such as a plane, a cylinder, or some geometry, or a three-dimensional model obtained by laser scanning or other methods. The model-based attitude estimation method is to calculate the similarity and update the attitude of the object by comparing the real image with the synthetic image. In order to avoid optimization search in a global state space, the existing model-based method generally degrades an optimization problem into a matching problem of a plurality of local features, and depends on accurate detection of the local features. When the noise is large and the accurate local features cannot be extracted, the robustness of the method is greatly influenced.

The learning-based method learns the corresponding relationship between the two-dimensional observation and the three-dimensional posture from training samples in different postures acquired in advance by means of a machine learning (machine learning) method, applies a decision rule or a regression function obtained by learning to the samples, and takes the obtained result as posture estimation of the samples. The learning-based method generally adopts global observation features, does not need to detect or identify local features of an object, and has better robustness. The disadvantage is that the precision and continuity of the attitude estimation cannot be guaranteed because the dense sampling required for continuous estimation in a high-dimensional space cannot be obtained.

Disclosure of Invention

The present invention provides the following technical solutions to overcome the above-mentioned drawbacks in the prior art.

A three-dimensional attitude estimation method based on multi-sensor fusion comprises the following steps:

an RGB image acquisition step, wherein an image sensor is used for acquiring an RGB image of a measured object, FAST characteristic points of the RGB image are extracted, and a KLT sparse optical flow method is used for tracking the characteristic points;

an IMU measurement value acquisition step, namely acquiring an IMU measurement value of a measured object by using an IMU sensor, and integrating the IMU measurement value to obtain the position, the speed and the rotation of the measured object at the current moment;

a depth image obtaining step, namely obtaining a depth image of the measured object by using a depth image sensor, and processing the depth image by using an Iterative Closest Point (ICP) algorithm to obtain an iterative closest point;

an objective function construction step, namely constructing an objective function based on prior information, an iteration closest point, an IMU (inertial measurement Unit) measurement value and visual reprojection;

and a pose estimation step, namely solving the minimum value of the objective function by using an iterative method to obtain the optimal pose of the object.

Still further, the method further comprises: and a loop detection step, namely detecting whether a loop occurs in the current frame, if so, executing global pose graph optimization to correct the positioning track of the object so as to eliminate the accumulated error.

Further, the objective function is:

min_(R,t){PP+∑||D||²+∑||I||²+∑||V||²}；

the most suitable motion parameter is obtained by the least square method: rotation parameter

And displacement parameter

Making four residual terms PP, ∑ | | | D | | non-woven²、∑||I||²And ∑ V calculation²Where PP represents the prior distribution of prior information, D represents the iterative closest point residual term, I represents the difference in the amount of change in position, velocity, rotation, gyroscope and acceleration biases between two frames of the object, and V represents the reprojection error.

Further, the least squares are solved based on a sliding window, input parameters required by the least squares calculation are contained in the sliding window, and error terms except the prior information of each least square are provided by values in the sliding window.

Further, the key frame in the RGB image is determined based on the following conditions:

the difference value of the average parallax of the frame and the average parallax of the key frame of the previous frame exceeds a first threshold value;

alternatively, the number of feature points tracked by the frame is less than a second threshold.

Further, the geometry of the space is derived from the depth image to estimate the transformation matrix between two frames, and the Iterative Closest Point (ICP) residual term D uses the point-to-face distance as:

D＝(T·P_j-P_i)·n_i

wherein, P_jAnd P_iIs a matched three-dimensional point, n, in the depth image_iIs the corresponding normal vector, rotation parameter of the previous frame

And displacement parameter

Is a motion parameter between two frames;

the reprojection error V serves as a visual constraint and is defined on the tangent plane of the unit sphere:

wherein the content of the first and second substances,

is the predicted value of the unit vector of the three-dimensional point after the rigid body transformation of the characteristic point, P is the three-dimensional point obtained by the back projection of the pixel coordinate of the characteristic point through the pinhole camera model,

and

are any two orthogonal bases of the tangent plane to which P corresponds.

The invention also provides a three-dimensional attitude estimation device based on multi-sensor fusion, which comprises:

an RGB image acquisition unit, which is used for acquiring an RGB image of a measured object by using an image sensor, extracting FAST characteristic points of the RGB image and tracking the characteristic points by using a KLT sparse optical flow method;

the IMU measurement value acquisition unit is used for acquiring an IMU measurement value of the measured object by using an IMU sensor and integrating the IMU measurement value to obtain the position, the speed and the rotation of the measured object at the current moment;

the depth image acquisition unit is used for acquiring a depth image of the measured object by using a depth image sensor and processing the depth image by using an Iterative Closest Point (ICP) algorithm to obtain an iterative closest point;

the objective function construction unit is used for constructing an objective function based on the prior information, the iteration closest point, the IMU measurement value and the visual re-projection;

and the pose estimation unit is used for solving the minimum value of the objective function by using an iterative method to obtain the optimal pose of the object.

Furthermore, the loop detection unit is used for detecting whether a loop occurs in the current frame, and if so, global pose graph optimization is executed to correct the positioning track of the object so as to eliminate the accumulated error.

Further, the objective function is:

min_(R,t){PP+∑||D||²+∑||I||²+∑||V||²}；

And displacement parameter

D＝(T·P_j-P_i)·n_i

And displacement parameter

Is a motion parameter between two frames;

wherein the content of the first and second substances,

and

are any two orthogonal bases of the tangent plane to which P corresponds.

The invention has the technical effects that: the invention discloses a three-dimensional attitude estimation method based on multi-sensor fusion, which comprises the following steps: acquiring an RGB image of a measured object by using an image sensor, extracting FAST (FAST Fourier transform) feature points of the RGB image and tracking the feature points by using a KLT sparse optical flow method; the method comprises the steps that an IMU sensor is used for collecting IMU measured values of a measured object, and the IMU measured values are integrated to obtain the position, the speed and the rotation of the measured object at the current moment; acquiring a depth image of a measured object by using a depth image sensor, and processing the depth image by using an Iterative Closest Point (ICP) algorithm to obtain an iterative closest point; constructing an objective function based on the prior information, the iteration closest point, the IMU measurement value and the visual reprojection; and solving the minimum value of the objective function by using an iterative method to obtain the optimal pose of the object. The method is based on RGB images, IMU measurement values and depth images of moving objects acquired by various sensors for attitude estimation, an objective function for attitude estimation is constructed, the objective function is solved by using a least square method, the input value of the function is input through a sliding window, the specific number of the sliding window is set, the solving time is reduced, the estimation precision of the pose is improved, a loop detection process is also designed, global pose graph optimization is executed when loop occurs to correct the positioning track of the objects, so that accumulated errors are eliminated, and a corresponding residual error function is specifically designed.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

Fig. 1 is a flowchart of a three-dimensional attitude estimation method based on multi-sensor fusion according to an embodiment of the present invention.

Fig. 2 is a block diagram of a three-dimensional attitude estimation apparatus based on multi-sensor fusion according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a three-dimensional attitude estimation method based on multi-sensor fusion, which comprises the following steps:

an RGB image acquisition step S101 of acquiring an RGB image of a measurement object using an image sensor, extracting FAST feature points of the RGB image, and tracking the feature points using a KLT sparse optical flow method; the image sensor can be a camera, a video camera and the like, and can acquire a plurality of frames of images of the measured object.

Every time a new RGB image comes, the system will automatically extract FAST feature points, and then use KLT sparse optical flow method for feature point tracking. In order to ensure the tracking accuracy and keep the calculation cost relatively small, the number of feature point tracking can be set to be in the range of 100 to 300, and new feature points are detected while the feature point tracking is performed so as to maintain the number of tracked feature points to be always kept within a threshold value.

An IMU measurement value acquisition step S102, wherein an IMU sensor is used for acquiring an IMU measurement value of a measured object, and the IMU measurement value is integrated to obtain the position, the speed and the rotation of the measured object at the current moment; the IMU sensor is also called an inertial sensor, and the position, the speed and the rotation of the current moment can be obtained by integrating discrete IMU measurement values, so that motion estimation can be performed on an object carrying the sensor at present.

A depth image obtaining step S103, obtaining a depth image of the measured object by using a depth image sensor, and processing the depth image by using an Iterative Closest Point (ICP) algorithm to obtain an iterative closest point; the depth image sensor can be a depth camera or a depth camera, and the geometric structure of the measured object space can be obtained through the depth image, so that the transformation matrix between two frames is estimated.

In addition, the system estimates the real scale of the scene through the depth image. The depth image can provide the real distance of a pixel point relative to a camera optical center (Pinhole), so that the transformation of pixel coordinates and space coordinates can be carried out through a Pinhole camera model, and the real scale of a scene can be recovered.

Although the acquisition frequency of the IMU measurement values in S102 is higher than that of the depth images in S103, in general, the IMU measurement values accumulated in the time difference between two adjacent frames of depth images are not enough to integrate the requirement because the IMU data and the image data need to be aligned. The invention accumulates IMU measured values in image acquisition intervals of four continuous frames so as to avoid fluctuation of an integration result caused by insufficient IMU measured data. The depth image processing of four frames apart obtains the iteration closest point with larger error, the invention reserves the iteration closest point calculation result of the adjacent frame, and combines the iteration closest point calculation results of the continuous four frames as the iteration closest point result of the adjacent 4 frames.

And an objective function construction step S104, constructing an objective function based on the prior information, the iteration closest point, the IMU measurement value and the visual reprojection. Since each sensor has an unsuitable environment, the environment during exercise cannot be controlled manually. In order to improve robustness under multiple scenes and obtain more accurate positioning, the invention adopts multiple sensors to perform data fusion, and improves the accuracy and robustness of pose estimation.

The invention comprehensively utilizes three sensor information such as RGB image, depth image, IMU measurement value and the like to estimate the optimal attitude estimation between two frames in a tight coupling (Tightly-coupled) mode, and solves the least square problem in an iterative mode.

Specifically, the objective function is:

min_(R,t){PP+∑||D||²+∑||I||²+∑||V||²}；

And displacement parameter

Making four residual terms PP, ∑ | | | D | | non-woven²、∑||I||²And ∑ V calculation²The sum of (a) and (b) is the lowest, where PP denotes the prior distribution of the prior information, D denotes the iterative closest point residual term, I denotes the difference in the variation of position, velocity, rotation, gyroscope bias and acceleration bias between two frames of the object, and V denotes the reprojection error, i.e. PP, D, I, V may also be referred to as four constraints of the objective function, i.e. four residual terms.

The state outside the sliding window can also provide certain constraint information, in order to obtain a more real constraint relation, the state outside the sliding window is also considered (but not optimized), and the system uses a marginalization skill to convert the constraint information into a prior distribution of variables to be optimized, namely PP in the least square problem.

For spatial geometry derived from depth images to estimate the transformation matrix between two frames, the Iterative Closest Point (ICP) residual term D uses the point-to-face distance as:

D＝(T·P_j-P_i)·n_i

And displacement parameter

Is the motion parameter between two frames.

According to the calculation characteristics of the iteration closest point, the system adopts the GPU to improve the calculation efficiency of the iteration closest point. Because the depth image similarity between two frames is high, the number of iterations is limited to 5 times at most in order to avoid unnecessary iterations.

For the IMU constraint I, given two frames of IMU measurements (acceleration and angular velocity), IMU pre-integration can compute position, velocity, and rotation. The system defines the IMU error term as the difference in the amount of change in position, velocity, rotation, gyroscope bias, and acceleration bias between two frames.

For the reprojection error V, as a visual constraint, it is defined on the tangent plane of the unit sphere:

wherein the content of the first and second substances,

and

are any two orthogonal bases of the tangent plane to which P corresponds. For projection and back projection

Is a projection function derived from a pinhole camera model projected from a three-dimensional space point to a two-dimensional space point, defined as:

wherein (f)_x,f_y) And (c)_x,c_y) The camera internal reference can be obtained by a checkerboard calibration method and a phase internal reference calibration method. The inverse operation of the projection is back projection.

And a pose estimation step S105, solving the minimum value of the objective function by using an iterative method to obtain the optimal pose of the object.

And adding all residual error items together to form an objective function, and minimizing the objective function by using an iterative method to obtain the optimal pose. The direction of iteration is determined by the first derivative of the objective function. In order to ensure the real-time performance of the system, the maximum iteration number for solving the pose is set to be 8.

In one embodiment, the initial values for the method are provided by roughly estimating the states (gravity vector, gyroscope bias and velocity, etc.) on which the system depends through a sliding window based motion recovery structure (SfM) and a visual inertial correction module before implementing the method.

In one embodiment, the method further comprises: and a loop detection step S106, detecting whether a loop occurs in the current frame, if so, executing global pose graph optimization to correct the positioning track of the object so as to eliminate the accumulated error. The method uses bag-of-words model based loop detection, and if the current frame is determined to be a key frame, the system converts the RGB image into a vector according to the dictionary. And then submitting the vector to loop detection, wherein the loop detection judges whether the current frame has loop or not and returns a result. If the loop is detected to occur, global Pose Graph (Pose-Graph) optimization is performed to correct the positioning track and eliminate accumulated errors.

In one embodiment, the least squares are solved based on a sliding window containing the input parameters required for the least squares calculation, and the error term for each least square, excluding the prior information, is provided by the values within the sliding window. Namely, the invention adopts a form based on a sliding window to realize the least square problem. All the inputs (pixel points, three-dimensional space points, states and the like) required by least squares are contained in the sliding window, and the error term of each least square is provided by the value in the sliding window (except the prior information). To limit the solution time of the least squares problem, setting the number of sliding windows to 10 is sufficient to derive an accurate pose (also referred to as pose) estimate between two frames.

In one embodiment, the keyframes in the RGB image are determined based on the following conditions: the average disparity of the frame is different from the average disparity of the key frame of the previous frame by more than a first threshold (for example, the first threshold is set to 1.0, and the positioning performance is good in most scenes); alternatively, the number of feature points tracked by the frame is less than a second threshold, for example, the second threshold is set to 10, which ensures high robustness of the system and does not generate too many key frames.

The method is based on RGB images, IMU measurement values and depth images of moving objects acquired by various sensors for attitude estimation, an objective function for attitude estimation is constructed, the objective function is solved by using a least square method, the input value of the function is input through a sliding window, the specific number of the sliding window is set, the solving time is reduced, the estimation precision of the pose is improved, a loop detection process is also designed, global pose graph optimization is executed when loop occurs to correct the positioning track of the objects, so that accumulated errors are eliminated, and a corresponding residual error function is specifically designed.

Fig. 2 shows a three-dimensional attitude estimation device based on multi-sensor fusion of the present invention, which comprises:

an RGB image acquisition unit 201 for acquiring an RGB image of a measured object using an image sensor, extracting FAST feature points of the RGB image, and tracking the feature points using a KLT sparse optical flow method; the image sensor can be a camera, a video camera and the like, and can acquire a plurality of frames of images of the measured object.

An IMU measurement value acquisition unit 202, configured to acquire an IMU measurement value of a measured object using an IMU sensor, and integrate the IMU measurement value to obtain a position, a speed, and a rotation of the measured object at a current time; the IMU sensor is also called an inertial sensor, and the position, the speed and the rotation of the current moment can be obtained by integrating discrete IMU measurement values, so that motion estimation can be performed on an object carrying the sensor at present.

A depth image obtaining unit 203, configured to obtain a depth image of the measured object by using a depth image sensor, and process the depth image by using an Iterative Closest Point (ICP) algorithm to obtain an iterative closest point; the depth image sensor can be a depth camera or a depth camera, and the geometric structure of the measured object space can be obtained through the depth image, so that the transformation matrix between two frames is estimated.

An objective function construction unit 204, configured to construct an objective function based on the prior information, the iterative closest point, the IMU measurement value, and the visual re-projection. Since each sensor has an unsuitable environment, the environment during exercise cannot be controlled manually. In order to improve robustness under multiple scenes and obtain more accurate positioning, the invention adopts multiple sensors to perform data fusion, and improves the accuracy and robustness of pose estimation.

Specifically, the objective function is:

min_(R,t){PP+∑||D||²+∑||I||²+∑||V||²}；

And displacement parameter

Making four residual terms PP, ∑ | | | D | | non-woven²、∑||I||²And ∑ V calculation²Has the lowest value of (a), wherein PP means firstThe prior distribution of the experimental information, D represents the iterative closest point residual terms, I represents the difference in the amount of change in position, velocity, rotation, gyroscope bias and acceleration bias between two frames of the object, and V represents the reprojection error, i.e., PP, D, I, V may also be referred to as four constraints of the objective function, i.e., four residual terms.

D＝(T·P_j-P_i)·n_i

And displacement parameter

Is the motion parameter between two frames.

wherein the content of the first and second substances,

and

And the pose estimation unit 205 is configured to solve the minimum value for the objective function by using an iterative method to obtain an optimal pose of the object.

In one embodiment, prior to implementing the present apparatus, initial values are provided for the apparatus by roughly estimating the states (gravity vector, gyroscope bias, velocity, etc.) on which the system depends based on a sliding window motion recovery structure (SfM) and a visual inertial correction module.

In one embodiment, the apparatus further comprises: and a loop detection unit 206, configured to detect whether a loop occurs in the current frame, and if so, perform global pose graph optimization to correct the positioning trajectory of the object to eliminate the accumulated error. The device uses bag-of-words model based loop detection, if the current frame is determined to be a key frame, the system will convert the RGB image into a vector according to the dictionary. And then submitting the vector to loop detection, wherein the loop detection judges whether the current frame has loop or not and returns a result. If the loop is detected to occur, global Pose Graph (Pose-Graph) optimization is performed to correct the positioning track and eliminate accumulated errors.

The device carries out attitude estimation based on RGB images, IMU measurement values and depth images of a moving object acquired by various sensors, constructs an objective function for attitude estimation, solves the objective function by using a least square method, inputs the input value of the function through a sliding window, sets the specific number of the sliding window, reduces the solving time, improves the estimation precision of the pose, designs a loop detection process, and corrects the positioning track of the object by executing global pose graph optimization when loop occurs so as to eliminate accumulated errors, and specifically designs a corresponding residual error function.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or the portions that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the apparatuses described in the embodiments or some portions of the embodiments of the present application.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. A three-dimensional attitude estimation method based on multi-sensor fusion is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

and a loop detection step, namely detecting whether a loop occurs in the current frame, if so, executing global pose graph optimization to correct the positioning track of the object so as to eliminate the accumulated error.

3. The method of claim 2, wherein the objective function is:

min_(R，t){PP+∑||D||²+∑||I||²+∑||V||²}；

And displacement parameter

4. The method of claim 3, wherein the least squares are solved based on a sliding window, the sliding window containing input parameters required for the least squares calculation, and the error term of each least square, excluding the prior information, is provided by a value within the sliding window.

5. The method of claim 4, wherein the key frame in the RGB image is determined based on the following conditions:

6. The method of claim 3,

the spatial geometry is derived from the depth image to estimate the transformation matrix between two frames, and the Iterative Closest Point (ICP) residual term D uses the point-to-face distance as:

D＝(T·P_j-P_i)·n_i

And displacement parameter

Is a motion parameter between two frames;

wherein the content of the first and second substances,

and

are any two orthogonal bases of the tangent plane to which P corresponds.

7. A three-dimensional attitude estimation device based on multi-sensor fusion is characterized by comprising:

8. The apparatus of claim 7, further comprising:

and the loop detection unit is used for detecting whether a loop occurs to the current frame, and if so, performing global pose graph optimization to correct the positioning track of the object so as to eliminate the accumulated error.

9. The method of claim 8, wherein the objective function is:

min_(R，t){PP+∑||D||²+∑||I||²+∑||V||²}；

And displacement parameter

10. The method of claim 9, wherein the least squares are solved based on a sliding window, the sliding window containing input parameters required for the least squares calculation, and the error term of each least square, excluding the prior information, is provided by a value within the sliding window.

11. The method of claim 10, wherein determining the key frame in the RGB image is based on:

12. The method of claim 9,

D＝(T·P_j-P_i)·n_i

And displacement parameter

Is a motion parameter between two frames;

wherein the content of the first and second substances,

of unit vectors of three-dimensional points after rigid body transformation of characteristic pointsPredicting value P is three-dimensional point obtained by pixel coordinate back projection of pinhole camera model,

and

are any two orthogonal bases of the tangent plane to which P corresponds.