WO2021035669A1

WO2021035669A1 - Pose prediction method, map construction method, movable platform, and storage medium

Info

Publication number: WO2021035669A1
Application number: PCT/CN2019/103599
Authority: WO
Inventors: 朱张豪
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-04
Also published as: CN112219087A

Abstract

A pose prediction method, a map construction method, a movable platform, and a storage medium. The method comprises: acquiring a first time period according to a timestamp of the current image frame and a timestamp of a previous image frame (S101); acquiring measurement data of a first encoder and measurement data of a second encoder within the first time period (S102); obtaining a prediction transformation matrix of the current image frame with respect to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder within the first time period (S103); and obtaining a predicted pose of the current image frame according to the prediction transformation matrix and pose information of the previous image frame (S104). By applying the encoders to visual SLAM, and predicting the pose of the current image frame by means of the measurement data of the encoders, the accuracy of pose acquisition can be improved, the stability and robustness of visual SLAM is improved, and visual SLAM can still stably operate during visual failure.

Description

Pose prediction method, map construction method, movable platform and storage medium

Technical field

The embodiment of the present invention relates to the field of machine vision, in particular to a pose prediction method, a map construction method, a movable platform and a storage medium.

Background technique

Visual SLAM (Simultaneous Localization and Mapping, real-time positioning and map construction) refers to a robot that creates a map based on the images collected by the camera in a completely unknown environment under the condition of its own location uncertain, and uses the map for autonomous positioning and navigation.

The existing visual SLAM method can cause special problems such as sparse feature points in the image (no texture) or too much repetition with the previous image frame, blur or non-overlapping area caused by high-speed motion, and accidental reduction of the camera frame rate. The tracking function of visual SLAM is invalid, that is, the relative pose of the current image frame relative to the previous image frame cannot be obtained, which makes the visual SLAM unable to continue to work, and the vision can be restored by repositioning when the closed loop is triggered.

The existing visual SLAM method has poor stability and robustness. When a visual failure occurs, the vision cannot be restored until the closed loop occurs. Therefore, the visual SLAM will be interrupted when the visual failure occurs, and the map points and trajectory during the visual failure process are missing. , Causing the map points and trajectories before and after the visual failure to fail to connect, resulting in the visual SLAM unable to operate stably.

Summary of the invention

The embodiment of the present invention provides a pose prediction method, a map construction method, a movable platform, and a storage medium to improve the accuracy of the pose acquisition of the current image frame, improve the stability and robustness of visual SLAM, and prevent visual failure. Time Vision SLAM can still run stably.

The first aspect of the embodiments of the present invention is to provide a pose prediction method, which is suitable for a movable platform. The movable platform includes a first scroll wheel, a second scroll wheel, a camera, a first encoder, and a second encoder. The first scroll wheel and the second scroll wheel are used to displace the movable platform, the axes of the first scroll wheel and the second scroll wheel coincide; the camera is used to collect image frames, the The first encoder is used to detect the speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel; the method includes:

Obtain the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame;

Acquiring the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtain the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame.

The second aspect of the embodiments of the present invention is to provide a movable platform, including: a first scroll wheel, a second scroll wheel, a camera, a first encoder, a second encoder, a memory, and a processor;

The first rolling wheel and the second rolling wheel are used for displacing the movable platform, and the axes of the first rolling wheel and the second rolling wheel coincide;

The camera is used to collect image frames;

The first encoder is used to detect the speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel;

The memory is used to store program codes;

The processor calls the program code, and when the program code is executed, is used to perform the following operations:

A third aspect of the embodiments of the present invention is to provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect.

The fourth aspect of the embodiments of the present invention is to provide a map construction method, which is suitable for a movable platform, and the movable platform includes a first scroll wheel, a second scroll wheel, a camera, a first encoder, and a second encoder , The first scroll wheel and the second scroll wheel are used to displace the movable platform, the axes of the first scroll wheel and the second scroll wheel coincide; the camera is used to collect image frames, the first An encoder is used to detect the speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel; the method includes:

Obtaining the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame;

After obtaining the predicted pose of the current image frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined before the current image frame The key frames are constructed.

The fifth aspect of the embodiments of the present invention is to provide a movable platform, including: a first scroll wheel, a second scroll wheel, a camera, a first encoder, a second encoder, a memory, and a processor;

The camera is used to collect image frames;

The memory is used to store program codes;

The sixth aspect of the embodiments of the present invention is to provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the fourth aspect.

According to the pose prediction method, map construction method, movable platform and storage medium provided by the embodiments of the present invention, the first time period is acquired according to the time stamp of the current image frame and the time stamp of the previous image frame; and the first time is acquired The measurement data of the first encoder and the measurement data of the second encoder in the segment; according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, the current image frame relative to the upper A predicted transformation matrix of an image frame; obtaining the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame. The embodiment of the present invention applies the encoder to the visual SLAM and predicts the pose of the current image frame through the encoder measurement data, which can improve the accuracy of the pose acquisition, improve the stability and robustness of the visual SLAM, and prevent visual failure. Time Vision SLAM can still run stably.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative labor.

Fig. 1 is a flowchart of a map construction method provided by an embodiment of the present invention;

Figure 2 is a flowchart of a pose prediction method provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a map construction method provided by another embodiment of the present invention;

4 is a flowchart of a map construction method provided by another embodiment of the present invention;

FIG. 5 is a flowchart of a map construction method provided by another embodiment of the present invention;

Fig. 6 is a flowchart of a map construction method provided by another embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a movable platform provided by another embodiment of the present invention.

detailed description

The following clearly describes the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that when a component is referred to as being "fixed to" another component, it can be directly on the other component or a central component may also exist. When a component is considered to be "connected" to another component, it can be directly connected to the other component or there may be a centered component at the same time.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the present invention. The terms used in the description of the present invention herein are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. The term "and/or" as used herein includes any and all combinations of one or more related listed items.

Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

The embodiment of the present invention provides a map construction method. The map construction method is applied to a movable platform, the movable platform may be a robot, an unmanned vehicle, etc., as shown in FIG. 7, the movable platform 700 specifically includes a first scroll wheel 701 and a second scroll wheel 702 , The first encoder 703, the second encoder 704, the camera 705, and further may include a processor 706 and a memory 707. The first scroll wheel 701 and the second scroll wheel 702 are used to displace the movable platform, and the axes of the first scroll wheel 701 and the second scroll wheel 702 coincide; the camera 705 is used to capture images Frame, the first encoder 703 is used to detect the rate of the first scroll wheel 701, and the second encoder 704 is used to detect the rate of the second scroll wheel 702.

Wherein, the camera 705 can be a monocular camera, a binocular camera, an RGBD depth camera, etc. When the camera 705 adopts a monocular camera, it is usually used in close coupling with an IMU (Inertial Measurement Unit). To determine the map scale, the IMU is a measuring unit integrating two types of sensors, a gyroscope for measuring three-dimensional rotation and an accelerometer for three-dimensional acceleration; of course, when the camera 705 adopts a binocular camera and an RGBD depth camera It can also be used in conjunction with IMU to further improve accuracy, but the use of IMU has higher requirements for sensor calibration. Encoder refers to the angular displacement sensor, which measures the speed of the wheel by detecting the number of radians that the robot wheel has turned in a certain period of time. It is mainly divided into three types: photoelectric, contact, and electromagnetic.

As shown in FIG. 1, the map construction method according to the embodiment of the present invention can be divided into a tracking thread S11, a partial mapping thread S12, and a loop detection thread S13. The tracking thread S11 includes predicting the pose of an image frame, as shown in FIG. As shown, the pose of the predicted image frame specifically includes:

Step S101: Obtain a first time period according to the time stamp of the current image frame and the time stamp of the previous image frame.

After acquiring any image frame, the image can be preprocessed first. The preprocessing process specifically includes:

For RGBD cameras, the images collected include RGB images and depth images. You can first convert the GRB image to a grayscale image, extract feature points from the grayscale image, and then calculate the stereo coordinates of the feature points based on the grayscale image and the depth image. That is, the coordinate of the feature point relative to the false right camera, the image collected by the RGBD camera is equivalent to the image collected by the binocular camera.

For the binocular camera, the collected images are also converted into grayscale images, the feature points are extracted from the grayscale images, and the feature points in the left camera and right camera images are matched to calculate the depth of the feature points to obtain the stereo coordinates of the feature points .

It should be noted that the extracted feature points in this embodiment can be ORB (Oriented Brief) feature points. The ORB feature is a rotation-invariant feature, which has high stability, is insensitive to lighting and moving objects, and is calculated The amount is small, and it can be processed in real time using only the CPU of a conventional notebook computer; of course, the feature points can also be Harris corner points, SIFT (Scale-invariant Feature Transform) feature points, SURF (Speeded Up Robust Features) feature points, etc.

The application of feature points will be described in detail below.

Step S102: Obtain the measurement data of the first encoder and the measurement data of the second encoder in the first time period.

Preferably, after the measurement data of the first encoder or the second encoder is acquired, it can be determined whether the measurement data is available, specifically as follows:

Buffer the measurement data in a list, and record the collection time corresponding to the measurement data;

Acquiring the measurement data closest to the current image frame acquisition time from the buffer list, and acquiring the time error between the acquisition time of the measurement data and the current image frame acquisition time;

If the time error exceeds the preset time threshold, it means that the measurement data is not available.

In this embodiment, the acquisition frequency measured by the encoder is usually high, such as 200 Hz, while the acquisition frequency of the image is usually low, such as 30 Hz. Generally, it is impossible to find the encoder measurement data collected at the same time as the image. Therefore, it is possible to find the encoder measurement data closest to the image acquisition time, that is, the time error between the encoder measurement data acquisition time and the current image frame acquisition time does not exceed the preset time threshold. When the system is unstable, there may be missing encoder measurement data, that is, it is impossible to find the encoder data whose time error with the current image frame acquisition time does not exceed the preset time threshold, that is, before and after the current image frame acquisition time The encoder measurement data of is not reliable, these unreliable encoder measurement data can be discarded, and only the pure visual SLAM in the prior art (that is, only relying on vision for pose prediction and map construction). In the same way, a similar judgment process can also be used for IMU measurement data.

Step S103: Obtain a prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period.

Step S104: Obtain the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame.

In this embodiment, the camera pose of the current image frame can be predicted according to the encoder data to obtain the predicted pose of the current image frame. Specifically, it is assumed that the movement of the movable platform is a two-dimensional movement (plane or curved surface), and the ground has sufficient friction. Among them, the curved surface can adopt the idea of Taylor expansion, which is approximated to a flat surface in a short time, that is, the frame rate is high enough. It can also handle curved surface movement.

Obtain the predictive transformation matrix T ₁₀ from the encoder data, where the predictive transformation matrix T ₁₀ contains a 3×3 rotation matrix R and a three-dimensional translation vector t, 1 represents the current image frame camera coordinate system, and 0 represents the previous image frame camera coordinate system ; Combined with the pose T _0w of the previous image frame (w is the world coordinate system), the predicted pose T _1w =T ₁₀ T _{0w of the} current image frame can be obtained.

In this embodiment, the predictive transformation matrix T ₁₀ may be derived through a plane motion model, such as a motion model of a differential two-wheeled vehicle.

_{The measured velocity v EE} and the measured angular velocity ω _EE of the current image frame of the movable platform are shown in formula (1):

Wherein, x-axis component of the measured speed v v and z-axis _EE component of the measured angular velocity ω _EE [omega] satisfy the differential motion model of the motorcycle, as shown in formula (2) (wherein E represents a current image frame coding coordinate system, Take the center of the two-wheel track as the origin, and the x-axis points forward):

among them,

with

These are the first scroll wheel speed and the second scroll wheel speed measured by the encoder (all data between the encoder data closest to the current image frame and the previous image frame acquisition time). It is understandable that in the _{process of obtaining the predictive transformation matrix T 10} , all the data in the first time period can be obtained

with

Data, you can also get part of

with

data.

Substitute the measured velocity v _EE and the measured angular velocity ω _EE of the current image frame into the differential formula (3) of the following motion model:

Among them, Ei represents the encoder coordinate system of the previous image frame, and ω _EE ^∧ represents the antisymmetric matrix of the measured angular velocity of the current image frame,

with

Respectively represent the first derivative of the current image frame rotation matrix and position with respect to time.

Integrate the differential formula of the motion model to obtain the prediction rotation matrix between two frames

And prediction translation matrix

(The arrow indicates the estimation using encoder information, assuming that the information does not contain any noise

). This process is a pre-integration process used to obtain the prediction relationship between two frames.

Further, the rotation matrix will be predicted

And prediction translation matrix

From the encoder coordinate system to the camera coordinate system, the prediction transformation matrix between i and j can be obtained as shown in formula (4):

That is, T _CjCi is the aforementioned predictive transformation matrix T ₁₀ . _{Among them, the fixed transformation matrix T CE} of the camera coordinate system relative to the encoder coordinate system used in the conversion needs to be calibrated in advance.

_{Further, the predicted pose T 1w of the} current image frame can be obtained by the formula T _1w =T ₁₀ T _0w . In the embodiment of the present invention, the encoder is applied to the visual SLAM, and the current image frame pose is predicted through the encoder measurement data, which can solve the blur or non-overlapping area caused by sparse feature points (no texture) or too much repetition, high-speed motion, The accidental reduction of the camera frame rate and other visual failure problems, so that the map points and trajectories before and after the failure can be connected, the SLAM system will not be interrupted due to this, and the stability and robustness of the visual SLAM can be improved.

The tracking thread S11 further includes optimizing the predicted pose, specifically:

The predicted pose of the current image frame is optimized by the preset motion-only BA (motion-only buy adjustment) to obtain the optimized pose of the current image frame; Wherein, the pure motion beam adjustment includes reprojection error and encoder error.

In the embodiment of the present invention, the predicted pose of the current image frame is optimized by pure motion beam adjustment, where BA (bundle adjustment) is to adjust camera parameters (mainly refer to external parameters: the poses of many key frames; it is also possible Including internal parameters: focal length and optical center pixel coordinates) and structure (the position of the map point seen by the camera), so that the reprojection error (that is, the edge connecting the two key frames corresponds to a bunch of matching point pairs, of which key frame 1 is After the point is projected onto key frame 2 through geometric relations, the coordinate difference with the matching point is minimized (that is, the sum of the squares of the two norms of the above-mentioned error about the information matrix is minimized as the objective function of the optimization), a nonlinear optimization method, While Motion-only BA only adjusts camera parameters (mainly refers to the poses of many key frames), but the optimized objective function is still a nonlinear optimization method based on reprojection errors.

The reprojection error reprojects the feature points on the predicted pose of the current image frame through the map points of the local map, and calculates the error between the reprojected feature points and the feature points of the current image frame.

The optimization formula of the objective function of the pure motion beam adjustment may be as shown in formula (5):

among them,

In order to estimate the maximum likelihood of pose, the part on the right side min is the objective function. The first term is the prior error, which is generally 0. However, if the poses of the two frames before and after are optimized at the same time when tightly coupled with the IMU, you can Use the Hessian matrix in the optimization process of Motion-only BA of the previous image frame as the information matrix Σ ₀ ^-1 to keep the pose of the previous image frame as unchanged as possible; the second term is the reprojection error term corresponding to the _{visual observation z il} ; The last term is _{the error term formed by the encoder information E ij} (that is, the modulus of the error vector is about the square of the covariance matrix).

In this embodiment, the encoder error term can be obtained by the following formula (6):

Where e _R represents the error of the relative rotation matrix (using logarithmic mapping or its Lie algebra, it is a three-dimensional vector), e _p represents the error of the relative translation (three-dimensional vector), that is, the encoder error

Is a six-dimensional vector corresponding to the covariance matrix

It is a 6×6 square matrix, and its inverse matrix, the information matrix, is used here. e _R , e _p and the covariance matrix

All are based on the above-mentioned pre-integration to obtain the rotation matrix

And relative translation

Calculated.

When calculating the encoder error term in the objective function, the result of pre-integration and the following formula are used, that is, the relationship between the relative transformation in the encoder coordinate system and the relative transformation in the camera coordinate system:

which is:

_Wherein, R _CiW and _p CiW to the world coordinate system, the pose of an image frame, R _CE and p _CE required i.e. formula (4) in advance involved calibration transformation matrix T _CE fixing elements.

The aforementioned Motion-only BA uses non-linear optimization to calculate the optimized pose R _CjW and p _CjW of the current image frame that minimizes the objective function.

In addition, the Jacobian matrix used in the nonlinear optimization process in this embodiment can be numerical, and of course, the analytical solution derived from the Lie algebra related formula can also be used. The latter often makes the optimization more stable.

In another preferred embodiment, when an IMU (inertial measurement unit) is also used for pose optimization, the pure motion beam adjustment further includes an IMU error term.

Then the optimization formula of the objective function of the pure motion beam adjustment can be as shown in formula 8:

among them,

Is the IMU error term.

In the prior art, for acquiring the pose of the current image frame, a predictive violent matching method is usually used to project the feature points of the previous image frame onto the current image by using the predictive transformation matrix of the uniform motion model (including the rotation matrix and the translation vector) After the frame, a small rectangular range is selected for fast violent matching. The pose and speed of the previous image frame are used to estimate the initial pose of the current image frame, and then the initial pose is optimized to obtain the optimized pose. The present invention uses the wheel speed measured by the encoder to estimate the initial pose of the current image frame, and then optimizes the initial pose to obtain the optimized pose, which is more robust.

Further, on the basis of the foregoing embodiment, as shown in FIG. 3, after performing the pure motion beam adjustment in step S201, the method may further include:

Step S202: Judge whether the vision is invalid;

If the vision is invalid, go to step S203; if the vision is not invalid, go to step S204;

Step S203: Use the predicted pose of the current image frame as the pose of the current image frame;

Step S204: Use the optimized pose of the current image frame as the pose of the current image frame.

In this embodiment, when the pure motion beam adjustment fails, that is, when the optimized pose cannot be obtained through the objective function of the pure motion beam adjustment described above, it is determined that the vision is invalid. If the vision fails, it means that the Motion-only BA process is invalid, that is, the optimized pose obtained by the final optimization is not accurate, and there is a large deviation from the real pose, so it is not used, and the predicted pose is used as the current image frame. If the vision has not failed, that is, the Motion-only BA process is successful, the optimized pose can be used as the pose of the current image frame.

In this embodiment, the visual failure can be caused by the following reasons: the feature points in the image are sparse (no texture) or repeated too much with the previous image frame, blur or non-overlapping area caused by high-speed motion, and accidental decrease of the camera frame rate.

In a preferred embodiment, when the optimization result cannot be obtained through the pure motion beam adjustment model, it is determined that the current image frame is visually invalid relative to the previous image frame.

In another embodiment, when judging visual failure, the optimized pose obtained by Motion-only BA can be used to re-match the feature points of the current image frame and the frame with the same feature points as the current image frame (or The map points in the map are matched with the feature points of the current image frame). During the matching process, the same feature points in the frames and the current image frame can be projected into the current image frame. If a feature point is projected into the current image frame After the distance between the feature points corresponding to the current image frame is less than the threshold, it is determined as an interior point; otherwise, the feature point is an exterior point; judge whether the number of interior points in the current image frame is greater than the preset number threshold, if it is greater than or If it is equal to the preset number threshold, it means that the Motion-only BA is successful, that is, the vision has not failed; if it is less than the preset number threshold, it means that the Motion-only BA has failed, that is, the vision is invalid.

In this embodiment, in the tracking thread S11 of the visual SLAM, the pose information of the current image frame relative to the previous image frame is obtained, and the movement track of the movable platform can be tracked according to the relative pose.

In this embodiment, when the camera collects the image of the current image frame, the camera can compare the current image frame according to the pose of the camera in the previous image frame and the speed of the first scroll wheel and the speed of the second scroll wheel detected by the encoder. Predict the pose of the current image frame to obtain the predicted pose of the current image frame, and perform feature point matching according to the predicted pose of the current image frame, the image of the current image frame, and the key frame with the same feature point as the current image frame. The predicted pose of the current image frame is optimized, and the optimized pose of the current image frame is obtained.

When the feature points in the image are sparse (no texture) or repeated too much with the previous image frame, blur or non-overlapping area caused by high-speed motion, and accidental reduction of the camera frame rate, etc., the vision fails temporarily, that is, the current image cannot be passed To optimize the pose of the current image frame by matching the feature points to the key frames with the same feature points as the current image frame, the predicted pose of the current image frame can be used as the pose information of the current image frame to ensure the subsequent process Continue to avoid the inability to continue working due to the failure of the map point and the track before and after the visual failure. Since the predicted pose of the current image frame is obtained based on the encoder, and the encoder does not have the characteristic of zero drift, it can ensure that the accumulated error in a short period of time will not be large, which can solve the transition from the short visual failure to the visual recovery period. Problems, thereby improving the stability and robustness of visual SLAM.

In this embodiment, after acquiring the pose information of the current image frame, the local mapping thread S12 can be performed according to the image frame and the corresponding pose information. Specifically, it can first determine whether the current image frame is a key frame. If it is determined to be a key frame, the feature points of the key frame can be matched at the back end to add new map points to realize the construction of a local map and the optimization of the global map. .

The visual SLAM method provided in this embodiment obtains camera images and encoder data; obtains camera pose information according to the image frames and the measurement data of the encoder; according to the image frames and the pose information Carry out the construction of the map. By applying the encoder to the visual SLAM, the stability and robustness of the visual SLAM can be improved, and the failure of the map points and trajectories before and after the visual failure when the visual failure is short-lived can be avoided and the work cannot be continued.

In the local mapping thread S12, as shown in FIG. 4, after obtaining the pose of the current image frame, it may further include:

Step S301: Determine whether the current image frame is a key frame;

If the current image frame is a key frame, the current image frame is added to the local mapping thread S12 to construct the map. Specifically, it may include:

Step S302: When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame;

Wherein, the constructed map is constructed by key frames determined before the current image frame.

In this embodiment, in order to reduce the computational resource consumption of the map construction process, only the key frames that can provide enough map points are added to the mapping thread. For RGBD cameras or binocular cameras, map points that are close to each other can also be directly added. Insert into the map as a new map point.

In the prior art, it is usually judged whether the distance between the current image frame and the previous key frame is within a preset distance range, that is, when the distance between the current image frame and the previous key frame is not too far or too close The current image frame will be used as the new key frame; or when the matching number of feature points of the current image frame is greater than the preset matching number threshold and sufficient new map points are provided, the current image frame will also be used as the new key frame. The number of feature point matches refers to the number of matches between the feature points of the current image frame and the map points in the map, or the number of matches between the feature points of the current image frame and the feature points of the key frame that has the same feature points as the current image frame; For the tight coupling of the monocular camera and the IMU, in order to prevent zero drift, the time interval between the current image frame and the previous key frame must not be too long, that is, the time interval between the current image frame and the previous key frame exceeds the preset time interval When thresholding, the current image frame must be used as the new key frame.

In this embodiment, when the vision has not failed, the key frame judgment strategy in the prior art can be used; and when the vision fails, since the predicted pose of the current image frame can be obtained through the encoder data, two frames are allowed It is connected through the encoder (that is, pure encoder edges are allowed), it is not necessary that the distance between the current image frame and the previous key frame must be greater than the minimum value in the preset distance range, and the current image frame is also not required The number of feature points matching is greater than the preset matching number threshold, so the above two restrictions can be removed; but for RGBD cameras or binocular cameras, the minimum number of map points created by the current image frame needs to be increased, that is, the current image frame creates map points The number needs to be greater than the preset number of map points threshold to prevent too few contributed map points and lose the meaning of joining; for the case where the monocular camera is tightly coupled with the IMU, add a minimum time interval limit to tentatively build the map thread In (Partial Mapping), some map points are constructed by triangulation to restore visual tracking, that is, the time interval between the current image frame and the previous key frame needs to be greater than the preset minimum time interval, and due to the mapping thread (Partial Mapping) Redundant or unimportant key frames and map points will be removed, so there is no need to worry about the time criterion adopted by the monocular camera will bring about redundancy problems.

Further, after the vision is restored, the key frame judgment strategy when the vision fails can be restored to the key frame judgment strategy in the prior art.

Further, as shown in FIG. 5, updating the constructed map according to the current image frame includes:

Step S401: Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

Step S402: After generating a new map point, construct a common view of the current image frame and perform local beam adjustment to adjust other key frames that are in common view with the current image frame, the map points that can be seen in the current image frame, and The map points that can be seen in the other key frames are optimized; wherein the local beam adjustment includes an encoder error term.

In this embodiment, after the new key frame is acquired, the redundant map points can be removed first. The redundant map points can include the map points that are not seen near the current key frame and the previous few frames. For map points with a small ratio of times that are continuously viewed, for feature points in the current image frame that can be fused with existing map points, the existing map points can be replaced by fusion.

After performing local BA optimization of key frame pose and map points, redundant key frames can also be removed. Redundant key frames refer to the existence of at least one other key frame and the number of same map points that it sees exceeds the threshold (for example, 90 %), in this embodiment, it is determined whether those key frames (frame nodes in the common view) that can see the same map point as the current key frame are redundant. Through the above-mentioned removal of redundant map points and redundant key frames, it is possible to avoid too many map points in the map, reduce the amount of calculation, and improve the efficiency of mapping.

In addition, as shown in FIG. 6, the constructing the common view of the current image frame to perform local beam adjustment includes:

Step S501: When any key frame and the previous key frame are connected by the pure encoder side, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, according to the The time stamp of the previous key frame and the time stamp of the next key frame acquire the second time period; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Step S502: Obtain all measurement data of the first encoder and all measurement data of the second encoder in the second time period;

Step S503: Obtain a prediction transformation matrix of the next key frame relative to the previous key frame according to all the measurement data measured by the first encoder and all the measurement data of the second encoder in the second time period.

Specifically, after the new key frame is obtained, the continuous key frames connected by the pure encoder side can be removed, and the key frames connected by the continuous pure encoder side, such as "key frame 1-encoder side 1-key frame 2" -Encoder side 2-key frame 3", the relative pose of key frame 2 to key frame 1 is the predicted pose obtained from the encoder data (ie encoder side 1), key frame 3 is relative to key frame 2 The relative pose of is also the predicted pose obtained from the encoder data (that is, encoder side 2). Removing the continuous pure encoder side can reduce the amount of calculation for subsequent optimization and prevent only the encoder data from being used for triangulation. The error is large (the accuracy of the encoder and the friction with the ground cause the encoder to have lower accuracy than the vision). Remove the consecutive key frames connected by the pure encoder side. For the three key frames in the above example, the key frame 2 connected by the continuous pure encoder side can be deleted to obtain "key frame 1-encoder side-key frame 3" , Where the encoder side needs to be recalculated, that is, the predicted pose of key frame 3 relative to key frame 1 is calculated through the pre-integration process in the foregoing embodiment.

In this embodiment, a new map point can be created by triangulation, where triangulation is based on two points (the relative pose of the key frame) and the ray passing them (the pixel coordinates of the matching point pair, that is, the direction) to obtain the ray The position of the intersection point relative to these two points (the depth or position of the matching point pair or 3D point).

Specifically, for a binocular camera, for a fixed 3D point X that exists in space, the 3D point X is observed by the left and right cameras of the binocular camera, and the imaging position of the 3D point on the left camera is x. The imaging position of the right camera is x'. According to the pinhole camera model and the camera projection equation, we can calculate two rays respectively based on x and x', and the two rays will theoretically pass through the 3D point X. Based on this theory, the intersection of two rays is the position of the 3D point X in space, so that the position of the 3D point X can be determined. For a monocular camera, the images collected by the left and right cameras of the binocular camera can be replaced by two adjacent frames (the relative pose between the two frames is known). The two adjacent frames simultaneously observe a certain fixed in space. The triangulation process of 3D point X is the same as that of a binocular camera, so I won't repeat it here. For the RGBD camera, since the preprocessing process has calculated the coordinates of the feature points relative to the false right camera, the image collected by the RGBD camera is equivalent to the image collected by the binocular camera, so the same triangulation process as the binocular camera can be used , I won’t repeat it here.

In this embodiment, the key frame pose and map points can be optimized by local BA, where the local BA process is similar to the Motion-only BA optimization process in the foregoing embodiment, and the objective function of the local BA also adds an encoder error term. In the local BA in the prior art, the key frames participating in the local BA are usually the common view of the current key frame (that is, the picture contains two types of vertices, the frame and the map point, and the reprojection between the frame and the map point it can see. Error margin, these key frames can see the map points seen by the current key frame, and the map points in the figure are all map points that can be seen by these key frames), and the key frame participating in the local BA in this embodiment is changed to the current The image frame is pushed forward by N consecutive key frames (N is greater than or equal to 1), and the objective function of the local BA can be constructed through the N consecutive key frames, so as to optimize the pose and map points of the current key frame, which can reduce Small amount of calculation.

Further, after the local map is obtained, the current key frame is sent to the loop detection thread S13 to perform closed loop detection to determine whether the movable platform is located at the previously reached position.

Closed-loop detection in the prior art, query the map database, calculate the similar transformation (SE3 or Sim3) between the current key frame and the closed-loop key frame through the RANSAC (Random Sample Consensus) algorithm, and merge the current key frame through closed-loop fusion And the map point seen by the key frame near the closed-loop key frame. Among them, the RANSAC algorithm uses several pairs of matching points to calculate the transformation randomly, uses the theory of probability statistics to find the most correct transformation, and distinguishes which matching point pairs are correct (the matching point pairs are correct, then called interior points).

In this embodiment, the bag-of-words (BoW) vector of each key frame can be obtained to obtain the distance between the bag-of-words vector of any key frame and the bag-of-words vector of the current key frame (for example, Euclidean distance), judge whether there is a closed loop based on the distance between two bag-of-words vectors, where the bag-of-words vector is obtained based on the feature points of the key frame and the preset feature point bag of words, and the preset feature point bag of words is obtained by the feature It is constructed by clustering, which contains the feature point information arranged in a predetermined order, and the key frame feature point is matched with each feature point information in the preset feature point word bag. If a certain feature point in the preset feature point word bag is The feature point is set to 1 if it appears in the key frame, and 0 if it does not appear, so that the bag of words vector of the key frame is obtained. The bag-of-words vector of each key frame is stored in the database. The distance between the bag-of-words vector of the current key frame and the bag-of-words vector of each key frame in the database can be used to judge whether the current key frame has a closed loop. The key frame of is the closed-loop key frame of the current image frame.

Further, in this embodiment, the threshold for detecting the closed loop can be lowered after the visual failure. The threshold for detecting the closed loop includes the number of repeated detections, the threshold for calculating the number of matching points for similar transformations, and the number of internal points. By relaxing the conditions for detecting the closed loop, Quickly detect the closed loop after the visual failure to reduce the error caused by the pure encoder side during the visual failure. After the vision is restored, the threshold of the detection change can be restored to the normal threshold, which can avoid excessively frequent closed-loop consumption of calculations.

Further, when a closed loop is detected, the pose graph can be optimized.

Among them, the pose graph is a graph in which the vertices do not contain map points, only contain all the key frames, and the edges contain only the relative transformations between the frames on the supporting tree and the similar transformations corresponding to all closed-loop edges. The spanning tree is the connection A loop-free graph of key frames in the entire map. Generally, two consecutive key frames in time will become parent and child nodes, but when a parent node is deleted, the parent node of the child node will be modified according to the number of map points seen in common (this time the corresponding removal The operation of the key frame will update the pre-integration of the child nodes), and the root node is the initial key frame when the visual SLAM system is created.

In this embodiment, the pose graph optimizes the pose of each key frame, and the objective function is relative transformation

And vertex pose error

About the covariance matrix

(Originally a 3×3 identity matrix I) the sum of squares.

In this embodiment, when a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between adjacent key frames. Specifically, the covariance matrix can be obtained by pre-integration

To modify the corresponding relative transformation of the pure encoder side on the support tree

Covariance matrix

The diagonal element is generally greater than 1, which means that the accuracy of the pure encoder edge is not as good as the encoder and the visually tightly coupled edge, which can reduce the error introduced by the pure encoder edge during nonlinear optimization. Through the above-mentioned pose graph optimization, the error introduced by the pure encoder side can be dispersed to all sides, for example, on the support tree, key frame 1-key frame 2-key frame 3, when the closed loop of key frame 3 and key 1 occurs , A closed-loop edge is formed between key frame 1 and key frame 3 (similar transformation SE3 or Sim3 of current key frame and closed-loop key frame), which can disperse the cumulative error of key frame 1 and key frame 3 to key frame 1-key On all sides of frame 2-key frame 3, the error of each side is relatively small and within an acceptable range.

Further, after the optimization of the pose map is completed, global BA optimization can also be performed. The global BA is similar to the Motion-only BA and local BA optimization process in the above embodiment. The objective function of the global BA also adds the encoder error term, and the difference from the local BA optimization process is that the key frames participating in the global BA are All key frames and map points in the entire map library. After completing the global BA, you can update the map, update all map points, update all key frames in the map library until the latest key frame.

On the basis of any of the above embodiments, if the visual SLAM is tightly coupled with the IMU, the IMU needs to be initialized to obtain the bias of the IMU, including the bias of the gyroscope in the IMU and the bias of the accelerometer. Gravity vector can also be obtained. In addition, for the case of monocular camera and IMU tightly coupled, the map scale needs to be determined through IMU initialization, while binocular cameras and RGBD cameras usually do not need to be initialized to determine the map scale through IMU initialization. When the binocular camera and RGBD camera determine the map scale, the initialization result is close to 1.

In this embodiment, when the visual SLAM just starts to run, the first predetermined number of key frames (for example, 20) can be taken, and the zero offset, gravity vector, and map scale of the IMU can be obtained according to these key frames, and then the map scale Update to all the key frames, including the predetermined number of key frames and the new key frames obtained in the initialization process, and then perform global BA optimization to update the map scale, and then the visual SLAM can run at the updated map scale.

On the basis of any of the foregoing embodiments, this embodiment also provides a lightweight positioning method.

Specifically, in this embodiment, after the map construction is completed, the simplest saving method can be adopted, that is, only the key frames and map point information of the necessary map are saved, and binary file storage can be adopted. The main storage content involves sensor type; key frame sequence number, pose, bag-of-words vector, feature point sequence and inter-frame sensor information sequence, sequence number information of map points that can be observed by the frame, and sequence number information of closed-loop key frame connected to the frame; map Point's serial number, position, reference key frame serial number, key frame serial number information that the map point can be observed; supporting tree structure (namely parent node serial number)

When reading, take reverse save operations, rebuild key frames and map points, update related information (such as supporting tree structure, common view information, and add key frames to the map database), and rebuild key frames and map points. The sequence number for creating the key frame and the sequence number for the map point may not be the original sequence number.

In this embodiment, the map information of the long-running visual SLAM system can be stored in a small storage space, which can increase the maximum working time of the visual SLAM system, but it takes more time to reconstruct the map and trajectory when reloading. time. When the above-mentioned reconstructed map is used, the visual SLAM can only start the tracking thread S11 for positioning, and does not need to start the local mapping thread S12 and the closed loop thread to achieve lightweight positioning. It should be noted that for the case where the monocular camera and the IMU are tightly coupled, the IMU still needs to be initialized (to obtain the initial zero bias and gravitational acceleration).

The embodiment of the present invention provides a movable platform. FIG. 7 is a structural diagram of a movable platform provided by an embodiment of the present invention. As shown in FIG. 7, the movable platform 700 includes a first scroll wheel 701, a second scroll wheel 702, a first encoder 703, and a second encoder 704, camera 705, memory 707, and processor 706;

The first rolling wheel 701 and the second rolling wheel 702 are used for displacing the movable platform, and the axes of the first rolling wheel 701 and the second rolling wheel 702 coincide;

The camera 705 is used to collect image frames;

The first encoder 703 is used to detect the speed of the first scroll wheel 701, and the second encoder 704 is used to detect the speed of the second scroll wheel 702;

The memory 707 is used to store program codes;

The processor 706 calls the program code, and when the program code is executed, is configured to perform the following operations:

Acquiring all the measurement data of the first encoder 703 and all the measurement data of the second encoder 704 in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to all the measurement data measured by the first encoder 703 and all the measurement data of the second encoder 704 in the first time period;

On the basis of the foregoing embodiment, the processor 706 obtains the prediction of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period. When transforming the matrix, the processor is configured to:

According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the first derivative of the rotation matrix of the current image frame with respect to time, and the second derivative of the position of the current image frame with respect to time ；

Based on the first time period, respectively integrating the first derivative and the second derivative to obtain a predicted rotation matrix and a predicted translation matrix of the current image frame relative to the previous image frame;

According to the predicted rotation matrix and predicted translation matrix of the current image frame relative to the previous image frame, the predicted transformation matrix of the current image frame relative to the previous image frame is obtained.

On the basis of any of the foregoing embodiments, after the processor 706 obtains the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame, the processor 706 is also Configured as:

The prediction pose of the current image frame is optimized according to a preset pure motion beam adjustment model; wherein, the pure motion beam adjustment model includes an encoder error term, and the encoder error term passes through the first encoder The measurement data of 703 and the measurement data of the second encoder 704 are obtained.

On the basis of any of the foregoing embodiments, the processor 706 is further configured to:

If the current image frame has a visual failure relative to the previous image frame, the pose of the current image frame is the predicted pose;

If there is no visual failure of the current image frame relative to the previous image frame, the pose of the current image frame is the optimization result of the pure motion beam adjustment model.

When the optimization result cannot be obtained through the pure motion beam adjustment model, determine whether the current image frame is visually invalid relative to the previous image frame; or,

Performing feature matching between the current image frame and the previous image frame to obtain the number of interior points according to the optimized pose;

When the number of interior points is less than the preset threshold, it is determined whether the current image frame is visually invalid relative to the previous image frame.

When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined by the key frame determined before the current image frame Build.

On the basis of any of the foregoing embodiments, when the processor 706 updates the constructed map according to the current image frame, the processor 706 is configured to:

Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

After generating a new map point, construct the common view of the current image frame and perform local beam adjustment to adjust other key frames that are common view with the current image frame, the map points that can be seen in the current image frame, and the other The map points that can be seen in the key frame are optimized; wherein, the local beam adjustment includes an encoder error term.

On the basis of any of the foregoing embodiments, when the processor 706 constructs the common view of the current image frame to perform local beam adjustment, the processor 706 is configured to:

When any key frame and the previous key frame are connected by a pure encoder, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, and according to the previous key frame The second time period is obtained by the timestamp of and the timestamp of the next key frame; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Acquiring all the measurement data of the first encoder 703 and all the measurement data of the second encoder 704 in the second time period;

According to all the measurement data measured by the first encoder 703 and all the measurement data of the second encoder 704 in the second time period, the prediction transformation matrix of the next key frame relative to the previous key frame is obtained.

On the basis of any of the foregoing embodiments, when the processor 706 constructs the common view of the current image frame to perform local beam adjustment, the processor 706 is further configured to:

Acquiring N consecutive key frames preset before the current image frame, and constructing a local optimization map according to the current image frame and the map points that can be seen, the N consecutive key frames and the map points that can be seen; N is greater than or equal to 1;

Construct the encoder error term according to the prediction transformation matrix between any two adjacent key frames in the local optimization map, and perform reprojection according to the pose of the map point on each key frame to obtain the reprojection error;

According to the encoder error and the reprojection error, optimize the current image frame and the map points it can see, the N consecutive key frames and the map points it can see, and update all the map points in the local optimization map. The pose of the key frame and the position of all map points.

Perform closed-loop detection according to the feature points of the current image frame;

When a closed loop between the current image frame and any key frame is detected, the map points seen by the current image frame and the key frame attached to the key frame are merged.

On the basis of any of the foregoing embodiments, after the processor 706 performs closed-loop detection, the processor 706 is further configured to:

The pose map is constructed to optimize all the key frames to update the poses of all the key frames; the pose map includes all the key frames and the relative transformation between the two key frames.

On the basis of any of the foregoing embodiments, when the processor 706 constructs a pose graph to optimize all key frames, the processor 706 is configured to:

When a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between the adjacent key frames.

On the basis of any of the foregoing embodiments, after the processor 706 constructs a pose graph to optimize all key frames, the processor 706 is further configured to:

All key frames and all map points are updated according to the preset global bundle adjustment.

In another embodiment, the processor 706 calls the program code, and when the program code is executed, is configured to perform the following operations:

The specific principles and implementation manners of the movable platform provided in the embodiments of the present invention are similar to the foregoing embodiments, and will not be repeated here.

The mobile platform provided in this embodiment obtains the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame; obtains the measurement data and the second code of the first encoder in the first time period Measured measurement data; According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the prediction transformation matrix of the current image frame relative to the previous image frame; According to the predicted change The matrix and the pose information of the previous image frame obtain the predicted pose of the current image frame. In this embodiment, the encoder is applied to the visual SLAM, and the pose of the current image frame is predicted through the encoder measurement data, which can improve the accuracy of the pose acquisition, improve the stability and robustness of the visual SLAM, and when the vision fails Visual SLAM can still run stably.

In addition, this embodiment also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the pose prediction method and/or map construction method described in the foregoing embodiments.

In addition, this embodiment also provides a computer program, including program code. When the computer runs the computer program, the program code executes the pose prediction method and/or map construction method as described in the foregoing embodiment.

In the several embodiments provided by the present invention, it should be understood that the disclosed device and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium, and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute the method described in each embodiment of the present invention. Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, namely The internal structure is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention. range.

Claims

A pose prediction method, characterized in that it is suitable for a movable platform, the movable platform includes a first scroll wheel, a second scroll wheel, a camera, a first encoder and a second encoder, the first The scroll wheel and the second scroll wheel are used to displace the movable platform, the axes of the first scroll wheel and the second scroll wheel coincide; the camera is used to collect image frames, and the first encoder is used to detect The speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel; the method includes:

Obtain the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame;

Acquiring the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtain the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame.
The method according to claim 1, characterized in that, according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, the current image frame relative to the previous image frame is obtained. The prediction transformation matrix includes:

According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the first derivative of the rotation matrix of the current image frame with respect to time, and the second derivative of the position of the current image frame with respect to time ；

Based on the first time period, respectively integrating the first derivative and the second derivative to obtain a predicted rotation matrix and a predicted translation matrix of the current image frame relative to the previous image frame;

According to the predicted rotation matrix and predicted translation matrix of the current image frame relative to the previous image frame, the predicted transformation matrix of the current image frame relative to the previous image frame is obtained.
The method according to claim 1 or 2, wherein after obtaining the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame, the method further comprises:

The prediction pose of the current image frame is optimized according to a preset pure motion beam adjustment model; wherein, the pure motion beam adjustment model includes an encoder error term, and the encoder error term is based on the first encoding The measurement data of the encoder and the measurement data of the second encoder are calculated.
The method according to claim 3, further comprising:

If the current image frame has a visual failure relative to the previous image frame, the pose of the current image frame is the predicted pose;

If there is no visual failure of the current image frame relative to the previous image frame, the pose of the current image frame is the optimization result of the pure motion beam adjustment model.
The method according to claim 4, further comprising:

When the optimization result cannot be obtained through the pure motion beam adjustment model, it is determined that the current image frame is visually invalid relative to the previous image frame; or,

Performing feature matching between the current image frame and the previous image frame to obtain the number of interior points according to the optimized pose;

When the number of interior points is less than the preset threshold, it is determined that the current image frame is visually invalid relative to the previous image frame.
The method according to any one of claims 1-5, further comprising:

When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined by the key frame determined before the current image frame Build.
The method according to claim 6, wherein the updating the constructed map according to the current image frame comprises:

Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

After generating a new map point, construct the common view of the current image frame and perform local beam adjustment to adjust other key frames that are common view with the current image frame, the map points that can be seen in the current image frame, and the other The map points that can be seen in the key frame are optimized; wherein, the local beam adjustment includes an encoder error term.
The method according to claim 7, wherein the constructing a common view of the current image frame to perform local beam adjustment comprises:

When any key frame and the previous key frame are connected by the pure encoder side, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, and the key frame is removed according to the previous key frame. The time stamp of the frame and the time stamp of the next key frame acquire the second time period; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Acquiring all the measurement data of the first encoder and all the measurement data of the second encoder in the second time period;

According to all the measurement data measured by the first encoder and all the measurement data of the second encoder in the second time period, the prediction transformation matrix of the next key frame relative to the previous key frame is obtained.
The method according to claim 7 or 8, wherein the constructing a common view of the current image frame for local beam adjustment further comprises:

Acquiring N consecutive key frames preset before the current image frame, and constructing a local optimization map according to the current image frame and the map points that can be seen, the N consecutive key frames and the map points that can be seen; N is greater than or equal to 1;

Construct the encoder error term according to the prediction transformation matrix between any two adjacent key frames in the local optimization map, and perform reprojection according to the pose of the map point on each key frame to obtain the reprojection error;

According to the encoder error and the reprojection error, optimize the current image frame and the map points it can see, the N consecutive key frames and the map points it can see, and update all the map points in the local optimization map. The pose of the key frame and the position of all map points.
The method according to any one of claims 1-9, further comprising:

Perform closed-loop detection according to the feature points of the current image frame;

When a closed loop between the current image frame and any key frame is detected, the map points seen by the current image frame and the key frame attached to the key frame are merged.
The method according to claim 10, characterized in that, after the closed-loop detection is performed, the method further comprises:

The pose map is constructed to optimize all the key frames to update the poses of all the key frames; the pose map includes all the key frames and the relative transformation between the two key frames.
The method according to claim 11, wherein said constructing a pose map to optimize all key frames comprises:

When a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between the adjacent key frames.
The method according to claim 11 or 12, characterized in that, after said constructing the pose map to optimize all the key frames, the method further comprises:

All key frames and all map points are updated according to the preset global bundle adjustment.
A movable platform, characterized by comprising: a first scroll wheel, a second scroll wheel, a camera, a first encoder, a second encoder, a memory, and a processor;

The first rolling wheel and the second rolling wheel are used for displacing the movable platform, and the axes of the first rolling wheel and the second rolling wheel coincide;

The camera is used to collect image frames;

The first encoder is used to detect the speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel;

The memory is used to store program codes;

The processor calls the program code, and when the program code is executed, is used to perform the following operations:

Obtain the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame;

Acquiring the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtain the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame.
The mobile platform according to claim 14, wherein the processor obtains the current image frame relative to the measurement data of the first encoder and the measurement data of the second encoder during the first time period. For the prediction transformation matrix of the previous image frame, the processor is configured to:

According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the first derivative of the rotation matrix of the current image frame with respect to time, and the second derivative of the position of the current image frame with respect to time ；

Based on the first time period, respectively integrating the first derivative and the second derivative to obtain a predicted rotation matrix and a predicted translation matrix of the current image frame relative to the previous image frame;

According to the predicted rotation matrix and predicted translation matrix of the current image frame relative to the previous image frame, the predicted transformation matrix of the current image frame relative to the previous image frame is obtained.
The mobile platform according to claim 14 or 15, wherein after the processor obtains the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame, The processor is also configured to:

The predicted pose of the current image frame is optimized according to a preset pure motion beam adjustment model; wherein, the pure motion beam adjustment model includes an encoder error term, and the encoder error term is based on the first encoding The measurement data of the encoder and the measurement data of the second encoder are calculated.
The movable platform of claim 16, wherein the processor is further configured to:

If the current image frame has a visual failure relative to the previous image frame, the pose of the current image frame is the predicted pose;

If there is no visual failure of the current image frame relative to the previous image frame, the pose of the current image frame is the optimization result of the pure motion beam adjustment model.
The movable platform of claim 17, wherein the processor is further configured to:

When the optimization result cannot be obtained through the pure motion beam adjustment model, determine whether the current image frame is visually invalid relative to the previous image frame; or,

Performing feature matching between the current image frame and the previous image frame to obtain the number of interior points according to the optimized pose;

When the number of interior points is less than the preset threshold, it is determined whether the current image frame is visually invalid relative to the previous image frame.
The movable platform according to any one of claims 14-18, wherein the processor is further configured to:

When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined by the key frame determined before the current image frame Build.
The mobile platform according to claim 19, wherein when the processor updates the constructed map according to the current image frame, the processor is configured to:

Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

After generating new map points, construct the common view of the current image frame and perform local beam adjustments to adjust other key frames that are common view with the current image frame, the map points that can be seen in the current image frame, and the other The map points that can be seen in the key frame are optimized; wherein, the local beam adjustment includes an encoder error term.
The movable platform according to claim 20, wherein when the processor constructs the common view of the current image frame to perform local beam adjustment, the processor is configured to:

When any key frame and the previous key frame are connected by a pure encoder, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, and the previous key frame is The second time period is obtained by the timestamp of and the timestamp of the next key frame; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Acquiring all the measurement data of the first encoder and all the measurement data of the second encoder in the second time period;

According to all the measurement data measured by the first encoder and all the measurement data of the second encoder in the second time period, the prediction transformation matrix of the next key frame relative to the previous key frame is obtained.
The movable platform according to claim 20 or 21, wherein when the processor constructs the common view of the current image frame to perform local beam adjustment, the processor is further configured to:

Acquiring N consecutive key frames preset before the current image frame, and constructing a local optimization map according to the current image frame and the map points that can be seen, the N consecutive key frames and the map points that can be seen; N is greater than or equal to 1;

Construct the encoder error term according to the prediction transformation matrix between any two adjacent key frames in the local optimization map, and perform reprojection according to the pose of the map point on each key frame to obtain the reprojection error;

According to the encoder error and the reprojection error, optimize the current image frame and the map points it can see, the N consecutive key frames and the map points it can see, and update all the map points in the local optimization map. The pose of the key frame and the position of all map points.
The movable platform according to any one of claims 14-22, wherein the processor is further configured to:

Perform closed-loop detection according to the feature points of the current image frame;

When a closed loop between the current image frame and any key frame is detected, the map points seen by the current image frame and the key frame attached to the key frame are merged.
The movable platform according to claim 23, wherein after the processor performs closed-loop detection, the processor is further configured to:

The pose map is constructed to optimize all the key frames to update the poses of all the key frames; the pose map includes all the key frames and the relative transformation between the two key frames.
The movable platform according to claim 24, wherein when the processor constructs the pose map to optimize all the key frames, the processor is configured to:

When a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between the adjacent key frames.
The mobile platform according to claim 24 or 25, wherein after the processor constructs the pose graph to optimize all key frames, the processor is further configured to:

All key frames and all map points are updated according to the preset global bundle adjustment.
A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program is executed by a processor to implement the method according to any one of claims 1-13.
A method for constructing a map, characterized in that it is suitable for a movable platform, and the movable platform includes a first scroll wheel, a second scroll wheel, a camera, a first encoder, and a second encoder. The wheel and the second scroll wheel are used to displace the movable platform, the axes of the first scroll wheel and the second scroll wheel coincide; the camera is used to collect image frames, and the first encoder is used to detect the The rate of the first scroll wheel, and the second encoder is used to detect the rate of the second scroll wheel; the method includes:

Obtain the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame;

Acquiring the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame;

After obtaining the predicted pose of the current image frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined before the current image frame The key frames are constructed.
The method according to claim 28, characterized in that, on the basis of obtaining the constructed map, updating the constructed map according to the current image frame, comprising:

When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame.
The method according to claim 28 or 29, wherein, according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, the current image frame is relative to the previous image. The prediction transformation matrix of the frame, including:

According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the first derivative of the rotation matrix of the current image frame with respect to time, and the second derivative of the position of the current image frame with respect to time ；

Based on the first time period, respectively integrating the first derivative and the second derivative to obtain a predicted rotation matrix and a predicted translation matrix of the current image frame relative to the previous image frame;

According to the predicted rotation matrix and predicted translation matrix of the current image frame relative to the previous image frame, the predicted transformation matrix of the current image frame relative to the previous image frame is obtained.
The method according to any one of claims 28-30, wherein after obtaining the predicted pose of the current image frame, the method further comprises:

The predicted pose of the current image frame is optimized according to a preset pure motion beam adjustment model; wherein, the pure motion beam adjustment model includes an encoder error term, and the encoder error term is based on the first encoding The measurement data of the encoder and the measurement data of the second encoder are calculated.
The method according to claim 31, wherein after obtaining the predicted pose of the current image frame, the method further comprises:

If the current image frame has a visual failure relative to the previous image frame, the pose of the current image frame is the predicted pose;

If there is no visual failure of the current image frame relative to the previous image frame, the pose of the current image frame is the optimization result of the pure motion beam adjustment model.
The method according to claim 32, wherein after obtaining the predicted pose of the current image frame, the method further comprises:

When the optimization result cannot be obtained through the pure motion beam adjustment model, it is determined that the current image frame is visually invalid relative to the previous image frame; or,

Performing feature matching between the current image frame and the previous image frame to obtain the number of interior points according to the optimized pose;

When the number of interior points is less than the preset threshold, it is determined that the current image frame is visually invalid relative to the previous image frame.
The method according to any one of claims 28-33, wherein the updating the constructed map according to the current image frame comprises:

Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

After generating new map points, construct the common view of the current image frame and perform local beam adjustments to adjust other key frames that are common view with the current image frame, the map points that can be seen in the current image frame, and the other The map points that can be seen in the key frame are optimized; wherein, the local beam adjustment includes an encoder error term.
The method according to claim 34, wherein the constructing a common view of the current image frame to perform local beam adjustment comprises:

When any key frame and the previous key frame are connected by the pure encoder side, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, and the key frame is removed according to the previous key frame. The time stamp of the frame and the time stamp of the next key frame acquire the second time period; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Acquiring all the measurement data of the first encoder and all the measurement data of the second encoder in the second time period;

According to all the measurement data measured by the first encoder and all the measurement data of the second encoder in the second time period, the prediction transformation matrix of the next key frame relative to the previous key frame is obtained.
The method according to claim 34 or 35, wherein the constructing a common view of the current image frame for local beam adjustment further comprises:

Acquiring N consecutive key frames preset before the current image frame, and constructing a local optimization map according to the current image frame and the map points that can be seen, the N consecutive key frames and the map points that can be seen; N is greater than or equal to 1;

Construct the encoder error term according to the prediction transformation matrix between any two adjacent key frames in the local optimization map, and perform reprojection according to the pose of the map point on each key frame to obtain the reprojection error;

According to the encoder error and the reprojection error, optimize the current image frame and the map points it can see, the N consecutive key frames and the map points it can see, and update all the map points in the local optimization map. The pose of the key frame and the position of all map points.
The method according to any one of claims 28-36, wherein after obtaining the predicted pose of the current image frame, the method further comprises:

Perform closed-loop detection according to the feature points of the current image frame;

When a closed loop between the current image frame and any key frame is detected, the map points seen by the current image frame and the key frame attached to the key frame are merged.
The method according to claim 37, wherein after the closed-loop detection is performed, the method further comprises:

The pose map is constructed to optimize all the key frames to update the poses of all the key frames; the pose map includes all the key frames and the relative transformation between the two key frames.
The method according to claim 38, wherein said constructing a pose map to optimize all key frames comprises:

When a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between the adjacent key frames.
The method according to claim 38 or 39, characterized in that, after the construction of the pose map to optimize all the key frames, the method further comprises:

All key frames and all map points are updated according to the preset global bundle adjustment.
A movable platform, characterized by comprising: a first scroll wheel, a second scroll wheel, a camera, a first encoder, a second encoder, a memory, and a processor;

The first rolling wheel and the second rolling wheel are used for displacing the movable platform, and the axes of the first rolling wheel and the second rolling wheel coincide;

The camera is used to collect image frames;

The first encoder is used to detect the speed of the first scroll wheel, and the second encoder is used to detect the speed of the second scroll wheel;

The memory is used to store program codes;

The processor calls the program code, and when the program code is executed, is used to perform the following operations:

Obtain the first time period according to the time stamp of the current image frame and the time stamp of the previous image frame;

Acquiring the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the prediction transformation matrix of the current image frame relative to the previous image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period;

Obtaining the predicted pose of the current image frame according to the predicted change matrix and the pose information of the previous image frame;

After obtaining the predicted pose of the current image frame, on the basis of the constructed map, the constructed map is updated according to the current image frame; the constructed map is determined before the current image frame The key frames are constructed.
The mobile platform according to claim 41, wherein the processor updates the constructed map according to the current image frame on the basis of obtaining the constructed map. Is configured as:

When it is determined that the current image frame is a key frame, on the basis of the constructed map, the constructed map is updated according to the current image frame.
The movable platform according to claim 41 or 42, wherein the processor obtains the current image frame according to the measurement data of the first encoder and the measurement data of the second encoder in the first time period Relative to the predicted transformation matrix of the previous image frame, the processor is configured to:

According to the measurement data of the first encoder and the measurement data of the second encoder in the first time period, obtain the first derivative of the rotation matrix of the current image frame with respect to time, and the second derivative of the position of the current image frame with respect to time ；

Based on the first time period, respectively integrating the first derivative and the second derivative to obtain a predicted rotation matrix and a predicted translation matrix of the current image frame relative to the previous image frame;

According to the predicted rotation matrix and predicted translation matrix of the current image frame relative to the previous image frame, the predicted transformation matrix of the current image frame relative to the previous image frame is obtained.
The movable platform according to any one of claims 41-43, wherein after the processor obtains the predicted pose of the current image frame, the processor is further configured to:

The predicted pose of the current image frame is optimized according to a preset pure motion beam adjustment model; wherein, the pure motion beam adjustment model includes an encoder error term, and the encoder error term is based on the first encoding The measurement data of the encoder and the measurement data of the second encoder are calculated.
The mobile platform of claim 44, wherein after the processor obtains the predicted pose of the current image frame, the processor is further configured to:

If the current image frame has a visual failure relative to the previous image frame, the pose of the current image frame is the predicted pose;

If there is no visual failure of the current image frame relative to the previous image frame, the pose of the current image frame is the optimization result of the pure motion beam adjustment model.
The mobile platform according to claim 45, wherein after the processor obtains the predicted pose of the current image frame, the processor is further configured to:

When the optimization result cannot be obtained through the pure motion beam adjustment model, it is determined that the current image frame is visually invalid relative to the previous image frame; or,

According to the optimized pose, perform feature matching between the current image frame and the previous image frame to obtain the number of interior points;

When the number of interior points is less than the preset threshold, it is determined that the current image frame is visually invalid relative to the previous image frame.
The mobile platform according to any one of claims 41-46, wherein when the processor updates the constructed map according to the current image frame, the processor is configured to:

Take out redundant key frames and redundant map points according to a preset strategy, and perform a triangulation operation according to the current image frame to generate new map points;

After generating new map points, construct the common view of the current image frame and perform local beam adjustments to adjust other key frames that are common view with the current image frame, the map points that can be seen in the current image frame, and the other The map points that can be seen in the key frame are optimized; wherein the local beam adjustment includes an encoder error term.
The movable platform according to claim 47, wherein when the processor constructs the common view of the current image frame to perform local beam adjustment, the processor is configured to:

When any key frame and the previous key frame are connected by the pure encoder side, and the key frame and the next key frame are also connected by the pure encoder side, the key frame is removed, and the key frame is removed according to the previous key frame. The time stamp of the frame and the time stamp of the next key frame acquire the second time period; wherein the pure encoder side is that the visual failure occurs between adjacent key frames;

Acquiring all the measurement data of the first encoder and all the measurement data of the second encoder in the second time period;

According to all the measurement data measured by the first encoder and all the measurement data of the second encoder in the second time period, the prediction transformation matrix of the next key frame relative to the previous key frame is obtained.
The movable platform according to claim 47 or 48, wherein when the processor constructs the common view of the current image frame to perform local beam adjustment, the processor is further configured to:

Acquiring N consecutive key frames preset before the current image frame, and constructing a local optimization map according to the current image frame and the map points that can be seen, the N consecutive key frames and the map points that can be seen; N is greater than or equal to 1;

Construct the encoder error term according to the prediction transformation matrix between any two adjacent key frames in the local optimization map, and perform reprojection according to the pose of the map point on each key frame to obtain the reprojection error;

According to the encoder error and the reprojection error, optimize the current image frame and the map points it can see, the N consecutive key frames and the map points it can see, and update all the map points in the local optimization map. The pose of the key frame and the position of all map points.
The movable platform according to any one of claims 41-49, wherein after the processor obtains the predicted pose of the current image frame, the processor is further configured to:

Perform closed-loop detection according to the feature points of the current image frame;

When a closed loop between the current image frame and any key frame is detected, the map points seen by the current image frame and the key frame attached to the key frame are merged.
The mobile platform of claim 50, wherein after the processor performs closed-loop detection, the processor is further configured to:

The pose map is constructed to optimize all the key frames to update the poses of all the key frames; the pose map includes all the key frames and the relative transformation between the two key frames.
The movable platform according to claim 51, wherein when the processor constructs the pose map to optimize all the key frames, the processor is configured to:

When a visual failure occurs between adjacent key frames, a modified covariance matrix is generated according to the predictive transformation matrix to modify the covariance matrix of the relative transformation between the adjacent key frames.
The movable platform according to claim 51 or 52, wherein after the processor constructs the pose map to optimize all key frames, the processor is further configured to:

All key frames and all map points are updated according to the preset global bundle adjustment.
A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program is executed by a processor to implement the method according to any one of claims 28-40.