WO2020259248A1

WO2020259248A1 - Depth information-based pose determination method and device, medium, and electronic apparatus

Info

Publication number: WO2020259248A1
Application number: PCT/CN2020/094461
Authority: WO
Inventors: 王宇鹭
Original assignee: Oppo广东移动通信有限公司
Priority date: 2019-06-28
Filing date: 2020-06-04
Publication date: 2020-12-30
Also published as: CN110349213A; CN110349213B

Abstract

A depth information-based pose determination method and device, a storage medium, and an electronic apparatus. The method comprises: acquiring, by means of a camera, a current frame image associated with a scene, and acquiring depth information of the current frame image (S210); extracting feature points from the current frame image, determining a feature point having valid depth information to be a three-dimensional feature point, and determining a feature point having invalid depth information to be a two-dimensional feature point (S220); respectively performing matching on the basis of the two-dimensional feature point, the three-dimensional feature point and a local map point to construct a first error function (S230); and determining a pose parameter of the camera on the basis of the current frame by calculating the minimum value of the first error function (S240). The invention can improve tracking precision of SLAM.

Description

Method, device, medium and electronic equipment for determining pose based on depth information

This application claims the priority of a patent application filed with the Chinese Patent Office on June 28, 2019, with the application number CN201910580095.1, and the application titled "Pose determination method, device, medium and electronic equipment based on depth information". The entire content is incorporated into this application by reference.

Technical field

The present disclosure relates to the field of computer vision technology, and in particular to a method for determining a pose based on depth information, a device for determining a pose based on depth information, a computer-readable storage medium, and electronic equipment.

Background technique

SLAM (Simultaneous Localization And Mapping, Simultaneous Localization And Mapping) is a method to move the terminal device in the scene and collect the image of the scene, and at the same time to determine the device’s own pose and model the scene. It is AR (Augmented Reality, enhanced Reality), basic technology in robotics and other fields.

In the existing SLAM methods, most of the visual signal units (such as cameras) and IMU (Inertia Measurement Unit) are aligned to determine the pose of the camera in real time, so as to restore the scene captured by the camera to a scene model. . However, this method is limited by the accuracy and latency of the IMU. More accurate pose information can be obtained in a short time. Long-term use will cause more serious drift, resulting in the inability to accurately track the camera, which is not conducive to scene construction. mold.

It should be noted that the information disclosed in the above background section is only used to strengthen the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.

Summary of the invention

The present disclosure provides a method for determining a pose based on depth information, a device for determining a pose based on depth information, a computer-readable storage medium, and electronic equipment, thereby improving at least to a certain extent the low tracking accuracy of existing SLAM methods The problem.

Other characteristics and advantages of the present disclosure will become apparent through the following detailed description, or partly learned through the practice of the present disclosure.

According to a first aspect of the present disclosure, a method for determining a pose based on depth information is provided, including: acquiring a current frame image of a scene through a camera, and acquiring depth information of the current frame image; and extracting from the current frame image Feature points, the feature points with valid depth information are determined as three-dimensional feature points, and the feature points with invalid depth information are determined as two-dimensional feature points; the two-dimensional feature points and the three-dimensional feature points are respectively connected to the local map points of the scene Matching is performed to construct a first error function. The first error function includes a two-dimensional error term and a three-dimensional error term. The two-dimensional error term is the error between the successfully matched two-dimensional feature point and the local map point. The three-dimensional error term is the error between the successfully matched three-dimensional feature point and the local map point; by calculating the minimum value of the first error function, the pose parameter of the camera in the current frame is determined.

According to a second aspect of the present disclosure, there is provided a device for determining a pose based on depth information, including: an image acquisition module for acquiring a current frame image of a scene through a camera, and acquiring depth information of the current frame image; The point extraction module is used to extract feature points from the current frame image, determine the feature points with valid depth information as three-dimensional feature points, and determine the feature points with invalid depth information as two-dimensional feature points; the function building module is used to The two-dimensional feature points and the three-dimensional feature points are matched with the local map points respectively to construct a first error function. The first error function includes a two-dimensional error term and a three-dimensional error term, and the two-dimensional error term is a successful match. The three-dimensional error term is the error between the successfully matched three-dimensional feature point and the local map point; the pose determination module is used to calculate the first error function The minimum value of determines the pose parameter of the camera in the current frame.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the pose determination method of the first aspect and possible implementations thereof.

According to a fourth aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions Perform the pose determination method of the first aspect and possible implementations thereof.

The present disclosure has the following beneficial effects:

Based on the depth information of the current frame image, the feature points extracted in the image are divided into two-dimensional feature points and three-dimensional feature points, and the two types of feature points are matched with local map points to construct a two-dimensional error term and a three-dimensional error term , So as to establish the first error function, and obtain the pose parameters of the camera in the current frame by optimizing the minimum value of the first error function. On the one hand, introducing depth information to establish an error function as a basis for determining pose parameters can improve the accuracy of pose parameters in SLAM and increase tracking accuracy. On the other hand, based on whether the depth information is valid, the feature points are classified, and error terms are constructed respectively. Compared with the method of optimizing all the feature points through one error term, this exemplary embodiment has higher flexibility and Pertinence, and can reduce the influence of invalid or wrong depth information on the results. In addition, it can also improve the stability and robustness of the SLAM method for a long time.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present disclosure.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and together with the specification are used to explain the principle of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 shows a schematic diagram of the architecture of a SLAM system in this exemplary embodiment;

Fig. 2 shows a flow chart of a method for determining a pose based on depth information in this exemplary embodiment;

Fig. 3 shows a sub-flow chart of a method for determining a pose based on depth information in this exemplary embodiment;

Fig. 4 shows a flow chart of a SLAM method in this exemplary embodiment;

Fig. 5 shows a structural block diagram of a device for determining a pose based on depth information in this exemplary embodiment;

Fig. 6 shows a computer-readable storage medium for implementing the above method in this exemplary embodiment;

Fig. 7 shows an electronic device for implementing the above-mentioned method in this exemplary embodiment.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, the provision of these embodiments makes the present disclosure more comprehensive and complete, and fully conveys the concept of the example embodiments To those skilled in the art. The described features, structures or characteristics may be combined in one or more embodiments in any suitable way.

The exemplary embodiments of the present disclosure first provide a method for determining a pose based on depth information, which is mainly applied in a SLAM scene to determine the pose of a camera. Figure 1 shows the SLAM system architecture diagram of the application environment of this method. As shown in FIG. 1, the SLAM system 100 may include: a scene 101, a movable camera 102, a movable depth sensor 103, and a computing device 104. Among them, the scene 101 is a realistic scene to be modeled, such as an interior, a courtyard, and a street. The camera 102 and the depth sensor 103 can be integrated. For example, the camera 102 is a plane camera, and the depth sensor 103 is a TOF (Time Of Flight) sensor set aside; or the camera 102 and the depth sensor 103 are two cameras, A binocular camera is formed; or the depth sensor 103 is an infrared light device, which and the camera 102 form a structured light camera. The camera 102 and the depth sensor 103 can move within the scene 101 to collect the image of the scene 101 and its depth information. FIG. 1 shows that the camera 102 and the depth sensor 103 are set on a movable robot. In addition, the user can hold a mobile phone or wear it. Smart glasses and the like move within the scene 101, and the mobile phone or smart glasses has a camera 102 and a depth sensor 103 built in. The computing device 104 can be a terminal computer or server, etc., which is connected to the camera 102 and the depth sensor 103 for data interaction. The camera 102 and the depth sensor 103 send the collected images and their depth information to the computing device 104, and the computing device 104 performs Process analysis to achieve positioning and modeling in SLAM.

It should be noted that the SLAM system 100 shown in FIG. 1 is only an example, and there can be several variations. For example, the camera 102, the depth sensor 103, and the computing device 104 can be integrated into one device, such as a built-in camera. 102. The robot of the depth sensor 103 and the computing device 104 can take pictures while moving in the scene 101, and process the photos to realize positioning and modeling; the number of devices is not limited to the situation shown in FIG. 1, as You can set up multiple cameras (such as 3 or 4 cameras on a mobile phone), or set up a computing device cluster composed of multiple servers, and process a large number of scene images through cloud computing, etc.; Figure 1 can also be added Devices not shown in, such as setting up an IMU (Inertia Measurement Unit) matched with the camera 102 to help determine the pose of the camera 102, or setting a projection device to generate a virtual projection in the scene 103, and the camera 102 , The user or the robot interacts.

At the beginning of SLAM, the scene is modeled from scratch, and no scene images are collected at this time; after the SLAM process starts, the camera follows the user or the robot to move in the scene, collecting scene images while moving, forming continuous frames of images Stream, sent to the computing device in real time; after obtaining a certain number of frames of images, the computing device can initialize the map model of the scene, which usually only includes a small part of the scene, or is different from the actual scene; after that, the camera collects a frame of Images and computing devices can update and optimize the map model based on the image (of course, you can also update and optimize the map model when the key frames are filtered out), add map points that are not in the map model, or modify the location of existing map points When updating and optimizing the map model, you need to determine the camera's pose parameters, which is a necessary part in SLAM. Only after the camera's pose parameters are determined, can the images collected by the camera be matched to To update and optimize the map model in the three-dimensional map model.

This exemplary embodiment is an improved method for how to determine the pose parameters of the camera in each frame. Based on the SLAM system 100 of FIG. 1, the execution subject of this exemplary embodiment may be the computing device 104 therein. Fig. 2 shows a process of this exemplary embodiment, which may include the following steps S210 to S240:

Step S210: Acquire a current frame image of the scene through the camera, and acquire depth information of the current frame image.

Among them, each frame of image collected by the camera, the computing device analyzes one frame of image, and the current frame of image is the latest frame of image collected by the camera. Depth information is collected at the same time as the current frame of image. For example, if you use a plane camera and a depth sensor to capture a scene image, you can get the depth information of each pixel in the image, usually the depth value; if you use a binocular camera to capture the scene image, you can The triangulation algorithm obtains the depth information of each pixel in the image; the structured light camera is used to shoot the scene image, and the infrared dot matrix can be used to project the infrared light signal to the scene, and after receiving the reflected signal, it is calculated by the change of the infrared light In-depth information, etc.

Step S220: Extract feature points from the current frame image, determine feature points with valid depth information as three-dimensional feature points, and determine feature points with invalid depth information as two-dimensional feature points.

Among them, the feature points are representative and highly recognizable points or regions in the image, such as corner points, edges, and some blocks in the image. This exemplary embodiment can use the ORB algorithm (Oriented FAST and Rotated Brief). , For FAST (Features From Accelerated Segment Test, features based on accelerated segmentation detection) algorithm and rotating Brief (Binary Robust Independent Elementary Features, Binary Robust Independent Elementary Features)) to extract feature points and describe feature points; FAST can also be used , SIFT (Scale-Invariant Feature Transform, Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features, accelerated robust features) and other algorithms to extract feature points; it can also perform target detection on the current frame of image, and detect the edges of objects Extract certain feature points on the contour, and so on.

The extracted feature points are pixels in the current frame image, which have depth information. In this exemplary embodiment, considering the capability limitations of depth detection components such as depth sensors, binocular cameras, etc., it is impossible to accurately detect the depth information of objects that are too close or too far from the depth sensor, or for black or high-reflective materials. The processing capabilities of objects and scenes with large lighting changes are poor, and the depth information of the current frame image may contain invalid pixel depth information. Therefore, based on whether the depth information is valid, the feature points can be divided into three-dimensional feature points and two-dimensional feature points: feature points with invalid depth information are two-dimensional feature points, and because their depth information is invalid, only their values in the current frame image are retained. Two-dimensional coordinates (that is, plane coordinates); the feature points with effective depth information are three-dimensional feature points. In addition to their two-dimensional coordinates in the current frame image, they also have their third-dimensional coordinates in the depth direction, and their coordinate values are usually The depth value.

When detecting whether the depth information is valid, it is mainly to detect whether the depth information of each feature point can accurately reflect the actual situation of the photographed object. Based on the principle of this idea, for images of different types and different scenes, the detection methods and standards used can be based on Depending on the situation, this disclosure does not limit this, and several specific detection method examples are provided below.

(1) When the depth sensor cannot accurately detect the depth information of the object, it can output the depth value of the corresponding part as an invalid value or an abnormal value. For example, the depth detection range of a TOF sensor is usually 0.5 to 3 meters. If the object reaches the TOF sensor If the distance is outside this range, the TOF (the time difference between the transmitted and received signal) sensed by the TOF sensor exceeds the upper or lower limit, and the depth value of the object can be recorded as an invalid value or upper and lower limit. The depth value is not credible and is invalid information; on the contrary, if the depth value is a normal value within the detection range, it is valid information.

(2) Based on the feature points extracted by target detection, all feature points of each object in the current frame image can be uniformly detected with the object as a unit, and the depth value span of the object is detected (that is, the maximum depth value minus the minimum depth Value), if it is within the normal range, the depth information of all feature points of the object are valid. For example: a chair is detected from the current frame image, and 10 feature points (including corner points, edge points, etc.) are extracted from the contour of the chair, and the maximum depth value of the 10 feature points is subtracted from the minimum depth value to obtain the depth of the chair The value span is considered to be the thickness of the chair in the depth direction; by setting the thickness range of each object in advance, for example, the thickness of the chair is 0.5 to 2 meters, it is judged whether the above depth value span is within this range, if it is, Then the depth information of the 10 feature points are all valid, otherwise all are invalid.

In step S230, the two-dimensional feature points and the three-dimensional feature points are respectively matched with local map points of the scene to construct a first error function.

The local map point refers to the map point of the scene detected before the current frame within a local range centered on the shooting area of the current frame image, and the map point refers to the point added in the map model of the scene. When SLAM collects scene images, a certain number of key frames are usually selected from continuous frame images. This is a representative frame selected to reduce information redundancy in the modeling process. Usually, a certain number of frames can be selected every interval One frame is a key frame, or a key frame is extracted when the image content changes a lot. In this exemplary embodiment, the local map point may be a map point that appeared in the previous key frame and the common view key frame of the previous key frame. Among them, the last key frame is the key frame closest to the current frame; common view means that the content of the two images is similar or has a common field of view (Field Of Vision, FOV), indicating that the two images are taken The regions of with a higher degree of overlap and a common view relationship, where one frame is the common view frame of the other frame. This exemplary embodiment can detect whether the feature points of other key frames and the previous key frame are the same points. If the number of the same feature points exceeds a certain ratio, the other key frames are the common view key frames of the previous key frame. , You can also determine the degree of common view between each other key frame and the previous key frame according to the number of the same feature points, and select a certain number of other key frames from high to low to be the common view key frame of the previous key frame. After determining the previous key frame and its common view key frame, take the union of the map points, and the obtained map points are local map points.

After the local map points are obtained, the two-dimensional feature points and the three-dimensional feature points in the current frame image are matched with the local map points respectively. If it is determined that the feature points and the local map points are the same point in the scene, the matching is successful. The matching method can include several exemplary methods: feature descriptions of feature points and local map points, for example, through ORB, BRIEF and other algorithms, according to the similarity of descriptors to determine whether the feature points and local map points match; for local map points Perform down-sampling to make it the same as the number of feature points in the current frame image, and then perform ICP (Iterative Closest Point) algorithm matching on the point cloud of feature points and the point cloud of local map points; based on target detection from The feature points extracted from the current frame image are matched with the object model in the local map point in the unit of object, and all the feature points in the successfully matched object are matched with the local map point in the corresponding object model.

After the matching, the first error function can be constructed based on the matched feature points and the local map point pairs:

Among them, Loss1 is the first error function, Loss2D is the two-dimensional error term, and Loss3D is the three-dimensional error term; P _2D represents the set of two-dimensional feature points, e _i,k represents the difference between any one of them and the corresponding local map point Error; P _3D represents a collection of three-dimensional feature points, e _{j, k} represents the error between any one of them and the corresponding local map point; k represents the current frame.

In an optional implementation manner, a robust kernel function ρ() can be added to formula (1) to reduce the influence of mismatches on the final result, as shown below:

In an alternative embodiment, the two-dimensional error term may be the reprojection error between the successfully matched two-dimensional feature point and the local map point, and the three-dimensional error term may be the difference between the successfully matched three-dimensional feature point and the local map point. The ICP (Iterated Nearest Neighbor) error between the two has the following relationship:

Among them, w represents the world coordinate system,

Is the pose parameter of the camera in the current frame, which represents the rotation and translation parameters used in the actual scene conversion from the world coordinate system to the plane of the current frame image;

Is the plane coordinate of the two-dimensional feature point i in the current frame image,

Is the world coordinates of the local map point corresponding to the two-dimensional feature point i, and π() represents the projection of the three-dimensional local map point to the image plane (here is the plane of the current frame image). Therefore, formula (3) represents the local map point After being reprojected to the plane of the current frame image, the plane coordinate error with the corresponding two-dimensional feature point. Similarly, in formula (4),

Is the three-dimensional coordinates of the three-dimensional feature point j in the current frame image (including depth information, which is the coordinates in the three-dimensional camera coordinate system),

Is the world coordinate of the local map point corresponding to the three-dimensional feature point j, after

Converted to the camera coordinate system, and

Calculate the coordinate error.

In an optional implementation manner, an information matrix can also be added to the above formula (1) or (2) to measure the observation uncertainty of feature points, as shown below:

among them,

Is the information matrix of the two-dimensional feature point i expressed in the form of a covariance matrix,

The indicated information matrix of the three-dimensional feature point j is related to the noise performance of the camera itself. Here, it is equivalent to performing a weighted calculation on the feature points at different positions, which can improve the accuracy of the first error function.

Step S240: Determine the pose parameter of the camera in the current frame by calculating the minimum value of the first error function.

In SLAM, the error function can also be called an optimization function, a constraint function, etc., which are used to optimize the corresponding variable parameters. In this exemplary embodiment, the first error function is used to optimize the pose parameters of the camera in the current frame. Taking formula (5) as an example, the relationship is as follows:

Through nonlinear optimization of the first error function, the pose parameters are obtained after multiple iterations

The condition for iterative convergence may be that the iteration reaches a certain number of rounds, or the value of the first error function decreased in two consecutive rounds of iterations is lower than a predetermined value, etc.

Based on the above content, in this exemplary embodiment, based on the depth information of the current frame image, the feature points extracted in the image are divided into two-dimensional feature points and three-dimensional feature points, and the two types of feature points are respectively matched with local map points , In order to construct a two-dimensional error term and a three-dimensional error term, thereby establishing the first error function, and by optimizing the minimum value of the first error function, the pose parameters of the camera in the current frame are obtained. On the one hand, introducing depth information to establish an error function as a basis for determining pose parameters can improve the accuracy of pose parameters in SLAM and increase tracking accuracy. On the other hand, based on whether the depth information is valid, the feature points are classified, and error terms are constructed respectively. Compared with the method of optimizing all the feature points through one error term, this exemplary embodiment has higher flexibility and Pertinence, and can reduce the influence of invalid or wrong depth information on the results. In addition, it can also improve the stability and robustness of the SLAM method for a long time.

In an optional implementation, if the IMU and the visual signal unit are aligned (or called alignment, registration, fusion, coupling, etc.) in advance, the first error function may also include an inertial measurement error term, which is The error between the IMU and the visual signal unit. Among them, the visual signal unit refers to a unit for positioning and modeling through visual signals (mainly images), which mainly includes a camera, and may also include a depth sensor, a computer, etc., which are matched with the camera. The first error function can be as follows:

When the robust kernel function ρ() and the information matrix are introduced, the first error function can also be:

In formulas (7) and (8), e _IMU,k is the inertial measurement error term, which represents the error between the current frame IMU and the visual signal unit,

Represents the information matrix of the IMU. The inertial measurement error term is set in the first error function, and the IMU signal can be used as a basis parameter for pose optimization to further improve the accuracy of the pose parameters.

In this exemplary embodiment, the alignment of the IMU and the visual signal unit may include the following steps: obtain the IMU gyroscope by calculating the minimum error between the rotation parameter in the pre-integration of the IMU and the rotation parameter measured by the visual signal unit Bias; by calculating the minimum error between the position parameter in the pre-integration of the IMU and the position parameter measured by the visual signal unit, the gravity acceleration of the IMU is obtained; based on the gyroscope bias and the gravity acceleration of the IMU, the IMU and the visual signal The unit is aligned. The above steps can be performed in the initialization phase of SLAM, that is, the IMU and the visual signal unit are aligned in the initialization phase, and then in the tracking process, the above steps can also be performed to continuously optimize and adjust the alignment state of the IMU and the visual signal unit to further Improve the accuracy of tracking.

The aforementioned tracking and pose determination processes are usually performed by the tracking thread in SLAM. In addition, SLAM may also include a key frame processing thread (or called a map modeling thread, a map reconstruction thread, etc.). In an optional implementation manner, the key frame processing thread may determine the current frame as a new key frame, and update the map model of the scene according to the new key frame and its pose parameters. Through the pose parameters, the new key frame can be converted to the world coordinate system, matched with the existing scene map model, optimized and corrected the map point position in the map model, or added new map points, and deleted abnormal map points Wait.

In an alternative embodiment, not every frame is processed as a key frame. After step S240, the following steps can be performed: when it is determined that the current frame meets the first preset condition, the current frame is determined as the new key frame Frame, the map model of the scene is updated; when it is determined that the current frame does not meet the first preset condition, the current frame is determined as a normal frame, and the next frame is processed. Among them, the first preset condition may include:

The current frame is more than the preset number of frames from the previous key frame, and the preset number of frames can be set according to experience or actual application requirements. For example, if more than 15 frames are exceeded, the current frame is the new key frame.

The disparity between the current frame and the previous key frame exceeds the preset value. The disparity is the opposite concept of the common view, which represents the degree of difference between the areas captured by the two frames. The greater the difference, the lower the common view, and the greater the parallax; The difference can be set according to experience or actual application requirements. For example, it can be set to 15%. When the disparity between the current frame and the previous key frame exceeds 15%, it is a new key frame.

Count the number of two-dimensional feature points whose error with the local map point is less than the first threshold in the current frame image, and the number of three-dimensional feature points whose error with the local map point is less than the second threshold, the sum of the two numbers The number of feature points successfully tracked for the current frame. If it is less than the preset number, the current frame is a new key frame. The first threshold, the second threshold, and the preset number can be set according to experience or actual application requirements.

It should be noted that the above three conditions can also be combined arbitrarily. For example, when the current frame is more than the preset number of frames from the previous key frame, and the disparity between the current frame and the previous key frame exceeds the preset difference, the current frame is the new one The key frame is not limited in this disclosure.

After a new key frame is determined, it can be added to the key frame queue, and the key frame processing thread processes the key frames in the queue sequentially to update the map model of the scene. The following three aspects specifically explain how to update the map model:

In the first aspect, the existing map points can be updated, and the pose parameters of the key frames can be further optimized. Refer to Figure 3, which specifically includes the following steps S310-S340:

Step S310: Obtain a new key frame and other key frames associated with the new key frame to form a key frame set.

Among them, other key frames associated with the new key frame may be: M key frames closest to the new key frame, and N common view key frames of the new key frame, where M and N are preset positive integers, It can be set according to experience or actual application requirements. Of course, there may be repeated frames in M key frames and N common view key frames. The two parts can be unioned to obtain the key frame set, which is recorded as F _key . Or, key frames that have other association relationships with the new key frame can also be formed into a key frame set F _key .

Step S320: Obtain all the map points that have appeared in the key frame set to form a map point set.

In other words, take the union of the map points of all key frames in F _key to form a set of map points, denoted as P _map .

In step S330, a second error function is constructed based on the key frame set, the pose parameters of each key frame, and the map point set.

The second error function includes a reprojection error term, which is the sum of the reprojection error from any map point in the map point set to any key frame in the key frame set, which can be expressed as follows:

e _o,p represents the reprojection error from any map point p in P _map to any key frame o in F _key . Furthermore, a robust kernel function ρ() can also be added to the second error function, then:

In an optional implementation manner, in order to improve the accuracy of the second error function, an inter-frame inertial measurement error term can also be set, which is the IMU between any two adjacent key frames i and i+1 in the key frame set The sum of errors is as follows:

In formula (11), the IMU information matrix between key frames i and i+1 is also added, which can further optimize the second error function.

In step S340, by calculating the minimum value of the second error function, the pose parameters of each key frame in the key frame set and the coordinates of each map point in the map point set are optimized to update the map model.

Among them, the optimization solution can be as follows:

X ^p is the world coordinate of any map point p in P _map ,

It is the pose parameter of any key frame q in the F _key . By optimizing these two parameters, the map point coordinates can be optimized and corrected, and the map model can be updated.

In the second aspect, abnormal map points can be deleted from existing map points. Specifically, based on the key frame set and map point set established above, the map points that meet the second preset condition can be used as abnormal map points, from Deleted from the map model. The second preset condition can include any of the following:

If the average value of the reprojection error of the map point p on each key frame in the key frame set is greater than the preset error threshold, then p is an abnormal map point. The preset error threshold can be set according to experience or actual application requirements. When calculating the reprojection error, you can select all the key frames in the key frame set for calculation, or you can select p key frames with projection for calculation.

If the number of successfully tracked key frames of the map point p in the key frame set is less than the predicted number multiplied by the preset ratio, then p is an abnormal map point. Successfully being tracked means that the reprojection error of the map point p on the key frame is less than a certain value, for example, less than the aforementioned first threshold. Based on the position of p and the pose parameters of each key frame, the key frame processing thread can predict the number of key frames where p is successfully tracked. This number is multiplied by a preset ratio less than or equal to 1, and the result is used to measure whether the tracking is abnormal; Let the ratio represent the allowable degree of deviation, which can be set according to experience or actual application requirements, for example, 0.5. The judgment relationship is as follows:

II() is an indicator function, and the values in () are 1 and 0 respectively when true and false; T1 is the first threshold, R is the preset ratio, and Pre() is the prediction function.

In the third aspect, new map points can be added. Specifically, if there is a feature point in the new key frame that does not match the local map point (or map point in the map point set), it can be considered that the feature point does not yet exist In the map model; then the feature point can be matched with the feature points of other key frames in the key frame set. If the matching is successful, a pair of feature points is obtained, which can be regarded as the projection of the same point in the scene on two different frames; triangulate the point pair to restore its three-dimensional coordinates in the scene To get a new map point, which can be added to the map model.

What needs to be added is that the above method can actually be applied to each key frame. The feature points of each key frame in the key frame set are matched with the map points in the map point set. The unmatched feature points form an unknown point. Set, and then match the feature points in the unknown point set pairwise (using a non-replacement method, if a point pair is matched, the two points will be removed from the set, so there will be no match between one point and two or In the case of multiple points), the matched point pairs correspond to triangulation to generate new map points.

In an optional implementation, if there is a feature point that does not match the local map point in the new key frame, since the feature point has depth information, that is, the feature point has three-dimensional image coordinates. The pose parameter can map the feature point to the world coordinate system, calculate its true position, and add it to the map model as a new map point. Even if the position of the map point is deviated, it can be continuously optimized in the processing of subsequent frames.

In addition to the above-mentioned tracking thread and key frame processing thread, SLAM may also include a loopback detection thread, which is used to perform loopback detection for new keyframes to optimize the map model globally. Specifically, for the new key frame, the feature points in the key frame are converted into a dictionary description through the pre-trained visual bag of words model; then the dictionary similarity between the key frame and the previous key frame is calculated, if the similarity reaches With a certain threshold, the key frame is considered to be a candidate loop frame; geometric verification is performed on the candidate loop frame, that is, the matching points should meet the corresponding geometric relationship, if the geometric verification is passed, it is considered a loop frame, and the map model is globally optimized .

Fig. 4 shows the flow of a SLAM method of this exemplary embodiment, including three parts respectively executed by a tracking thread, a key frame processing thread, and a loopback detection thread, which are specifically as follows:

The tracking thread executes step S411 to collect the current frame image and its depth information; then executes step S412 to extract feature points from the current frame image; then executes step S413 to classify feature points according to the depth information to obtain two-dimensional feature points and three-dimensional feature points ; Then step S414 is performed to match the two-dimensional feature points and the three-dimensional feature points with the local map points respectively to construct a first error function; then step S415 is performed to optimize the pose of the current frame by calculating the minimum value of the first error function Parameters; finally step S416 is performed to determine whether the current frame meets the first preset condition, if not, then enter the processing of the next frame, if it is satisfied, it will be treated as a new key frame and added to the key frame queue. The process of tracking the thread ends.

In response to the current frame as the new key frame, the key frame processing thread executes step S421 to add a new key frame to the key frame queue; then executes step S422 to construct a second error function; then executes step S423 to calculate the second error The minimum value of the function is to optimize the poses of multiple nearby key frames (key frames close to the current frame) and the position of the map point. This is a local optimization of the map model; after the local optimization, step S424 is executed to determine the IMU and vision Whether the signal unit is aligned, if not, perform step S425 for alignment; before alignment, it can also be judged whether there is obvious parallax between a certain number of nearby key frames, and if so, alignment can be performed If it does not exist, it is considered that the alignment is impossible, and the alignment step is skipped; then step S426 is executed to delete the abnormal map point in the map model; then step S427 is executed to add a new map point in the map model. The flow of the key frame processing thread ends.

The loopback detection thread can perform global optimization based on the local optimization of the key frame processing thread. Specifically, it includes: first perform step S431 to perform loopback detection; if it is a loopback candidate frame, perform step S432 to perform geometric verification; if the geometric verification passes , Step S433 is executed to optimize the map model globally.

Exemplary embodiments of the present disclosure also provide a pose determination device based on depth information. As shown in FIG. 5, the pose determination device 500 may include: an image acquisition module 510 for acquiring current information about the scene through a camera. Frame image and obtain the depth information of the current frame image; the feature point extraction module 520 is used to extract feature points from the current frame image, determine feature points with valid depth information as three-dimensional feature points, and determine feature points with invalid depth information as Two-dimensional feature points; the function construction module 530 is used to match two-dimensional feature points and three-dimensional feature points with local map points respectively to construct a first error function, the first error function includes a two-dimensional error term and a three-dimensional error term, The two-dimensional error term is the error between the successfully matched two-dimensional feature point and the local map point, and the three-dimensional error term is the error between the successfully matched three-dimensional feature point and the local map point; the pose determination module 540 is used to calculate The minimum value of the first error function determines the pose parameters of the camera in the current frame.

In an optional implementation, the error between the successfully matched two-dimensional feature point and the local map point may be a reprojection error, and the error between the successfully matched three-dimensional feature point and the local map point may be the iterative nearest neighbor error.

In an optional implementation manner, the local map points may include: the previous key frame and the map points that appeared in the common view key frame of the previous key frame; wherein, the previous key frame is the key closest to the current frame frame.

In an alternative embodiment, if the pose determination device 500 aligns the IMU and the visual signal unit in advance, the first error function may also include an inertial measurement error term, which is the error between the IMU and the visual signal unit. , The visual signal unit includes a camera.

In an optional embodiment, the pose determination device 500 may further include: an IMU alignment module, configured to calculate the minimum error between the rotation parameter in the pre-integration of the IMU and the rotation parameter measured by the visual signal unit , Get the gyroscope bias of the IMU, calculate the minimum error between the position parameter in the pre-integration of the IMU and the position parameter measured by the visual signal unit to get the gravity acceleration of the IMU, and based on the gyroscope bias and gravity of the IMU Acceleration, align the IMU and the visual signal unit.

In an optional implementation manner, the pose determining apparatus 500 may further include: a map update module, configured to determine the current frame as a new key frame, and update the map model of the scene according to the new key frame and the pose parameters.

In an optional implementation manner, the map update module may also be used to determine the current frame as a new key frame when it is determined that the current frame meets the first preset condition, and when it is determined that the current frame does not meet the first preset condition When, the current frame is determined as a normal frame, and the next frame is processed; the first preset condition can include any one or a combination of the following: the current frame is more than the preset number of frames from the previous key frame, and the previous key frame It is the key frame closest to the current frame; the disparity between the current frame and the previous key frame exceeds the preset difference; the number of feature points successfully tracked in the current frame is less than the preset number, and the number of feature points successfully tracked is determined by the following method: Statistics The number of two-dimensional feature points whose error with the local map point is less than the first threshold, and the number of three-dimensional feature points whose error with the local map point is less than the second threshold, the sum of the two numbers is the feature point with successful tracking Quantity.

In an alternative embodiment, the map update module may include: a key frame acquisition unit for acquiring a new key frame and other key frames associated with the new key frame to form a key frame set; a map point acquiring unit, It is used to obtain all the map points that have appeared in the key frame set to form a map point set; the second function construction unit is used to construct a second function based on the key frame set and the pose parameters of each key frame and the map point set. Error function. The second error function includes a reprojection error term. The reprojection error term is the sum of reprojection errors from any map point in the map point set to any key frame in the key frame set; an optimization processing unit for By calculating the minimum value of the second error function, the pose parameters of each key frame in the key frame set and the coordinates of each map point in the map point set are optimized to update the map model.

In an optional implementation manner, the second error function may also include an inter-frame inertial measurement error term, which is the sum of errors between any two adjacent key frames of the IMU in the key frame set.

In an optional implementation manner, other key frames associated with the new key frame may include: M key frames closest to the new key frame, and N common view key frames of the new key frame; wherein, M and N are preset positive integers.

In an optional implementation manner, the map update module may further include: a map point deletion unit, configured to delete map points in the map point set that meet the second preset condition from the map model; wherein, the second preset Conditions may include: the number of key frames where the map point is successfully tracked in the key frame set is less than the predicted number multiplied by the preset ratio, which is less than or equal to 1; or the reprojection of the map point on each key frame in the key frame set The mean value of the error is greater than the preset error threshold.

In an optional implementation manner, the map update module may further include: a map point adding unit configured to, if there is a feature point in the new key frame that does not match the local map point, collect the feature point with the key frame The feature points of other key frames are matched, and triangulation calculation is performed according to the matching results to obtain new map points to add to the map model.

In an optional implementation manner, the pose determination device 500 may further include: a loop detection module, configured to perform loop detection for the new key frame, so as to globally optimize the map model.

The specific details of the modules/units of the above-mentioned device have been described in detail in the implementation of the method part, and the details of the undisclosed solution can be referred to the content of the implementation in the method part, and thus will not be repeated.

Those skilled in the art can understand that various aspects of the present disclosure can be implemented as systems, methods, or program products. Therefore, various aspects of the present disclosure can be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which may be collectively referred to herein as "Circuit", "Module" or "System".

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method of this specification. In some possible implementation manners, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to make the terminal device execute the above-mentioned instructions in this specification. The steps according to various exemplary embodiments of the present disclosure are described in the "Exemplary Methods" section.

With reference to FIG. 6, a program product 600 for implementing the above method according to an exemplary embodiment of the present disclosure is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program codes, and can be used in a terminal Running on equipment, such as a personal computer. However, the program product of the present disclosure is not limited thereto. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.

The program product can adopt any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.

The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

The program code for performing the operations of the present disclosure can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computing device (for example, using Internet service providers) Business to connect via the Internet).

Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method. The electronic device 700 according to this exemplary embodiment of the present disclosure is described below with reference to FIG. 7. The electronic device 700 shown in FIG. 7 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 7, the electronic device 700 may be in the form of a general-purpose computing device. The components of the electronic device 700 may include, but are not limited to: the aforementioned at least one processing unit 710, the aforementioned at least one storage unit 720, a bus 730 connecting different system components (including the storage unit 720 and the processing unit 710), and a display unit 740.

The storage unit 720 stores program codes, and the program codes can be executed by the processing unit 710 so that the processing unit 710 executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification. For example, the processing unit 710 may execute the method steps shown in FIG. 2, FIG. 3, or FIG. 4, etc.

The storage unit 720 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 721 and/or a cache storage unit 722, and may further include a read-only storage unit (ROM) 723.

The storage unit 720 may also include a program/utility tool 724 having a set of (at least one) program module 725. Such program module 725 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 730 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.

The electronic device 700 may also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 700, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 750. In addition, the electronic device 700 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 760. As shown in the figure, the network adapter 760 communicates with other modules of the electronic device 700 through the bus 730. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the foregoing embodiments, those skilled in the art can easily understand that the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiment of the present disclosure.

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.

It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the exemplary embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Those skilled in the art will easily think of other embodiments of the present disclosure after considering the description and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure, which follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

It should be understood that the present disclosure is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present disclosure is only limited by the appended claims.

Claims

A method for determining a pose based on depth information, which is characterized in that it includes:

Acquiring a current frame image about the scene through a camera, and acquiring depth information of the current frame image;

Extracting feature points from the current frame image, determining feature points with valid depth information as three-dimensional feature points, and determining feature points with invalid depth information as two-dimensional feature points;

The two-dimensional feature points and the three-dimensional feature points are matched with local map points of the scene respectively to construct a first error function. The first error function includes a two-dimensional error term and a three-dimensional error term. The error term is the error between the successfully matched two-dimensional feature point and the local map point, and the three-dimensional error term is the error between the successfully matched three-dimensional feature point and the local map point;

By calculating the minimum value of the first error function, the pose parameters of the camera in the current frame are determined.
The method according to claim 1, wherein if the inertial measurement unit and the visual signal unit are aligned in advance, the first error function further includes an inertial measurement error term, which is the inertial measurement unit and the visual signal unit. Errors between visual signal units, the visual signal unit including the camera.
The method according to claim 2, wherein the aligning the inertial measurement unit and the visual signal unit comprises:

Obtaining the gyroscope bias of the inertial measurement unit by calculating the minimum error between the rotation parameter in the pre-integration of the inertial measurement unit and the rotation parameter measured by the visual signal unit;

Obtaining the gravity acceleration of the inertial measurement unit by calculating the minimum error between the position parameter in the pre-integration of the inertial measurement unit and the position parameter measured by the visual signal unit;

Based on the gyroscope bias and gravitational acceleration of the inertial measurement unit, the inertial measurement unit and the visual signal unit are aligned.
The method according to any one of claims 1-3, wherein the method further comprises:

The current frame is determined as a new key frame, and the map model of the scene is updated according to the new key frame and the pose parameters.
The method according to claim 4, wherein after determining the pose parameters of the camera in the current frame, the method further comprises:

When it is determined that the current frame meets the first preset condition, determining the current frame as a new key frame;

When it is determined that the current frame does not meet the first preset condition, determine the current frame as a normal frame, and enter the processing of the next frame;

The first preset condition includes any one or a combination of the following:

The current frame is more than a preset number of frames from the previous key frame, and the previous key frame is the key frame closest to the current frame;

The disparity between the current frame and the previous key frame exceeds a preset difference;

The number of feature points successfully tracked in the current frame is less than a preset number, and the number of feature points successfully tracked is determined by the following method:

The number of the two-dimensional feature points whose error between statistics and local map points is less than the first threshold, and the number of three-dimensional feature points whose error with the local map points is less than the second threshold, the sum of the two numbers is The number of feature points with successful tracking.
The method according to claim 4, wherein the updating the map model of the scene according to the new key frame and the pose parameter comprises:

Acquiring the new key frame and other key frames associated with the new key frame to form a key frame set;

Acquiring all map points that have appeared in the key frame set to form a map point set;

Based on the key frame set and the pose parameters of each of the key frames, and the map point set, a second error function is constructed. The second error function includes a reprojection error term. The sum of reprojection errors from any map point in the map point set to any key frame in the key frame set;

By calculating the minimum value of the second error function, the pose parameters of each key frame in the key frame set and the coordinates of each map point in the map point set are optimized to update the map model.
The method according to claim 6, wherein the second error function further includes an inter-frame inertial measurement error term, which is an error between any two adjacent key frames of the inertial measurement unit in the key frame set. with.
The method according to claim 6, wherein the other key frames associated with the new key frame comprise: M key frames closest to the new key frame, and the new key frame N common-view key frames of, where M and N are preset positive integers.
The method according to claim 6, wherein when the map model is updated, the map points satisfying a second preset condition in the map point set are also deleted from the map model; wherein, the The second preset conditions include:

The number of key frames for which the map point is successfully tracked in the key frame set is less than the predicted number multiplied by a preset ratio, and the preset ratio is less than or equal to 1; or

The average value of the reprojection error of the map point on each key frame in the key frame set is greater than a preset error threshold.
The method according to claim 6, wherein when the map model is updated, if there is a feature point that does not match the local map point in the new key frame, then the feature point and the The feature points of other key frames in the key frame set are matched, and triangulation calculation is performed according to the matching result to obtain new map points to be added to the map model.
The method according to claim 4, wherein the method further comprises:

Perform loopback detection on the new key frame to optimize the map model globally.
A device for determining a pose based on depth information, characterized in that it comprises:

An image acquisition module for acquiring a current frame image about the scene through a camera, and acquiring depth information of the current frame image;

The feature point extraction module is configured to extract feature points from the current frame image, determine feature points with valid depth information as three-dimensional feature points, and determine feature points with invalid depth information as two-dimensional feature points;

The function construction module is used to match the two-dimensional feature points and the three-dimensional feature points with local map points respectively to construct a first error function. The first error function includes a two-dimensional error term and a three-dimensional error term. The two-dimensional error term is the error between the successfully matched two-dimensional feature point and the local map point, and the three-dimensional error term is the error between the successfully matched three-dimensional feature point and the local map point;

The pose determination module is configured to determine the pose parameters of the camera in the current frame by calculating the minimum value of the first error function.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method of any one of claims 1-11 when the computer program is executed by a processor.
An electronic device, characterized in that it comprises:

Processor; and

A memory for storing executable instructions of the processor;

Wherein, the processor is configured to execute the method according to any one of claims 1-11 by executing the executable instructions.