WO2022174711A1

WO2022174711A1 - Visual inertial system initialization method and apparatus, medium, and electronic device

Info

Publication number: WO2022174711A1
Application number: PCT/CN2022/072711
Authority: WO
Inventors: 尹赫
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-02-18
Filing date: 2022-01-19
Publication date: 2022-08-25
Also published as: CN112819860A; CN112819860B

Abstract

A visual inertial system initialization method, a visual inertial system initialization apparatus, a computer-readable storage medium, and an electronic device, which relate to the field of visual positioning technology. The method comprises: when receiving an image, performing frame-by-frame calculation on the image, until a first preset amount of poses are obtained (S310); and determining a movement speed, a gravity vector, and a rate of deviation corresponding to an inertial measurement unit according to the first preset amount of poses, so as to initialize a video inertial system according to the movement speed, the gravity vector, and the rate of deviation (S320). In the present method, a pose can be output before the visual inertial system has completed initialization; also, a time interval for receiving images is amply used, and the time it takes to obtain the first preset amount of poses is shortened, thereby accelerating the speed that the visual inertial system is initialized.

Description

Visual-inertial system initialization method and device, medium and electronic device

cross reference

This disclosure claims the priority of the Chinese patent application with the application number 202110190368.9 filed on February 18, 2021, both titled "Visual Inertial System Initialization Method and Device, Medium and Electronic Equipment", the entire content of which is by reference All incorporated herein.

technical field

The present disclosure relates to the technical field of visual positioning, and in particular, to a visual inertial system initialization method, a visual inertial system initialization device, a computer-readable medium, and an electronic device.

Background technique

At present, indoor positioning technology is a rigid requirement for mobile devices such as mobile phones, AR glasses, and indoor robots. In an indoor environment, a mobile device cannot determine its own position through a global positioning technology such as GPS (Global Positioning System), and can only rely on the sensor of the mobile device itself to achieve positioning. On the mobile phone or AR glasses, the most direct and easiest to obtain is the visual sensor (camera, etc.) data and the inertial sensor IMU (Inertial measurement unit) data, both of which can be combined with algorithms to achieve positioning. Before 2017, the positioning technology using only vision sensors developed rapidly, but with the continuous breakthrough of technology, the inherent defects of vision sensors have also been exposed, and only the use of cameras can no longer break through the bottleneck faced by the current positioning technology. Likewise, the same bottleneck occurs with techniques that only use the IMU for localization.

Therefore, in recent years, the VIO (Visual-IMU Odometry, visual-inertial navigation fusion odometry) technology has been derived in the industry, that is, a technology that uses both visual sensors and IMU for fusion positioning. The development of this technology has also been widely used in indoor navigation, augmented reality, robotics and even driverless industries.

public content

According to a first aspect of the present disclosure, there is provided a method for initializing a visual inertial system, comprising: in the process of receiving an image, performing frame-by-frame calculation on the image until a first preset number of poses are obtained; Determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, the frame-by-frame calculation includes: after receiving a frame of the current frame image , extract the feature points of the current frame image and the depth information corresponding to the feature points; determine the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image; The image is taken as the previous frame image and continues to receive the new current frame image.

According to a second aspect of the present disclosure, there is provided an apparatus for initializing a visual inertial system, comprising: a pose determination module, configured to perform frame-by-frame calculation on an image during a process of receiving an image until a first preset number of poses are obtained The initialization module is used to determine the corresponding motion speed, gravity vector and deviation rate of the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, one by one The frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and the depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image Determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.

According to a third aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic device, characterized by comprising:

processor; and

A memory for storing one or more programs, which, when executed by one or more processors, enables the one or more processors to implement the above-mentioned method.

Description of drawings

1 shows a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 shows a schematic diagram of an electronic device to which an embodiment of the present disclosure can be applied;

FIG. 3 schematically shows a flowchart of a method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure;

FIG. 4 schematically shows a flow chart of frame-by-frame calculation in an exemplary embodiment of the present disclosure;

5 schematically shows a schematic diagram of the principle of determining the pose corresponding to the current frame image by using the current frame image and the previous frame image in an exemplary embodiment of the present disclosure;

6 schematically shows a schematic diagram of a frame-by-frame calculation process in an exemplary embodiment of the present disclosure;

FIG. 7 schematically shows a schematic diagram of another frame-by-frame calculation process in an exemplary embodiment of the present disclosure;

FIG. 8 schematically shows a schematic diagram of still another frame-by-frame calculation process in an exemplary embodiment of the present disclosure;

FIG. 9 schematically shows a schematic diagram of a map point recovery process in an exemplary embodiment of the present disclosure;

FIG. 10 schematically shows a flowchart of another method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure;

FIG. 11 schematically shows a schematic diagram of the composition of a visual-inertial system initialization apparatus in an exemplary embodiment of the present disclosure.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repeated descriptions will be omitted. Some of the block diagrams shown in the figures are functional entities that do not necessarily necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

FIG. 1 shows a schematic diagram of a system architecture of an exemplary application environment to which a visual-inertial system initialization method and apparatus according to embodiments of the present disclosure can be applied.

As shown in FIG. 1 , the system architecture 100 may include one or more of

terminal devices

101 , 102 , 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The

terminal devices

101, 102, and 103 may be various terminal devices having a visual inertial system, including but not limited to desktop computers, portable computers, smart phones, and tablet computers, and so on. It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs. For example, the server 105 may be a server cluster composed of multiple servers, or the like.

The visual inertial system initialization method provided by the embodiment of the present disclosure is generally performed by the

terminal devices

101 , 102 , and 103 , and accordingly, the visual inertial system initialization apparatus is generally set in the

terminal devices

101 , 102 , and 103 . However, those skilled in the art can easily understand that the method for initializing the visual inertial system provided by the embodiment of the present disclosure can also be executed by the server 105 , and correspondingly, the visual inertial system initialization device can also be set in the server 105 . This exemplary embodiment There is no special restriction on this. For example, in an exemplary embodiment, the user may collect images through the visual sensors of the

terminal devices

101, 102, 103, and send the images to the server 105, so that the server 105 performs pose calculation, and calculates the The results are sent to the

terminal devices

101, 102, and 103, and the

terminal devices

101, 102, and 103 determine the motion speed, gravity vector, and deviation rate corresponding to the inertial measurement unit through the pose sent by the server 105, and then initialize the visual inertial system.

Exemplary embodiments of the present disclosure provide an electronic device for implementing a visual inertial system initialization method, which may be the

terminal devices

101 , 102 , 103 or the server 105 in FIG. 1 . The electronic device includes at least a processor and a memory, the memory is used to store executable instructions of the processor, and the processor is configured to execute the visual inertial system initialization method by executing the executable instructions.

The following takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes. In other embodiments, the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 . In other embodiments, the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.

As shown in FIG. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, Battery 242, Antenna 1, Antenna 2, Mobile Communication Module 250, Wireless Communication Module 260, Audio Module 270, Speaker 271, Receiver 272, Microphone 273, Headphone Interface 274, Sensor Module 280, Display Screen 290, Camera Module 291, Indication 292, a motor 293, a key 294, a subscriber identification module (SIM) card interface 295, and the like. The sensor module 280 may include a depth sensor 2801, an inertial sensor 2802, a gyroscope sensor 2803, and the like.

The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (Application Processor, AP), a modem processor, a graphics processor (Graphics Processing Unit, GPU), an image signal processor (Image Signal Processor, ISP), controller, video codec, digital signal processor (Digital Signal Processor, DSP), baseband processor and/or Neural-Network Processing Unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

A memory is provided in the processor 210 . The memory can store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and the execution is controlled by the processor 210 .

The mobile terminal 200 may implement a shooting function through an ISP, a camera module 291, a video codec, a GPU, a display screen 290, an application processor, and the like. Among them, the ISP is used to process the data fed back by the camera module 291; the camera module 291 is used to capture still images or videos; the digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals; video The codec is used to compress or decompress the digital video, and the mobile terminal 200 may also support one or more video codecs. In some embodiments, the above-mentioned camera module 291 may be used as a visual sensor in a visual inertial system, and image acquisition is performed by the camera module 291 .

The depth sensor 2801 is used to acquire depth information of the scene. In some embodiments, a depth sensor may be disposed in the camera module 291 for capturing depth information corresponding to the image while capturing the image.

Inertial sensors 2802, also known as inertial measurement units, may be used to detect and measure acceleration and rotational motion.

The gyro sensor 2803 may be used to determine the motion attitude of the mobile terminal 200 . In addition, sensors with other functions can also be set in the sensor module 280 according to actual needs, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, and a bone conduction sensor. sensors, etc.

In the field of visual positioning, related researchers have made many attempts. For example, on a technical level, in 2017, Apple launched the augmented reality development kit ARKit at the WWDC2017 conference, which mainly has three functions: pose estimation, environment understanding, and light estimation. The core of which is the pose estimation function, which uses VIO technology to provide positioning for mobile AR applications by integrating device camera image information and device motion sensor information. For another example, still in 2017, Google announced the launch of an augmented reality SDK that is aligned with ARKit, called ARCore, which also includes three main functions: pose estimation, environment understanding, and light estimation. When performing pose estimation, ARCore uses a monocular camera and IMU sensor to detect the visual difference feature points in the captured camera image and use these points to calculate its position change. This visual information is combined with inertial measurements from the device's IMU to estimate the camera's pose relative to the surrounding world over time. ARcore provides good pose information and environmental information for applications such as AR and indoor navigation on Android phones.

Before positioning based on the VIO system, it is necessary to determine a reference world coordinate system and an initial set of map points to facilitate subsequent positioning based on this coordinate system. At the same time, the real scale of the pose cannot be determined by using the visual positioning method alone. After the fusion of the IMU, due to the characteristics of the IMU sensor itself, it is necessary to determine the initial speed of the IMU, the deviation rate of the IMU, the gravity vector and the scale information. Use visually determined poses for true scale confirmation or adjustment. The above process is a necessary step for the operation of the SLAM or VIO system - the initialization process.

In the related art, there are usually the following two initialization methods:

One is applied to the VINS system. The specific process is as follows: Accumulate 10 frames of images, and select two frames of images L and R with sufficient parallax from the 10 frames of images, and use the epipolar geometric constraint to solve the bit between the two frames of images. posture. Then use the pose triangulation to recover some map points that are co-viewed between the two frames. Project these map points to frames other than L and R frames, calculate the pose of the frame by minimizing the reprojection error, and then triangulate between the frame and L and R frames, and then restore more many map points. By repeating the above process, the poses of the above 10 frames and the map points corresponding to the 10 frames of images can be solved. Finally, the rotation constraints and translation constraints are used to align the previously determined poses of the 10 frames of images with the IMU, and the previously determined poses of the 10 frames of images are used as the accurate poses to constrain the variables to be obtained by the IMU, and finally SVD is used. Decomposition solves a system of inhomogeneous equations to determine all quantities to be solved.

The second is applied to the RGBD-VINS system, and the specific process is as follows: after waiting for 10 frames of images to be accumulated in the system, two images L and R with sufficient parallax are selected from the 10 frames of images. Since there is a depth camera in the RGBD-VINS system, the depth of each frame of image is known, and it is no longer necessary to use the 2D point-to-2D point epipolar geometric constraint to solve the pose of the two frames of images. The re-projection error of the 3D point to the 2D point is solved, so that the solved pose is directly scale deterministic. In the recovery process of map points, the triangulation method is no longer adopted, but the depth camera information is directly used. When there is a common view relationship between two frames, the corresponding depth value can be used to restore the three-dimensional coordinates of the common view point. , that is, the map points corresponding to each frame of image. Finally, when the IMU is initialized, since the depth information has already determined the scale of the pose, it is no longer necessary to use the IMU to estimate the scale. vector and deviation rate.

However, the above two initialization methods have technical limitations.

Among them, the first method mainly includes the following shortcomings: 1. Using a monocular camera for system initialization requires that a certain parallax must be satisfied between the images used for initialization, and there must be enough matching points between the two frames of images to ensure Initialization successful. However, it takes more time to detect parallax and more matching points, and if the requirements are not met, the initialization may fail and restart, resulting in a low initialization success rate; 2. It is necessary to accumulate 10 frames of consecutive images before starting the initialization. process, it is impossible to output the pose when the system starts running; 3. The use of a monocular camera for visual initialization will cause the problem of uncertainty in the pose and scale. Even if the information from the IMU later can restore the true scale, the noise of the terminal IMU is relatively high. Under the major premise, it is difficult to accurately calculate the scale, and at the same time, it will increase the computational burden; 4. In the monocular system, the map points that need to be used in the initialization process are recovered by the triangulation algorithm. For a long time, it needs to consume a lot of time for calculation, resulting in a slow initialization process.

The second method includes the following disadvantages: 1. It is necessary to accumulate 10 frames of images before initialization can be performed. Before the accumulation of 10 frames of images is full, the system does not perform any work and wastes a lot of time. At the same time, it is necessary to detect parallax in 10 frames of images. In scenes with small parallax, the initialization will still fail; 2. It is impossible to output the pose when the system starts running; 3. In the recovery of map points, the depth is completely trusted. camera information. When the depth camera is noisy, or a large number of map points are not within the range of the depth camera, the number of map points available for pose calculation may be severely reduced, resulting in large errors in the calculation results or even failure to successfully calculate the results.

Based on one or more of the above problems, this example embodiment provides a method for initializing a visual-inertial system. The method for initializing the visual inertial system may be applied to the foregoing server 105, and may also be applied to one or more of the foregoing

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to Fig. 3, the visual inertial system initialization method may include the following steps S310 and S320:

In step S310, in the process of receiving the image, frame-by-frame calculation is performed on the image until a first preset number of poses are obtained.

The above frame-by-frame calculation is shown in FIG. 4 and may include the following steps S410 to S430:

In step S410, when a frame of the current frame image is received, the feature points of the current frame image and the depth information corresponding to the feature points are extracted.

In an exemplary embodiment, since the incoming frame rate of images of the VIO system is about 10 Hz, the time interval between each frame of images is about 100 ms. Therefore, in the related art, if 10 frames of images need to be accumulated, it is necessary to Image accumulation takes about 1000ms, and initialization can take place after 1000ms. In order to avoid time waste when images are accumulated, a frame of image can be processed correspondingly after it is received. Specifically, when a current image frame is received, a feature point in the current image frame is first extracted, and at the same time depth information corresponding to the feature point is extracted from the depth information collected by the depth sensor.

In step S420, the pose corresponding to the current frame image is determined based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image.

In an exemplary embodiment, after the feature points of the current frame image and the depth information of each feature point are extracted, the feature points and depth information of the current frame image can be compared with the previous frame image received before the current frame image is received. , that is, the feature points of the previous frame image and the depth information corresponding to each feature point determine the pose corresponding to the current frame image.

In an exemplary embodiment, when determining the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image of the current frame image, the current frame image The feature points of the current frame image are matched with the feature points of the previous frame image corresponding to the current frame image to obtain the matching matching feature points; at this time, since the pose of the previous frame image has been determined, the features of the previous frame image can be obtained. The depth information corresponding to the point determines the map point of the matching feature point in the three-dimensional coordinate system, and then the map point is projected into the current frame image, and then the pose corresponding to the current frame image is determined.

For example, referring to FIG. 5 , optical flow matching can be performed between the feature points of the current frame image and the feature points of the previous frame image corresponding to the current frame image, and then determined based on the depth information corresponding to the feature points of the previous frame image. Match the map point corresponding to the matching feature point, and reproject the map point to the current frame image, and finally calculate the current frame image by minimizing the reprojection error through PnP (Perspective-n-Point, multi-point perspective). corresponding pose.

The above three-dimensional coordinate system refers to the world coordinate system, and the world coordinate system can usually be determined according to the camera coordinate system corresponding to the first frame image of the input visual inertial system. All poses determined from the second frame are relative to the world. Relative pose in terms of coordinate system. Specifically, when the received current frame image is the first frame image, the corresponding camera coordinate system of the first frame image can be directly obtained, and then the camera coordinate system is set as the world coordinate system.

It should be noted that when the current frame image is received, since the depth information of the current frame image can be collected by the depth camera, it is not necessary to use the epipolar constraint to solve the pose between the two frame images, even if the two The parallax between the two images is very small. It is also possible to perform map point projection through the depth information, and then solve the relative pose by minimizing the re-projection error. Therefore, it is not necessary to judge the difference between the current frame image and the previous frame image in advance. whether the parallax between them is sufficient.

In step S430, the current frame image is taken as the previous frame image, and a new current frame image is continuously received.

In an exemplary embodiment, since a sufficient number of poses are required to initialize the IMU, before determining the motion velocity, gravity vector and deviation rate corresponding to the IMU of the inertial measurement unit, it is necessary to repeatedly determine the corresponding images of multiple current frames. poses until the number of determined poses is equal to the first preset number required for IMU initialization. Specifically, when the pose is repeatedly determined, after a frame of the current frame image is received and the pose corresponding to the current frame image is determined, the current frame image can be used as the previous frame image, so that the received new frame The pose of the current frame image can be determined by using the current frame image received last time as the previous frame image.

Referring to FIG. 6 , when the received current frame image is the second frame image, the feature points in the second frame image and the depth information corresponding to the feature points and the depth information in the previous frame image (the first frame image) can be used. The depth information corresponding to the feature points and the feature points determines the pose corresponding to the second frame image; then the second frame image is used as the previous frame image, and then when the received current frame image is the third frame image, due to the previous frame image The image has changed from the first frame image to the second frame image, so it can be directly based on the feature points and the depth information corresponding to the feature points in the third frame image and the feature points and features in the previous frame image (the second frame image) The depth information corresponding to the point determines the pose corresponding to the third frame of image; then repeat the above process to determine the pose corresponding to each current image frame that is continuously input in the form of frame-by-frame (frame-by-frame) until the i-th frame is determined After the corresponding poses are determined, a first preset number of poses are determined and then stopped.

In addition, when the received current frame image is the first frame image, since there is no previous frame image as a reference, at this time, the pose corresponding to the first frame image can be directly determined as the preset pose. Wherein, the preset pose may be a unit matrix. It should be noted that, after the pose corresponding to the first frame of image is determined as the preset pose, the preset pose is also counted as a part of the pose quantity. That is, assuming that the first preset number is 10, in addition to the preset poses, 9 poses need to be determined through the above repeatedly determined poses, and the total of the preset poses and the determined 9 poses is equal to the first preset The number is 10. At this time, the process of repeatedly determining the pose can be stopped and other subsequent processing can be performed.

In an exemplary embodiment, the received current frame image may have interference such as noise, and the process of calculating the pose may fail due to interference such as noise. For example, disturbances such as noise cause the PnP minimization projection error solution to fail. If the number of the determined poses is less than the first preset number, and there is a current frame image for which the pose has not been successfully determined, the current frame image can be discarded, and the current frame image is not used as the previous frame image in step S430. The process is to retain the previous previous frame image, then re-receive the new current frame image, and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.

For example, referring to FIG. 7 , when the current frame image is the sixth frame image a, the previous frame image is the fifth frame image, and it is determined that the sixth frame image a corresponds to the sixth frame image a based on the sixth frame image a and the fifth frame image It fails when the pose of the 6th frame image is relatively large, indicating that the noise contained in the 6th frame image may be relatively large. Therefore, the acceptable current frame image is used as the new 6th frame image b to continue the calculation, and then the corresponding position of the new 6th frame image b is determined. posture. Then, taking the new sixth frame image b as the previous frame image, the process of frame-by-frame calculation is continued.

Further, in an exemplary embodiment, due to the interference of noise, there may also be a situation that the pose cannot be successfully determined in the received consecutive multiple frames of images. In this case, when the pose corresponding to the current frame image has not been successfully determined, the number of the current frame images whose pose has not been successfully determined can be counted, and the number of the current frame images whose pose has not been successfully determined is equal to the second preset. When the number of poses is set, all the poses that have been determined can be cleared, that is, the number of poses is reset to 0. Then, a new current frame image is received again, and the above frame-by-frame calculation process is performed again, and the number of poses is accumulated again until the number of poses is equal to the first preset number.

It should be noted that, in an exemplary embodiment, there are two situations in which it is equal to the above-mentioned second preset number. One is that the number of current frame images whose poses are not successfully determined is cumulatively equal to the second preset number; The first is that a second preset number of images of the current frame are required to fail to successfully determine the corresponding pose. Among them, in the second method, in order to satisfy the condition that the corresponding pose is not successfully determined for the second consecutive preset number of current frame images, after the first current frame image for which the pose is not successfully determined appears for the first time, Count the number of the current frame images whose poses have not been successfully determined; before the number of statistics is equal to the second preset number, if any current frame image has successfully determined the corresponding pose, the previously counted unsuccessful determinations can be made. The current frame image number of the pose is reset.

For example, referring to FIG. 8 , assuming that the second preset number is 3, when the previous frame image is the fifth frame image, if the current frame image (the first sixth frame image d) fails to determine the corresponding When the pose is counted, the number of current frame images for which the pose has not been successfully determined is n=1; after receiving a new current frame image (the second sixth frame image e), the corresponding pose has not been successfully determined again, then n=1 +1=2; the new current frame image (the third and sixth frame image f) is received again, and the corresponding pose has not been successfully determined, then n=2+1=3, at this time it can be determined that n is equal to the second The preset number, so it is necessary to delete all the poses determined by the previous 5 frames of images. After that, use the received first current frame image as the first frame image to perform the above frame-by-frame calculation process again, and re-accumulate the determined poses until the number of determined poses is accumulated after the pose of the i-th frame image is determined. equal to the first preset number.

In addition, in the above example, before implementing n=3, if any current frame image has successfully calculated the corresponding pose, n can be reset to 0. Specifically, for example, at n-2, when a new current frame image (the third sixth frame image) is received again, the third sixth frame image successfully determines the corresponding pose, and at this time, n=2 Reset to n=0. Then the third 6th frame image can be used as the previous frame image, and a new current frame image can be obtained, and the new current frame image can be determined through the new current frame image and the previous frame image (the third 6th frame image) The pose corresponding to the image.

By setting a reasonable processing mechanism for the failure of the pose calculation, the mechanical energy can prevent the failure of the pose calculation, improve the success rate of initialization, and make the initialization more robust.

In step S320, the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit are determined according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.

In an exemplary embodiment, when determining the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, the first preset number of poses and the inertial measurement unit may be aligned. , which uses rotational and translation constraints to calculate the motion velocity, gravity vector, and deflection rate of the inertial measurement unit. In addition, when the IMU is initialized, the scale of the pose does not need to be determined since the depth information has already determined the scale of the pose.

It should be noted that, before aligning the first preset number of poses with the inertial measurement unit, a more accurate internal and external parameter calibration can be performed on the inertial measurement unit, and an accurate acceleration deviation rate and angular velocity deviation of the inertial measurement unit can be obtained. rate and the external parameter transformation between the inertial measurement unit and the camera, so that the joint initialization process of the visual and inertial measurement unit converges faster, and the accuracy will be further improved.

In an exemplary embodiment, in order to obtain more accurate movement speed, gravity vector and deviation rate for initialization, it is necessary to determine the movement speed, gravity vector and deviation rate corresponding to the inertial measurement unit through a more accurate pose. Specifically, before determining the motion speed, gravity vector, and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, map points may be recovered for the target images corresponding to the first preset number of poses to obtain Map points corresponding to each target image, and then construct a local cluster adjustment according to the restored map points to optimize the first preset number of poses to obtain an optimized pose, and finally determine the corresponding inertial measurement unit according to the optimized pose. The velocity of motion, the gravity vector, and the rate of deviation.

Wherein, the target image may include the current frame image for which the pose of the first preset number is successfully determined. For example, assuming that a total of 12 current frame images are received in order to determine the first preset number of poses, 2 of which are current frame images whose poses have not been successfully determined, then the remaining 10 current frame images whose corresponding poses have been successfully determined are for the target image.

In an exemplary embodiment, since one current frame image determines one pose, the number of target images is also the first preset number. At this time, referring to FIG. 9 , a target image with a common-view matching relationship may be searched among the first preset number i of target images to obtain at least one pair of target images. Then, for each pair of target image pairs, the depth information of the target image in the target image pair is used for reprojection, and the reprojection error is calculated. When the reprojection error is small, that is, less than or equal to the preset threshold, it can be considered that the error of the depth information is small, so the depth information can be used for map point recovery. At this time, the feature points can be back-projected by using the depth information to restore the map points of the target image pair; on the contrary, when the re-projection error is large, that is, greater than the preset threshold, it can be considered that the error of the depth information is large. Therefore, it is not suitable to restore map points through depth information. At this time, the triangulation method can be used to restore map points of the target image pair. All the recovered map points are then used as a set of map points to construct a local bundle adjustment to optimize the pose.

According to different conditions, depth information or triangulation method is used to restore map points, which increases the number of map points while ensuring the accuracy of map points; at the same time, even if the depth information received by the VIO system is of poor quality, accurate initialization can be achieved .

In an exemplary embodiment, when performing map point recovery, in addition to determining whether to use depth information for map point recovery through reprojection error, a probability model can also be used to model the uncertainty of depth information, and then The probability of the uncertainty of the depth information determines whether to use the depth information for map point recovery. Using the probability model to judge the uncertainty can reduce the noise caused by the depth information and realize a more accurate initialization process.

The preset threshold may be set according to actual application scenarios, application environments, and the like. For example, when the depth information is more reliable, the preset threshold may be determined as a larger value; on the contrary, when the depth information is less reliable, a smaller value may be selected as the preset threshold.

In addition, after the initialization is completed, the gravity direction adjustment can also be performed on all the previously determined poses according to the determined gravity vector, so as to output the adjusted poses.

Referring to FIG. 10 , the technical solutions of the embodiments of the present disclosure will be described in detail by taking the first preset number as 10, the second preset number as 3, and the preset threshold as 1/460 as an example.

Step S1001, receiving the current frame image;

Step S1003, performing feature extraction on the current frame image, and extracting depth information corresponding to each feature point;

Step S1005, determine whether the current frame image is the first frame image of the input system;

Step S1007, when the current frame image is not the first frame of video, the current frame image and the previous frame image are subjected to optical flow matching and PnP to minimize the method of re-projection error to determine the pose corresponding to the current frame image;

Step S1009, judging whether the current frame image successfully determines the corresponding pose;

Step S1011, when the current frame image fails to determine the corresponding pose, determine whether the corresponding pose has not been successfully determined for three consecutive received current frame images;

Step S1013, when the corresponding poses are not successfully determined for three consecutive received current frame images, clear all the poses determined before;

Step S1015, when only one or two consecutively received current frame images fail to successfully determine the corresponding pose, discard the current frame image;

Step S1017, when the current frame image successfully determines the corresponding pose, the current frame image is used as the previous frame image;

Step S1019, judging whether the number m of the accumulated determined poses is equal to the first preset number 10;

Step S1021, when m is equal to 10, search for a common view relationship in 10 target images whose poses are successfully determined, and determine the target image pair;

Step S1023, reproject the depth information of the target image in the target image pair, and calculate the reprojection error;

Step S1025, when the calculated re-projection error is less than or equal to 1/460, use the depth information to back-project the feature points to restore map points;

Step S1027, when the calculated reprojection error is greater than 1/460, use the triangulation method to restore map points;

Step S1029, performing local BA adjustment on the poses corresponding to the 10 target images through the restored map points;

Step S1031, initialize the IMU through the optimized 10 poses, and calculate the initial speed, gravity vector and deviation rate corresponding to the IMU;

Step S1033, adjusting the gravity direction of all poses and map points according to the gravity vector;

Step S1035, output the pose.

It should be noted that, due to the different requirements of different application scenarios, in step S1035, before the initialization of the IMU, the pose corresponding to the current frame image determined according to the current frame image and the previous frame image may be directly output; in addition, you can also After the IMU is initialized, the pose is gravity adjusted and then output.

According to the method in the above embodiment, the public data set A\B\C (hereinafter referred to as the STU data set) released by ShanghaiTech University (STU) is used for experimental verification, and the results shown in Table 1 can be obtained. Based on Table 1, the following conclusions can be drawn:

(1) Compared with the VINS-MONO and VINS-RGBD schemes, the present embodiment has a significant improvement in the time-consuming of the initialization process. Compared with VINS-MONO, the average speed is increased by 6 to 7 times; compared with VINS-RGBD, the average speed is increased by 3 to 4 times.

(2) In the time of outputting the pose for the first time, the time of outputting the camera pose in this embodiment is much earlier than that of VINS-MONO and VINS-RGBD, and the pose is output at least 30 frames ahead of time. The pose can be output as soon as the system starts running.

(3) In terms of overall trajectory accuracy, since the STU dataset used in the test has better quality depth map data, the final overall trajectory estimation accuracy of this embodiment is basically the same as that of the VINS-RGBD scheme, and is better than VINS in most cases -MONO; when the quality of the depth information of the received image is poor, this embodiment adopts the combination of the triangulation method and the depth information to restore the map points, which can restore more three-dimensional map points while ensuring accuracy , compared with the VINS-RGBD scheme, it will bring about a significant improvement in accuracy.

Table 1 Performance comparison of this embodiment with VINS-MONO and VINS-RGBD

In order to verify the robustness of this example, we tested it on the VCU-RVI dataset (15 sets of data) published by Virginia Commonwealth University, and calculated the trajectory root mean square error of VINS-MONO, VINS-RGBD and this example (RMSE), and the test results are shown in Table 2.

It can be seen from the test results that neither VINS-MONO nor VINS-RGBD successfully tracked all 15 trajectories; VINS-MONO has two sets of trajectories that failed to be initialized successfully and failed to track; while VINS-RGBD has three sets of trajectories due to Failed to initialize successfully and trace failed. In contrast to this embodiment, this embodiment has successfully initialized and tracked all 15 sets of data sets, that is, this embodiment has better robustness. At the same time, in the tracking accuracy of the 15 sets of data sets, the accuracy of the technical solution is also better than other methods in most cases, that is, the present embodiment can achieve more accurate results.

Table 2 Trajectory root mean square error

Among them, X represents initialization failure.

To sum up, this exemplary embodiment has the following beneficial effects:

(1) It is no longer necessary to wait for the accumulation of 10 frames of images like VINS-MONO and VINS-RGBD, and the pose can be output only after the IMU is successfully initialized, which advances the time to output the pose. That is, when the system starts to run, it can start to output the available pose information after receiving the first current frame image. In terms of user experience, users no longer need to wait for the visual IMU joint initialization to complete before they can start using related applications, but can start immediately after opening the application.

(2) Since the original VIO system needs to wait for the system to accumulate 10 frames of images before it starts to operate, no operation is performed when it is less than 10 frames, and the incoming frame rate of the image is about 10Hz, and the interval between each frame of image is about 10Hz. The time interval is about 100ms. Without any operation, the time interval of 100ms per frame is completely wasted. This application makes full use of these time intervals and uses the frame-by-frame method to calculate the time interval between each two frames of images. relative pose. Compared with the related art scheme of accumulating 10 frames of images, then solving the pose and then restoring the map points, this technical solution only needs to calculate the pose of the last frame when 10 frames of images are passed in, which greatly reduces the initialization. time-consuming process.

(3) When there is a failure to determine the pose, it can reasonably handle the failure situation differently, and at the same time, the depth information and the triangulation method are used to restore the map points in different situations, which improves the success rate of initialization, and further improves the accuracy and robustness of the final pose.

It should be noted that the above-mentioned drawings are only schematic illustrations of the processes included in the method according to the exemplary embodiment of the present disclosure, and are not intended to be limiting. It is easy to understand that the processes shown in the above figures do not indicate or limit the chronological order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, in multiple modules.

Further, referring to Fig. 11 , the embodiment of this example also provides a visual inertial system initialization apparatus 1100, including a pose determination module 1110 and an initialization module 1120. in:

The pose determination module 1110 may be configured to perform frame-by-frame calculation on the image during the process of receiving the image, until the pose of the first preset number is obtained.

The initialization module 1120 may be configured to determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.

In an exemplary embodiment, the above-mentioned frame-by-frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and The feature points and depth information of the previous frame image determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.

In an exemplary embodiment, the pose determination module 1110 may be configured to discard the current frame image and retain the previous frame image when the pose corresponding to the current frame image is not successfully determined; when receiving a new current frame image , and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.

In an exemplary embodiment, the pose determination module 1110 may be configured to count the number of current frame images whose poses are unsuccessfully determined when the pose corresponding to the current frame image is unsuccessfully determined; When the number of frame images is equal to the second preset number, the determined pose is cleared, and new current frame images are continued to be received.

In an exemplary embodiment, the pose determination module 1110 may be configured to reset the number of current frame images whose poses are not successfully determined when the pose corresponding to the current frame image is successfully determined.

In an exemplary embodiment, the pose determination module 1110 may be configured to perform feature matching between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points; The depth information of , determines the map points that match the feature points, and projects the map points to the current frame image to determine the pose corresponding to the current frame image.

In an exemplary embodiment, the visual-inertial system initialization apparatus 1100 may further include a pose optimization module, configured to perform map point recovery on the target images corresponding to the first preset number of poses, so as to obtain the recovered map points; A local bundle adjustment is constructed according to the map points to optimize the pose to obtain the optimized pose.

In an exemplary embodiment, the pose optimization module may be used to find a common-view matching relationship between target images corresponding to a first preset number of poses to obtain at least a pair of target image pairs; for each pair of target images Yes, use the depth information of the target image in the target image pair to reproject, and calculate the reprojection error; when the reprojection error is less than or equal to the preset threshold, use the depth information to backproject the feature points to map the target image. recover.

In an exemplary embodiment, the pose optimization module may be configured to perform map point recovery on the target image through triangulation when the reprojection error is greater than a preset threshold.

In an exemplary embodiment, the pose determination module 1110 may be configured to determine the pose corresponding to the first frame of image as a preset pose.

In an exemplary embodiment, the initialization module 1120 may be configured to adjust the gravitational direction of the pose according to the gravitational vector.

The specific details of each module in the above-mentioned apparatus have been described in detail in the method part of the implementation manner, and the undisclosed details can refer to the method part of the implementation manner, and thus will not be repeated.

As will be appreciated by one skilled in the art, various aspects of the present disclosure may be implemented as a system, method or program product. Therefore, various aspects of the present disclosure can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", "module" or "system".

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a terminal device, the program code is used to cause the terminal device to execute the above-mentioned procedures in this specification. The steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure, for example, any one or more of the steps in FIG. 3 , FIG. 4 , and FIG. 10 may be performed.

It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).

Other embodiments of the present disclosure will readily suggest themselves to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A visual-inertial system initialization method, including:

In the process of receiving an image, frame-by-frame calculation is performed on the image until a first preset number of poses are obtained;

Determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, the gravity vector and the deviation rate;

Wherein, the frame-by-frame calculation includes:

When receiving a frame of the current frame image, extract the feature points of the current frame image and the depth information corresponding to the feature points;

Determine the pose corresponding to the current frame image based on the feature point and the depth information of the current frame image and the feature point and the depth information of the previous frame image;

Taking the current frame image as the previous frame image, and continuing to receive a new current frame image.
The method according to claim 1, wherein before using the current frame image as the previous frame image, the method further comprises:

When the pose corresponding to the current frame image is unsuccessfully determined, discarding the current frame image and retaining the previous frame image;

When a new current frame image is received, the pose corresponding to the new current frame image is calculated based on the new current frame image and the retained last frame image.
The method of claim 2, wherein the method further comprises:

When the pose corresponding to the current frame image is not successfully determined, count the number of the current frame image whose pose is not successfully determined;

When the number of the current frame images for which the pose is not successfully determined is equal to a second preset number, the determined pose is cleared, and new current frame images are continued to be received.
The method according to claim 3, wherein after counting the number of the current frame images whose poses are not successfully determined, the method further comprises:

When the pose corresponding to the current frame image is successfully determined, the number of the current frame image whose pose is not successfully determined is reset.
The method according to claim 1, wherein the current frame image is determined based on the feature points and the depth information of the current frame image and the feature points and the depth information of the previous frame image The corresponding poses, including:

Feature matching is performed between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points;

Determine the map point of the matching feature point based on the depth information corresponding to the feature point of the previous frame image, and project the map point to the current frame image to determine the position corresponding to the current frame image posture.
The method according to claim 1, wherein, before determining the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, the method further comprises:

performing map point recovery on the target images corresponding to the first preset number of poses to obtain recovered map points;

A local bundle adjustment is constructed according to the map points, so as to optimize the pose to obtain an optimized pose.
The method according to claim 6, wherein the performing map point recovery on the target images corresponding to the first preset number of poses comprises:

searching for a common-view matching relationship between the target images corresponding to the first preset number of poses to obtain at least one pair of target images;

For each pair of the target image pair, reproject the depth information of the target image in the target image pair, and calculate the reprojection error;

When the reprojection error is less than or equal to a preset threshold, the feature points are back-projected by using the depth information, so as to perform map point recovery on the target image.
The method of claim 7, wherein the method further comprises:

When the reprojection error is greater than the preset threshold, map point recovery is performed on the target image by a triangulation method.
The method according to claim 1, wherein, when the received current frame image is the first frame image, the method comprises:

The pose corresponding to the first frame of image is determined as a preset pose.
The method of claim 1, wherein the method further comprises:

Adjust the gravitational direction of the pose according to the gravitational vector.
A visual inertial system initialization device, comprising:

a pose determination module, configured to perform frame-by-frame calculation for the image in the process of receiving the image until a first preset number of poses are obtained;

The initialization module is used to determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to determine the video inertia according to the motion speed, the gravity vector and the deviation rate The system is initialized;

Wherein, the frame-by-frame calculation includes:

When receiving a frame of the current frame image, extract the feature points of the current frame image and the depth information corresponding to the feature points;

Determine the pose corresponding to the current frame image based on the feature point and the depth information of the current frame image and the feature point and the depth information of the previous frame image;

Taking the current frame image as the previous frame image, and continuing to receive a new current frame image.
The apparatus according to claim 11, wherein the pose determination module is further configured to discard the current frame image and retain the previous frame image when the pose corresponding to the current frame image is not successfully determined ; When receiving a new current frame image, calculate the pose corresponding to the new current frame image based on the new current frame image and the retained last frame image.
The apparatus according to claim 12, wherein the pose determination module is further configured to count the number of the current frame images whose poses are not successfully determined when the pose corresponding to the current frame image is not successfully determined; When the number of the current frame images for which the pose is not successfully determined is equal to a second preset number, the determined pose is cleared, and new current frame images are continued to be received.
The apparatus according to claim 13, wherein the pose determination module is further configured to reset the current frame image of the unsuccessfully determined pose when the pose corresponding to the current frame image is successfully determined. quantity.
The apparatus according to claim 11, wherein the pose determination module is further configured to perform feature matching between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points ; Determine the map point of the matching feature point based on the depth information corresponding to the feature point of the previous frame image, and project the map point to the current frame image to determine the corresponding map point of the current frame image. pose.
The device according to claim 11, wherein the pose determination module is further configured to perform map point recovery on the target images corresponding to the first preset number of poses to obtain the recovered map points; The map points are used to construct a local bundle adjustment to optimize the pose to obtain the optimized pose.
The apparatus according to claim 16, wherein the pose determination module is further configured to search for a common-view matching relationship between the target images corresponding to the first preset number of poses, so as to obtain at least one pair of target images For each pair of the target image pair, use the depth information of the target image in the target image pair to reproject, and calculate the reprojection error; when the reprojection error is less than or equal to a preset threshold, use the The depth information is used to back-project the feature points, so as to perform map point restoration on the target image.
The apparatus according to claim 17, wherein the pose determination module is further configured to perform map point restoration on the target image by a triangulation method when the reprojection error is greater than the preset threshold.
A computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any one of claims 1 to 10.
An electronic device comprising:

processor; and

a memory for storing executable instructions for the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 10 by executing the executable instructions.