WO2022174711A1 - Visual inertial system initialization method and apparatus, medium, and electronic device - Google Patents

Visual inertial system initialization method and apparatus, medium, and electronic device Download PDF

Info

Publication number
WO2022174711A1
WO2022174711A1 PCT/CN2022/072711 CN2022072711W WO2022174711A1 WO 2022174711 A1 WO2022174711 A1 WO 2022174711A1 CN 2022072711 W CN2022072711 W CN 2022072711W WO 2022174711 A1 WO2022174711 A1 WO 2022174711A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame image
current frame
pose
image
poses
Prior art date
Application number
PCT/CN2022/072711
Other languages
French (fr)
Chinese (zh)
Inventor
尹赫
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2022174711A1 publication Critical patent/WO2022174711A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present disclosure relates to the technical field of visual positioning, and in particular, to a visual inertial system initialization method, a visual inertial system initialization device, a computer-readable medium, and an electronic device.
  • indoor positioning technology is a rigid requirement for mobile devices such as mobile phones, AR glasses, and indoor robots.
  • a mobile device cannot determine its own position through a global positioning technology such as GPS (Global Positioning System), and can only rely on the sensor of the mobile device itself to achieve positioning.
  • GPS Global Positioning System
  • the most direct and easiest to obtain is the visual sensor (camera, etc.) data and the inertial sensor IMU (Inertial measurement unit) data, both of which can be combined with algorithms to achieve positioning.
  • the positioning technology using only vision sensors developed rapidly, but with the continuous breakthrough of technology, the inherent defects of vision sensors have also been exposed, and only the use of cameras can no longer break through the bottleneck faced by the current positioning technology. Likewise, the same bottleneck occurs with techniques that only use the IMU for localization.
  • VIO Visual-IMU Odometry, visual-inertial navigation fusion odometry
  • a method for initializing a visual inertial system comprising: in the process of receiving an image, performing frame-by-frame calculation on the image until a first preset number of poses are obtained; Determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, the frame-by-frame calculation includes: after receiving a frame of the current frame image , extract the feature points of the current frame image and the depth information corresponding to the feature points; determine the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image; The image is taken as the previous frame image and continues to receive the new current frame image.
  • an apparatus for initializing a visual inertial system comprising: a pose determination module, configured to perform frame-by-frame calculation on an image during a process of receiving an image until a first preset number of poses are obtained
  • the initialization module is used to determine the corresponding motion speed, gravity vector and deviation rate of the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, one by one
  • the frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and the depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image Determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.
  • a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the above-mentioned method.
  • an electronic device characterized by comprising:
  • a memory for storing one or more programs, which, when executed by one or more processors, enables the one or more processors to implement the above-mentioned method.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;
  • FIG. 2 shows a schematic diagram of an electronic device to which an embodiment of the present disclosure can be applied
  • FIG. 3 schematically shows a flowchart of a method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure
  • FIG. 4 schematically shows a flow chart of frame-by-frame calculation in an exemplary embodiment of the present disclosure
  • FIG. 5 schematically shows a schematic diagram of the principle of determining the pose corresponding to the current frame image by using the current frame image and the previous frame image in an exemplary embodiment of the present disclosure
  • FIG. 6 schematically shows a schematic diagram of a frame-by-frame calculation process in an exemplary embodiment of the present disclosure
  • FIG. 7 schematically shows a schematic diagram of another frame-by-frame calculation process in an exemplary embodiment of the present disclosure
  • FIG. 8 schematically shows a schematic diagram of still another frame-by-frame calculation process in an exemplary embodiment of the present disclosure
  • FIG. 9 schematically shows a schematic diagram of a map point recovery process in an exemplary embodiment of the present disclosure.
  • FIG. 10 schematically shows a flowchart of another method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure
  • FIG. 11 schematically shows a schematic diagram of the composition of a visual-inertial system initialization apparatus in an exemplary embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • FIG. 1 shows a schematic diagram of a system architecture of an exemplary application environment to which a visual-inertial system initialization method and apparatus according to embodiments of the present disclosure can be applied.
  • the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, and 103 may be various terminal devices having a visual inertial system, including but not limited to desktop computers, portable computers, smart phones, and tablet computers, and so on. It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the server 105 may be a server cluster composed of multiple servers, or the like.
  • the visual inertial system initialization method provided by the embodiment of the present disclosure is generally performed by the terminal devices 101 , 102 , and 103 , and accordingly, the visual inertial system initialization apparatus is generally set in the terminal devices 101 , 102 , and 103 .
  • the method for initializing the visual inertial system provided by the embodiment of the present disclosure can also be executed by the server 105 , and correspondingly, the visual inertial system initialization device can also be set in the server 105 .
  • This exemplary embodiment There is no special restriction on this.
  • the user may collect images through the visual sensors of the terminal devices 101, 102, 103, and send the images to the server 105, so that the server 105 performs pose calculation, and calculates the The results are sent to the terminal devices 101, 102, and 103, and the terminal devices 101, 102, and 103 determine the motion speed, gravity vector, and deviation rate corresponding to the inertial measurement unit through the pose sent by the server 105, and then initialize the visual inertial system.
  • Exemplary embodiments of the present disclosure provide an electronic device for implementing a visual inertial system initialization method, which may be the terminal devices 101 , 102 , 103 or the server 105 in FIG. 1 .
  • the electronic device includes at least a processor and a memory, the memory is used to store executable instructions of the processor, and the processor is configured to execute the visual inertial system initialization method by executing the executable instructions.
  • the mobile terminal 200 in FIG. 2 takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes.
  • the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 .
  • the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.
  • the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, Battery 242, Antenna 1, Antenna 2, Mobile Communication Module 250, Wireless Communication Module 260, Audio Module 270, Speaker 271, Receiver 272, Microphone 273, Headphone Interface 274, Sensor Module 280, Display Screen 290, Camera Module 291, Indication 292, a motor 293, a key 294, a subscriber identification module (SIM) card interface 295, and the like.
  • the sensor module 280 may include a depth sensor 2801, an inertial sensor 2802, a gyroscope sensor 2803, and the like.
  • the processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (Application Processor, AP), a modem processor, a graphics processor (Graphics Processing Unit, GPU), an image signal processor (Image Signal Processor, ISP), controller, video codec, digital signal processor (Digital Signal Processor, DSP), baseband processor and/or Neural-Network Processing Unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • an application processor Application Processor, AP
  • modem processor e.g., GPU
  • ISP image signal processor
  • ISP image Signal Processor
  • controller e.g., video codec
  • DSP Digital Signal Processor
  • NPU Neural-Network Processing Unit
  • a memory is provided in the processor 210 .
  • the memory can store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and the execution is controlled by the processor 210 .
  • the mobile terminal 200 may implement a shooting function through an ISP, a camera module 291, a video codec, a GPU, a display screen 290, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera module 291;
  • the camera module 291 is used to capture still images or videos;
  • the digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals;
  • the codec is used to compress or decompress the digital video, and the mobile terminal 200 may also support one or more video codecs.
  • the above-mentioned camera module 291 may be used as a visual sensor in a visual inertial system, and image acquisition is performed by the camera module 291 .
  • the depth sensor 2801 is used to acquire depth information of the scene.
  • a depth sensor may be disposed in the camera module 291 for capturing depth information corresponding to the image while capturing the image.
  • Inertial sensors 2802 also known as inertial measurement units, may be used to detect and measure acceleration and rotational motion.
  • the gyro sensor 2803 may be used to determine the motion attitude of the mobile terminal 200 .
  • sensors with other functions can also be set in the sensor module 280 according to actual needs, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, and a bone conduction sensor. sensors, etc.
  • ARCore uses a monocular camera and IMU sensor to detect the visual difference feature points in the captured camera image and use these points to calculate its position change. This visual information is combined with inertial measurements from the device's IMU to estimate the camera's pose relative to the surrounding world over time.
  • ARcore provides good pose information and environmental information for applications such as AR and indoor navigation on Android phones.
  • One is applied to the VINS system.
  • the specific process is as follows: Accumulate 10 frames of images, and select two frames of images L and R with sufficient parallax from the 10 frames of images, and use the epipolar geometric constraint to solve the bit between the two frames of images. posture. Then use the pose triangulation to recover some map points that are co-viewed between the two frames. Project these map points to frames other than L and R frames, calculate the pose of the frame by minimizing the reprojection error, and then triangulate between the frame and L and R frames, and then restore more many map points. By repeating the above process, the poses of the above 10 frames and the map points corresponding to the 10 frames of images can be solved.
  • the rotation constraints and translation constraints are used to align the previously determined poses of the 10 frames of images with the IMU, and the previously determined poses of the 10 frames of images are used as the accurate poses to constrain the variables to be obtained by the IMU, and finally SVD is used.
  • Decomposition solves a system of inhomogeneous equations to determine all quantities to be solved.
  • the second is applied to the RGBD-VINS system, and the specific process is as follows: after waiting for 10 frames of images to be accumulated in the system, two images L and R with sufficient parallax are selected from the 10 frames of images. Since there is a depth camera in the RGBD-VINS system, the depth of each frame of image is known, and it is no longer necessary to use the 2D point-to-2D point epipolar geometric constraint to solve the pose of the two frames of images. The re-projection error of the 3D point to the 2D point is solved, so that the solved pose is directly scale deterministic. In the recovery process of map points, the triangulation method is no longer adopted, but the depth camera information is directly used.
  • the corresponding depth value can be used to restore the three-dimensional coordinates of the common view point. , that is, the map points corresponding to each frame of image.
  • the first method mainly includes the following shortcomings: 1. Using a monocular camera for system initialization requires that a certain parallax must be satisfied between the images used for initialization, and there must be enough matching points between the two frames of images to ensure Initialization successful. However, it takes more time to detect parallax and more matching points, and if the requirements are not met, the initialization may fail and restart, resulting in a low initialization success rate; 2. It is necessary to accumulate 10 frames of consecutive images before starting the initialization. process, it is impossible to output the pose when the system starts running; 3. The use of a monocular camera for visual initialization will cause the problem of uncertainty in the pose and scale.
  • the second method includes the following disadvantages: 1. It is necessary to accumulate 10 frames of images before initialization can be performed. Before the accumulation of 10 frames of images is full, the system does not perform any work and wastes a lot of time. At the same time, it is necessary to detect parallax in 10 frames of images. In scenes with small parallax, the initialization will still fail; 2. It is impossible to output the pose when the system starts running; 3. In the recovery of map points, the depth is completely trusted. camera information. When the depth camera is noisy, or a large number of map points are not within the range of the depth camera, the number of map points available for pose calculation may be severely reduced, resulting in large errors in the calculation results or even failure to successfully calculate the results.
  • this example embodiment provides a method for initializing a visual-inertial system.
  • the method for initializing the visual inertial system may be applied to the foregoing server 105, and may also be applied to one or more of the foregoing terminal devices 101, 102, and 103, which is not particularly limited in this exemplary embodiment.
  • the visual inertial system initialization method may include the following steps S310 and S320:
  • step S310 in the process of receiving the image, frame-by-frame calculation is performed on the image until a first preset number of poses are obtained.
  • the above frame-by-frame calculation is shown in FIG. 4 and may include the following steps S410 to S430:
  • step S410 when a frame of the current frame image is received, the feature points of the current frame image and the depth information corresponding to the feature points are extracted.
  • the time interval between each frame of images is about 100 ms. Therefore, in the related art, if 10 frames of images need to be accumulated, it is necessary to Image accumulation takes about 1000ms, and initialization can take place after 1000ms. In order to avoid time waste when images are accumulated, a frame of image can be processed correspondingly after it is received. Specifically, when a current image frame is received, a feature point in the current image frame is first extracted, and at the same time depth information corresponding to the feature point is extracted from the depth information collected by the depth sensor.
  • step S420 the pose corresponding to the current frame image is determined based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image.
  • the feature points and depth information of the current frame image can be compared with the previous frame image received before the current frame image is received. , that is, the feature points of the previous frame image and the depth information corresponding to each feature point determine the pose corresponding to the current frame image.
  • the current frame image when determining the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image of the current frame image, the current frame image
  • the feature points of the current frame image are matched with the feature points of the previous frame image corresponding to the current frame image to obtain the matching matching feature points; at this time, since the pose of the previous frame image has been determined, the features of the previous frame image can be obtained.
  • the depth information corresponding to the point determines the map point of the matching feature point in the three-dimensional coordinate system, and then the map point is projected into the current frame image, and then the pose corresponding to the current frame image is determined.
  • optical flow matching can be performed between the feature points of the current frame image and the feature points of the previous frame image corresponding to the current frame image, and then determined based on the depth information corresponding to the feature points of the previous frame image. Match the map point corresponding to the matching feature point, and reproject the map point to the current frame image, and finally calculate the current frame image by minimizing the reprojection error through PnP (Perspective-n-Point, multi-point perspective). corresponding pose.
  • PnP Perspective-n-Point, multi-point perspective
  • the above three-dimensional coordinate system refers to the world coordinate system, and the world coordinate system can usually be determined according to the camera coordinate system corresponding to the first frame image of the input visual inertial system. All poses determined from the second frame are relative to the world. Relative pose in terms of coordinate system. Specifically, when the received current frame image is the first frame image, the corresponding camera coordinate system of the first frame image can be directly obtained, and then the camera coordinate system is set as the world coordinate system.
  • the depth information of the current frame image can be collected by the depth camera, it is not necessary to use the epipolar constraint to solve the pose between the two frame images, even if the two The parallax between the two images is very small. It is also possible to perform map point projection through the depth information, and then solve the relative pose by minimizing the re-projection error. Therefore, it is not necessary to judge the difference between the current frame image and the previous frame image in advance. whether the parallax between them is sufficient.
  • step S430 the current frame image is taken as the previous frame image, and a new current frame image is continuously received.
  • the IMU since a sufficient number of poses are required to initialize the IMU, before determining the motion velocity, gravity vector and deviation rate corresponding to the IMU of the inertial measurement unit, it is necessary to repeatedly determine the corresponding images of multiple current frames. poses until the number of determined poses is equal to the first preset number required for IMU initialization. Specifically, when the pose is repeatedly determined, after a frame of the current frame image is received and the pose corresponding to the current frame image is determined, the current frame image can be used as the previous frame image, so that the received new frame The pose of the current frame image can be determined by using the current frame image received last time as the previous frame image.
  • the feature points in the second frame image and the depth information corresponding to the feature points and the depth information in the previous frame image can be used.
  • the depth information corresponding to the feature points and the feature points determines the pose corresponding to the second frame image; then the second frame image is used as the previous frame image, and then when the received current frame image is the third frame image, due to the previous frame image
  • the image has changed from the first frame image to the second frame image, so it can be directly based on the feature points and the depth information corresponding to the feature points in the third frame image and the feature points and features in the previous frame image (the second frame image)
  • the depth information corresponding to the point determines the pose corresponding to the third frame of image; then repeat the above process to determine the pose corresponding to each current image frame that is continuously input in the form of frame-by-frame (frame-by-frame) until the i-th frame is determined After the corresponding poses are determined, a first preset number of poses
  • the pose corresponding to the first frame image can be directly determined as the preset pose.
  • the preset pose may be a unit matrix. It should be noted that, after the pose corresponding to the first frame of image is determined as the preset pose, the preset pose is also counted as a part of the pose quantity. That is, assuming that the first preset number is 10, in addition to the preset poses, 9 poses need to be determined through the above repeatedly determined poses, and the total of the preset poses and the determined 9 poses is equal to the first preset The number is 10. At this time, the process of repeatedly determining the pose can be stopped and other subsequent processing can be performed.
  • the received current frame image may have interference such as noise, and the process of calculating the pose may fail due to interference such as noise. For example, disturbances such as noise cause the PnP minimization projection error solution to fail.
  • the current frame image can be discarded, and the current frame image is not used as the previous frame image in step S430.
  • the process is to retain the previous previous frame image, then re-receive the new current frame image, and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.
  • the current frame image is the sixth frame image a
  • the previous frame image is the fifth frame image
  • the acceptable current frame image is used as the new 6th frame image b to continue the calculation, and then the corresponding position of the new 6th frame image b is determined. posture. Then, taking the new sixth frame image b as the previous frame image, the process of frame-by-frame calculation is continued.
  • the pose corresponding to the current frame image has not been successfully determined
  • the number of the current frame images whose pose has not been successfully determined can be counted, and the number of the current frame images whose pose has not been successfully determined is equal to the second preset.
  • the number of poses is set, all the poses that have been determined can be cleared, that is, the number of poses is reset to 0. Then, a new current frame image is received again, and the above frame-by-frame calculation process is performed again, and the number of poses is accumulated again until the number of poses is equal to the first preset number.
  • the second preset number there are two situations in which it is equal to the above-mentioned second preset number.
  • One is that the number of current frame images whose poses are not successfully determined is cumulatively equal to the second preset number; The first is that a second preset number of images of the current frame are required to fail to successfully determine the corresponding pose.
  • the second method in order to satisfy the condition that the corresponding pose is not successfully determined for the second consecutive preset number of current frame images, after the first current frame image for which the pose is not successfully determined appears for the first time, Count the number of the current frame images whose poses have not been successfully determined; before the number of statistics is equal to the second preset number, if any current frame image has successfully determined the corresponding pose, the previously counted unsuccessful determinations can be made. The current frame image number of the pose is reset.
  • the second preset number is 3, when the previous frame image is the fifth frame image, if the current frame image (the first sixth frame image d) fails to determine the corresponding
  • n can be reset to 0.
  • the third 6th frame image can be used as the previous frame image, and a new current frame image can be obtained, and the new current frame image can be determined through the new current frame image and the previous frame image (the third 6th frame image) The pose corresponding to the image.
  • the mechanical energy can prevent the failure of the pose calculation, improve the success rate of initialization, and make the initialization more robust.
  • step S320 the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit are determined according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.
  • the first preset number of poses and the inertial measurement unit may be aligned. , which uses rotational and translation constraints to calculate the motion velocity, gravity vector, and deflection rate of the inertial measurement unit.
  • the scale of the pose does not need to be determined since the depth information has already determined the scale of the pose.
  • a more accurate internal and external parameter calibration can be performed on the inertial measurement unit, and an accurate acceleration deviation rate and angular velocity deviation of the inertial measurement unit can be obtained. rate and the external parameter transformation between the inertial measurement unit and the camera, so that the joint initialization process of the visual and inertial measurement unit converges faster, and the accuracy will be further improved.
  • map points may be recovered for the target images corresponding to the first preset number of poses to obtain Map points corresponding to each target image, and then construct a local cluster adjustment according to the restored map points to optimize the first preset number of poses to obtain an optimized pose, and finally determine the corresponding inertial measurement unit according to the optimized pose.
  • the velocity of motion, the gravity vector, and the rate of deviation may be recovered for the target images corresponding to the first preset number of poses to obtain Map points corresponding to each target image, and then construct a local cluster adjustment according to the restored map points to optimize the first preset number of poses to obtain an optimized pose, and finally determine the corresponding inertial measurement unit according to the optimized pose.
  • the target image may include the current frame image for which the pose of the first preset number is successfully determined. For example, assuming that a total of 12 current frame images are received in order to determine the first preset number of poses, 2 of which are current frame images whose poses have not been successfully determined, then the remaining 10 current frame images whose corresponding poses have been successfully determined are for the target image.
  • the number of target images is also the first preset number.
  • a target image with a common-view matching relationship may be searched among the first preset number i of target images to obtain at least one pair of target images. Then, for each pair of target image pairs, the depth information of the target image in the target image pair is used for reprojection, and the reprojection error is calculated.
  • the reprojection error is small, that is, less than or equal to the preset threshold, it can be considered that the error of the depth information is small, so the depth information can be used for map point recovery.
  • the feature points can be back-projected by using the depth information to restore the map points of the target image pair; on the contrary, when the re-projection error is large, that is, greater than the preset threshold, it can be considered that the error of the depth information is large. Therefore, it is not suitable to restore map points through depth information.
  • the triangulation method can be used to restore map points of the target image pair. All the recovered map points are then used as a set of map points to construct a local bundle adjustment to optimize the pose.
  • depth information or triangulation method is used to restore map points, which increases the number of map points while ensuring the accuracy of map points; at the same time, even if the depth information received by the VIO system is of poor quality, accurate initialization can be achieved .
  • a probability model when performing map point recovery, in addition to determining whether to use depth information for map point recovery through reprojection error, a probability model can also be used to model the uncertainty of depth information, and then The probability of the uncertainty of the depth information determines whether to use the depth information for map point recovery. Using the probability model to judge the uncertainty can reduce the noise caused by the depth information and realize a more accurate initialization process.
  • the preset threshold may be set according to actual application scenarios, application environments, and the like. For example, when the depth information is more reliable, the preset threshold may be determined as a larger value; on the contrary, when the depth information is less reliable, a smaller value may be selected as the preset threshold.
  • the gravity direction adjustment can also be performed on all the previously determined poses according to the determined gravity vector, so as to output the adjusted poses.
  • the technical solutions of the embodiments of the present disclosure will be described in detail by taking the first preset number as 10, the second preset number as 3, and the preset threshold as 1/460 as an example.
  • Step S1001 receiving the current frame image
  • Step S1003 performing feature extraction on the current frame image, and extracting depth information corresponding to each feature point;
  • Step S1005 determine whether the current frame image is the first frame image of the input system
  • Step S1007 when the current frame image is not the first frame of video, the current frame image and the previous frame image are subjected to optical flow matching and PnP to minimize the method of re-projection error to determine the pose corresponding to the current frame image;
  • Step S1009 judging whether the current frame image successfully determines the corresponding pose
  • Step S1011 when the current frame image fails to determine the corresponding pose, determine whether the corresponding pose has not been successfully determined for three consecutive received current frame images;
  • Step S1013 when the corresponding poses are not successfully determined for three consecutive received current frame images, clear all the poses determined before;
  • Step S1015 when only one or two consecutively received current frame images fail to successfully determine the corresponding pose, discard the current frame image
  • Step S1017 when the current frame image successfully determines the corresponding pose, the current frame image is used as the previous frame image;
  • Step S1019 judging whether the number m of the accumulated determined poses is equal to the first preset number 10;
  • Step S1021 when m is equal to 10, search for a common view relationship in 10 target images whose poses are successfully determined, and determine the target image pair;
  • Step S1023 reproject the depth information of the target image in the target image pair, and calculate the reprojection error
  • Step S1025 when the calculated re-projection error is less than or equal to 1/460, use the depth information to back-project the feature points to restore map points;
  • Step S1027 when the calculated reprojection error is greater than 1/460, use the triangulation method to restore map points;
  • Step S1029 performing local BA adjustment on the poses corresponding to the 10 target images through the restored map points
  • Step S1031 initialize the IMU through the optimized 10 poses, and calculate the initial speed, gravity vector and deviation rate corresponding to the IMU;
  • Step S1033 adjusting the gravity direction of all poses and map points according to the gravity vector
  • Step S1035 output the pose.
  • step S1035 before the initialization of the IMU, the pose corresponding to the current frame image determined according to the current frame image and the previous frame image may be directly output; in addition, you can also After the IMU is initialized, the pose is gravity adjusted and then output.
  • the public data set A ⁇ B ⁇ C (hereinafter referred to as the STU data set) released by ShanghaiTech University (STU) is used for experimental verification, and the results shown in Table 1 can be obtained. Based on Table 1, the following conclusions can be drawn:
  • the present embodiment has a significant improvement in the time-consuming of the initialization process.
  • the average speed is increased by 6 to 7 times; compared with VINS-RGBD, the average speed is increased by 3 to 4 times.
  • the time of outputting the camera pose in this embodiment is much earlier than that of VINS-MONO and VINS-RGBD, and the pose is output at least 30 frames ahead of time.
  • the pose can be output as soon as the system starts running.
  • the final overall trajectory estimation accuracy of this embodiment is basically the same as that of the VINS-RGBD scheme, and is better than VINS in most cases -MONO; when the quality of the depth information of the received image is poor, this embodiment adopts the combination of the triangulation method and the depth information to restore the map points, which can restore more three-dimensional map points while ensuring accuracy , compared with the VINS-RGBD scheme, it will bring about a significant improvement in accuracy.
  • VINS-MONO has two sets of trajectories that failed to be initialized successfully and failed to track
  • VINS-RGBD has three sets of trajectories due to Failed to initialize successfully and trace failed.
  • this embodiment has successfully initialized and tracked all 15 sets of data sets, that is, this embodiment has better robustness.
  • the accuracy of the technical solution is also better than other methods in most cases, that is, the present embodiment can achieve more accurate results.
  • X represents initialization failure.
  • this exemplary embodiment has the following beneficial effects:
  • the depth information and the triangulation method are used to restore the map points in different situations, which improves the success rate of initialization, and further improves the accuracy and robustness of the final pose.
  • the embodiment of this example also provides a visual inertial system initialization apparatus 1100, including a pose determination module 1110 and an initialization module 1120. in:
  • the pose determination module 1110 may be configured to perform frame-by-frame calculation on the image during the process of receiving the image, until the pose of the first preset number is obtained.
  • the initialization module 1120 may be configured to determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.
  • the above-mentioned frame-by-frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and The feature points and depth information of the previous frame image determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.
  • the pose determination module 1110 may be configured to discard the current frame image and retain the previous frame image when the pose corresponding to the current frame image is not successfully determined; when receiving a new current frame image , and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.
  • the pose determination module 1110 may be configured to count the number of current frame images whose poses are unsuccessfully determined when the pose corresponding to the current frame image is unsuccessfully determined; When the number of frame images is equal to the second preset number, the determined pose is cleared, and new current frame images are continued to be received.
  • the pose determination module 1110 may be configured to reset the number of current frame images whose poses are not successfully determined when the pose corresponding to the current frame image is successfully determined.
  • the pose determination module 1110 may be configured to perform feature matching between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points; The depth information of , determines the map points that match the feature points, and projects the map points to the current frame image to determine the pose corresponding to the current frame image.
  • the visual-inertial system initialization apparatus 1100 may further include a pose optimization module, configured to perform map point recovery on the target images corresponding to the first preset number of poses, so as to obtain the recovered map points; A local bundle adjustment is constructed according to the map points to optimize the pose to obtain the optimized pose.
  • a pose optimization module configured to perform map point recovery on the target images corresponding to the first preset number of poses, so as to obtain the recovered map points;
  • a local bundle adjustment is constructed according to the map points to optimize the pose to obtain the optimized pose.
  • the pose optimization module may be used to find a common-view matching relationship between target images corresponding to a first preset number of poses to obtain at least a pair of target image pairs; for each pair of target images Yes, use the depth information of the target image in the target image pair to reproject, and calculate the reprojection error; when the reprojection error is less than or equal to the preset threshold, use the depth information to backproject the feature points to map the target image. recover.
  • the pose optimization module may be configured to perform map point recovery on the target image through triangulation when the reprojection error is greater than a preset threshold.
  • the pose determination module 1110 may be configured to determine the pose corresponding to the first frame of image as a preset pose.
  • the initialization module 1120 may be configured to adjust the gravitational direction of the pose according to the gravitational vector.
  • aspects of the present disclosure may be implemented as a system, method or program product. Therefore, various aspects of the present disclosure can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", “module” or "system”.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a terminal device, the program code is used to cause the terminal device to execute the above-mentioned procedures in this specification.
  • the steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure for example, any one or more of the steps in FIG. 3 , FIG. 4 , and FIG. 10 may be performed.
  • the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network

Abstract

A visual inertial system initialization method, a visual inertial system initialization apparatus, a computer-readable storage medium, and an electronic device, which relate to the field of visual positioning technology. The method comprises: when receiving an image, performing frame-by-frame calculation on the image, until a first preset amount of poses are obtained (S310); and determining a movement speed, a gravity vector, and a rate of deviation corresponding to an inertial measurement unit according to the first preset amount of poses, so as to initialize a video inertial system according to the movement speed, the gravity vector, and the rate of deviation (S320). In the present method, a pose can be output before the visual inertial system has completed initialization; also, a time interval for receiving images is amply used, and the time it takes to obtain the first preset amount of poses is shortened, thereby accelerating the speed that the visual inertial system is initialized.

Description

视觉惯性系统初始化方法及装置、介质和电子设备Visual-inertial system initialization method and device, medium and electronic device
交叉引用cross reference
本公开要求于2021年02月18日提交的申请号为202110190368.9名称均为“视觉惯性系统初始化方法及装置、介质和电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。This disclosure claims the priority of the Chinese patent application with the application number 202110190368.9 filed on February 18, 2021, both titled "Visual Inertial System Initialization Method and Device, Medium and Electronic Equipment", the entire content of which is by reference All incorporated herein.
技术领域technical field
本公开涉及视觉定位技术领域,具体涉及一种视觉惯性系统初始化方法、视觉惯性系统初始化装置、计算机可读介质和电子设备。The present disclosure relates to the technical field of visual positioning, and in particular, to a visual inertial system initialization method, a visual inertial system initialization device, a computer-readable medium, and an electronic device.
背景技术Background technique
目前,室内定位技术是手机、AR眼镜、室内机器人等移动设备的刚性需求。在室内环境下,移动设备无法通过GPS(Global Positioning System,全球定位系统)等全局定位技术来确定自身位置,只能依赖移动设备自身的传感器实现定位。在手机端或者AR眼镜上,最直接最容易获取的就是视觉传感器(相机等)数据和惯性传感器IMU(Inertial measurement unit)数据,两者都可以结合算法实现定位。在2017年以前,仅使用视觉传感器的定位技术发展迅速,但也随着技术的不断突破,视觉传感器的固有缺陷也暴露出来,仅使用相机已无法突破目前定位技术面临的瓶颈。同样的,仅使用IMU进行定位的技术也出现了相同的瓶颈。At present, indoor positioning technology is a rigid requirement for mobile devices such as mobile phones, AR glasses, and indoor robots. In an indoor environment, a mobile device cannot determine its own position through a global positioning technology such as GPS (Global Positioning System), and can only rely on the sensor of the mobile device itself to achieve positioning. On the mobile phone or AR glasses, the most direct and easiest to obtain is the visual sensor (camera, etc.) data and the inertial sensor IMU (Inertial measurement unit) data, both of which can be combined with algorithms to achieve positioning. Before 2017, the positioning technology using only vision sensors developed rapidly, but with the continuous breakthrough of technology, the inherent defects of vision sensors have also been exposed, and only the use of cameras can no longer break through the bottleneck faced by the current positioning technology. Likewise, the same bottleneck occurs with techniques that only use the IMU for localization.
因此,近年来行业内衍生出了VIO(Visual-IMU Odometry,视觉惯导融合里程计)技术,即同时使用视觉传感器和IMU进行融合定位的技术。该项技术的发展也被广泛应用于室内导航、增强现实、机器人乃至无人驾驶等行业。Therefore, in recent years, the VIO (Visual-IMU Odometry, visual-inertial navigation fusion odometry) technology has been derived in the industry, that is, a technology that uses both visual sensors and IMU for fusion positioning. The development of this technology has also been widely used in indoor navigation, augmented reality, robotics and even driverless industries.
公开内容public content
根据本公开的第一方面,提供一种视觉惯性系统初始化方法,包括:在接收图像的过程中,针对图像进行逐帧计算,直至得到第一预设数量的位姿;根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据运动速度、重力向量和偏离率对视频惯性系统进行初始化;其中,逐帧计算包括:在接收到一帧当前帧图像时,提取当前帧图像的特征点和特征点对应的深度信息;基于当前帧图像的特征点和深度信息与上一帧图像的特征点和深度信息确定当前帧图像对应的位姿;将当前帧图像作为上一帧图像,并继续接收新的当前帧图像。According to a first aspect of the present disclosure, there is provided a method for initializing a visual inertial system, comprising: in the process of receiving an image, performing frame-by-frame calculation on the image until a first preset number of poses are obtained; Determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, the frame-by-frame calculation includes: after receiving a frame of the current frame image , extract the feature points of the current frame image and the depth information corresponding to the feature points; determine the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image; The image is taken as the previous frame image and continues to receive the new current frame image.
根据本公开的第二方面,提供一种视觉惯性系统初始化装置,包括:位姿确定模块,用于在接收图像的过程中,针对图像进行逐帧计算,直至得到第一预设数量的位姿;初始化模块,用于根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据运动速度、重力向量和偏离率对视频惯性系统进行初始化;其中,逐帧计算包括:在接收到一帧当前帧图像时,提取当前帧图像的特征点和特征点对应的深度信息;基于当前帧图像的特征点和深度信息与上一帧图像的特征点和深度信息确定当前帧图像对应的位姿;将当前帧图像作为上一帧图像,并继续接收新的当前帧图像。According to a second aspect of the present disclosure, there is provided an apparatus for initializing a visual inertial system, comprising: a pose determination module, configured to perform frame-by-frame calculation on an image during a process of receiving an image until a first preset number of poses are obtained The initialization module is used to determine the corresponding motion speed, gravity vector and deviation rate of the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate; wherein, one by one The frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and the depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image Determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.
根据本公开的第三方面,提供一种计算机可读介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述的方法。According to a third aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the above-mentioned method.
根据本公开的第四方面,提供一种电子设备,其特征在于,包括:According to a fourth aspect of the present disclosure, there is provided an electronic device, characterized by comprising:
处理器;以及processor; and
存储器,用于存储一个或多个程序,当一个或多个程序被一个或多个处理器执行时,使得一个或多个处理器实现上述的方法。A memory for storing one or more programs, which, when executed by one or more processors, enables the one or more processors to implement the above-mentioned method.
附图说明Description of drawings
图1示出了可以应用本公开实施例的一种示例性系统架构的示意图;1 shows a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;
图2示出了可以应用本公开实施例的一种电子设备的示意图;FIG. 2 shows a schematic diagram of an electronic device to which an embodiment of the present disclosure can be applied;
图3示意性示出本公开示例性实施例中一种视觉惯性系统初始化方法的流程图;FIG. 3 schematically shows a flowchart of a method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure;
图4示意性示出本公开示例性实施例中逐帧计算的流程图;FIG. 4 schematically shows a flow chart of frame-by-frame calculation in an exemplary embodiment of the present disclosure;
图5示意性示出本公开示例性实施例中通过当前帧图像和上一帧图像确定当前帧图像对应的位姿的原理示意图;5 schematically shows a schematic diagram of the principle of determining the pose corresponding to the current frame image by using the current frame image and the previous frame image in an exemplary embodiment of the present disclosure;
图6示意性示出本公开示例性实施例中一种逐帧计算过程的示意图;6 schematically shows a schematic diagram of a frame-by-frame calculation process in an exemplary embodiment of the present disclosure;
图7示意性示出本公开示例性实施例中另一种逐帧计算过程的示意图;FIG. 7 schematically shows a schematic diagram of another frame-by-frame calculation process in an exemplary embodiment of the present disclosure;
图8示意性示出本公开示例性实施例中再一种逐帧计算过程的示意图;FIG. 8 schematically shows a schematic diagram of still another frame-by-frame calculation process in an exemplary embodiment of the present disclosure;
图9示意性示出本公开示例性实施例中地图点恢复过程的示意图;FIG. 9 schematically shows a schematic diagram of a map point recovery process in an exemplary embodiment of the present disclosure;
图10示意性示出本公开示例性实施例中另一种视觉惯性系统初始化方法的流程图;FIG. 10 schematically shows a flowchart of another method for initializing a visual-inertial system in an exemplary embodiment of the present disclosure;
图11示意性示出本公开示例性实施例中视觉惯性系统初始化装置的组成示意图。FIG. 11 schematically shows a schematic diagram of the composition of a visual-inertial system initialization apparatus in an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repeated descriptions will be omitted. Some of the block diagrams shown in the figures are functional entities that do not necessarily necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
图1示出了可以应用本公开实施例的一种视觉惯性系统初始化方法及装置的示例性应用环境的系统架构的示意图。FIG. 1 shows a schematic diagram of a system architecture of an exemplary application environment to which a visual-inertial system initialization method and apparatus according to embodiments of the present disclosure can be applied.
如图1所示,系统架构100可以包括终端设备101、102、103中的一个或多个,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。终端设备101、102、103可以是各种具有视觉惯性系统的终端设备,包括但不限于台式计算机、便携式计算机、智能手机和平板电脑等等。应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器105可以是多个服务器组成的服务器集群等。As shown in FIG. 1 , the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, and 103 may be various terminal devices having a visual inertial system, including but not limited to desktop computers, portable computers, smart phones, and tablet computers, and so on. It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs. For example, the server 105 may be a server cluster composed of multiple servers, or the like.
本公开实施例所提供的视觉惯性系统初始化方法一般由终端设备101、102、103中执行,相应地,视觉惯性系统初始化装置一般设置于终端设备101、102、103中。但本领域技术人员容易理解的是,本公开实施例所提供的视觉惯性系统初始化方法也可以由服务器105执行,相应的,视觉惯性系统初始化装置也可以设置于服务器105中,本示例性实施例中对此不做特殊限定。举例而言,在一种示例性实施例中,可以是用户通过终端设备101、102、103的视觉传感器采集图像,并将图像发送至服务器105,以使服务器105进行位姿计算,并将计算结果发送至终端设备101、102、103,终端设备101、102、103通过服务器105发送的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,进而对视觉惯性系统进行初始化。The visual inertial system initialization method provided by the embodiment of the present disclosure is generally performed by the terminal devices 101 , 102 , and 103 , and accordingly, the visual inertial system initialization apparatus is generally set in the terminal devices 101 , 102 , and 103 . However, those skilled in the art can easily understand that the method for initializing the visual inertial system provided by the embodiment of the present disclosure can also be executed by the server 105 , and correspondingly, the visual inertial system initialization device can also be set in the server 105 . This exemplary embodiment There is no special restriction on this. For example, in an exemplary embodiment, the user may collect images through the visual sensors of the terminal devices 101, 102, 103, and send the images to the server 105, so that the server 105 performs pose calculation, and calculates the The results are sent to the terminal devices 101, 102, and 103, and the terminal devices 101, 102, and 103 determine the motion speed, gravity vector, and deviation rate corresponding to the inertial measurement unit through the pose sent by the server 105, and then initialize the visual inertial system.
本公开的示例性实施方式提供一种用于实现视觉惯性系统初始化方法的电子设备,其可以是图1中的终端设备101、102、103或服务器105。该电子设备至少包括处理器和存储器,存储器用于存储处理器的可执行指令,处理器配置为经由执行可执行指令来执行视觉惯性系统初始化方法。Exemplary embodiments of the present disclosure provide an electronic device for implementing a visual inertial system initialization method, which may be the terminal devices 101 , 102 , 103 or the server 105 in FIG. 1 . The electronic device includes at least a processor and a memory, the memory is used to store executable instructions of the processor, and the processor is configured to execute the visual inertial system initialization method by executing the executable instructions.
下面以图2中的移动终端200为例,对电子设备的构造进行示例性说明。本领域技术人员应当理解,除了特别用于移动目的的部件之外,图2中的构造也能够应用于固定类型的设备。在另一些实施方式中,移动终端200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或软件和硬件的组合实现。各部件间的接口连接关系只是示意性示出,并不构成对移动终端200的结构限定。在另一些实施方式中,移动终端200也可以采用与图2不同的接口连接方式,或多种接口连接方式的组合。The following takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes. In other embodiments, the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 . In other embodiments, the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.
如图2所示,移动终端200具体可以包括:处理器210、内部存储器221、外部存储器接口222、通用串行总线(Universal Serial Bus,USB)接口230、充电管理模块240、电源管理模块241、电池242、天线1、天线2、移动通信模块250、无线通信模块260、音频模块270、扬声器271、受话器272、麦克风273、耳机接口274、传感器模块280、显示屏290、摄像模组291、指示器292、马达293、按键294以及用户标识模块(subscriber identification module,SIM)卡接口295等。其中传感器模块280可以包括深度传感器2801、惯性传感器2802、陀螺仪传感器2803等。As shown in FIG. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, Battery 242, Antenna 1, Antenna 2, Mobile Communication Module 250, Wireless Communication Module 260, Audio Module 270, Speaker 271, Receiver 272, Microphone 273, Headphone Interface 274, Sensor Module 280, Display Screen 290, Camera Module 291, Indication 292, a motor 293, a key 294, a subscriber identification module (SIM) card interface 295, and the like. The sensor module 280 may include a depth sensor 2801, an inertial sensor 2802, a gyroscope sensor 2803, and the like.
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(Application Processor,AP)、调制解调处理器、图形处理器(Graphics Processing Unit,GPU)、图像信号处理器(Image Signal Processor,ISP)、控制器、视频编解码器、数字信号处理器(Digital Signal Processor,DSP)、基带处理器和/或神经网络处理器(Neural-Network Processing Unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (Application Processor, AP), a modem processor, a graphics processor (Graphics Processing Unit, GPU), an image signal processor (Image Signal Processor, ISP), controller, video codec, digital signal processor (Digital Signal Processor, DSP), baseband processor and/or Neural-Network Processing Unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
处理器210中设置有存储器。存储器可以存储用于实现六个模块化功能的指令:检测指令、连接指令、信息管理指令、分析指令、数据传输指令和通知指令,并由处理器210来控制执行。A memory is provided in the processor 210 . The memory can store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and the execution is controlled by the processor 210 .
移动终端200可以通过ISP、摄像模组291、视频编解码器、GPU、显示屏290及应用处理器等实现拍摄功能。其中,ISP用于处理摄像模组291反馈的数据;摄像模组291用于捕获静态图像或视频;数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号;视频编解码器用于对数字视频压缩或解压缩,移动终端200还可以支持一种或多种视频编解码器。在一些实施例中,可以将上述摄像模组291作为视觉惯性系统中的视觉传感器,通过摄像模组291进行图像采集。The mobile terminal 200 may implement a shooting function through an ISP, a camera module 291, a video codec, a GPU, a display screen 290, an application processor, and the like. Among them, the ISP is used to process the data fed back by the camera module 291; the camera module 291 is used to capture still images or videos; the digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals; video The codec is used to compress or decompress the digital video, and the mobile terminal 200 may also support one or more video codecs. In some embodiments, the above-mentioned camera module 291 may be used as a visual sensor in a visual inertial system, and image acquisition is performed by the camera module 291 .
深度传感器2801用于获取景物的深度信息。在一些实施例中,深度传感器可以设置于摄像模组291,用于在采集图像的同时,采集图像对应的深度信息。The depth sensor 2801 is used to acquire depth information of the scene. In some embodiments, a depth sensor may be disposed in the camera module 291 for capturing depth information corresponding to the image while capturing the image.
惯性传感器2802,又叫惯性测量单元,可以用于检测和测量加速度与旋转运动。Inertial sensors 2802, also known as inertial measurement units, may be used to detect and measure acceleration and rotational motion.
陀螺仪传感器2803可以用于确定移动终端200的运动姿态。此外,还可以根据实际需要在传感器模块280中设置其他功能的传感器,例如气压传感器、磁传感器、加速度传感器、距离传感器、接近光传感器、指纹传感器、温度传感器、触摸传感器、环境光传感器、骨传导传感器等。The gyro sensor 2803 may be used to determine the motion attitude of the mobile terminal 200 . In addition, sensors with other functions can also be set in the sensor module 280 according to actual needs, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, and a bone conduction sensor. sensors, etc.
在视觉定位这一领域内,相关研究人员先后做了很多尝试。例如,在技术层面上,2017年,苹果公司在WWDC2017大会上推出了增强现实开发套件ARKit,主要有位姿估计、环境理解、光估计三项功能。其中最核心的就是位姿估计功能,该功能使用了VIO技术,通过整合设备摄像头图像信息与设备运动传感器信息,为手机AR应用提供定位。再如,仍然在2017年,Google宣布推出了和ARKit对标的增强现实SDK,名为ARCore, 同样包含位姿估计、环境理解、光估计三个主要功能。在进行位姿估计时,ARCore使用到了单目摄像头和IMU传感器,检测捕获的摄像头图像中的视觉差异特征点,并使用这些点来计算其位置变化。这些视觉信息再与设备IMU的惯性测量结果结合,一起用于估测摄像头随着时间推移而相对于周围世界的姿态。ARcore为安卓手机的AR、室内导航等应用提供了良好的位姿信息和环境信息。In the field of visual positioning, related researchers have made many attempts. For example, on a technical level, in 2017, Apple launched the augmented reality development kit ARKit at the WWDC2017 conference, which mainly has three functions: pose estimation, environment understanding, and light estimation. The core of which is the pose estimation function, which uses VIO technology to provide positioning for mobile AR applications by integrating device camera image information and device motion sensor information. For another example, still in 2017, Google announced the launch of an augmented reality SDK that is aligned with ARKit, called ARCore, which also includes three main functions: pose estimation, environment understanding, and light estimation. When performing pose estimation, ARCore uses a monocular camera and IMU sensor to detect the visual difference feature points in the captured camera image and use these points to calculate its position change. This visual information is combined with inertial measurements from the device's IMU to estimate the camera's pose relative to the surrounding world over time. ARcore provides good pose information and environmental information for applications such as AR and indoor navigation on Android phones.
在基于VIO系统进行定位之前,需要确定一个基准世界坐标系和初始的地图点集合,以便于基于该坐标系进行后续定位。同时,单独使用视觉定位的方式无法确定位姿的真实尺度,而在融合IMU后,由于IMU传感器本身的特性,需要确定IMU的初始速度、IMU的偏离率、重力向量以及尺度信息,以对单独使用视觉确定的位姿进行真实尺度的确认或调整。以上过程,就是SLAM或VIO系统运行的必要步骤——初始化过程。Before positioning based on the VIO system, it is necessary to determine a reference world coordinate system and an initial set of map points to facilitate subsequent positioning based on this coordinate system. At the same time, the real scale of the pose cannot be determined by using the visual positioning method alone. After the fusion of the IMU, due to the characteristics of the IMU sensor itself, it is necessary to determine the initial speed of the IMU, the deviation rate of the IMU, the gravity vector and the scale information. Use visually determined poses for true scale confirmation or adjustment. The above process is a necessary step for the operation of the SLAM or VIO system - the initialization process.
在相关技术中,通常有以下两种初始化的方法:In the related art, there are usually the following two initialization methods:
一种是应用于VINS系统,具体的过程如下:积累10帧图像,并在10帧图像中选取两帧视差足够的图像L和R,利用对极几何约束来求解这两帧图像之间的位姿。然后利用该位姿三角化恢复出两帧之间共视的一些地图点。将这些地图点投影到除L帧和R帧以外的其他帧上,利用最小化重投影误差计算该帧位姿,之后在该帧与L帧、R帧之间进行三角化,再恢复出更多的地图点。反复以上过程,可求解出上述10帧的位姿,以及这10帧图像对应的地图点。最后利用旋转约束和平移约束对前面确定的10帧图像的位姿和IMU进行对齐,将前面确定的10帧图像的位姿作为准确位姿,以此来约束IMU的待求变量,最终采用SVD分解求解非齐次方程组的方式来确定所有待求量。One is applied to the VINS system. The specific process is as follows: Accumulate 10 frames of images, and select two frames of images L and R with sufficient parallax from the 10 frames of images, and use the epipolar geometric constraint to solve the bit between the two frames of images. posture. Then use the pose triangulation to recover some map points that are co-viewed between the two frames. Project these map points to frames other than L and R frames, calculate the pose of the frame by minimizing the reprojection error, and then triangulate between the frame and L and R frames, and then restore more many map points. By repeating the above process, the poses of the above 10 frames and the map points corresponding to the 10 frames of images can be solved. Finally, the rotation constraints and translation constraints are used to align the previously determined poses of the 10 frames of images with the IMU, and the previously determined poses of the 10 frames of images are used as the accurate poses to constrain the variables to be obtained by the IMU, and finally SVD is used. Decomposition solves a system of inhomogeneous equations to determine all quantities to be solved.
第二种应用于RGBD-VINS系统,具体的过程如下:等待系统中积累满10帧图像以后,在10帧图像中选取两帧视差足够的图像L和R。由于在RGBD-VINS系统中存在深度相机,因此每帧图像的深度都是已知的,不再需要使用2D点对2D点的对极几何约束来求解两帧图像的位姿,可以直接采用最小化3D点对2D点的重投影误差来求解,这样求解出的位姿直接具备了尺度确定性。在地图点的恢复过程中,也不再采取三角化的方式,而是直接利用深度相机信息,当两帧之间存在共视关系时,就可以用对应的深度值来恢复共视点的三维坐标,即各帧图像对应的地图点。最后在IMU初始化时,由于深度信息已经确定了位姿的尺度,所以不再需要再利用IMU估计尺度,尺度量设置为已知量而不再是待求量,只求解IMU的初始速度、重力向量以及偏离率即可。The second is applied to the RGBD-VINS system, and the specific process is as follows: after waiting for 10 frames of images to be accumulated in the system, two images L and R with sufficient parallax are selected from the 10 frames of images. Since there is a depth camera in the RGBD-VINS system, the depth of each frame of image is known, and it is no longer necessary to use the 2D point-to-2D point epipolar geometric constraint to solve the pose of the two frames of images. The re-projection error of the 3D point to the 2D point is solved, so that the solved pose is directly scale deterministic. In the recovery process of map points, the triangulation method is no longer adopted, but the depth camera information is directly used. When there is a common view relationship between two frames, the corresponding depth value can be used to restore the three-dimensional coordinates of the common view point. , that is, the map points corresponding to each frame of image. Finally, when the IMU is initialized, since the depth information has already determined the scale of the pose, it is no longer necessary to use the IMU to estimate the scale. vector and deviation rate.
然而,上述两种初始化的方法均存在技术上的局限性。However, the above two initialization methods have technical limitations.
其中,第一种方法主要包括以下缺点:1、使用单目相机进行系统初始化,要求用于初始化的图像之间必须满足一定的视差,并且两帧图像之间的匹配点要足够多,才能保障初始化成功。但检测视差和更多的匹配点需要消耗更多的时间,而且如果不满足要求是还可能导致初始化失败并重启,进而导致初始化成功率低;2、需要累积10帧连续图像之后才开始进行初始化过程,无法在系统一开始运行就输出位姿;3、使用单目相机进行视觉初始化会造成位姿尺度不确定的问题,即使后面融入IMU的信息可以恢复出真实尺度,但是在终端IMU噪声较大前提下,也很难精确的计算出尺度,同时还会增加计算负担;4、在单目系统里,初始化过程中需要使用的地图点都是由三角化算法恢复出来的,当地图点较多时,需要消耗大量的时间用来计算,导致初始化过程缓慢。Among them, the first method mainly includes the following shortcomings: 1. Using a monocular camera for system initialization requires that a certain parallax must be satisfied between the images used for initialization, and there must be enough matching points between the two frames of images to ensure Initialization successful. However, it takes more time to detect parallax and more matching points, and if the requirements are not met, the initialization may fail and restart, resulting in a low initialization success rate; 2. It is necessary to accumulate 10 frames of consecutive images before starting the initialization. process, it is impossible to output the pose when the system starts running; 3. The use of a monocular camera for visual initialization will cause the problem of uncertainty in the pose and scale. Even if the information from the IMU later can restore the true scale, the noise of the terminal IMU is relatively high. Under the major premise, it is difficult to accurately calculate the scale, and at the same time, it will increase the computational burden; 4. In the monocular system, the map points that need to be used in the initialization process are recovered by the triangulation algorithm. For a long time, it needs to consume a lot of time for calculation, resulting in a slow initialization process.
第二种方法则包括以下缺点:1、需要累积10帧图像才可以进行初始化,在10帧图像累积满之前,系统不进行任何工作,浪费了大量时间。同时需要在10帧图像中检测视差,在视差小的场景,仍然会出现初始化不通过的情况;2、无法在系统一开始运行就输出位姿;3、在地图点的恢复中,完全信赖深度相机的信息。在深度相机噪声较大,或者大量地图点不在深度相机的量程范围时,可能导致计算位姿时可用的地图点数量严重减少,进而导致计算结果误差大甚至无法成功计算结果的情况。The second method includes the following disadvantages: 1. It is necessary to accumulate 10 frames of images before initialization can be performed. Before the accumulation of 10 frames of images is full, the system does not perform any work and wastes a lot of time. At the same time, it is necessary to detect parallax in 10 frames of images. In scenes with small parallax, the initialization will still fail; 2. It is impossible to output the pose when the system starts running; 3. In the recovery of map points, the depth is completely trusted. camera information. When the depth camera is noisy, or a large number of map points are not within the range of the depth camera, the number of map points available for pose calculation may be severely reduced, resulting in large errors in the calculation results or even failure to successfully calculate the results.
基于上述一个或多个问题,本示例实施方式提供了一种视觉惯性系统初始化方法。 该视觉惯性系统初始化方法可以应用于上述服务器105,也可以应用于上述终端设备101、102、103中的一个或多个,本示例性实施例中对此不做特殊限定。参考图3所示,该视觉惯性系统初始化方法可以包括以下步骤S310和S320:Based on one or more of the above problems, this example embodiment provides a method for initializing a visual-inertial system. The method for initializing the visual inertial system may be applied to the foregoing server 105, and may also be applied to one or more of the foregoing terminal devices 101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to Fig. 3, the visual inertial system initialization method may include the following steps S310 and S320:
在步骤S310中,在接收图像的过程中,针对图像进行逐帧计算,直至得到第一预设数量的位姿。In step S310, in the process of receiving the image, frame-by-frame calculation is performed on the image until a first preset number of poses are obtained.
其中,上述逐帧计算参照图4所示,可以包括以下步骤S410至S430:The above frame-by-frame calculation is shown in FIG. 4 and may include the following steps S410 to S430:
在步骤S410中,在接收到一帧当前帧图像时,提取当前帧图像的特征点和特征点对应的深度信息。In step S410, when a frame of the current frame image is received, the feature points of the current frame image and the depth information corresponding to the feature points are extracted.
在一示例性实施例中,由于VIO系统的图像传入帧率为10Hz左右,每帧图像之间的时间间隔约为100ms,因此,在相关技术中,如果需要累积满10帧图像,则需要花费大约1000ms的时间进行图像累积,并在1000ms之后才可以进行初始化。为了避免图像累积时的时间浪费,可以在接收到一帧图像就对该图像进行对应的处理。具体的,在接收到一帧当前图像帧时,先提取当前图像帧中的特征点,同时提取深度传感器采集的深度信息中,该特征点对应的深度信息。In an exemplary embodiment, since the incoming frame rate of images of the VIO system is about 10 Hz, the time interval between each frame of images is about 100 ms. Therefore, in the related art, if 10 frames of images need to be accumulated, it is necessary to Image accumulation takes about 1000ms, and initialization can take place after 1000ms. In order to avoid time waste when images are accumulated, a frame of image can be processed correspondingly after it is received. Specifically, when a current image frame is received, a feature point in the current image frame is first extracted, and at the same time depth information corresponding to the feature point is extracted from the depth information collected by the depth sensor.
在步骤S420中,基于当前帧图像的特征点和深度信息与上一帧图像的特征点和深度信息确定当前帧图像对应的位姿。In step S420, the pose corresponding to the current frame image is determined based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image.
在一示例性实施例中,在提取了当前帧图像的特征点和各个特征点的深度信息后,可以基于当前帧图像的特征点和深度信息,与接收当前帧图像之前接收的前一帧图像,即上一帧图像的特征点和各个特征点对应的深度信息确定当前帧图像对应的位姿。In an exemplary embodiment, after the feature points of the current frame image and the depth information of each feature point are extracted, the feature points and depth information of the current frame image can be compared with the previous frame image received before the current frame image is received. , that is, the feature points of the previous frame image and the depth information corresponding to each feature point determine the pose corresponding to the current frame image.
在一示例性实施例中,在基于当前帧图像的特征点和深度信息与当前帧图像的上一帧图像的特征点和深度信息确定当前帧图像对应的位姿时,可以先将当前帧图像的特征点和当前帧图像对应的上一帧图像的特征点进行特征匹配,以获取匹配的匹配特征点;此时,由于上一帧图像位姿已经确定,因此可以通过上一帧图像的特征点对应的深度信息确定匹配特征点在三维坐标系中的地图点,然后将该地图点投影到当前帧图像中,进而确定当前帧图像对应的位姿。In an exemplary embodiment, when determining the pose corresponding to the current frame image based on the feature points and depth information of the current frame image and the feature points and depth information of the previous frame image of the current frame image, the current frame image The feature points of the current frame image are matched with the feature points of the previous frame image corresponding to the current frame image to obtain the matching matching feature points; at this time, since the pose of the previous frame image has been determined, the features of the previous frame image can be obtained. The depth information corresponding to the point determines the map point of the matching feature point in the three-dimensional coordinate system, and then the map point is projected into the current frame image, and then the pose corresponding to the current frame image is determined.
举例而言,参照图5所示,可以将当前帧图像的特征点和当前帧图像对应的上一帧图像的特征点进行光流匹配,然后基于上一帧图像的特征点对应的深度信息确定匹配上的匹配特征点对应的地图点,并将该地图点重投影到当前帧图像上,最后通过PnP(Perspective-n-Point,多点透视)最小化重投影误差的方法计算出当前帧图像对应的位姿。For example, referring to FIG. 5 , optical flow matching can be performed between the feature points of the current frame image and the feature points of the previous frame image corresponding to the current frame image, and then determined based on the depth information corresponding to the feature points of the previous frame image. Match the map point corresponding to the matching feature point, and reproject the map point to the current frame image, and finally calculate the current frame image by minimizing the reprojection error through PnP (Perspective-n-Point, multi-point perspective). corresponding pose.
其中,上述三维坐标系指世界坐标系,该世界坐标系通常可以根据输入视觉惯性系统的第一帧图像对应的相机坐标系确定,从第二帧开始确定的所有位姿,均为相对该世界坐标系而言的相对位姿。具体的,在接到的当前帧图像为第一帧图像时,可以直接获取第一帧图的对应的相机坐标系,然后将该相机坐标系设定为世界坐标系。The above three-dimensional coordinate system refers to the world coordinate system, and the world coordinate system can usually be determined according to the camera coordinate system corresponding to the first frame image of the input visual inertial system. All poses determined from the second frame are relative to the world. Relative pose in terms of coordinate system. Specifically, when the received current frame image is the first frame image, the corresponding camera coordinate system of the first frame image can be directly obtained, and then the camera coordinate system is set as the world coordinate system.
需要说明的是,在接收到当前帧图像时,由于可以通过深度相机采集到的该当前帧图像的深度信息,因此不需要使用对极约束条件去求解两帧图像之间的位姿,即使两张图像之间的视差很小,也可以通过深度信息进行地图点投影,再通过最小化重投影误差的方式来求解出相对位姿,因此,不需要提前判断当前帧图像与上一帧图像之间的视差是否足够。It should be noted that when the current frame image is received, since the depth information of the current frame image can be collected by the depth camera, it is not necessary to use the epipolar constraint to solve the pose between the two frame images, even if the two The parallax between the two images is very small. It is also possible to perform map point projection through the depth information, and then solve the relative pose by minimizing the re-projection error. Therefore, it is not necessary to judge the difference between the current frame image and the previous frame image in advance. whether the parallax between them is sufficient.
在步骤S430中,将当前帧图像作为上一帧图像,并继续接收新的当前帧图像。In step S430, the current frame image is taken as the previous frame image, and a new current frame image is continuously received.
在一示例性实施例中,由于对IMU进行初始化时,需要足够数量的位姿,因此在确定惯性测量单元IMU对应的运动速度、重力向量和偏离率之前,需要重复确定多个当前帧图像对应的位姿,直至确定的位姿数量等于IMU初始化所需的第一预设数量。具体的,在重复确定位姿的时候,可以在接收到一帧当前帧图像并确定了该当前帧图像对应的位姿后,将该当前帧图像作为上一帧图像,使得接收到的新的当前帧图像可以以上一次接 收到的当前帧图像为上一帧图像进行位姿确定。In an exemplary embodiment, since a sufficient number of poses are required to initialize the IMU, before determining the motion velocity, gravity vector and deviation rate corresponding to the IMU of the inertial measurement unit, it is necessary to repeatedly determine the corresponding images of multiple current frames. poses until the number of determined poses is equal to the first preset number required for IMU initialization. Specifically, when the pose is repeatedly determined, after a frame of the current frame image is received and the pose corresponding to the current frame image is determined, the current frame image can be used as the previous frame image, so that the received new frame The pose of the current frame image can be determined by using the current frame image received last time as the previous frame image.
参照图6所示,在接收到的当前帧图像为第2帧图像时,可以根据第2帧图像中的特征点和特征点对应的深度信息以及上一帧图像(第1帧图像)中的特征点和特征点对应的深度信息确定第2帧图像对应的位姿;然后用第2帧图像作为上一帧图像,之后在接收到的当前帧图像为第3帧图像时,由于上一帧图像已经由第1帧图像变为第2帧图像,因此可以直接根据第3帧图像中的特征点和特征点对应的深度信息以及上一帧图像(第2帧图像)中的特征点和特征点对应的深度信息确定第3帧图像对应的位姿;之后重复上述过程,以frame-by-frame(逐帧)的形式确定连续输入的各个当前图像帧对应的位姿,直至确定第i帧对应的位姿之后,共确定出第一预设数量的位姿后停止。Referring to FIG. 6 , when the received current frame image is the second frame image, the feature points in the second frame image and the depth information corresponding to the feature points and the depth information in the previous frame image (the first frame image) can be used. The depth information corresponding to the feature points and the feature points determines the pose corresponding to the second frame image; then the second frame image is used as the previous frame image, and then when the received current frame image is the third frame image, due to the previous frame image The image has changed from the first frame image to the second frame image, so it can be directly based on the feature points and the depth information corresponding to the feature points in the third frame image and the feature points and features in the previous frame image (the second frame image) The depth information corresponding to the point determines the pose corresponding to the third frame of image; then repeat the above process to determine the pose corresponding to each current image frame that is continuously input in the form of frame-by-frame (frame-by-frame) until the i-th frame is determined After the corresponding poses are determined, a first preset number of poses are determined and then stopped.
此外,在接收到的当前帧图像为第一帧图像时,由于没有作为参考的上一帧图像,此时可以直接将第一帧图像对应的位姿确定为预设位姿。其中,预设位姿可以是单位矩阵。需要说明的是,将第一帧图像对应的位姿确定为预设位姿后,预设位姿也作为位姿数量的一部分进行计数。即假设第一预设数量为10时,除了预设位姿以外,还需要通过上述重复确定的位姿确定9个位姿,预设位姿与确定的9个位姿合计等于第一预设数量10,此时可以停止重复确定位姿的过程,进行其他后续处理。In addition, when the received current frame image is the first frame image, since there is no previous frame image as a reference, at this time, the pose corresponding to the first frame image can be directly determined as the preset pose. Wherein, the preset pose may be a unit matrix. It should be noted that, after the pose corresponding to the first frame of image is determined as the preset pose, the preset pose is also counted as a part of the pose quantity. That is, assuming that the first preset number is 10, in addition to the preset poses, 9 poses need to be determined through the above repeatedly determined poses, and the total of the preset poses and the determined 9 poses is equal to the first preset The number is 10. At this time, the process of repeatedly determining the pose can be stopped and other subsequent processing can be performed.
在一示例性实施例中,接收到的当前帧图像可能存在噪声等干扰,在噪声等干扰的原因下,计算位姿的过程可能会出现失败的情况。例如,噪声等干扰导致PnP最小化投影误差求解失败。如果在确定的位姿数量小于第一预设数量时,出现了没有成功确定位姿的当前帧图像时,可以将当前帧图像丢弃,同时不进行步骤S430中将当前帧图像作为上一帧图像过程,即保留之前的上一帧图像,然后重新接收新的当前帧图像,并基于新的当前帧图像和保留的上一帧图像的计算新的当前帧图像对应的位姿。In an exemplary embodiment, the received current frame image may have interference such as noise, and the process of calculating the pose may fail due to interference such as noise. For example, disturbances such as noise cause the PnP minimization projection error solution to fail. If the number of the determined poses is less than the first preset number, and there is a current frame image for which the pose has not been successfully determined, the current frame image can be discarded, and the current frame image is not used as the previous frame image in step S430. The process is to retain the previous previous frame image, then re-receive the new current frame image, and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.
例如,参照图7所示,在当前帧图像为第6帧图像a时,上一帧图像即为第5帧图像,在基于第6帧图像a和第5帧图像确定第6帧图像a对应的位姿时失败,说明第6帧图像中包含的噪声可能比较大,因此接收性的当前帧图像作为新的第6帧图像b继续进行计算,进而确定新的第6帧图像b对应的位姿。然后以新的第6帧图像b作为上一帧图像,继续进行逐帧计算的过程。For example, referring to FIG. 7 , when the current frame image is the sixth frame image a, the previous frame image is the fifth frame image, and it is determined that the sixth frame image a corresponds to the sixth frame image a based on the sixth frame image a and the fifth frame image It fails when the pose of the 6th frame image is relatively large, indicating that the noise contained in the 6th frame image may be relatively large. Therefore, the acceptable current frame image is used as the new 6th frame image b to continue the calculation, and then the corresponding position of the new 6th frame image b is determined. posture. Then, taking the new sixth frame image b as the previous frame image, the process of frame-by-frame calculation is continued.
进一步地,在一示例性实施例中,由于噪声的干扰,还可能出现接收到的连续多帧图像均出现无法成功确定位姿的情况。在这种情况下,可以在没有成功确定当前帧图像对应的位姿,对未成功确定位姿的当前帧图像进行数量统计,在未成功确定位姿的当前帧图像的数量等于第二预设数量时,可以将已经确定的位姿全部清空,即位姿数量被重置为0。然后重新接收新的当前帧图像,并重新进行上述逐帧计算的过程,重新进行位姿数量的累积,直至位姿数量等于第一预设数量。Further, in an exemplary embodiment, due to the interference of noise, there may also be a situation that the pose cannot be successfully determined in the received consecutive multiple frames of images. In this case, when the pose corresponding to the current frame image has not been successfully determined, the number of the current frame images whose pose has not been successfully determined can be counted, and the number of the current frame images whose pose has not been successfully determined is equal to the second preset. When the number of poses is set, all the poses that have been determined can be cleared, that is, the number of poses is reset to 0. Then, a new current frame image is received again, and the above frame-by-frame calculation process is performed again, and the number of poses is accumulated again until the number of poses is equal to the first preset number.
需要说明的是,在一示例性实施例中,等于上述第二预设数量的情况有两种,一种是未成功确定位姿的当前帧图像的数量累计等于第二预设数量;另一种是需要连续第二预设数量的当前帧图像没有成功确定对应的位姿。其中,在第二种方法中,为了满足连续第二预设数量的当前帧图像没有成功确定对应位姿的条件,可以在第一次出现未成功确定位姿的第一个当前帧图像后,对未成功确定位姿的当前帧图像进行数量统计;在统计的数量等于第二预设数量之前,如果出现任意一个当前帧图像成功确定了对应的位姿,可以对之前统计的未成功确定位姿的当前帧图像数量进行重置。It should be noted that, in an exemplary embodiment, there are two situations in which it is equal to the above-mentioned second preset number. One is that the number of current frame images whose poses are not successfully determined is cumulatively equal to the second preset number; The first is that a second preset number of images of the current frame are required to fail to successfully determine the corresponding pose. Among them, in the second method, in order to satisfy the condition that the corresponding pose is not successfully determined for the second consecutive preset number of current frame images, after the first current frame image for which the pose is not successfully determined appears for the first time, Count the number of the current frame images whose poses have not been successfully determined; before the number of statistics is equal to the second preset number, if any current frame image has successfully determined the corresponding pose, the previously counted unsuccessful determinations can be made. The current frame image number of the pose is reset.
举例而言,参照图8所示,假设第二预设数量为3,在上一帧图像为第5帧图像时,若当前帧图像(第一个第6帧图像d)未成功确定对应的位姿时,统计未成功确定位姿的当前帧图像的数量n=1;接收到新的当前帧图像(第二个第6帧图像e)再次未成功确定对应的位姿,则n=1+1=2;再次接收到新的当前帧图像(第三个第6帧图像f),仍然未成功确定对应的位姿,则n=2+1=3,此时可以确定n等于第二预设数量,因此需要将前面5帧图像确定的位姿全部删除。之后,利用接收的第一个当前帧图像为第一帧图像 重新进行上述逐帧计算的过程,重新累积确定的位姿,直到在确定第i帧图像的位姿后,累积确定位姿的数量等于第一预设数量。For example, referring to FIG. 8 , assuming that the second preset number is 3, when the previous frame image is the fifth frame image, if the current frame image (the first sixth frame image d) fails to determine the corresponding When the pose is counted, the number of current frame images for which the pose has not been successfully determined is n=1; after receiving a new current frame image (the second sixth frame image e), the corresponding pose has not been successfully determined again, then n=1 +1=2; the new current frame image (the third and sixth frame image f) is received again, and the corresponding pose has not been successfully determined, then n=2+1=3, at this time it can be determined that n is equal to the second The preset number, so it is necessary to delete all the poses determined by the previous 5 frames of images. After that, use the received first current frame image as the first frame image to perform the above frame-by-frame calculation process again, and re-accumulate the determined poses until the number of determined poses is accumulated after the pose of the i-th frame image is determined. equal to the first preset number.
此外,在上例中,在实现n=3之前,如果出现了任意一个当前帧图像成功计算出了对应的位姿时,可以将n重置为0。具体的,例如在n-2时,再次接收到新的当前帧图像(第三个第6帧图像)时,第三个第6帧图像成功确定了对应的位姿,此时将n=2重置为n=0。然后可以将第三个第6帧图像作为上一帧图像,并获取新的当前帧图像,并通过新的当前帧图像和上一帧图像(第三个第6帧图像)确定新的当前帧图像对应的位姿。In addition, in the above example, before implementing n=3, if any current frame image has successfully calculated the corresponding pose, n can be reset to 0. Specifically, for example, at n-2, when a new current frame image (the third sixth frame image) is received again, the third sixth frame image successfully determines the corresponding pose, and at this time, n=2 Reset to n=0. Then the third 6th frame image can be used as the previous frame image, and a new current frame image can be obtained, and the new current frame image can be determined through the new current frame image and the previous frame image (the third 6th frame image) The pose corresponding to the image.
通过设置合理的位姿计算失败的处理机制,对位姿计算失败的情况机械能预防,提高了初始化的成功率,使得初始化更加鲁棒。By setting a reasonable processing mechanism for the failure of the pose calculation, the mechanical energy can prevent the failure of the pose calculation, improve the success rate of initialization, and make the initialization more robust.
在步骤S320中,根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据运动速度、重力向量和偏离率对视频惯性系统进行初始化。In step S320, the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit are determined according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.
在一示例性实施例中,在根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率时,可以将第一预设数量的位姿与惯性测量单元进行对齐,利用旋转约束和平移约束计算惯性测量单元的运动速度、重力向量和偏离率。此外,在IMU初始化时,由于深度信息已经确定了位姿的尺度因此不需要确定位姿的尺度。In an exemplary embodiment, when determining the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, the first preset number of poses and the inertial measurement unit may be aligned. , which uses rotational and translation constraints to calculate the motion velocity, gravity vector, and deflection rate of the inertial measurement unit. In addition, when the IMU is initialized, the scale of the pose does not need to be determined since the depth information has already determined the scale of the pose.
需要说明的是,在将第一预设数量的位姿与惯性测量单元进行对齐之前,可以对惯性测量单元进行更加精准的内外参标定,可以获取准确的惯性测量单元的加速度偏离率、角速度偏离率以及惯性测量单元与相机之间的外参变换,进而使得视觉与惯性测量单元联合初始化过程更快收敛,精度也会有进一步的提升。It should be noted that, before aligning the first preset number of poses with the inertial measurement unit, a more accurate internal and external parameter calibration can be performed on the inertial measurement unit, and an accurate acceleration deviation rate and angular velocity deviation of the inertial measurement unit can be obtained. rate and the external parameter transformation between the inertial measurement unit and the camera, so that the joint initialization process of the visual and inertial measurement unit converges faster, and the accuracy will be further improved.
在一示例性实施例中,为了能够得到更准确的运动速度、重力向量和偏离率进行初始化,需要通过更加准确的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率。具体的,可以在根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率之前,先对第一预设数量的位姿对应的目标图像进行地图点恢复,获取各个目标图像对应的地图点,然后根据恢复的地图点构建局部集束调整,以对第一预设数量的位姿进行优化得到优化后的位姿,最后根据优化后的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率。In an exemplary embodiment, in order to obtain more accurate movement speed, gravity vector and deviation rate for initialization, it is necessary to determine the movement speed, gravity vector and deviation rate corresponding to the inertial measurement unit through a more accurate pose. Specifically, before determining the motion speed, gravity vector, and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, map points may be recovered for the target images corresponding to the first preset number of poses to obtain Map points corresponding to each target image, and then construct a local cluster adjustment according to the restored map points to optimize the first preset number of poses to obtain an optimized pose, and finally determine the corresponding inertial measurement unit according to the optimized pose. The velocity of motion, the gravity vector, and the rate of deviation.
其中,目标图像可以包括成功确定第一预设数量位姿的当前帧图像。例如,假设为了确定第一预设数量的位姿共接收12个当前帧图像,其中有2个未成功确定位姿的当前帧图像,则其余的10个成功确定对应位姿的当前帧图像即为目标图像。Wherein, the target image may include the current frame image for which the pose of the first preset number is successfully determined. For example, assuming that a total of 12 current frame images are received in order to determine the first preset number of poses, 2 of which are current frame images whose poses have not been successfully determined, then the remaining 10 current frame images whose corresponding poses have been successfully determined are for the target image.
在一示例性实施例中,由于一个当前帧图像确定一个位姿,因此目标图像的数量也为第一预设数量。此时,参照图9所示,可以先在第一预设数量i个目标图像之间查找存在共视匹配关系的目标图像,以获取至少一对目标图像对。然后针对每对目标图像对,利用目标图像对中的目标图像的深度信息进行重投影,并计算重投影误差。在重投影误差较小,即小于等于预设阈值时,可以认为深度信息的误差较小,因此可以利用深度信息进行地图点恢复。此时,可以利用深度信息对特征点进行反投影,以对目标图像对进行地图点恢复;相反的,在重投影误差较大,即大于预设阈值时,可以认为深度信息的误差较大,因此不宜通过深度信息进行地图点恢复,此时可以采用三角化法对目标图像对进行地图点恢复。然后将恢复的所有地图点作为地图点集合,用于构建局部集束调整,以对位姿进行优化。In an exemplary embodiment, since one current frame image determines one pose, the number of target images is also the first preset number. At this time, referring to FIG. 9 , a target image with a common-view matching relationship may be searched among the first preset number i of target images to obtain at least one pair of target images. Then, for each pair of target image pairs, the depth information of the target image in the target image pair is used for reprojection, and the reprojection error is calculated. When the reprojection error is small, that is, less than or equal to the preset threshold, it can be considered that the error of the depth information is small, so the depth information can be used for map point recovery. At this time, the feature points can be back-projected by using the depth information to restore the map points of the target image pair; on the contrary, when the re-projection error is large, that is, greater than the preset threshold, it can be considered that the error of the depth information is large. Therefore, it is not suitable to restore map points through depth information. At this time, the triangulation method can be used to restore map points of the target image pair. All the recovered map points are then used as a set of map points to construct a local bundle adjustment to optimize the pose.
根据不同的条件选择使用深度信息或三角化法进行地图点恢复,在保证地图点精度的同时增加了地图点的数量;同时,即使VIO系统接收的深度信息质量较差,也能做到精准初始化。According to different conditions, depth information or triangulation method is used to restore map points, which increases the number of map points while ensuring the accuracy of map points; at the same time, even if the depth information received by the VIO system is of poor quality, accurate initialization can be achieved .
在一示例性实施例中,在进行地图点恢复时,除了上述通过重投影误差确定是否使用深度信息进行地图点恢复之外,还可以利用概率模型对深度信息的不确定进行建模,然后通过深度信息的不确定性的概率大小确定是否采用深度信息进行地图点恢复。通过 概率模型进行不确定性判断,可以减小深度信息带来的噪声,实现更加精准的初始化过程。In an exemplary embodiment, when performing map point recovery, in addition to determining whether to use depth information for map point recovery through reprojection error, a probability model can also be used to model the uncertainty of depth information, and then The probability of the uncertainty of the depth information determines whether to use the depth information for map point recovery. Using the probability model to judge the uncertainty can reduce the noise caused by the depth information and realize a more accurate initialization process.
其中,预设阈值可以根据实际应用场景、应用环境等进行设定。例如,在深度信息更加可靠时,可以将该预设阈值确定为较大的值;反之在深度信息可靠性较小时,可以选择较小的值作为预设阈值。The preset threshold may be set according to actual application scenarios, application environments, and the like. For example, when the depth information is more reliable, the preset threshold may be determined as a larger value; on the contrary, when the depth information is less reliable, a smaller value may be selected as the preset threshold.
此外,在初始化结束后,还可以根据确定的重力向量对之前确定的所有位姿进行重力方向调整,以输出调整后的位姿。In addition, after the initialization is completed, the gravity direction adjustment can also be performed on all the previously determined poses according to the determined gravity vector, so as to output the adjusted poses.
以下参照图10所示,以第一预设数量为10,第二预设数量为3,预设阈值为1/460为例,对本公开实施例的技术方案进行详细阐述。Referring to FIG. 10 , the technical solutions of the embodiments of the present disclosure will be described in detail by taking the first preset number as 10, the second preset number as 3, and the preset threshold as 1/460 as an example.
步骤S1001,接收当前帧图像;Step S1001, receiving the current frame image;
步骤S1003,对当前帧图像进行特征提取,提取各个特征点对应的深度信息;Step S1003, performing feature extraction on the current frame image, and extracting depth information corresponding to each feature point;
步骤S1005,判断当前帧图像是否为输入系统的第一帧图像;Step S1005, determine whether the current frame image is the first frame image of the input system;
步骤S1007,在当前帧图像不是第一帧视频时,将当前帧图像与上一帧图像进行光流匹配和PnP最小化重投影误差的方法确定当前帧图像对应的位姿;Step S1007, when the current frame image is not the first frame of video, the current frame image and the previous frame image are subjected to optical flow matching and PnP to minimize the method of re-projection error to determine the pose corresponding to the current frame image;
步骤S1009,判断当前帧图像是否成功确定对应的位姿;Step S1009, judging whether the current frame image successfully determines the corresponding pose;
步骤S1011,在当前帧图像未成功确定对应的位姿时,判断是否连续3个接收到的当前帧图像均未成功确定对应的位姿;Step S1011, when the current frame image fails to determine the corresponding pose, determine whether the corresponding pose has not been successfully determined for three consecutive received current frame images;
步骤S1013,在连续3个接收到的当前帧图像均未成功确定对应的位姿时,清空之前确定的所有位姿;Step S1013, when the corresponding poses are not successfully determined for three consecutive received current frame images, clear all the poses determined before;
步骤S1015,在仅有1个或连续2个接收到的当前帧图像未成功确定对应的位姿时,丢弃当前帧图像;Step S1015, when only one or two consecutively received current frame images fail to successfully determine the corresponding pose, discard the current frame image;
步骤S1017,在当前帧图像成功确定对应的位姿时,将当前帧图像作为上一帧图像;Step S1017, when the current frame image successfully determines the corresponding pose, the current frame image is used as the previous frame image;
步骤S1019,判断累积确定位姿的数量m是否等于第一预设数量10;Step S1019, judging whether the number m of the accumulated determined poses is equal to the first preset number 10;
步骤S1021,在m等于10时,在10帧成功确定位姿的目标图像中查找共视关系,并确定目标图像对;Step S1021, when m is equal to 10, search for a common view relationship in 10 target images whose poses are successfully determined, and determine the target image pair;
步骤S1023,利用目标图像对中目标图像的深度信息进行重投影,并计算重投影误差;Step S1023, reproject the depth information of the target image in the target image pair, and calculate the reprojection error;
步骤S1025,在计算的重投影误差小于等于1/460时,利用深度信息对特征点进行反投影进行地图点恢复;Step S1025, when the calculated re-projection error is less than or equal to 1/460, use the depth information to back-project the feature points to restore map points;
步骤S1027,在在计算的重投影误差大于1/460时,利用三角化法进行地图点恢复;Step S1027, when the calculated reprojection error is greater than 1/460, use the triangulation method to restore map points;
步骤S1029,通过恢复的地图点对10个目标图像对应的位姿进行局部BA调整;Step S1029, performing local BA adjustment on the poses corresponding to the 10 target images through the restored map points;
步骤S1031,通过优化后的10个位姿进行IMU初始化,计算IMU对应的初始速度、重力向量和偏离率;Step S1031, initialize the IMU through the optimized 10 poses, and calculate the initial speed, gravity vector and deviation rate corresponding to the IMU;
步骤S1033,根据重力向量对所有的位姿和地图点进行重力方向调整;Step S1033, adjusting the gravity direction of all poses and map points according to the gravity vector;
步骤S1035,输出位姿。Step S1035, output the pose.
需要说明的是,由于不同应用场景的需求不同,在步骤S1035中,可以在IMU初始化之前,直接将根据当前帧图像和上一帧图像确定的当前帧图像对应的位姿输出;此外,还可以在IMU初始化之后,对位姿进行重力调整后再输出。It should be noted that, due to the different requirements of different application scenarios, in step S1035, before the initialization of the IMU, the pose corresponding to the current frame image determined according to the current frame image and the previous frame image may be directly output; in addition, you can also After the IMU is initialized, the pose is gravity adjusted and then output.
根据上述实施例中的方法,利用上海科技大学(STU)发布的公开数据集A\B\C(后简称为STU数据集)进行实验验证,可以得到如表1所示的结果。基于表1可以得出以下结论:According to the method in the above embodiment, the public data set A\B\C (hereinafter referred to as the STU data set) released by ShanghaiTech University (STU) is used for experimental verification, and the results shown in Table 1 can be obtained. Based on Table 1, the following conclusions can be drawn:
(1)在初始化过程的耗时上,本实施例相比于VINS-MONO、VINS-RGBD方案均有大幅度的提升。相比与VINS-MONO,平均速度提升6到7倍;相比VINS-RGBD,平均速度提升3到4倍。(1) Compared with the VINS-MONO and VINS-RGBD schemes, the present embodiment has a significant improvement in the time-consuming of the initialization process. Compared with VINS-MONO, the average speed is increased by 6 to 7 times; compared with VINS-RGBD, the average speed is increased by 3 to 4 times.
(2)在首次输出位姿的时间上,本实施例输出相机位姿的时间要远远提前于 VINS-MONO和VINS-RGBD,至少提前30帧输出位姿,即通常情况下,本实施例可以在系统一开始运行就输出位姿。(2) In the time of outputting the pose for the first time, the time of outputting the camera pose in this embodiment is much earlier than that of VINS-MONO and VINS-RGBD, and the pose is output at least 30 frames ahead of time. The pose can be output as soon as the system starts running.
(3)在整体轨迹精度上,由于测试使用的STU数据集拥有质量较好的深度图数据,因此本实施例最后的整体轨迹估计精度与VINS-RGBD方案基本持平,在多数情况下优于VINS-MONO;而在接收到的图像的深度信息质量较差时,本实施例采用了三角化法和深度信息结合的方式来恢复地图点,在保证准确的同时可以恢复出更多的三维地图点,相比于VINS-RGBD方案,会带来精度上的明显提升。(3) In terms of overall trajectory accuracy, since the STU dataset used in the test has better quality depth map data, the final overall trajectory estimation accuracy of this embodiment is basically the same as that of the VINS-RGBD scheme, and is better than VINS in most cases -MONO; when the quality of the depth information of the received image is poor, this embodiment adopts the combination of the triangulation method and the depth information to restore the map points, which can restore more three-dimensional map points while ensuring accuracy , compared with the VINS-RGBD scheme, it will bring about a significant improvement in accuracy.
表1本实施例与VINS-MONO和VINS-RGBD的性能对比Table 1 Performance comparison of this embodiment with VINS-MONO and VINS-RGBD
Figure PCTCN2022072711-appb-000001
Figure PCTCN2022072711-appb-000001
为了验证本实施例的鲁棒性,我们在弗吉尼亚联邦大学公开的VCU-RVI数据集(15组数据)上进行了测试,计算VINS-MONO、VINS-RGBD和本实施例的轨迹均方根误差(RMSE),测试结果如表2所示。In order to verify the robustness of this example, we tested it on the VCU-RVI dataset (15 sets of data) published by Virginia Commonwealth University, and calculated the trajectory root mean square error of VINS-MONO, VINS-RGBD and this example (RMSE), and the test results are shown in Table 2.
从测试结果种可以看出,VINS-MONO、VINS-RGBD都未成功跟踪所有的15条轨迹;VINS-MONO有两组轨迹由于无法初始化成功而跟踪失败;而VINS-RGBD则出现三组轨迹由于无法初始化成功而跟踪失败。反观本实施例,本实施例在全部15组数据集上都做到了成功初始化和跟踪,即本实施例具有更好的鲁棒性。同时,在15组数据集的跟踪轨迹精度上,本技术方案的精度在大多数情况下也都优于其他方法,即本实施例可以更加准确的结果。It can be seen from the test results that neither VINS-MONO nor VINS-RGBD successfully tracked all 15 trajectories; VINS-MONO has two sets of trajectories that failed to be initialized successfully and failed to track; while VINS-RGBD has three sets of trajectories due to Failed to initialize successfully and trace failed. In contrast to this embodiment, this embodiment has successfully initialized and tracked all 15 sets of data sets, that is, this embodiment has better robustness. At the same time, in the tracking accuracy of the 15 sets of data sets, the accuracy of the technical solution is also better than other methods in most cases, that is, the present embodiment can achieve more accurate results.
表2轨迹均方根误差Table 2 Trajectory root mean square error
Figure PCTCN2022072711-appb-000002
Figure PCTCN2022072711-appb-000002
其中,X代表初始化失败。Among them, X represents initialization failure.
综上,本示例性实施方式中,具有以下有益效果:To sum up, this exemplary embodiment has the following beneficial effects:
(1)不再需要像VINS-MONO和VINS-RGBD那样等待累积满10帧图像,并且IMU初始化成功之后才能输出位姿,提前了输出位姿的时间。即在系统开始运行时,接收到第一个当前帧图像便可以开始输出可利用的位姿信息。在用户体验上,用户不再需要等待视觉IMU联合初始化完成才可以开始使用相关应用,而是打开应用后立刻便可以开始.(1) It is no longer necessary to wait for the accumulation of 10 frames of images like VINS-MONO and VINS-RGBD, and the pose can be output only after the IMU is successfully initialized, which advances the time to output the pose. That is, when the system starts to run, it can start to output the available pose information after receiving the first current frame image. In terms of user experience, users no longer need to wait for the visual IMU joint initialization to complete before they can start using related applications, but can start immediately after opening the application.
(2)由于原来的VIO系统需要等待系统累积满10帧图像才开始进行操作,在未满10帧的时候不进行任何操作,而图像的传入帧率为10Hz左右,每帧图像之间的时间间隔约为100ms,不进行任何操作导致了这每帧100ms的时间间隔完全被浪费,而本申请则充分利用了这些时间间隔,使用frame-by-frame的方式计算出每两帧图像之间的相对位姿。相对相关技术中积累10帧图像之后再进行位姿求解,然后再进行地图点恢复的方案,本技术方案在10帧图像传入的时候,只需计算最后一帧的位姿,大幅降低了初始化过程的耗时.(2) Since the original VIO system needs to wait for the system to accumulate 10 frames of images before it starts to operate, no operation is performed when it is less than 10 frames, and the incoming frame rate of the image is about 10Hz, and the interval between each frame of image is about 10Hz. The time interval is about 100ms. Without any operation, the time interval of 100ms per frame is completely wasted. This application makes full use of these time intervals and uses the frame-by-frame method to calculate the time interval between each two frames of images. relative pose. Compared with the related art scheme of accumulating 10 frames of images, then solving the pose and then restoring the map points, this technical solution only needs to calculate the pose of the last frame when 10 frames of images are passed in, which greatly reduces the initialization. time-consuming process.
(3)在出现位姿确定失败时,可以合理的对失败情况的进行不同的处理,同时在不同情况下分别采用深度信息和三角化法进行地图点恢复,提高了初始化的成功率,进而提高了最终确定位姿的精度和鲁棒性。(3) When there is a failure to determine the pose, it can reasonably handle the failure situation differently, and at the same time, the depth information and the triangulation method are used to restore the map points in different situations, which improves the success rate of initialization, and further improves the accuracy and robustness of the final pose.
需要注意的是,上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。It should be noted that the above-mentioned drawings are only schematic illustrations of the processes included in the method according to the exemplary embodiment of the present disclosure, and are not intended to be limiting. It is easy to understand that the processes shown in the above figures do not indicate or limit the chronological order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, in multiple modules.
进一步的,参考图11所示,本示例的实施方式中还提供一种视觉惯性系统初始化装 置1100,包括位姿确定模块1110和初始化模块1120。其中:Further, referring to Fig. 11 , the embodiment of this example also provides a visual inertial system initialization apparatus 1100, including a pose determination module 1110 and an initialization module 1120. in:
位姿确定模块1110可以用于在接收图像的过程中,针对图像进行逐帧计算,直至得到第一预设数量的位姿。The pose determination module 1110 may be configured to perform frame-by-frame calculation on the image during the process of receiving the image, until the pose of the first preset number is obtained.
初始化模块1120可以用于根据第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据运动速度、重力向量和偏离率对视频惯性系统进行初始化。The initialization module 1120 may be configured to determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, gravity vector and deviation rate.
在一示例性实施例中,上述逐帧计算包括:在接收到一帧当前帧图像时,提取当前帧图像的特征点和特征点对应的深度信息;基于当前帧图像的特征点和深度信息与上一帧图像的特征点和深度信息确定当前帧图像对应的位姿;将当前帧图像作为上一帧图像,并继续接收新的当前帧图像。In an exemplary embodiment, the above-mentioned frame-by-frame calculation includes: when a frame of the current frame image is received, extracting the feature points of the current frame image and depth information corresponding to the feature points; based on the feature points and depth information of the current frame image and The feature points and depth information of the previous frame image determine the pose corresponding to the current frame image; take the current frame image as the previous frame image, and continue to receive new current frame images.
在一示例性实施例中,位姿确定模块1110可以用于在未成功确定当前帧图像对应的位姿时,丢弃当前帧图像,并保留上一帧图像;在接收到新的当前帧图像时,基于新的当前帧图像和保留的上一帧图像计算新的当前帧图像对应的位姿。In an exemplary embodiment, the pose determination module 1110 may be configured to discard the current frame image and retain the previous frame image when the pose corresponding to the current frame image is not successfully determined; when receiving a new current frame image , and calculate the pose corresponding to the new current frame image based on the new current frame image and the retained previous frame image.
在一示例性实施例中,位姿确定模块1110可以用于在未成功确定当前帧图像对应的位姿时,统计未成功确定位姿的当前帧图像的数量;在未成功确定位姿的当前帧图像的数量等于第二预设数量时,清空已经确定的位姿,并继续接收新的当前帧图像。In an exemplary embodiment, the pose determination module 1110 may be configured to count the number of current frame images whose poses are unsuccessfully determined when the pose corresponding to the current frame image is unsuccessfully determined; When the number of frame images is equal to the second preset number, the determined pose is cleared, and new current frame images are continued to be received.
在一示例性实施例中,位姿确定模块1110可以用于在成功确定当前帧图像对应的位姿时,重置未成功确定位姿的当前帧图像的数量。In an exemplary embodiment, the pose determination module 1110 may be configured to reset the number of current frame images whose poses are not successfully determined when the pose corresponding to the current frame image is successfully determined.
在一示例性实施例中,位姿确定模块1110可以用于将当前帧图像的特征点与上一帧图像的特征点进行特征匹配,以获取匹配特征点;基于上一帧图像的特征点对应的深度信息确定匹配特征点的地图点,并将地图点投影至当前帧图像,以确定当前帧图像对应的位姿。In an exemplary embodiment, the pose determination module 1110 may be configured to perform feature matching between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points; The depth information of , determines the map points that match the feature points, and projects the map points to the current frame image to determine the pose corresponding to the current frame image.
在一示例性实施例中,视觉惯性系统初始化装置1100还可以包括位姿优化模块,用于对第一预设数量的位姿对应的目标图像进行地图点恢复,以获取恢复后的地图点;根据地图点构建局部集束调整,以对位姿进行优化得到优化后的位姿。In an exemplary embodiment, the visual-inertial system initialization apparatus 1100 may further include a pose optimization module, configured to perform map point recovery on the target images corresponding to the first preset number of poses, so as to obtain the recovered map points; A local bundle adjustment is constructed according to the map points to optimize the pose to obtain the optimized pose.
在一示例性实施例中,位姿优化模块可以用于在第一预设数量的位姿对应的目标图像之间查找共视匹配关系,以获取至少一对目标图像对;针对每对目标图像对,利用目标图像对中目标图像的深度信息进行重投影,并计算重投影误差;在重投影误差小于等于预设阈值时,利用深度信息对特征点进行反投影,以对目标图像进行地图点恢复。In an exemplary embodiment, the pose optimization module may be used to find a common-view matching relationship between target images corresponding to a first preset number of poses to obtain at least a pair of target image pairs; for each pair of target images Yes, use the depth information of the target image in the target image pair to reproject, and calculate the reprojection error; when the reprojection error is less than or equal to the preset threshold, use the depth information to backproject the feature points to map the target image. recover.
在一示例性实施例中,位姿优化模块可以用于在重投影误差大于预设阈值时,通过三角化法对目标图像进行地图点恢复。In an exemplary embodiment, the pose optimization module may be configured to perform map point recovery on the target image through triangulation when the reprojection error is greater than a preset threshold.
在一示例性实施例中,位姿确定模块1110可以用于将第一帧图像对应的位姿确定为预设位姿。In an exemplary embodiment, the pose determination module 1110 may be configured to determine the pose corresponding to the first frame of image as a preset pose.
在一示例性实施例中,初始化模块1120可以用于根据重力向量对位姿进行重力方向调整。In an exemplary embodiment, the initialization module 1120 may be configured to adjust the gravitational direction of the pose according to the gravitational vector.
上述装置中各模块的具体细节在方法部分实施方式中已经详细说明,未披露的细节内容可以参见方法部分的实施方式内容,因而不再赘述。The specific details of each module in the above-mentioned apparatus have been described in detail in the method part of the implementation manner, and the undisclosed details can refer to the method part of the implementation manner, and thus will not be repeated.
所属技术领域的技术人员能够理解,本公开的各个方面可以实现为系统、方法或程序产品。因此,本公开的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。As will be appreciated by one skilled in the art, various aspects of the present disclosure may be implemented as a system, method or program product. Therefore, various aspects of the present disclosure can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", "module" or "system".
本公开的示例性实施方式还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例 性实施方式的步骤,例如可以执行图3、图4和图10中任意一个或多个步骤。Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a terminal device, the program code is used to cause the terminal device to execute the above-mentioned procedures in this specification. The steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure, for example, any one or more of the steps in FIG. 3 , FIG. 4 , and FIG. 10 may be performed.
需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
此外,可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Furthermore, program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Other embodiments of the present disclosure will readily suggest themselves to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (20)

  1. 一种视觉惯性系统初始化方法,其中,包括:A visual-inertial system initialization method, including:
    在接收图像的过程中,针对所述图像进行逐帧计算,直至得到第一预设数量的位姿;In the process of receiving an image, frame-by-frame calculation is performed on the image until a first preset number of poses are obtained;
    根据所述第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据所述运动速度、所述重力向量和所述偏离率对视频惯性系统进行初始化;Determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to initialize the video inertial system according to the motion speed, the gravity vector and the deviation rate;
    其中,所述逐帧计算包括:Wherein, the frame-by-frame calculation includes:
    在接收到一帧当前帧图像时,提取所述当前帧图像的特征点和所述特征点对应的深度信息;When receiving a frame of the current frame image, extract the feature points of the current frame image and the depth information corresponding to the feature points;
    基于所述当前帧图像的所述特征点和所述深度信息与上一帧图像的所述特征点和所述深度信息确定所述当前帧图像对应的位姿;Determine the pose corresponding to the current frame image based on the feature point and the depth information of the current frame image and the feature point and the depth information of the previous frame image;
    将所述当前帧图像作为所述上一帧图像,并继续接收新的当前帧图像。Taking the current frame image as the previous frame image, and continuing to receive a new current frame image.
  2. 根据权利要求1所述的方法,其中,在将所述当前帧图像作为所述上一帧图像之前,所述方法还包括:The method according to claim 1, wherein before using the current frame image as the previous frame image, the method further comprises:
    在未成功确定所述当前帧图像对应的位姿时,丢弃所述当前帧图像,并保留所述上一帧图像;When the pose corresponding to the current frame image is unsuccessfully determined, discarding the current frame image and retaining the previous frame image;
    在接收到新的当前帧图像时,基于所述新的当前帧图像和保留的所述上一帧图像计算所述新的当前帧图像对应的位姿。When a new current frame image is received, the pose corresponding to the new current frame image is calculated based on the new current frame image and the retained last frame image.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:The method of claim 2, wherein the method further comprises:
    在未成功确定所述当前帧图像对应的位姿时,统计未成功确定位姿的所述当前帧图像的数量;When the pose corresponding to the current frame image is not successfully determined, count the number of the current frame image whose pose is not successfully determined;
    在所述未成功确定位姿的所述当前帧图像的数量等于第二预设数量时,清空已经确定的所述位姿,并继续接收新的当前帧图像。When the number of the current frame images for which the pose is not successfully determined is equal to a second preset number, the determined pose is cleared, and new current frame images are continued to be received.
  4. 根据权利要求3所述的方法,其中,在统计未成功确定位姿的所述当前帧图像的数量之后,所述方法还包括:The method according to claim 3, wherein after counting the number of the current frame images whose poses are not successfully determined, the method further comprises:
    在成功确定所述当前帧图像对应的位姿时,重置所述未成功确定位姿的所述当前帧图像的数量。When the pose corresponding to the current frame image is successfully determined, the number of the current frame image whose pose is not successfully determined is reset.
  5. 根据权利要求1所述的方法,其中,所述基于所述当前帧图像的所述特征点和所述深度信息与上一帧图像的所述特征点和所述深度信息确定所述当前帧图像对应的位姿,包括:The method according to claim 1, wherein the current frame image is determined based on the feature points and the depth information of the current frame image and the feature points and the depth information of the previous frame image The corresponding poses, including:
    将所述当前帧图像的所述特征点与所述上一帧图像的特征点进行特征匹配,以获取匹配特征点;Feature matching is performed between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points;
    基于所述上一帧图像的所述特征点对应的深度信息确定所述匹配特征点的地图点,并将所述地图点投影至所述当前帧图像,以确定所述当前帧图像对应的位姿。Determine the map point of the matching feature point based on the depth information corresponding to the feature point of the previous frame image, and project the map point to the current frame image to determine the position corresponding to the current frame image posture.
  6. 根据权利要求1所述的方法,其中,在所述根据所述第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率之前,所述方法还包括:The method according to claim 1, wherein, before determining the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, the method further comprises:
    对所述第一预设数量的位姿对应的目标图像进行地图点恢复,以获取恢复后的地图点;performing map point recovery on the target images corresponding to the first preset number of poses to obtain recovered map points;
    根据所述地图点构建局部集束调整,以对所述位姿进行优化得到优化后的位姿。A local bundle adjustment is constructed according to the map points, so as to optimize the pose to obtain an optimized pose.
  7. 根据权利要求6所述的方法,其中,所述对所述第一预设数量的位姿对应的目标图像进行地图点恢复,包括:The method according to claim 6, wherein the performing map point recovery on the target images corresponding to the first preset number of poses comprises:
    在所述第一预设数量的位姿对应的目标图像之间查找共视匹配关系,以获取至少一对目标图像对;searching for a common-view matching relationship between the target images corresponding to the first preset number of poses to obtain at least one pair of target images;
    针对每对所述目标图像对,利用所述目标图像对中所述目标图像的深度信息进行重投影,并计算重投影误差;For each pair of the target image pair, reproject the depth information of the target image in the target image pair, and calculate the reprojection error;
    在所述重投影误差小于等于预设阈值时,利用所述深度信息对所述特征点进行反投影,以对所述目标图像进行地图点恢复。When the reprojection error is less than or equal to a preset threshold, the feature points are back-projected by using the depth information, so as to perform map point recovery on the target image.
  8. 根据权利要求7所述的方法,其中,所述方法还包括:The method of claim 7, wherein the method further comprises:
    在所述重投影误差大于所述预设阈值时,通过三角化法对所述目标图像进行地图点恢复。When the reprojection error is greater than the preset threshold, map point recovery is performed on the target image by a triangulation method.
  9. 根据权利要求1所述的方法,其中,在接收到的所述当前帧图像为第一帧图像时,所述方法包括:The method according to claim 1, wherein, when the received current frame image is the first frame image, the method comprises:
    将所述第一帧图像对应的位姿确定为预设位姿。The pose corresponding to the first frame of image is determined as a preset pose.
  10. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    根据所述重力向量对所述位姿进行重力方向调整。Adjust the gravitational direction of the pose according to the gravitational vector.
  11. 一种视觉惯性系统初始化装置,其中,包括:A visual inertial system initialization device, comprising:
    位姿确定模块,用于在接收图像的过程中,针对所述图像进行逐帧计算,直至得到第一预设数量的位姿;a pose determination module, configured to perform frame-by-frame calculation for the image in the process of receiving the image until a first preset number of poses are obtained;
    初始化模块,用于根据所述第一预设数量的位姿确定惯性测量单元对应的运动速度、重力向量和偏离率,以根据所述运动速度、所述重力向量和所述偏离率对视频惯性系统进行初始化;The initialization module is used to determine the motion speed, gravity vector and deviation rate corresponding to the inertial measurement unit according to the first preset number of poses, so as to determine the video inertia according to the motion speed, the gravity vector and the deviation rate The system is initialized;
    其中,所述逐帧计算包括:Wherein, the frame-by-frame calculation includes:
    在接收到一帧当前帧图像时,提取所述当前帧图像的特征点和所述特征点对应的深度信息;When receiving a frame of the current frame image, extract the feature points of the current frame image and the depth information corresponding to the feature points;
    基于所述当前帧图像的所述特征点和所述深度信息与上一帧图像的所述特征点和所述深度信息确定所述当前帧图像对应的位姿;Determine the pose corresponding to the current frame image based on the feature point and the depth information of the current frame image and the feature point and the depth information of the previous frame image;
    将所述当前帧图像作为所述上一帧图像,并继续接收新的当前帧图像。Taking the current frame image as the previous frame image, and continuing to receive a new current frame image.
  12. 根据权利要求11所述的装置,其中,所述位姿确定模块还用于在未成功确定所述当前帧图像对应的位姿时,丢弃所述当前帧图像,并保留所述上一帧图像;在接收到新的当前帧图像时,基于所述新的当前帧图像和保留的所述上一帧图像计算所述新的当前帧图像对应的位姿。The apparatus according to claim 11, wherein the pose determination module is further configured to discard the current frame image and retain the previous frame image when the pose corresponding to the current frame image is not successfully determined ; When receiving a new current frame image, calculate the pose corresponding to the new current frame image based on the new current frame image and the retained last frame image.
  13. 根据权利要求12所述的装置,其中,所述位姿确定模块还用于在未成功确定所述当前帧图像对应的位姿时,统计未成功确定位姿的所述当前帧图像的数量;在所述未成功确定位姿的所述当前帧图像的数量等于第二预设数量时,清空已经确定的所述位姿,并继续接收新的当前帧图像。The apparatus according to claim 12, wherein the pose determination module is further configured to count the number of the current frame images whose poses are not successfully determined when the pose corresponding to the current frame image is not successfully determined; When the number of the current frame images for which the pose is not successfully determined is equal to a second preset number, the determined pose is cleared, and new current frame images are continued to be received.
  14. 根据权利要求13所述的装置,其中,所述位姿确定模块还用于在成功确定所述当前帧图像对应的位姿时,重置所述未成功确定位姿的所述当前帧图像的数量。The apparatus according to claim 13, wherein the pose determination module is further configured to reset the current frame image of the unsuccessfully determined pose when the pose corresponding to the current frame image is successfully determined. quantity.
  15. 根据权利要求11所述的装置,其中,所述位姿确定模块还用于将所述当前帧图像的所述特征点与所述上一帧图像的特征点进行特征匹配,以获取匹配特征点;基于所述上一帧图像的所述特征点对应的深度信息确定所述匹配特征点的地图点,并将所述地图点投影至所述当前帧图像,以确定所述当前帧图像对应的位姿。The apparatus according to claim 11, wherein the pose determination module is further configured to perform feature matching between the feature points of the current frame image and the feature points of the previous frame image to obtain matching feature points ; Determine the map point of the matching feature point based on the depth information corresponding to the feature point of the previous frame image, and project the map point to the current frame image to determine the corresponding map point of the current frame image. pose.
  16. 根据权利要求11所述的装置,其中,所述位姿确定模块还用于对所述第一预设数量的位姿对应的目标图像进行地图点恢复,以获取恢复后的地图点;根据所述地图点构建局部集束调整,以对所述位姿进行优化得到优化 后的位姿。The device according to claim 11, wherein the pose determination module is further configured to perform map point recovery on the target images corresponding to the first preset number of poses to obtain the recovered map points; The map points are used to construct a local bundle adjustment to optimize the pose to obtain the optimized pose.
  17. 根据权利要求16所述的装置,其中,所述位姿确定模块还用于在所述第一预设数量的位姿对应的目标图像之间查找共视匹配关系,以获取至少一对目标图像对;针对每对所述目标图像对,利用所述目标图像对中所述目标图像的深度信息进行重投影,并计算重投影误差;在所述重投影误差小于等于预设阈值时,利用所述深度信息对所述特征点进行反投影,以对所述目标图像进行地图点恢复。The apparatus according to claim 16, wherein the pose determination module is further configured to search for a common-view matching relationship between the target images corresponding to the first preset number of poses, so as to obtain at least one pair of target images For each pair of the target image pair, use the depth information of the target image in the target image pair to reproject, and calculate the reprojection error; when the reprojection error is less than or equal to a preset threshold, use the The depth information is used to back-project the feature points, so as to perform map point restoration on the target image.
  18. 根据权利要求17所述的装置,其中,所述位姿确定模块还用于在所述重投影误差大于所述预设阈值时,通过三角化法对所述目标图像进行地图点恢复。The apparatus according to claim 17, wherein the pose determination module is further configured to perform map point restoration on the target image by a triangulation method when the reprojection error is greater than the preset threshold.
  19. 一种计算机可读介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至10中任一项所述的方法。A computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any one of claims 1 to 10.
  20. 一种电子设备,其中,包括:An electronic device comprising:
    处理器;以及processor; and
    存储器,用于存储所述处理器的可执行指令;a memory for storing executable instructions for the processor;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至10任一项所述的方法。wherein the processor is configured to perform the method of any one of claims 1 to 10 by executing the executable instructions.
PCT/CN2022/072711 2021-02-18 2022-01-19 Visual inertial system initialization method and apparatus, medium, and electronic device WO2022174711A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110190368.9 2021-02-18
CN202110190368.9A CN112819860B (en) 2021-02-18 2021-02-18 Visual inertial system initialization method and device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
WO2022174711A1 true WO2022174711A1 (en) 2022-08-25

Family

ID=75864182

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/072711 WO2022174711A1 (en) 2021-02-18 2022-01-19 Visual inertial system initialization method and apparatus, medium, and electronic device

Country Status (2)

Country Link
CN (1) CN112819860B (en)
WO (1) WO2022174711A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819860B (en) * 2021-02-18 2023-12-22 Oppo广东移动通信有限公司 Visual inertial system initialization method and device, medium and electronic equipment
CN115601419A (en) * 2021-07-07 2023-01-13 北京字跳网络技术有限公司(Cn) Synchronous positioning and mapping back-end optimization method, device and storage medium
CN113610918A (en) * 2021-07-29 2021-11-05 Oppo广东移动通信有限公司 Pose calculation method and device, electronic equipment and readable storage medium
CN113899364B (en) * 2021-09-29 2022-12-27 深圳市慧鲤科技有限公司 Positioning method and device, equipment and storage medium
CN117346650A (en) * 2022-06-28 2024-01-05 中兴通讯股份有限公司 Pose determination method and device for visual positioning and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107747941A (en) * 2017-09-29 2018-03-02 歌尔股份有限公司 A kind of binocular visual positioning method, apparatus and system
CN110057352A (en) * 2018-01-19 2019-07-26 北京图森未来科技有限公司 A kind of camera attitude angle determines method and device
CN110335316A (en) * 2019-06-28 2019-10-15 Oppo广东移动通信有限公司 Method, apparatus, medium and electronic equipment are determined based on the pose of depth information
CN112284381A (en) * 2020-10-19 2021-01-29 北京华捷艾米科技有限公司 Visual inertia real-time initialization alignment method and system
CN112819860A (en) * 2021-02-18 2021-05-18 Oppo广东移动通信有限公司 Visual inertial system initialization method and device, medium and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3058534B1 (en) * 2016-11-09 2019-02-01 Stereolabs INDIVIDUAL VISUAL IMMERSION DEVICE FOR MOVING PERSON WITH OBSTACLE MANAGEMENT
CN108489482B (en) * 2018-02-13 2019-02-26 视辰信息科技(上海)有限公司 The realization method and system of vision inertia odometer
CN110322500B (en) * 2019-06-28 2023-08-15 Oppo广东移动通信有限公司 Optimization method and device for instant positioning and map construction, medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107747941A (en) * 2017-09-29 2018-03-02 歌尔股份有限公司 A kind of binocular visual positioning method, apparatus and system
CN110057352A (en) * 2018-01-19 2019-07-26 北京图森未来科技有限公司 A kind of camera attitude angle determines method and device
CN110335316A (en) * 2019-06-28 2019-10-15 Oppo广东移动通信有限公司 Method, apparatus, medium and electronic equipment are determined based on the pose of depth information
CN112284381A (en) * 2020-10-19 2021-01-29 北京华捷艾米科技有限公司 Visual inertia real-time initialization alignment method and system
CN112819860A (en) * 2021-02-18 2021-05-18 Oppo广东移动通信有限公司 Visual inertial system initialization method and device, medium and electronic equipment

Also Published As

Publication number Publication date
CN112819860B (en) 2023-12-22
CN112819860A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
WO2022174711A1 (en) Visual inertial system initialization method and apparatus, medium, and electronic device
WO2020259248A1 (en) Depth information-based pose determination method and device, medium, and electronic apparatus
CN110322500B (en) Optimization method and device for instant positioning and map construction, medium and electronic equipment
CN107888828B (en) Space positioning method and device, electronic device, and storage medium
US11270460B2 (en) Method and apparatus for determining pose of image capturing device, and storage medium
CN109087359B (en) Pose determination method, pose determination apparatus, medium, and computing device
CN109727288B (en) System and method for monocular simultaneous localization and mapping
US11199414B2 (en) Method for simultaneous localization and mapping
CN110310326B (en) Visual positioning data processing method and device, terminal and computer readable storage medium
CN108805917B (en) Method, medium, apparatus and computing device for spatial localization
Chen et al. Rise of the indoor crowd: Reconstruction of building interior view via mobile crowdsourcing
CN109461208B (en) Three-dimensional map processing method, device, medium and computing equipment
WO2019170166A1 (en) Depth camera calibration method and apparatus, electronic device, and storage medium
WO2020228643A1 (en) Interactive control method and apparatus, electronic device and storage medium
CN110660098B (en) Positioning method and device based on monocular vision
CN111127524A (en) Method, system and device for tracking trajectory and reconstructing three-dimensional image
JP7150917B2 (en) Computer-implemented methods and apparatus, electronic devices, storage media and computer programs for mapping
CN110349212B (en) Optimization method and device for instant positioning and map construction, medium and electronic equipment
US11195297B2 (en) Method and system for visual localization based on dual dome cameras
CN111784776B (en) Visual positioning method and device, computer readable medium and electronic equipment
CN112907620A (en) Camera pose estimation method and device, readable storage medium and electronic equipment
CN111609868A (en) Visual inertial odometer method based on improved optical flow method
CN115699096B (en) Tracking augmented reality devices
CN112258647B (en) Map reconstruction method and device, computer readable medium and electronic equipment
CN112731503B (en) Pose estimation method and system based on front end tight coupling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22755484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22755484

Country of ref document: EP

Kind code of ref document: A1