WO2021160182A1 - Method and apparatus for estimating pose of image capturing device - Google Patents

Method and apparatus for estimating pose of image capturing device Download PDF

Info

Publication number
WO2021160182A1
WO2021160182A1 PCT/CN2021/076640 CN2021076640W WO2021160182A1 WO 2021160182 A1 WO2021160182 A1 WO 2021160182A1 CN 2021076640 W CN2021076640 W CN 2021076640W WO 2021160182 A1 WO2021160182 A1 WO 2021160182A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
image
pose
ref
capturing device
Prior art date
Application number
PCT/CN2021/076640
Other languages
French (fr)
Inventor
Jaechoon CHON
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp., Ltd. filed Critical Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority to CN202180014779.0A priority Critical patent/CN115210533A/en
Publication of WO2021160182A1 publication Critical patent/WO2021160182A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/005Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 with correlation of navigation data from several sources, e.g. map or contour matching

Definitions

  • the present disclosure in some embodiments thereof, relates to computer vision, and more specifically, but not exclusively, to a method and an apparatus for estimating a pose of an image capturing device.
  • the most key point is determining the mobile system’s spatial localization in real time, for example, estimating a pose of an image capturing device (e.g., a camera) provided by the mobile system, where the pose may include position and rotation thereof.
  • SLAM simultaneous localization and mapping
  • accumulation error may increase as time flows.
  • a current image pose should be re-corrected based on landmarks when the camera re-visits the same area.
  • the present disclosure provides a method and an apparatus for estimating a pose of an image capturing device.
  • a method for estimating a pose of an image capturing device including: acquiring a series of images of multiple landmarks from at least one pose; determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  • an apparatus for estimating a pose of an image capturing device comprising: an acquiring module, configured to acquire a series of images of multiple landmarks from at least one pose; a determining module, configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and an outputting module, configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  • an image capturing device including: a processor and a memory.
  • the memory is configured to store a computer program
  • the processor is configured to call and run the computer program stored in the memory, thereby implementing the method according to the forgoing first aspect or any embodiment thereof.
  • a chip configured to implement the method according to the forgoing first aspect or any embodiment thereof.
  • the chip includes a processor, configured to call and run a computer program from a memory, thereby causing an apparatus provided with the chip to implement the method according to the forgoing first aspect or any embodiment thereof.
  • a computer readable storage medium being used for storing a computer program, wherein the computer program causes a computer to implement the method according to the forgoing first aspect or any embodiment thereof.
  • a computer program product including computer program instructions that cause a computer to implement the method according to the forgoing first aspect or any embodiment thereof.
  • a computer program which, when running on a computer, causes the computer to implement the method according to the forgoing first aspect or any embodiment thereof.
  • FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks.
  • FIG. 3 illustrates a sequence diagram of an optional flow of operations for estimating a pose of an image capturing device according to some embodiments of the disclosure.
  • FIG. 4 illustrates a sequence diagram of an optional flow of operations 400 for constructing the mapping image set according to some embodiments of the disclosure.
  • FIG. 5 illustrates a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure.
  • FIG. 6 illustrates a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure.
  • FIG. 7 illustrates an exemplary application scenario where the operations 300 are implemented according to some embodiments of the disclosure.
  • FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application.
  • FIG. 9 is a block diagram of an image capturing device according to some embodiments of the application.
  • FIG. 10 is a block diagram of a chip according to some embodiments of the application.
  • the meaning of “a plurality” , “multiple” or “several” is at least two, for example, two, three, etc., unless specifically defined otherwise.
  • "And/or” describing the association relationship of the associated objects, indicates that there may be three relationships, such as A and/or B, which may indicate that there are three cases of single A, single B and both A and B.
  • the symbol “/” generally indicates that the contextual object is an "or" relationship.
  • the term “camera” is used herein to refer to an image capturing device such as one or more image sensors, an independent camera, an integrated camera, and/or any sensor adapted to document objects visually.
  • image capturing systems including an image capturing device, for example a camera, where there is a need to estimate a pose of the camera in a coordinate system.
  • coordinate systems include a world coordinate system and a coordinate system calibrated with a camera pose of a camera when capturing images.
  • a camera pose is a combination of position and orientation of a camera relative to a coordinate system.
  • a camera pose x may be expressed as a pair (R, t) , where R is a rotation matrix representing an orientation with respect to the coordinate system, and t is a translation vector representing the camera's position with respect to the coordinate system.
  • R is a rotation matrix representing an orientation with respect to the coordinate system
  • t is a translation vector representing the camera's position with respect to the coordinate system.
  • Other possible representations of orientation are double-angle representations and tensors.
  • Examples of such image capturing systems are medical systems including a camera inserted into a patient's body, for example by swallowing the camera, systems including autonomously moving devices, for example vehicles and robots, navigation applications and augmented reality systems.
  • 3D three-dimensional
  • a scene has image features, also known as landmarks.
  • An image of the scene captured by a camera sometimes referred to as a camera view, has observed image features representing the scene's landmarks.
  • camera pose estimation is solved using bundle adjustment (BA) optimization.
  • Bundle adjustment is a common approach to recover (by estimation) camera poses and 3D scene reconstruction given a sequence of images and possible measurements from additional sensors.
  • a bundle adjustment optimization calculation aims to minimize re-projection errors across all available images and for all landmarks identified in the available images.
  • a re-projection error for a landmark in an image is the difference between the location in the image of the landmark's observed image feature and the predicted location in the image of the landmark's observed image feature for a certain camera pose estimate.
  • the terms “visiting an area” , “mapping an area” and “observing an area” all mean capturing images of the area by a camera, and are used interchangeably.
  • pose graph method has been widely adapted for light computational cost.
  • the pose graph concept aligns the current camera track and old camera track when the camera revisits the same area. This pose graph method is trying to increase the similarity among the tracks even though the tracks have quite different patterns. Even though it takes 3D landmarks into consideration, the distance scale error among 3D landmarks and the current 3D points built by a single camera makes things worse.
  • ICP Iterative closest point
  • PnP Perspective-n-Point
  • the methods as described above only focus on the current (latest) frame pose correction, the re-correct position may suddenly jump to a big translated area. Also, they don’t consider distance scale error among the current estimated 3D points and 3D landmarks.
  • FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks.
  • the current frame pose 101 is corrected based on the perspective images of landmarks #1, #2 and #3.
  • the re-correct position of current frame pose jumps leftward to a big translated area, that is, substantially not matches with the predicted pose and the camera track.
  • the distance scale error caused by the change of the current frame pose is not considered in the re-correction thereof, thereby generating additional re-correction error.
  • some embodiments of the disclosure propose a pose estimation scheme which tightly couples the current image frame to its previous image frame. Because a distance between the current track and each camera in a tight coupled body is considered an error, the sudden jump translation issue can be suppressed.
  • Some embodiments of the disclosure also propose a pose estimation scheme which considers distance scale error among 3D landmarks and the current 3D points when minimizing those errors and estimating 6 DoF (Degrees of Freedom) of a tight coupled body.
  • a pose estimation scheme which considers distance scale error among 3D landmarks and the current 3D points when minimizing those errors and estimating 6 DoF (Degrees of Freedom) of a tight coupled body.
  • One example of the pose estimation scheme is performed through following equation.
  • 3D_error_term 3D land mark -scale *current 3D point
  • body_pose_error_term scale*body pose -current body pose
  • 3D_project_error_term 3D landmark projection on image -current image point.
  • FIG. 3, showing a sequence diagram of an optional flow of operations 300 according to some embodiments of the disclosure.
  • Such embodiments include at least one hardware processor and a single camera.
  • the at least one hardware processor acquires 301 a series of images of multiple landmarks from at least one pose.
  • the at least one pose is unknown. There may be a need to compute a set of estimations for the at least one pose.
  • the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto.
  • the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  • the mapping image set is acquired during the operations for estimating the pose of the image capturing device.
  • the image capturing device continuously captures images of the multiple landmarks and extracts feature information for each captured image to construct the mapping image set, which is used for estimating the pose for the recently acquired N images in an on-board way.
  • the mapping image set is acquired separately prior to the operations for estimating the pose of the image capturing device and, thus, used for estimating the pose for the recently acquired N images in an off-board way.
  • Such embodiments include at least one hardware processor, a single camera and a VIO (visual inertial odometry) unit.
  • the at least one hardware processor control the single camera to capture 401 a plurality of images of multiple landmarks at different poses of the image capturing device; and controls the VIO unit to extract 403 feature information for each image in the plurality of images.
  • the feature information includes 6DoF information of each image and depth information of the multiple landmarks in each image.
  • the VIO unit includes a visual odometry unit and an inertial measurement unit (IMU) .
  • IMU inertial measurement unit
  • An exemplary algorithm process of the visual odometry unit is as follows. A new frame image is acquired first, the ORB (Oriented FAST and Rotated BRIEF) feature points are extracted from the image, and corresponding BRIEF description of the feature points is calculated. Then, matching is performed between the feature points of recently acquired image and the feature points of previous image frames. At the same time, matched feature points are filtered using RANSAC algorithm. Finally, the rotational translation matrix between the current and previous image frames is obtained by minimizing re-projection error, so as to obtain the current pose of the image capturing device.
  • ORB Oriented FAST and Rotated BRIEF
  • the IMU is configured to obtain the acceleration and angular speed of the image capturing device by using gyroscope and accelerometer and, then, calculate the current pose of the image capturing device through integrating operation. Details thereof is omitted here for brevity.
  • FIG. 5 showing a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure.
  • Such embodiments include at least one hardware processor and a single camera.
  • the at least one hardware processor extracts 501 a plurality of observed image features of the multiple landmarks from a plurality of images captured by the single camera from at least one pose.
  • the at least one pose is unknown.
  • the observed image features may be expressed in a camera coordinate system.
  • the at least one hardware processor extracts 503 the plurality of observed image features by applying image matching algorithms to the images.
  • the image matching algorithms may include feature scale detection algorithms that produce scale information. Examples of image matching algorithms are SIFT and RANSAC.
  • the at least one hardware processor may identify 505 among the extracted plurality of observed image features at least one common observed image feature documented in at least some of the images.
  • the image matching algorithms are used for identifying the at least one common feature.
  • a corner feature matching and tracking method for example DBoW2, which is a bag-of-words place recognition approach, can be utilized to implement the operations 500.
  • DBoW2 can return loop-closure candidates after temporal and geometrical consistency check. All BRIEF descriptors may be kept for feature retrieving, but the raw image can be disregarded to reduce memory consumption.
  • the connection between the recently acquired N images and the N matched images is established by retrieving feature correspondences. Correspondences are found by the BRIEF descriptor matching.
  • FIG. 6, showing a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure.
  • Such embodiments include at least one hardware processor.
  • the at least one hardware processor calculates 601 a pose (R, T) based on following equations:
  • Pi R (p i -p ref ) + T;
  • R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system
  • T is a translation vector representing a position of the image capturing device with respect to the world coordinate system
  • p ref is a reference position in a body system when considering the recently acquired N images as a whole
  • p i is a pose translation determined for i-th image of the N matched images
  • P i is a position of i-th image of the recently acquired N images
  • L i is a position of i-th landmark of the multiple landmarks
  • dot represents an inner product operation
  • ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image.
  • the at least one hardware processor outputs 603 a pose (r ref , p ref ) for a most recently required image as:
  • R ref is a rotation matrix determined for the most recently required image.
  • the at least one hardware processor acquires 301 a series of images of landmarks L 1 , L 2 , L 3 from a series of poses 0, 1, 2, ...12.
  • the pose 12 is unknown.
  • the at least one hardware processor determines 303, for recently acquired 3 images, 3 matched images by matching each of the 3 images with a mapping image set.
  • the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto.
  • FIG. 7 feature correspondences are found between the three images at poses 10, 11, 12 and three matched images at poses 1, 3, 4, respectively.
  • the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired 3 images to known pose information of the 3 matched images.
  • FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application. As shown in FIG. 8, the apparatus 800 includes an acquiring module 801, a determining module 803 and an outputting module 805.
  • the acquiring module 801 is configured to acquire a series of images of multiple landmarks from at least one pose;
  • the determining module 803 is configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2;
  • the outputting module 805 is configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  • the apparatus 800 further includes a constructing module (not shown) .
  • the constructing module is configured to acquire the mapping image set by capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and extracting feature information for each image in the plurality of images.
  • the constructing module is specifically configured to: calculate 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
  • 6DoF 6 degree-of-pose
  • the constructing module is specifically configured to: calculate, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
  • VIO visual inertial odometry
  • IMU inertial measurement unit
  • the determining module 803 is specifically configured to: perform matching of corner features between each of the N images and each corresponding one of the N matched images.
  • the outputting module 805 is specifically configured to: calculate a pose (R, T) based on following equations:
  • Pi R (p i -p ref ) + T;
  • R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system
  • T is a translation vector representing a position of the image capturing device with respect to the world coordinate system
  • p ref is a reference position in a body system when considering the recently acquired N images as a whole
  • p i is a pose translation determined for i-th image of the N matched images
  • P i is a position of i-th image of the recently acquired N images
  • L i is a position of i-th landmark of the multiple landmarks
  • dot represents an inner product operation
  • ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image
  • R ref is a rotation matrix determined for the most recently required image.
  • FIG. 9 is a block diagram illustrating an image capturing device 900 according to some embodiments of the application.
  • the image capturing device 900 shown in FIG. 9 includes a processor 910, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.
  • the image capturing device 900 may further include a memory 920.
  • the processor 910 may call and run the computer program from the memory 920 to implement the method according to the embodiments of the application.
  • the memory 920 may be a separate device independent of the processor 910, or may be integrated in the processor 910.
  • the image capturing device 900 may further include a transceiver 930, and the processor 910 may control the transceiver 930 to communicate with other devices. Specifically, it may send information or data to other devices, or receive other information, data sent by the device.
  • the transceiver 930 may include a transmitter and a receiver.
  • the transceiver 930 may further include antennas, and the number of antennas may be one or more.
  • FIG. 10 is a block diagram illustrating a chip according to some embodiments of the application.
  • the chip 1000 shown in FIG. 10 includes a processor 1010, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.
  • the chip 1000 may further include a memory 1020.
  • the processor 1010 may call and run the computer program from the memory 1020 to implement the method according to the embodiments of the application.
  • the memory 1020 may be a separate device independent of the processor 1010, or may be integrated in the processor 1010.
  • the chip 1000 may further include an input interface 1030.
  • the processor 1010 may control the input interface 1030 to communicate with other devices or chips. Specifically, the processor 1010 may acquire information or data sent by other devices or chips.
  • the chip 1000 may further include an output interface 1040.
  • the processor 1010 may control the output interface 1040 to communicate with other devices or chips. Specifically, the processor 1010 may output information or data to the other devices or chips.
  • the chip can be applied to the image capturing device according to the embodiments of the application, and the chip can implement the corresponding process implemented by the image capturing device in the method according to the embodiments of the application.
  • the chip can implement the corresponding process implemented by the image capturing device in the method according to the embodiments of the application.
  • the chip mentioned in some embodiments of the application may also be referred to as a system-level chip, a system chip, a chip system or a system-on-chip.
  • the processor in the embodiments of the disclosure may be an integrated circuit chip with signal processing capability.
  • the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software.
  • the processor mentioned in some embodiments of the application may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) , or other programming logic devices, discrete gate or transistor logic devices, discrete hardware components, which can achieve or implement the methods, steps and block diagrams disclosed in embodiments of the disclosure.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory mentioned in some embodiments of the application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM) , programmable read-only memory (PROM) , erasable programmable read-only memory (erasable PROM, EPROM) , electrical memory erasable programmable read-only memory (EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (RAM) , which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • dynamic RAM dynamic random access memory
  • synchronous DRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synch link dynamic random access memory synch link DRAM, SLDRAM
  • direct Rambus random access memory direct Rambus RAM, DR RAM
  • the memory in the embodiments of the disclosure may also be static random access memory (static RAM, SRAM) , dynamic random access memory (dynamic RAM, DRAM) , Synchronous dynamic random access memory (synchronous DRAM, SDRAM) , double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM) , enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM) , synchronous connection Dynamic random access memory (synch link DRAM, SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DR RAM) , and the like.
  • static RAM static random access memory
  • DRAM dynamic random access memory
  • synchronous DRAM, SDRAM double data rate synchronous dynamic random access memory
  • double data rate SDRAM, DDR SDRAM double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced synchronous dynamic random access memory
  • synchronous connection Dynamic random access memory switch link DRAM, SLDRAM
  • Direct Rambus RAM Direct Rambus RAM
  • Embodiments of the disclosure further provide a computer readable storage medium, which is configured to store a computer program.
  • the computer readable storage medium may be applied to the network device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the network device in each method in some embodiments of the application.
  • the computer program causes the computer to execute the corresponding process implemented by the network device in each method in some embodiments of the application.
  • the computer readable storage medium may be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application.
  • the computer program causes the computer to execute the corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application.
  • a computer program product is also provided in some embodiments of the application, including computer program instructions.
  • the computer program product can be applied to the network device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the network device in each method in some embodiments of the application.
  • the computer program instruction causes the computer to execute a corresponding process implemented by the network device in each method in some embodiments of the application.
  • the computer program product can be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application.
  • the computer program instruction causes the computer to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application.
  • a computer program is also provided in some embodiments of the application.
  • the computer program may be applied to the network device in some embodiments of the application.
  • the computer program When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
  • the computer program may be applied to the mobile terminal/terminal device in some embodiments of the application.
  • the computer program When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments as described above are only exemplary.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored or not carried out.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as an independent product, they can be stored in a computer-readable storage medium.
  • a computer-readable storage medium including several instructions used for causing a computer device (which may be a personal computer, a server, or a network device, and the like) to perform all or part of the steps of the method described in some embodiments of the application.
  • the foregoing storage medium includes various medium that can store program codes, such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

A method for estimating a pose of an image capturing device, including: acquiring a series of images of multiple landmarks from at least one pose (301); determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2 (303); and outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images (305).

Description

METHOD AND APPARATUS FOR ESTIMATING POSE OF IMAGE CAPTURING DEVICE
RELATED APPLICATION
This application claims the benefit of priority of U.S. Provisional Patent Application No. 62/976,537 filed on Feb. 14, 2020, the contents of which are incorporated herein by reference in, their entirety.
TECHNICAL FIELD
The present disclosure, in some embodiments thereof, relates to computer vision, and more specifically, but not exclusively, to a method and an apparatus for estimating a pose of an image capturing device.
BACKGROUND
When realizing VR (virtual reality) and AR (augmented reality) , the most key point is determining the mobile system’s spatial localization in real time, for example, estimating a pose of an image capturing device (e.g., a camera) provided by the mobile system, where the pose may include position and rotation thereof. Such localization may be referred to as SLAM (simultaneous localization and mapping) .
During the process of localization based on camera and IMU (inertial measurement unit) , accumulation error may increase as time flows. To suppress the accumulation error, a current image pose should be re-corrected based on landmarks when the camera re-visits the same area.
SUMMARY
The present disclosure provides a method and an apparatus for estimating a pose of an image capturing device.
According to a first aspect, there is provided a method for estimating a pose of an image capturing device, including: acquiring a series of images of multiple landmarks from at least one pose; determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
According to a second aspect, there is provided an apparatus for estimating a pose of an image capturing device, comprising: an acquiring module, configured to acquire a series of images of multiple landmarks from at least one pose; a determining module, configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and an outputting module, configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
According to a third aspect, there is provided an image capturing device,  including: a processor and a memory. The memory is configured to store a computer program, and the processor is configured to call and run the computer program stored in the memory, thereby implementing the method according to the forgoing first aspect or any embodiment thereof.
According to a fourth aspect, there is provided a chip configured to implement the method according to the forgoing first aspect or any embodiment thereof.
Specifically, the chip includes a processor, configured to call and run a computer program from a memory, thereby causing an apparatus provided with the chip to implement the method according to the forgoing first aspect or any embodiment thereof.
According to a fifth aspect, there is provided a computer readable storage medium, being used for storing a computer program, wherein the computer program causes a computer to implement the method according to the forgoing first aspect or any embodiment thereof.
According to a sixth aspect, there is provided a computer program product, including computer program instructions that cause a computer to implement the method according to the forgoing first aspect or any embodiment thereof.
According to a seventh aspect, there is provided a computer program which, when running on a computer, causes the computer to implement the method according to the forgoing first aspect or any embodiment thereof.
Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks.
FIG. 3 illustrates a sequence diagram of an optional flow of operations for estimating a pose of an image capturing device according to some embodiments of the disclosure.
FIG. 4 illustrates a sequence diagram of an optional flow of operations 400 for constructing the mapping image set according to some embodiments of the disclosure.
FIG. 5 illustrates a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure.
FIG. 6 illustrates a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure.
FIG. 7 illustrates an exemplary application scenario where the operations 300 are implemented according to some embodiments of the disclosure.
FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application.
FIG. 9 is a block diagram of an image capturing device according to some embodiments of the application.
FIG. 10 is a block diagram of a chip according to some embodiments of the application.
DETAILED DESCRIPTION
Exemplary embodiments of the disclosure will now be described more fully with reference to the accompanying drawings, in which exemplary embodiments are shown. Exemplary embodiments of the disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of exemplary embodiments to those skilled in the art. In the drawings, the separate layers and regions are exaggerated for clarity. Like reference numerals in the drawings denote like elements, and thus their description will be omitted.
The described features, structures, or/and characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are disclosed to provide a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosure may be practiced without one or more of the specific details, or with other methods, components and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
In the present disclosure, terms such as "connected" and the like should be understood broadly, and may be directly connected or indirectly connected through an intermediate medium, unless otherwise specified. The specific meanings of the above terms in the present disclosure can be understood by those skilled in the art on a case-by-case basis.
Further, in the description of the present disclosure, the meaning of "a plurality" , "multiple" or "several" is at least two, for example, two, three, etc., unless specifically defined otherwise. "And/or" , describing the association relationship of the associated objects, indicates that there may be three relationships, such as A and/or B, which may indicate that there are three cases of single A, single B and both A and B. The symbol "/" generally indicates that the contextual object is an "or" relationship.
For brevity, the term “camera” is used herein to refer to an image capturing device such as one or more image sensors, an independent camera, an integrated camera, and/or any sensor adapted to document objects visually.
There exist image capturing systems including an image capturing device, for example a camera, where there is a need to estimate a pose of the camera in a coordinate system. Examples of coordinate systems include a world coordinate system and a coordinate system calibrated with a camera pose of a camera when  capturing images. A camera pose is a combination of position and orientation of a camera relative to a coordinate system. For example, a camera pose x may be expressed as a pair (R, t) , where R is a rotation matrix representing an orientation with respect to the coordinate system, and t is a translation vector representing the camera's position with respect to the coordinate system. Other possible representations of orientation are double-angle representations and tensors. Examples of such image capturing systems are medical systems including a camera inserted into a patient's body, for example by swallowing the camera, systems including autonomously moving devices, for example vehicles and robots, navigation applications and augmented reality systems. When a camera operates in unknown environments, without further information or sensors, estimation of a camera pose may involve three-dimensional (3D) reconstruction of a scene. This problem is known as “simultaneous localization and mapping” (SLAM) in computer vision and robotics communities.
A scene has image features, also known as landmarks. An image of the scene captured by a camera, sometimes referred to as a camera view, has observed image features representing the scene's landmarks. Typically, camera pose estimation is solved using bundle adjustment (BA) optimization. Bundle adjustment is a common approach to recover (by estimation) camera poses and 3D scene reconstruction given a sequence of images and possible measurements from additional sensors. A bundle adjustment optimization calculation aims to minimize re-projection errors across all available images and for all landmarks identified in the available images. A re-projection error for a landmark in an image is the difference between the location in the image of the landmark's observed image feature and the predicted location in the image of the landmark's observed image feature for a certain camera pose estimate.
Hitherto, the terms “visiting an area” , “mapping an area” and “observing an area” all mean capturing images of the area by a camera, and are used interchangeably.
The following description refers to a moving camera but applies to stationary cameras as well. Estimating the motion of a camera is also known as ego-motion estimation.
Maintaining a high level of estimation accuracy over long periods of time is challenging in environments deprived of a Global Positioning System (GPS) . Accuracy of estimations of camera poses and 3D structure for images captured over a period of time typically deteriorates over time due to accumulation of estimation errors. Estimation drift is the change in the accumulated estimation error over time. Estimation errors occur both when a camera re-observes previously mapped areas and when a camera continuously explores new areas. When a camera continuously observes new areas, bundle adjustment reduces to fixed-lag bundle adjustment which typically results in rapid trajectory drift; trajectory drift is the changes in the estimation of the camera's motion. Re-observation of an area is known as a loop-closure. When a camera re-observes previously mapped areas, estimation errors are typically reduced, but are still inevitable even in the case of a  loop-closure, in large-scale environments.
To suppress the localization’s accumulation error, pose graph method has been widely adapted for light computational cost. The pose graph concept aligns the current camera track and old camera track when the camera revisits the same area. This pose graph method is trying to increase the similarity among the tracks even though the tracks have quite different patterns. Even though it takes 3D landmarks into consideration, the distance scale error among 3D landmarks and the current 3D points built by a single camera makes things worse.
Iterative closest point (ICP) for more accurate alignment is an algorithm employed to minimize the difference between two clouds of points. In general, ICP builds normal vectors of each point and then aligns them using matched pair points. However, there are some barriers to directly apply the ICP method because time cost is quite expensive.
Another approach is PnP (Perspective-n-Point) . When the camera revisits the same area, PnP re-projects already built 3D landmarks on the current camera image to correct the current camera pose. When the distance between the camera and 3D landmark increases, the camera pose estimation error also increases.
Because the methods as described above only focus on the current (latest) frame pose correction, the re-correct position may suddenly jump to a big translated area. Also, they don’t consider distance scale error among the current estimated 3D points and 3D landmarks.
For example, FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks. As shown in FIG. 1, the current frame pose 101 is corrected based on the perspective images of landmarks #1, #2 and #3. As indicated by the originating point of three arrows in FIG. 2, the re-correct position of current frame pose jumps leftward to a big translated area, that is, substantially not matches with the predicted pose and the camera track. Moreover, as indicated by the diamonds at the end of two arrows in FIG. 2, the distance scale error caused by the change of the current frame pose is not considered in the re-correction thereof, thereby generating additional re-correction error.
To overcome deficiencies of the methods as described above, some embodiments of the disclosure propose a pose estimation scheme which tightly couples the current image frame to its previous image frame. Because a distance between the current track and each camera in a tight coupled body is considered an error, the sudden jump translation issue can be suppressed.
Some embodiments of the disclosure also propose a pose estimation scheme which considers distance scale error among 3D landmarks and the current 3D points when minimizing those errors and estimating 6 DoF (Degrees of Freedom) of a tight coupled body. One example of the pose estimation scheme is performed through following equation.
Minimization {3D_error_term + body_pose_error_term + 3D_project_error_term}
where, 3D_error_term = 3D land mark -scale *current 3D point,
body_pose_error_term = scale*body pose -current body pose, and
3D_project_error_term = 3D landmark projection on image -current image point.
With reference to the accompanying drawings, a method and an apparatus for estimating a pose of an image capturing device provided by the embodiments of the disclosure will be specifically described below.
Reference is now made to FIG. 3, showing a sequence diagram of an optional flow of operations 300 according to some embodiments of the disclosure. Such embodiments include at least one hardware processor and a single camera. In such embodiments, the at least one hardware processor acquires 301 a series of images of multiple landmarks from at least one pose. In such embodiments the at least one pose is unknown. There may be a need to compute a set of estimations for the at least one pose.
In some embodiments the at least one hardware processor determines 303, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2. For example, N =3. In other words, the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto.
In some embodiments the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
Convention solution to recover camera’s 6 DoF (Degree of Free) and scale around loop closure area uses a single-frame-based PnP method when calculating an absolute pose for pose graph. When land-mark depth’s accuracy by SLAM is poor and overlapping area between a current image frame and re-projected 3D landmarks is less than 60%, the single-frame-based PnP makes jittering motion (wrong 6DoF) .
Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.
In some embodiments, the mapping image set is acquired during the operations for estimating the pose of the image capturing device. In other words, the image capturing device continuously captures images of the multiple landmarks and extracts feature information for each captured image to construct the mapping image set, which is used for estimating the pose for the recently acquired N images in an on-board way. In some alternative embodiments, the mapping image set is acquired separately prior to the operations for estimating the pose of the image capturing device and, thus, used for estimating the pose for the recently acquired N images in an off-board way.
Reference is now made to FIG. 4, showing a sequence diagram of an optional flow of operations 400 for constructing the mapping image set according to some  embodiments of the disclosure. Such embodiments include at least one hardware processor, a single camera and a VIO (visual inertial odometry) unit. In such embodiments, the at least one hardware processor control the single camera to capture 401 a plurality of images of multiple landmarks at different poses of the image capturing device; and controls the VIO unit to extract 403 feature information for each image in the plurality of images. In some embodiments, the feature information includes 6DoF information of each image and depth information of the multiple landmarks in each image.
In some embodiments, the VIO unit includes a visual odometry unit and an inertial measurement unit (IMU) .
An exemplary algorithm process of the visual odometry unit is as follows. A new frame image is acquired first, the ORB (Oriented FAST and Rotated BRIEF) feature points are extracted from the image, and corresponding BRIEF description of the feature points is calculated. Then, matching is performed between the feature points of recently acquired image and the feature points of previous image frames. At the same time, matched feature points are filtered using RANSAC algorithm. Finally, the rotational translation matrix between the current and previous image frames is obtained by minimizing re-projection error, so as to obtain the current pose of the image capturing device.
In some embodiment, the IMU is configured to obtain the acceleration and angular speed of the image capturing device by using gyroscope and accelerometer and, then, calculate the current pose of the image capturing device through integrating operation. Details thereof is omitted here for brevity.
Reference is now made to FIG. 5, showing a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure. Such embodiments include at least one hardware processor and a single camera. In such embodiments, the at least one hardware processor extracts 501 a plurality of observed image features of the multiple landmarks from a plurality of images captured by the single camera from at least one pose. In such embodiments the at least one pose is unknown. There may be a need to compute a set of estimations for the at least one pose. Optionally, the observed image features may be expressed in a camera coordinate system.
In some embodiments the at least one hardware processor extracts 503 the plurality of observed image features by applying image matching algorithms to the images. The image matching algorithms may include feature scale detection algorithms that produce scale information. Examples of image matching algorithms are SIFT and RANSAC. The at least one hardware processor may identify 505 among the extracted plurality of observed image features at least one common observed image feature documented in at least some of the images. Optionally, the image matching algorithms are used for identifying the at least one common feature.
In some embodiments, a corner feature matching and tracking method, for example DBoW2, which is a bag-of-words place recognition approach, can be  utilized to implement the operations 500. Based on the operations 500, a plurality of corner features are detected and described by BRIEF descriptors, which as treated as a visual word to query the mapping image set. DBoW2 can return loop-closure candidates after temporal and geometrical consistency check. All BRIEF descriptors may be kept for feature retrieving, but the raw image can be disregarded to reduce memory consumption. When a loop is detected, the connection between the recently acquired N images and the N matched images is established by retrieving feature correspondences. Correspondences are found by the BRIEF descriptor matching.
Reference is now made to FIG. 6, showing a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure. Such embodiments include at least one hardware processor. In such embodiments, the at least one hardware processor calculates 601 a pose (R, T) based on following equations:
Pi = R (p i -p ref) + T;
(P i -L i) (dot) (P j -L j) -|P i -L i | |P j -L j | ray i (dot) ray j = 0;
R *ray i (dot) (P i -L i) -|P i -L i | = 0,
where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p ref is a reference position in a body system when considering the recently acquired N images as a whole; p i is a pose translation determined for i-th image of the N matched images; P i is a position of i-th image of the recently acquired N images; L i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image.
In some embodiments, the at least one hardware processor outputs 603 a pose (r ref, p ref) for a most recently required image as:
r ref = R *R ref; and
p ref = T,
where R ref is a rotation matrix determined for the most recently required image.
Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.
Reference is now made to FIG. 7, showing an example of application scenario where the operations 300 are implemented according to some embodiments of the disclosure. In such embodiments, the at least one hardware processor acquires 301 a series of images of landmarks L 1, L 2, L 3 from a series of  poses  0, 1, 2, …12. In such embodiments the pose 12 is unknown.
In some embodiments the at least one hardware processor determines 303, for recently acquired 3 images, 3 matched images by matching each of the 3 images  with a mapping image set. In other words, the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto. As shown in FIG. 7, feature correspondences are found between the three images at poses 10, 11, 12 and three matched images at  poses  1, 3, 4, respectively.
In some embodiments the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired 3 images to known pose information of the 3 matched images.
Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired 3 images are acquired by 3 cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled 3 images is small relative to the whole accumulation pose.
The method embodiments of the application have been described in detail above with reference to FIG. 3 to FIG. 7. The apparatus/device embodiments of the application will be described in detail below with reference to FIG. 8 and FIG. 9. It should be understood that the apparatus/device embodiments correspond to the method embodiments, and similar descriptions may refer to the method embodiments.
FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application. As shown in FIG. 8, the apparatus 800 includes an acquiring module 801, a determining module 803 and an outputting module 805.
The acquiring module 801 is configured to acquire a series of images of multiple landmarks from at least one pose;
The determining module 803 is configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and
The outputting module 805 is configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
Optionally, in some embodiments, the apparatus 800 further includes a constructing module (not shown) . The constructing module is configured to acquire the mapping image set by capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and extracting feature information for each image in the plurality of images.
Optionally, in some embodiments, the constructing module is specifically configured to: calculate 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
Optionally, in some embodiments, the constructing module is specifically configured to: calculate, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
Optionally, in some embodiments, the determining module 803 is specifically configured to: perform matching of corner features between each of the N images and each corresponding one of the N matched images.
Optionally, in some embodiments, the outputting module 805 is specifically configured to: calculate a pose (R, T) based on following equations:
Pi = R (p i -p ref) + T;
(P i -L i) (dot) (P j -L j) -|P i -L i | |P j -L j | ray i (dot) ray j = 0;
R *ray i (dot) (P i -L i) -|P i -L i | = 0,
where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p ref is a reference position in a body system when considering the recently acquired N images as a whole; p i is a pose translation determined for i-th image of the N matched images; P i is a position of i-th image of the recently acquired N images; L i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and
output a pose (r ref, p ref) for a most recently required image as:
r ref = R *R ref; and
p ref = T,
where R ref is a rotation matrix determined for the most recently required image.
Base on the apparatus provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.
FIG. 9 is a block diagram illustrating an image capturing device 900 according to some embodiments of the application. The image capturing device 900 shown in FIG. 9 includes a processor 910, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.
Optionally, as shown in FIG. 9, the image capturing device 900 may further include a memory 920. The processor 910 may call and run the computer program from the memory 920 to implement the method according to the embodiments of the application.
The memory 920 may be a separate device independent of the processor 910, or may be integrated in the processor 910.
Optionally, as shown in FIG. 9, the image capturing device 900 may further include a transceiver 930, and the processor 910 may control the transceiver 930 to communicate with other devices. Specifically, it may send information or data to other devices, or receive other information, data sent by the device.
The transceiver 930 may include a transmitter and a receiver. The transceiver 930 may further include antennas, and the number of antennas may be one or more.
FIG. 10 is a block diagram illustrating a chip according to some embodiments of the application. The chip 1000 shown in FIG. 10 includes a processor 1010, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.
Optionally, as shown in FIG. 10, the chip 1000 may further include a memory 1020. The processor 1010 may call and run the computer program from the memory 1020 to implement the method according to the embodiments of the application.
The memory 1020 may be a separate device independent of the processor 1010, or may be integrated in the processor 1010.
Optionally, the chip 1000 may further include an input interface 1030. The processor 1010 may control the input interface 1030 to communicate with other devices or chips. Specifically, the processor 1010 may acquire information or data sent by other devices or chips.
Optionally, the chip 1000 may further include an output interface 1040. The processor 1010 may control the output interface 1040 to communicate with other devices or chips. Specifically, the processor 1010 may output information or data to the other devices or chips.
Optionally, the chip can be applied to the image capturing device according to the embodiments of the application, and the chip can implement the corresponding process implemented by the image capturing device in the method according to the embodiments of the application. For brevity, details are not described herein.
It should be understood that the chip mentioned in some embodiments of the application may also be referred to as a system-level chip, a system chip, a chip system or a system-on-chip.
It should be understood that the processor in the embodiments of the disclosure may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software. The processor mentioned in some embodiments of the application may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) , or other programming logic devices, discrete gate or transistor logic devices, discrete hardware components, which can achieve or implement the methods, steps and block diagrams disclosed in embodiments of the disclosure. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in the embodiments of the disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The  storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
The memory mentioned in some embodiments of the application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. In some embodiments, the non-volatile memory may be read-only memory (ROM) , programmable read-only memory (PROM) , erasable programmable read-only memory (erasable PROM, EPROM) , electrical memory erasable programmable read-only memory (EPROM, EEPROM) or flash memory. The volatile memory may be a random access memory (RAM) , which is used as an external cache. By way of exemplary but not restrictive illustration, many forms of RAM are available, for example, static random access memory (static RAM, SRAM) , a dynamic random access memory (dynamic RAM, DRAM) , synchronous dynamic random access memory (synchronous DRAM, SDRAM) , double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM) , enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM) , synch link dynamic random access memory (synch link DRAM, SLDRAM) , and direct Rambus random access memory (direct Rambus RAM, DR RAM) and so on. It should be noted that the memories in the systems and methods described herein are intended to include, but are not limited to, these and any other suitable types of memories.
It should be understood that the foregoing memories are exemplary but not restrictive. For example, the memory in the embodiments of the disclosure may also be static random access memory (static RAM, SRAM) , dynamic random access memory (dynamic RAM, DRAM) , Synchronous dynamic random access memory (synchronous DRAM, SDRAM) , double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM) , enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM) , synchronous connection Dynamic random access memory (synch link DRAM, SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DR RAM) , and the like. In other words, the memory in the embodiments of the disclosure is intended to include, but is not limited to, these and any other suitable types of memory.
Embodiments of the disclosure further provide a computer readable storage medium, which is configured to store a computer program.
Optionally, the computer readable storage medium may be applied to the network device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
Optionally, the computer readable storage medium may be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the mobile terminal/terminal device in each method in some  embodiments of the application. For the sake of brevity, details will not be repeated here.
A computer program product is also provided in some embodiments of the application, including computer program instructions.
Optionally, the computer program product can be applied to the network device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
Optionally, the computer program product can be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
A computer program is also provided in some embodiments of the application.
Optionally, the computer program may be applied to the network device in some embodiments of the application. When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
Optionally, the computer program may be applied to the mobile terminal/terminal device in some embodiments of the application. When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.
Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Those of ordinary skill in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and are not repeated here.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments as described above are only exemplary. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple  units or components can be combined or integrated into another system, or some features can be ignored or not carried out. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, the functional units in the various embodiments of the disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
If the functions are implemented in the form of software functional units and sold or used as an independent product, they can be stored in a computer-readable storage medium. Based on this understanding, an essential part of the technical solution of this application or, in other words, a part thereof that contributes to existing technology, or other parts of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions used for causing a computer device (which may be a personal computer, a server, or a network device, and the like) to perform all or part of the steps of the method described in some embodiments of the application. The foregoing storage medium includes various medium that can store program codes, such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above content is only a specific implementation of the embodiments of the application, without limiting the protection scope of the embodiments of the application. Any modification or replacement conceived by those skilled in the art within the technical scope disclosed in some embodiments of the application should be covered within the protection scope of the embodiments of the application. Therefore, the protection scope of the embodiments of the application shall be subject to the protection scope of the claims.

Claims (17)

  1. A method for estimating a pose of an image capturing device, comprising:
    acquiring a series of images of multiple landmarks from at least one pose;
    determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and
    outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  2. The method of claim 1, further comprising acquiring the mapping image set by:
    capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and
    extracting feature information for each image in the plurality of images.
  3. The method of claim 2, wherein the extracting feature information for each image in the plurality of images comprises:
    calculating 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
  4. The method of claim 3, wherein the calculating step comprises:
    calculating, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
  5. The method of any one of claims 1-4, wherein the matching each of the N images with the mapping image set comprises:
    performing matching of corner features between each of the N images and each corresponding one of the N matched images.
  6. The method of any one of claims 1-4, wherein the outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images comprises:
    calculating a pose (R, T) based on following equations:
    Pi = R (p i-p ref) + T;
    (P i-L i) (dot) (P j-L j) -|P i-L i| |P j-L j| ray i (dot) ray j=0;
    R *ray i (dot) (P i-L i) -|P i-L i|=0,
    where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p ref is a reference position in a body system when considering the recently acquired N images as a whole; p i is a pose translation determined for i-th image of the N matched images; P i is a position of i-th image of the recently acquired N images; L i is a position of i-th landmark of the multiple landmarks; dot  represents an inner product operation; ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and
    outputting a pose (r ref, p ref) for a most recently required image as:
    r ref = R*R ref; and
    p ref = T,
    where R ref is a rotation matrix determined for the most recently required image.
  7. An apparatus for estimating a pose of an image capturing device, comprising:
    an acquiring module, configured to acquire a series of images of multiple landmarks from at least one pose;
    a determining module, configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and
    an outputting module, configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
  8. The apparatus of claim 7, further comprising a constructing module, configured to acquire the mapping image set by:
    capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and
    extracting feature information for each image in the plurality of images.
  9. The apparatus of claim 8, wherein the constructing module is specifically configured to:
    calculate 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
  10. The apparatus of claim 9, wherein the constructing module is specifically configured to:
    calculate, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
  11. The apparatus of any one of claims 7-10, wherein the determining module is specifically configured to:
    perform matching of corner features between each of the N images and each corresponding one of the N matched images.
  12. The apparatus of any one of claims 7-10, wherein the outputting module is specifically configured to:
    calculate a pose (R, T) based on following equations:
    Pi = R (p i-p ref) + T;
    (P i-L i) (dot) (P j-L j) -|P i-L i| |P j-L j|ray i (dot) ray j= 0;
    R *ray i (dot) (P i-L i) -|P i-L i| =0,
    where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p ref is a reference position in a body system when considering the recently acquired N images as a whole; p i is a pose translation determined for i-th image of the N matched images; P i is a position of i-th image of the recently acquired N images; L i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and
    output a pose (r ref, p ref) for a most recently required image as:
    r ref = R*R ref; and
    p ref= T,
    where R ref is a rotation matrix determined for the most recently required image.
  13. An image capturing device, comprising a processor and a memory, wherein the memory is configured to store a computer program, and the processor is configured to call and run the computer program stored in the memory, thereby implementing the method according to any one of claims 1 to 6.
  14. A chip, comprising a processor, wherein the processor is configured to call and run a computer program from a memory, thereby causing an apparatus provided with the chip to implement the method according to any one of claims 1 to 6.
  15. A computer readable storage medium, being used for storing a computer program, wherein the computer program causes a computer to implement the method according to any one of claims 1 to 6.
  16. A computer program product, comprising computer program instructions that cause a computer to implement the method according to any one of claims 1 to 6.
  17. A computer program, causing a computer to implement the method according to any one of claims 1 to 6.
PCT/CN2021/076640 2020-02-14 2021-02-14 Method and apparatus for estimating pose of image capturing device WO2021160182A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180014779.0A CN115210533A (en) 2020-02-14 2021-02-14 Method and apparatus for estimating pose of image capturing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202062976537P 2020-02-14 2020-02-14
US62/976,537 2020-02-14

Publications (1)

Publication Number Publication Date
WO2021160182A1 true WO2021160182A1 (en) 2021-08-19

Family

ID=77292044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/076640 WO2021160182A1 (en) 2020-02-14 2021-02-14 Method and apparatus for estimating pose of image capturing device

Country Status (2)

Country Link
CN (1) CN115210533A (en)
WO (1) WO2021160182A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768042A (en) * 2012-07-11 2012-11-07 清华大学 Visual-inertial combined navigation method
CN103292804A (en) * 2013-05-27 2013-09-11 浙江大学 Monocular natural vision landmark assisted mobile robot positioning method
CN105021184A (en) * 2015-07-08 2015-11-04 西安电子科技大学 Pose estimation system and method for visual carrier landing navigation on mobile platform
CN108074251A (en) * 2016-11-14 2018-05-25 广东技术师范学院 Mobile Robotics Navigation control method based on monocular vision
CN108802785A (en) * 2018-08-24 2018-11-13 清华大学 Vehicle method for self-locating based on High-precision Vector map and monocular vision sensor
US20190242711A1 (en) * 2018-02-08 2019-08-08 Raytheon Company Image geo-registration for absolute navigation aiding using uncertainy information from the on-board navigation system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768042A (en) * 2012-07-11 2012-11-07 清华大学 Visual-inertial combined navigation method
CN103292804A (en) * 2013-05-27 2013-09-11 浙江大学 Monocular natural vision landmark assisted mobile robot positioning method
CN105021184A (en) * 2015-07-08 2015-11-04 西安电子科技大学 Pose estimation system and method for visual carrier landing navigation on mobile platform
CN108074251A (en) * 2016-11-14 2018-05-25 广东技术师范学院 Mobile Robotics Navigation control method based on monocular vision
US20190242711A1 (en) * 2018-02-08 2019-08-08 Raytheon Company Image geo-registration for absolute navigation aiding using uncertainy information from the on-board navigation system
CN108802785A (en) * 2018-08-24 2018-11-13 清华大学 Vehicle method for self-locating based on High-precision Vector map and monocular vision sensor

Also Published As

Publication number Publication date
CN115210533A (en) 2022-10-18

Similar Documents

Publication Publication Date Title
Kneip et al. Robust real-time visual odometry with a single camera and an IMU
CN110411441B (en) System and method for multi-modal mapping and localization
CN110044354B (en) Binocular vision indoor positioning and mapping method and device
EP3420530B1 (en) A device and method for determining a pose of a camera
US10133279B2 (en) Apparatus of updating key frame of mobile robot and method thereof
WO2017163596A1 (en) Autonomous navigation using visual odometry
US9148650B2 (en) Real-time monocular visual odometry
EP3028252B1 (en) Rolling sequential bundle adjustment
US10033985B2 (en) Camera pose estimation apparatus and method for augmented reality imaging
US9020204B2 (en) Method and an apparatus for image-based navigation
KR102367361B1 (en) Location measurement and simultaneous mapping method and device
US8305430B2 (en) System and method for multi-camera visual odometry
Gräter et al. Robust scale estimation for monocular visual odometry using structure from motion and vanishing points
Badino A robust approach for ego-motion estimation using a mobile stereo platform
Alcantarilla et al. Visual odometry priors for robust EKF-SLAM
CN111127524A (en) Method, system and device for tracking trajectory and reconstructing three-dimensional image
Zhu Image Gradient-based Joint Direct Visual Odometry for Stereo Camera.
US11195297B2 (en) Method and system for visual localization based on dual dome cameras
JP6922348B2 (en) Information processing equipment, methods, and programs
WO2021160182A1 (en) Method and apparatus for estimating pose of image capturing device
Hartmann et al. Visual gyroscope for accurate orientation estimation
CN112767482A (en) Indoor and outdoor positioning method and system with multi-sensor fusion
Iz et al. Aerial image stitching using IMU data from a UAV
Zeng et al. DFPC-SLAM: A dynamic feature point constraints-based SLAM using stereo vision for dynamic environment
Manthe et al. Combining 2D to 2D and 3D to 2D Point Correspondences for Stereo Visual Odometry.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21754435

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21754435

Country of ref document: EP

Kind code of ref document: A1