WO2021160182A1

WO2021160182A1 - Method and apparatus for estimating pose of image capturing device

Info

Publication number: WO2021160182A1
Application number: PCT/CN2021/076640
Authority: WO
Inventors: Jaechoon CHON
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-02-14
Filing date: 2021-02-14
Publication date: 2021-08-19
Also published as: CN115210533A

Abstract

A method for estimating a pose of an image capturing device, including: acquiring a series of images of multiple landmarks from at least one pose (301); determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2 (303); and outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images (305).

Description

METHOD AND APPARATUS FOR ESTIMATING POSE OF IMAGE CAPTURING DEVICE

RELATED APPLICATION

This application claims the benefit of priority of U.S. Provisional Patent Application No. 62/976,537 filed on Feb. 14, 2020, the contents of which are incorporated herein by reference in, their entirety.

TECHNICAL FIELD

The present disclosure, in some embodiments thereof, relates to computer vision, and more specifically, but not exclusively, to a method and an apparatus for estimating a pose of an image capturing device.

BACKGROUND

When realizing VR (virtual reality) and AR (augmented reality) , the most key point is determining the mobile system’s spatial localization in real time, for example, estimating a pose of an image capturing device (e.g., a camera) provided by the mobile system, where the pose may include position and rotation thereof. Such localization may be referred to as SLAM (simultaneous localization and mapping) .

During the process of localization based on camera and IMU (inertial measurement unit) , accumulation error may increase as time flows. To suppress the accumulation error, a current image pose should be re-corrected based on landmarks when the camera re-visits the same area.

SUMMARY

The present disclosure provides a method and an apparatus for estimating a pose of an image capturing device.

According to a first aspect, there is provided a method for estimating a pose of an image capturing device, including: acquiring a series of images of multiple landmarks from at least one pose; determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.

According to a second aspect, there is provided an apparatus for estimating a pose of an image capturing device, comprising: an acquiring module, configured to acquire a series of images of multiple landmarks from at least one pose; a determining module, configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and an outputting module, configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.

According to a third aspect, there is provided an image capturing device, including: a processor and a memory. The memory is configured to store a computer program, and the processor is configured to call and run the computer program stored in the memory, thereby implementing the method according to the forgoing first aspect or any embodiment thereof.

According to a fourth aspect, there is provided a chip configured to implement the method according to the forgoing first aspect or any embodiment thereof.

Specifically, the chip includes a processor, configured to call and run a computer program from a memory, thereby causing an apparatus provided with the chip to implement the method according to the forgoing first aspect or any embodiment thereof.

According to a fifth aspect, there is provided a computer readable storage medium, being used for storing a computer program, wherein the computer program causes a computer to implement the method according to the forgoing first aspect or any embodiment thereof.

According to a sixth aspect, there is provided a computer program product, including computer program instructions that cause a computer to implement the method according to the forgoing first aspect or any embodiment thereof.

According to a seventh aspect, there is provided a computer program which, when running on a computer, causes the computer to implement the method according to the forgoing first aspect or any embodiment thereof.

Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks.

FIG. 3 illustrates a sequence diagram of an optional flow of operations for estimating a pose of an image capturing device according to some embodiments of the disclosure.

FIG. 4 illustrates a sequence diagram of an optional flow of operations 400 for constructing the mapping image set according to some embodiments of the disclosure.

FIG. 5 illustrates a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure.

FIG. 6 illustrates a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure.

FIG. 7 illustrates an exemplary application scenario where the operations 300 are implemented according to some embodiments of the disclosure.

FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application.

FIG. 9 is a block diagram of an image capturing device according to some embodiments of the application.

FIG. 10 is a block diagram of a chip according to some embodiments of the application.

DETAILED DESCRIPTION

Exemplary embodiments of the disclosure will now be described more fully with reference to the accompanying drawings, in which exemplary embodiments are shown. Exemplary embodiments of the disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of exemplary embodiments to those skilled in the art. In the drawings, the separate layers and regions are exaggerated for clarity. Like reference numerals in the drawings denote like elements, and thus their description will be omitted.

The described features, structures, or/and characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are disclosed to provide a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosure may be practiced without one or more of the specific details, or with other methods, components and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

In the present disclosure, terms such as "connected" and the like should be understood broadly, and may be directly connected or indirectly connected through an intermediate medium, unless otherwise specified. The specific meanings of the above terms in the present disclosure can be understood by those skilled in the art on a case-by-case basis.

Further, in the description of the present disclosure, the meaning of "a plurality" , "multiple" or "several" is at least two, for example, two, three, etc., unless specifically defined otherwise. "And/or" , describing the association relationship of the associated objects, indicates that there may be three relationships, such as A and/or B, which may indicate that there are three cases of single A, single B and both A and B. The symbol "/" generally indicates that the contextual object is an "or" relationship.

For brevity, the term “camera” is used herein to refer to an image capturing device such as one or more image sensors, an independent camera, an integrated camera, and/or any sensor adapted to document objects visually.

There exist image capturing systems including an image capturing device, for example a camera, where there is a need to estimate a pose of the camera in a coordinate system. Examples of coordinate systems include a world coordinate system and a coordinate system calibrated with a camera pose of a camera when capturing images. A camera pose is a combination of position and orientation of a camera relative to a coordinate system. For example, a camera pose x may be expressed as a pair (R, t) , where R is a rotation matrix representing an orientation with respect to the coordinate system, and t is a translation vector representing the camera's position with respect to the coordinate system. Other possible representations of orientation are double-angle representations and tensors. Examples of such image capturing systems are medical systems including a camera inserted into a patient's body, for example by swallowing the camera, systems including autonomously moving devices, for example vehicles and robots, navigation applications and augmented reality systems. When a camera operates in unknown environments, without further information or sensors, estimation of a camera pose may involve three-dimensional (3D) reconstruction of a scene. This problem is known as “simultaneous localization and mapping” (SLAM) in computer vision and robotics communities.

A scene has image features, also known as landmarks. An image of the scene captured by a camera, sometimes referred to as a camera view, has observed image features representing the scene's landmarks. Typically, camera pose estimation is solved using bundle adjustment (BA) optimization. Bundle adjustment is a common approach to recover (by estimation) camera poses and 3D scene reconstruction given a sequence of images and possible measurements from additional sensors. A bundle adjustment optimization calculation aims to minimize re-projection errors across all available images and for all landmarks identified in the available images. A re-projection error for a landmark in an image is the difference between the location in the image of the landmark's observed image feature and the predicted location in the image of the landmark's observed image feature for a certain camera pose estimate.

Hitherto, the terms “visiting an area” , “mapping an area” and “observing an area” all mean capturing images of the area by a camera, and are used interchangeably.

The following description refers to a moving camera but applies to stationary cameras as well. Estimating the motion of a camera is also known as ego-motion estimation.

Maintaining a high level of estimation accuracy over long periods of time is challenging in environments deprived of a Global Positioning System (GPS) . Accuracy of estimations of camera poses and 3D structure for images captured over a period of time typically deteriorates over time due to accumulation of estimation errors. Estimation drift is the change in the accumulated estimation error over time. Estimation errors occur both when a camera re-observes previously mapped areas and when a camera continuously explores new areas. When a camera continuously observes new areas, bundle adjustment reduces to fixed-lag bundle adjustment which typically results in rapid trajectory drift; trajectory drift is the changes in the estimation of the camera's motion. Re-observation of an area is known as a loop-closure. When a camera re-observes previously mapped areas, estimation errors are typically reduced, but are still inevitable even in the case of a loop-closure, in large-scale environments.

To suppress the localization’s accumulation error, pose graph method has been widely adapted for light computational cost. The pose graph concept aligns the current camera track and old camera track when the camera revisits the same area. This pose graph method is trying to increase the similarity among the tracks even though the tracks have quite different patterns. Even though it takes 3D landmarks into consideration, the distance scale error among 3D landmarks and the current 3D points built by a single camera makes things worse.

Iterative closest point (ICP) for more accurate alignment is an algorithm employed to minimize the difference between two clouds of points. In general, ICP builds normal vectors of each point and then aligns them using matched pair points. However, there are some barriers to directly apply the ICP method because time cost is quite expensive.

Another approach is PnP (Perspective-n-Point) . When the camera revisits the same area, PnP re-projects already built 3D landmarks on the current camera image to correct the current camera pose. When the distance between the camera and 3D landmark increases, the camera pose estimation error also increases.

Because the methods as described above only focus on the current (latest) frame pose correction, the re-correct position may suddenly jump to a big translated area. Also, they don’t consider distance scale error among the current estimated 3D points and 3D landmarks.

For example, FIG. 1 and FIG. 2 exemplarily illustrate conventional alignment among current 3D point and landmarks. As shown in FIG. 1, the current frame pose 101 is corrected based on the perspective images of landmarks #1, #2 and #3. As indicated by the originating point of three arrows in FIG. 2, the re-correct position of current frame pose jumps leftward to a big translated area, that is, substantially not matches with the predicted pose and the camera track. Moreover, as indicated by the diamonds at the end of two arrows in FIG. 2, the distance scale error caused by the change of the current frame pose is not considered in the re-correction thereof, thereby generating additional re-correction error.

To overcome deficiencies of the methods as described above, some embodiments of the disclosure propose a pose estimation scheme which tightly couples the current image frame to its previous image frame. Because a distance between the current track and each camera in a tight coupled body is considered an error, the sudden jump translation issue can be suppressed.

Some embodiments of the disclosure also propose a pose estimation scheme which considers distance scale error among 3D landmarks and the current 3D points when minimizing those errors and estimating 6 DoF (Degrees of Freedom) of a tight coupled body. One example of the pose estimation scheme is performed through following equation.

Minimization {3D_error_term + body_pose_error_term + 3D_project_error_term}

where, 3D_error_term = 3D land mark -scale *current 3D point,

body_pose_error_term = scale*body pose -current body pose, and

3D_project_error_term = 3D landmark projection on image -current image point.

With reference to the accompanying drawings, a method and an apparatus for estimating a pose of an image capturing device provided by the embodiments of the disclosure will be specifically described below.

Reference is now made to FIG. 3, showing a sequence diagram of an optional flow of operations 300 according to some embodiments of the disclosure. Such embodiments include at least one hardware processor and a single camera. In such embodiments, the at least one hardware processor acquires 301 a series of images of multiple landmarks from at least one pose. In such embodiments the at least one pose is unknown. There may be a need to compute a set of estimations for the at least one pose.

In some embodiments the at least one hardware processor determines 303, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2. For example, N =3. In other words, the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto.

In some embodiments the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.

Convention solution to recover camera’s 6 DoF (Degree of Free) and scale around loop closure area uses a single-frame-based PnP method when calculating an absolute pose for pose graph. When land-mark depth’s accuracy by SLAM is poor and overlapping area between a current image frame and re-projected 3D landmarks is less than 60%, the single-frame-based PnP makes jittering motion (wrong 6DoF) .

In some embodiments, the mapping image set is acquired during the operations for estimating the pose of the image capturing device. In other words, the image capturing device continuously captures images of the multiple landmarks and extracts feature information for each captured image to construct the mapping image set, which is used for estimating the pose for the recently acquired N images in an on-board way. In some alternative embodiments, the mapping image set is acquired separately prior to the operations for estimating the pose of the image capturing device and, thus, used for estimating the pose for the recently acquired N images in an off-board way.

Reference is now made to FIG. 4, showing a sequence diagram of an optional flow of operations 400 for constructing the mapping image set according to some embodiments of the disclosure. Such embodiments include at least one hardware processor, a single camera and a VIO (visual inertial odometry) unit. In such embodiments, the at least one hardware processor control the single camera to capture 401 a plurality of images of multiple landmarks at different poses of the image capturing device; and controls the VIO unit to extract 403 feature information for each image in the plurality of images. In some embodiments, the feature information includes 6DoF information of each image and depth information of the multiple landmarks in each image.

In some embodiments, the VIO unit includes a visual odometry unit and an inertial measurement unit (IMU) .

An exemplary algorithm process of the visual odometry unit is as follows. A new frame image is acquired first, the ORB (Oriented FAST and Rotated BRIEF) feature points are extracted from the image, and corresponding BRIEF description of the feature points is calculated. Then, matching is performed between the feature points of recently acquired image and the feature points of previous image frames. At the same time, matched feature points are filtered using RANSAC algorithm. Finally, the rotational translation matrix between the current and previous image frames is obtained by minimizing re-projection error, so as to obtain the current pose of the image capturing device.

In some embodiment, the IMU is configured to obtain the acceleration and angular speed of the image capturing device by using gyroscope and accelerometer and, then, calculate the current pose of the image capturing device through integrating operation. Details thereof is omitted here for brevity.

Reference is now made to FIG. 5, showing a sequence diagram of an optional flow of operations 500 for matching each image with the mapping image set according to some embodiments of the disclosure. Such embodiments include at least one hardware processor and a single camera. In such embodiments, the at least one hardware processor extracts 501 a plurality of observed image features of the multiple landmarks from a plurality of images captured by the single camera from at least one pose. In such embodiments the at least one pose is unknown. There may be a need to compute a set of estimations for the at least one pose. Optionally, the observed image features may be expressed in a camera coordinate system.

In some embodiments the at least one hardware processor extracts 503 the plurality of observed image features by applying image matching algorithms to the images. The image matching algorithms may include feature scale detection algorithms that produce scale information. Examples of image matching algorithms are SIFT and RANSAC. The at least one hardware processor may identify 505 among the extracted plurality of observed image features at least one common observed image feature documented in at least some of the images. Optionally, the image matching algorithms are used for identifying the at least one common feature.

In some embodiments, a corner feature matching and tracking method, for example DBoW2, which is a bag-of-words place recognition approach, can be utilized to implement the operations 500. Based on the operations 500, a plurality of corner features are detected and described by BRIEF descriptors, which as treated as a visual word to query the mapping image set. DBoW2 can return loop-closure candidates after temporal and geometrical consistency check. All BRIEF descriptors may be kept for feature retrieving, but the raw image can be disregarded to reduce memory consumption. When a loop is detected, the connection between the recently acquired N images and the N matched images is established by retrieving feature correspondences. Correspondences are found by the BRIEF descriptor matching.

Reference is now made to FIG. 6, showing a sequence diagram of an optional flow of operations 600 for re-localizing a pose of the image capturing device according to some embodiments of the disclosure. Such embodiments include at least one hardware processor. In such embodiments, the at least one hardware processor calculates 601 a pose (R, T) based on following equations:

Pi = R (p _i -p _ref) + T;

(P _i -L _i) (dot) (P _j -L _j) -|P _i -L _i | |P _j -L _j | ray _i (dot) ray _j = 0;

R *ray _i (dot) (P _i -L _i) -|P _i -L _i | = 0,

where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p _ref is a reference position in a body system when considering the recently acquired N images as a whole; p _i is a pose translation determined for i-th image of the N matched images; P _i is a position of i-th image of the recently acquired N images; L _i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray _i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image.

In some embodiments, the at least one hardware processor outputs 603 a pose (r _ref, p _ref) for a most recently required image as:

r _ref = R *R _ref; and

p _ref = T,

where R _ref is a rotation matrix determined for the most recently required image.

Reference is now made to FIG. 7, showing an example of application scenario where the operations 300 are implemented according to some embodiments of the disclosure. In such embodiments, the at least one hardware processor acquires 301 a series of images of landmarks L ₁, L ₂, L ₃ from a series of

poses

0, 1, 2, …12. In such embodiments the pose 12 is unknown.

In some embodiments the at least one hardware processor determines 303, for recently acquired 3 images, 3 matched images by matching each of the 3 images with a mapping image set. In other words, the recently acquired three images are each matched with the mapping image set to determine three matched images respectively corresponding thereto. As shown in FIG. 7, feature correspondences are found between the three images at poses 10, 11, 12 and three matched images at

poses

1, 3, 4, respectively.

In some embodiments the at least one hardware processor outputs 305 a pose of the image capturing device by aligning the recently acquired 3 images to known pose information of the 3 matched images.

Base on the method provided by the embodiments of the disclosure, it is assumed that the recently acquired 3 images are acquired by 3 cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled 3 images is small relative to the whole accumulation pose.

The method embodiments of the application have been described in detail above with reference to FIG. 3 to FIG. 7. The apparatus/device embodiments of the application will be described in detail below with reference to FIG. 8 and FIG. 9. It should be understood that the apparatus/device embodiments correspond to the method embodiments, and similar descriptions may refer to the method embodiments.

FIG. 8 is a block diagram of an apparatus for estimating a pose of an image capturing device according to some embodiments of the application. As shown in FIG. 8, the apparatus 800 includes an acquiring module 801, a determining module 803 and an outputting module 805.

The acquiring module 801 is configured to acquire a series of images of multiple landmarks from at least one pose;

The determining module 803 is configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and

The outputting module 805 is configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.

Optionally, in some embodiments, the apparatus 800 further includes a constructing module (not shown) . The constructing module is configured to acquire the mapping image set by capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and extracting feature information for each image in the plurality of images.

Optionally, in some embodiments, the constructing module is specifically configured to: calculate 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.

Optionally, in some embodiments, the constructing module is specifically configured to: calculate, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.

Optionally, in some embodiments, the determining module 803 is specifically configured to: perform matching of corner features between each of the N images and each corresponding one of the N matched images.

Optionally, in some embodiments, the outputting module 805 is specifically configured to: calculate a pose (R, T) based on following equations:

Pi = R (p _i -p _ref) + T;

R *ray _i (dot) (P _i -L _i) -|P _i -L _i | = 0,

where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p _ref is a reference position in a body system when considering the recently acquired N images as a whole; p _i is a pose translation determined for i-th image of the N matched images; P _i is a position of i-th image of the recently acquired N images; L _i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray _i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and

output a pose (r _ref, p _ref) for a most recently required image as:

r _ref = R *R _ref; and

p _ref = T,

Base on the apparatus provided by the embodiments of the disclosure, it is assumed that the recently acquired N images are acquired by N cameras mounted on a tight and rigid body system, from which the 6DoF and scale are estimated. Accordingly, even though pose error is getting greatly accumulated, the accumulated error of tight-coupled N images is small relative to the whole accumulation pose.

FIG. 9 is a block diagram illustrating an image capturing device 900 according to some embodiments of the application. The image capturing device 900 shown in FIG. 9 includes a processor 910, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.

Optionally, as shown in FIG. 9, the image capturing device 900 may further include a memory 920. The processor 910 may call and run the computer program from the memory 920 to implement the method according to the embodiments of the application.

The memory 920 may be a separate device independent of the processor 910, or may be integrated in the processor 910.

Optionally, as shown in FIG. 9, the image capturing device 900 may further include a transceiver 930, and the processor 910 may control the transceiver 930 to communicate with other devices. Specifically, it may send information or data to other devices, or receive other information, data sent by the device.

The transceiver 930 may include a transmitter and a receiver. The transceiver 930 may further include antennas, and the number of antennas may be one or more.

FIG. 10 is a block diagram illustrating a chip according to some embodiments of the application. The chip 1000 shown in FIG. 10 includes a processor 1010, which can call and run a computer program from a memory to implement the method according to the embodiments of the application.

Optionally, as shown in FIG. 10, the chip 1000 may further include a memory 1020. The processor 1010 may call and run the computer program from the memory 1020 to implement the method according to the embodiments of the application.

The memory 1020 may be a separate device independent of the processor 1010, or may be integrated in the processor 1010.

Optionally, the chip 1000 may further include an input interface 1030. The processor 1010 may control the input interface 1030 to communicate with other devices or chips. Specifically, the processor 1010 may acquire information or data sent by other devices or chips.

Optionally, the chip 1000 may further include an output interface 1040. The processor 1010 may control the output interface 1040 to communicate with other devices or chips. Specifically, the processor 1010 may output information or data to the other devices or chips.

Optionally, the chip can be applied to the image capturing device according to the embodiments of the application, and the chip can implement the corresponding process implemented by the image capturing device in the method according to the embodiments of the application. For brevity, details are not described herein.

It should be understood that the chip mentioned in some embodiments of the application may also be referred to as a system-level chip, a system chip, a chip system or a system-on-chip.

It should be understood that the processor in the embodiments of the disclosure may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software. The processor mentioned in some embodiments of the application may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) , or other programming logic devices, discrete gate or transistor logic devices, discrete hardware components, which can achieve or implement the methods, steps and block diagrams disclosed in embodiments of the disclosure. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in the embodiments of the disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

The memory mentioned in some embodiments of the application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. In some embodiments, the non-volatile memory may be read-only memory (ROM) , programmable read-only memory (PROM) , erasable programmable read-only memory (erasable PROM, EPROM) , electrical memory erasable programmable read-only memory (EPROM, EEPROM) or flash memory. The volatile memory may be a random access memory (RAM) , which is used as an external cache. By way of exemplary but not restrictive illustration, many forms of RAM are available, for example, static random access memory (static RAM, SRAM) , a dynamic random access memory (dynamic RAM, DRAM) , synchronous dynamic random access memory (synchronous DRAM, SDRAM) , double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM) , enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM) , synch link dynamic random access memory (synch link DRAM, SLDRAM) , and direct Rambus random access memory (direct Rambus RAM, DR RAM) and so on. It should be noted that the memories in the systems and methods described herein are intended to include, but are not limited to, these and any other suitable types of memories.

It should be understood that the foregoing memories are exemplary but not restrictive. For example, the memory in the embodiments of the disclosure may also be static random access memory (static RAM, SRAM) , dynamic random access memory (dynamic RAM, DRAM) , Synchronous dynamic random access memory (synchronous DRAM, SDRAM) , double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM) , enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM) , synchronous connection Dynamic random access memory (synch link DRAM, SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DR RAM) , and the like. In other words, the memory in the embodiments of the disclosure is intended to include, but is not limited to, these and any other suitable types of memory.

Embodiments of the disclosure further provide a computer readable storage medium, which is configured to store a computer program.

Optionally, the computer readable storage medium may be applied to the network device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

Optionally, the computer readable storage medium may be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program causes the computer to execute the corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

A computer program product is also provided in some embodiments of the application, including computer program instructions.

Optionally, the computer program product can be applied to the network device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

Optionally, the computer program product can be applied to the mobile terminal/terminal device in some embodiments of the application, and the computer program instruction causes the computer to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

A computer program is also provided in some embodiments of the application.

Optionally, the computer program may be applied to the network device in some embodiments of the application. When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the network device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

Optionally, the computer program may be applied to the mobile terminal/terminal device in some embodiments of the application. When the computer program is run on a computer, the computer is caused to execute a corresponding process implemented by the mobile terminal/terminal device in each method in some embodiments of the application. For the sake of brevity, details will not be repeated here.

Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Those of ordinary skill in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and are not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments as described above are only exemplary. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not carried out. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the functions are implemented in the form of software functional units and sold or used as an independent product, they can be stored in a computer-readable storage medium. Based on this understanding, an essential part of the technical solution of this application or, in other words, a part thereof that contributes to existing technology, or other parts of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions used for causing a computer device (which may be a personal computer, a server, or a network device, and the like) to perform all or part of the steps of the method described in some embodiments of the application. The foregoing storage medium includes various medium that can store program codes, such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above content is only a specific implementation of the embodiments of the application, without limiting the protection scope of the embodiments of the application. Any modification or replacement conceived by those skilled in the art within the technical scope disclosed in some embodiments of the application should be covered within the protection scope of the embodiments of the application. Therefore, the protection scope of the embodiments of the application shall be subject to the protection scope of the claims.

Claims

A method for estimating a pose of an image capturing device, comprising:

acquiring a series of images of multiple landmarks from at least one pose;

determining, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and

outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
The method of claim 1, further comprising acquiring the mapping image set by:

capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and

extracting feature information for each image in the plurality of images.
The method of claim 2, wherein the extracting feature information for each image in the plurality of images comprises:

calculating 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
The method of claim 3, wherein the calculating step comprises:

calculating, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
The method of any one of claims 1-4, wherein the matching each of the N images with the mapping image set comprises:

performing matching of corner features between each of the N images and each corresponding one of the N matched images.
The method of any one of claims 1-4, wherein the outputting a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images comprises:

calculating a pose (R, T) based on following equations:

Pi = R (p _i-p _ref) + T;

(P _i-L _i) (dot) (P _j-L _j) -|P _i-L _i| |P _j-L _j| ray _i (dot) ray _j=0;

R *ray _i (dot) (P _i-L _i) -|P _i-L _i|=0,

where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p _ref is a reference position in a body system when considering the recently acquired N images as a whole; p _i is a pose translation determined for i-th image of the N matched images; P _i is a position of i-th image of the recently acquired N images; L _i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray _i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and

outputting a pose (r _ref, p _ref) for a most recently required image as:

r _ref = R*R _ref; and

p _ref = T,

where R _ref is a rotation matrix determined for the most recently required image.
An apparatus for estimating a pose of an image capturing device, comprising:

an acquiring module, configured to acquire a series of images of multiple landmarks from at least one pose;

a determining module, configured to determine, for recently acquired N images, N matched images by matching each of the N images with a mapping image set, where N is an integer greater than 2; and

an outputting module, configured to output a pose of the image capturing device by aligning the recently acquired N images to known pose information of the N matched images.
The apparatus of claim 7, further comprising a constructing module, configured to acquire the mapping image set by:

capturing a plurality of images of multiple landmarks at different poses of the image capturing device; and

extracting feature information for each image in the plurality of images.
The apparatus of claim 8, wherein the constructing module is specifically configured to:

calculate 6 degree-of-pose (6DoF) information for the each image and depth information of the multiple landmarks in the each image.
The apparatus of claim 9, wherein the constructing module is specifically configured to:

calculate, based on a visual inertial odometry (VIO) algorithm, the 6DoF information and the depth information by using an inertial measurement unit (IMU) and camera image corner feature matching and tracking.
The apparatus of any one of claims 7-10, wherein the determining module is specifically configured to:

perform matching of corner features between each of the N images and each corresponding one of the N matched images.
The apparatus of any one of claims 7-10, wherein the outputting module is specifically configured to:

calculate a pose (R, T) based on following equations:

Pi = R (p _i-p _ref) + T;

(P _i-L _i) (dot) (P _j-L _j) -|P _i-L _i| |P _j-L _j|ray _i (dot) ray _j= 0;

R *ray _i (dot) (P _i-L _i) -|P _i-L _i| =0,

where R is a rotation matrix representing an orientation of the image capturing device with respect to a world coordinate system; T is a translation vector representing a position of the image capturing device with respect to the world coordinate system; p _ref is a reference position in a body system when considering the recently acquired N images as a whole; p _i is a pose translation determined for i-th image of the N matched images; P _i is a position of i-th image of the recently acquired N images; L _i is a position of i-th landmark of the multiple landmarks; dot represents an inner product operation; ray _i represents a ray from a camera focal point to an image pixel corresponding to a feature point on i-th image; and

output a pose (r _ref, p _ref) for a most recently required image as:

r _ref = R*R _ref; and

p _ref= T,

where R _ref is a rotation matrix determined for the most recently required image.
An image capturing device, comprising a processor and a memory, wherein the memory is configured to store a computer program, and the processor is configured to call and run the computer program stored in the memory, thereby implementing the method according to any one of claims 1 to 6.
A chip, comprising a processor, wherein the processor is configured to call and run a computer program from a memory, thereby causing an apparatus provided with the chip to implement the method according to any one of claims 1 to 6.
A computer readable storage medium, being used for storing a computer program, wherein the computer program causes a computer to implement the method according to any one of claims 1 to 6.
A computer program product, comprising computer program instructions that cause a computer to implement the method according to any one of claims 1 to 6.
A computer program, causing a computer to implement the method according to any one of claims 1 to 6.