NZ743271B2 - Relocalization systems and methods - Google Patents
Relocalization systems and methods Download PDFInfo
- Publication number
- NZ743271B2 NZ743271B2 NZ743271A NZ74327116A NZ743271B2 NZ 743271 B2 NZ743271 B2 NZ 743271B2 NZ 743271 A NZ743271 A NZ 743271A NZ 74327116 A NZ74327116 A NZ 74327116A NZ 743271 B2 NZ743271 B2 NZ 743271B2
- Authority
- NZ
- New Zealand
- Prior art keywords
- image
- data structure
- pose
- images
- dimensional
- Prior art date
Links
- 230000001537 neural Effects 0.000 claims abstract description 45
- 230000000875 corresponding Effects 0.000 claims abstract description 39
- 230000003247 decreasing Effects 0.000 claims abstract description 6
- 229920001223 polyethylene glycol Polymers 0.000 claims 1
- 230000004807 localization Effects 0.000 abstract description 22
- 238000000034 method Methods 0.000 description 7
- 150000002500 ions Chemical class 0.000 description 6
- 230000003190 augmentative Effects 0.000 description 5
- NLZUEZXRPGMBCV-UHFFFAOYSA-N Butylhydroxytoluene Chemical compound CC1=CC(C(C)(C)C)=C(O)C(C(C)(C)C)=C1 NLZUEZXRPGMBCV-UHFFFAOYSA-N 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006011 modification reaction Methods 0.000 description 3
- 241000229754 Iva xanthiifolia Species 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000004805 robotic Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 229940035295 Ting Drugs 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004397 blinking Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000000644 propagated Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000001131 transforming Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5838—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using colour
-
- G06K9/00664—
-
- G06K9/20—
-
- G06K9/52—
-
- G06K9/60—
-
- G06K9/6273—
-
- G06K9/66—
-
- G06K9/70—
-
- G06N3/0454—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30244—Camera pose
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/74—Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
Abstract
The present disclosure relates to the localisation of pose-sensitive systems. One disclosed embodiment relates to a method of determining a pose of an image capture device. The method includes capturing an image using an image capture device. The method also includes generating a data structure corresponding to the captured image. The method further includes comparing the data structure with a plurality of known data structures to identify a most similar known data structure. The method further includes reading metadata corresponding to the most similar known data structure to determine a pose of the image capture device. The method further includes training a neural network by mapping a plurality of known images to the plurality of known data structures. The data structure is an N dimensional vector. The step of generating the data structure corresponding to the captured image comprises using a neural network to map the captured image to the N dimensional vector. Each known image of the plurality has respective metadata including pose data. The step of training the neural network comprises accessing a database of the known images annotated with the respective metadata, decreasing a first Euclidean distance between first and second known N dimensional vectors respectively corresponding to matching first and second known images in an N dimensional space, and increasing a second Euclidean distance between first and third known N dimensional vectors respectively corresponding to non-matching first and third known images in the N dimensional space. esponding to the captured image. The method further includes comparing the data structure with a plurality of known data structures to identify a most similar known data structure. The method further includes reading metadata corresponding to the most similar known data structure to determine a pose of the image capture device. The method further includes training a neural network by mapping a plurality of known images to the plurality of known data structures. The data structure is an N dimensional vector. The step of generating the data structure corresponding to the captured image comprises using a neural network to map the captured image to the N dimensional vector. Each known image of the plurality has respective metadata including pose data. The step of training the neural network comprises accessing a database of the known images annotated with the respective metadata, decreasing a first Euclidean distance between first and second known N dimensional vectors respectively corresponding to matching first and second known images in an N dimensional space, and increasing a second Euclidean distance between first and third known N dimensional vectors respectively corresponding to non-matching first and third known images in the N dimensional space.
Description
RELOCALIZATION S AND METHODS
Cross-Reference to Related Application
This application claims priority to U.S. Provisional ation Serial
Number 62/263,529 filed on December 4, 2015 entitled “RELOCALIZATION
SYSTEMS AND METHODS,” under attorney docket number 9512-30061.00 US.
The present application includes t matter similar to that described in U.S.
Utility Patent Application Serial No. ,042 filed on May 9, 2016 ed
“DEVICES,METHODS AND SYSTEMS FOR BIOMETRIC USER RECOGNITION
UTILIZING NEURAL NETWORKS,” under ey docket number ML.20028.00.
The contents of the aforementioned patent application are hereby expressly and fully
incorporated by reference in their entirety, as though set forth in full.
The subject matter herein may be employed and/or ed with various
systems, such as those wearable computing systems and components thereof
designed by organizations such as Magic Leap, Inc. of Fort Lauderdale, Florida. The
following documents are hereby expressly and fully incorporated by nce in their
entirety, as though set forth in full: U.S. Patent Application Serial Number
14/641,376; U.S . Patent Application Serial Number 14/555,585 ; U.S . Patent
Application Serial Number 14/205,126; U.S . Patent Application Serial Num ber
14/212,961; U.S . Patent Application Serial Number 14/690,401 ; U.S . Patent
Application Serial Number 13/663,466; and U.S. Patent Application Serial Number
13/684,489.
Field of the Invention
The present disclosure relates to devices, s and systems for
localization of pose sensitive systems. In particular, the present disclosure relates to
devices, methods and systems for relocalization of pose sensitive systems that have
either lost and/or yet to established a system pose.
Background
An increasing number of systems require pose information for the
systems to function optimally. Examples of systems that require pose information f or
optimal performance include, but are not limited to, c and mixed reality (MR)
systems (i.e., l reality (VR) and/or augmented reality (AR) systems). Such
systems can be collectively referred to as “pose sensitive” s. One example of
pose information is l information along six degrees of freedom that locates and
orients the pose sensitive system in three-dimensional space.
Pose ive systems may become “lost” (i.e., lose track of the
system pose) after various events. Some of these events include: 1. Rapid camera
motion (e.g., in an AR system worn by a sports participant); 2. Occlusion (e.g., by a
person walking into a field of view); 3. Motion-blur (e.g., with rapid head rotation by
an AR system user); 4. Poor lighting (e.g., blinking lights); 5. Sporadic system
failures (e.g., power failures); and 6. Featureless nments (e.g., rooms with
plain walls). Any of these event and many others can cally affect e-based
tracking such as that employed by current simultaneous localization and mapping
(“SLAM”) systems with robust tracking front-ends, thereby causing these systems to
become lost.
Accordingly, relocalization (i.e., finding a ’s pose in a map when
the system is “lost” in a space that has been mapped) is a challenging and key
aspect of ime visual tracking. Tracking failure is a critical problem in SLAM
systems and a system’s ability to recover (or relocalize) relies upon its ability to
accurately recognize a location, which it has previously visited.
The problem of image based localization in robotics is ly
ed to as the Lost Robot problem (or the Kidnapped Robot m). The Lost
Robot problem is also related to both the Wake-up Robot problem and Loop Closure
ion. The Wake-up Robot m involves a system being turned on for the
first time. Loop Closure detection involves a system that is tracking successfully,
revisiting a previously visited location. In Loop e detection, the image
localization system must recognize that the system has visited the on before.
Such Loop Closure detections help prevent localization drift and are important when
building 3D maps of large environments. Accordingly, pose sensitive system
localization is useful in situations other than lost system scenarios.
MR systems (e.g., AR systems) have even higher localization
requirements than typical robotic systems. The devices, methods and systems for
localizing pose sensitive systems described and claimed herein can facilitate optimal
function of all pose sensitive systems.
Summary
In one embodiment directed to a method of determining a pose of an
image e device, the method includes capturing an image using an image
capture device. The method also includes generating a data structure corresponding
to the captured image. The method further includes comparing the data structure
with a ity of known data structures to identify a most similar known data
ure. Moreover, the method includes reading metadata corresponding to the
most similar known data structure to determine a pose of the image capture device.
In one or more ments, the data structure is a compact
representation of the captured image. The data structure may be an N dimensional
vector. The data structure may be a 128 dimensional vector.
In one or more embodiments, generating the data structure
ponding to the captured image includes using a neural network to map the
captured image to the N dimensional vector. The neural network may be a
utional neural network.
[0012] In one or more embodiments, each of the plurality of known data
structures is a respective known N dimensional vector in an N dimensional space.
Each of the plurality of known data structures may be a respective known 128
dimensional vector in a 128 dimensional space.
The data structure may be an N dimensional vector. Comparing the
data structure with the plurality of known data ures to identify the most r
known data structure may include determining respective Euclidean distances
between the N dimensional vector and each respective known N dimensional vector.
Comparing the data structure with the plurality of known data structures to fy
the most similar known data structure may also include identifying a known N
dimensional vector having a smallest distance to the N dimensional vector as the
most similar known data structure.
In one or more embodiments, the method also includes training a
neural network by mapping a plurality of known images to the ity of known data
structures. The neural network may be a convolutional neural k. Training the
neural k may include modifying the neural network based on comparing a pair
of known images of the plurality.
In one or more embodiments, training the neural network comprises
modifying the neural network based on comparing a triplet of known images of the
plurality. Each known image of the plurality may have respective metadata,
including pose data. Training the neural network may include accessing a database
of the known images annotated with the respective metadata. The pose data may
encode a translation and a rotation of a camera corresponding to a known image.
In one or more embodiments, each known data structure of the plurality
is a respective known N dimensional vector in an N dimensional space. A first
known image of the triplet may be a matching image for a second known image of
the triplet. A third known image of the triplet may be a tching image for the
first known image of the triplet. A first ean distance between respective first
and second pose data corresponding to the matching first and second known images
may be less than a ined threshold. A second Euclidean distance between
respective first and third pose data corresponding to the non-matching first and third
known images may be more than the predefined old.
In one or more embodiments, training the neural network es
decreasing a first Euclidean distance between first and second known N ional
vectors respectively corresponding to the matching first and second known images in
an N dimensional space. Training the neural network may also include increasing a
second Euclidean distance between first and third known N dimensional vectors
respectively corresponding to the non-matching first and third known images in the N
dimensional space.
In one or more embodiments, the method also includes ing the
data structure with the plurality of known data structures to identify the most similar
known data structure in real time. The metadata corresponding to the most similar
known data structure may include pose data ponding to the most similar
known data structure. The method may also include determining a pose of the
image capture device from the pose data in the metadata of the most similar known
data structure.
In another embodiment, there is provided a method of determining a
pose of an image capture device. The method comprises capturing an image using
an image capture device, ting a data structure ponding to the captured
image, comparing the data structure with a plurality of known data structures to
identify a most similar known data structure, reading metadata corresponding to the
most r known data structure to determine a pose of the image capture device,
and training a neural network by mapping a plurality of known images to the plurality
of known data structures. The data structure is an N dimensional vector. The step of
generating the data structure corresponding to the captured image comprises using
a neural network to map the captured image to the N dimensional vector. E ach
known image of the plurality has respective metadata including pose data. The step
of training the neural k comprises accessing a database of the known images
annotated with the respective metadata, decreasing a first Euclidean distance
between first and second known N dimensional vectors respectively corresponding
to ng first and second known images in an N dimensional space, and
increasing a second Euclidean distance n first and third known N dimensional
vectors respectively ponding to non-matching first and third known images in
the N dimensional space.
Brief Description of the Drawings
[0020] The drawings illustrate the design and utility of various embodiments of
the present ion. It should be noted that the figures are not drawn to scale and
that elements of similar structures or functions are represented by like reference
numerals throughout the figures. In order to better iate how to obtain the
recited and other advantages and objects of various embodiments of the
invention, a more detailed ption of the present inventions briefly described
above will be rendered by reference to specific embodiments thereof, which are
illustrated in the accompanying drawings. Understanding that these drawings depict
only typical embodiments of the invention and are not therefore to be considered
limiting of its scope, the invention will be described and explained with additional
specificity and detail through the use of the accompanying drawings in which:
Figure 1 is a schematic view of a query image and six known images
for a localization/relocalization system, according to one embodiment;
Figure 2 is a schematic view of an embedding of an image to a data
structure, according to one embodiment;
[0023] Figure 3 is a tic view of a method for training a neural network,
according to one embodiment;
Figure 4 is a schematic view of data flow in a method for
localizing/relocalizing a pose sensitive , according to one embodiment;
Figure 5 is a flow chart ing a method for localizing/relocalizing a
pose sensitive , according to one embodiment.
Detailed Description
Various ments of the invention are directed to methods,
systems, and articles of manufacture for localizing or relocalizing a pose ive
system (e.g., an augmented reality (AR) system) in a single embodiment or in
multiple embodiments. Other s, features, and advantages of the invention are
described in the detailed description, figures, and claims.
Various embodiments will now be described in detail with nce to
the drawings, which are provided as illustrative examples of the invention so as to
enable those skilled in the art to practice the invention. Notably, the figures and the
examples below are not meant to limit the scope of the present invention. Where
certain elements of the present ion may be partially or fully implemented using
known components (or methods or ses), only those portions of such known
components (or methods or processes) that are necessary for an understanding of
the present invention will be bed, and the ed descriptions of other
portions of such known components (or methods or processes) will be omitted so as
not to obscure the invention. Further, various embodiments encompass present and
future known equivalents to the components ed to herein by way of illustration.
Localization/Relocalization Systems and Methods
Various embodiments of augmented y display systems have been
discussed in co-owned U.S. y Patent Application Serial Number 14/555,585 filed
on November 27, 2014 under attorney docket number MLUS and entitled
“VIRTUAL AND AUGMENTED REALITY SYSTEMS AND METHODS,” and co-
owned U.S. Prov. Patent ation Serial Number 62/005,834 filed on May 30,
2014 under attorney docket number ML 30017.00 and entitled “METHODS AND
SYSTEM FOR CREATING FOCAL PLANES IN VIRTUAL AND AUGMENTED
REALITY,” the contents of the aforementioned U.S. patent applications are hereby
expressly and fully incorporated herein by reference as though set forth in full.
Localization/relocalization systems may be implemented independently of AR
systems, but many embodiments below are described in relation to AR s for
illustrative purposes only.
Disclosed are devices, methods and systems for localizing/relocalizing
pose sensitive systems. In one embodiment, the pose sensitive system may be a
head-mounted AR display system. In other embodiments, the pose sensitive system
may be a robot. s embodiments will be described below wit h respect to
localization/relocalization of a head-mounted AR system, but it should be
appreciated that the embodiments disclosed herein may be used independently of
any existing and/or known AR system.
For instance, when an AR system “loses” its pose tracking after it
experiences one of the disruptive events described above (e.g., rapid camera
motion, occlusion, motion-blur, poor lighting, sporadic system es, featureless
environments, and the like), the AR system performs a relocalization procedure
according to one ment to blish the pose of the system, which is needed
for optimal system performance. The AR system begins the relocalization procedure
by capturing one or more images using one or more s d thereto. Next,
the AR system compares a captured image with a plurality of known images to
identify a known image that is the closest match to the captured image. Then, the
AR system accesses metadata for the closest match known image including pose
data, and reestablishes the pose of the system using the pose data of the closest
match known image.
[0031] Figure 1 depicts a query image 110, which represents an image
captured by the lost AR system. Figure 1 also depicts a plurality (e.g., six) of known
images 112a-112f, against which the query image 110 is compared. The known
images 112a-112f may have been recently captured by the lost AR system. In the
embodiment depicted in Figure 1, known image 112a is the t match known
image to the query image 110. Accordingly, the AR system will reestablish its pose
using the pose data associated with known image 112a. The pose data may encode
a translation and a rotation of a camera corresponding to the closest match known
image 112a.
r, comparing a large number (e.g., more than 10,000) of image
pairs on the pixel-by-pixel basis is computationally intensive. This limitation renders
a by-pixel comparison prohibitively inefficient for real time (e.g., 60 or more
frames per second) pose sensitive system relocalization. Accordingly, Figure 1 only
schematically depicts the image comparison for system relocalization.
According to one embodiment, the query image (e.g., query image
110) and the plurality of known images (e.g., known images 112a-112f) are
ormed into data structures that are both easier to s and compare, and
easier to store and organize. In ular, each image is “embedded” by projecting
the image into a lower dimensional ld where triangle inequality is preserved.
Triangle lity is the geometric property wherein for any three points not on a
line, the sum of any two sides is greater than the third side.
In one embodiment, the lower dimensional manifold is a data structure
in the form of an N dimensional vector. In particular, the N dimensional vector may
be a 128 dimensional vector. Such a 128 dimensional vector strikes an effective
balance between size of the data structure and ability to analyze images represented
by the data structure. Varying the number of dimensions of N dimensional s
for an image based localization/relocalization method can affect the speed of
similarity metric computation and end-to-end training ibed below). All other
factors being equal, the lowest dimensional representation is preferred. Using 128
dimensional vectors results in a lean, yet robust embedding for image based
localization/relocalization methods. Such vectors can be used with convolutional
neural networks, rendering the localization/relocalization system improvable with
new data, and efficiently functional on new data sets.
Figure 2 schematically depicts the embedding of an image 210 through
a series of is/simplification/reduction steps 212. The image 210 may be a 120
pixel x 160 pixel image. The result of the operations in step 212 on the image 210 is
an N ional vector 214 (e.g., 128 dimensional) representing the image 210.
While the embodiments described herein utilize a 128 dimensional vector as a data
structure, any other data structure, including vectors with a different number of
dimensions, can ent the images to be analyzed in localization/relocalization
systems according to the embodiments herein.
For localization/relocalization, this compact entation of an image
(i.e., an embedding) may be used to compare the similarity of one location to r
by comparing the Euclidean ce between the N dimensional vectors. A network
of known N dimensional vectors corresponding to known training images, trained
with both indoor and outdoor on based datasets (described , may be
configured to learn visual similarity (positive images) and dissimilarity (negative
images). Based upon this learning process, the embedding is able to successfully
encode a large degree of appearance change for a specific location or area in a
relatively small data structure, making it an efficient representation of locality in a
localization/relocalization system.
Network Training
Networks must be trained before they can be used to efficiently embed
images into data structures. Figure 3 schematically depicts a method for training a
network 300 using image triplets 310, 312, 314, according to one embodiment. The
network 300 may be a convolutional neural network 300. The network training
system uses a query image 312, a ve (matching) image 310, and a ve
(non-matching) image 314 for one cycle of training. The query and positive images
312, 310 in Figure 3 each depict the same object (i.e., a person), perhaps from
different points of view. The query and negative images 312, 314 in Figure 3 depict
different objects (i.e., people). The same network 310 learns all of the images 310,
312, 314, but is trained to make the scores 320, 322 of the two matching images
310, 312 as close as possible and the score 324 of the non-matching image 314 as
different as possible from the scores 320, 322 of the two matching images 310, 312.
This training s is repeated with a large set of images.
[0038] When ng is complete, the network 300 maps different views of the
same image close together and different images far apart. This network can then be
used to encode images into a nearest neighbor space. When a newly captured
image is analyzed (as described above), it is encoded (e.g., into an N dimensional
vector). Then the localization/relocalization system can determine the distance to
the captured image’s nearest otherencoded images. If it is near to some encoded
s), it is considered to be a match for that s). If it is far from some
encoded image, it is ered to be a non-match for that image. As used in this
application, “near” and “far” include, but are not limited to, relative Euclidean
distances between two poses and/or N dimensional vectors.
[0039] Learning the weights of the neural network (i.e., the training thm)
includes comparing a triplet of known data structures of a plurality of known data
structures. The triplet consists of a query image, positive image, and negative
image. A first Euclidean distance between respective first and second pose data
corresponding to the query and positive images is less than a predefined threshold,
and a second Euclidean distance between respective first and third pose data
corresponding to the query and negative images is more than the predefined
threshold. The network produces a 128 dimensional vector for each image in the
triplet, and an error term is non-zero if the negative image is closer (in terms of
Euclidean distance) to the query image than the positive. The error is propagated
through the network using a neural network opagation algorithm. The network
can be trained by decreasing a first Euclidean ce between first and second 128
ional s corresponding to the query and positive images in an N
dimensional space, and increasing a second Euclidean distance between first and
third 128 dimensional vectors respectively ponding to the query and negative
images in the N dimensional space. The final configuration of the network is
achieved after passing a large number of triplets through the k.
It is desirable for an appearance based relocalization system generally
to be invariant to s in viewpoint, illumination, and scale. The deep metric
learning network described above is suited to solving the problem of appearance-
invariant relocalization. In one embodiment, the triplet convolutional neural network
model embeds an image into a lower dimensional space where the system can
measure meaningful ces between images. Through the careful selection of
triplets, consisting of three images that form an anchor-positive pair of similar images
and an anchor-negative pair of dissimilar images, the convolutional neural network
can be trained for a variety of locations, including changing locations.
While the training embodiment described above uses triplets of
images, network training according to other embodiments, may e other
pluralities of images (e.g., pairs and quadruplets). For image pair training, a query
image may be sequentially paired with positive and negative images. For image
quadruplet training, a quadruplet should include at least a query image, a positive
image, and a negative image. The ing image may be an additional ve or
negative image based on the intended application for which the network is being
trained. For zation/relocalization, which typically involves more non-matches
than matches, the fourth image in quadruplet training may be a negative image.
While the training embodiment described above uses a single
convolutional neural network, other training ments may utilize multiple
operatively coupled networks. In still other embodiments, the network(s) may be
other types of neural networks with backpropagation.
Exemplary Network Architecture
An exemplary neural network for use with localization/relocalization
systems according to one embodiment has 3x3 convolutions and a single fully
connected layer. This architecture allows the system to take age of emerging
hardware acceleration for popular architectures and the ability to initialize from
ImageNet pre-trained weights. This 3x3 convolutions architecture is sufficient for
g a wide array of problems with the same network ecture.
This exemplary neural network architecture es 8 convolutional
layers and 4 max pooling layers, followed by a single fully connected layer of size
128. A max pooling layer is disposed after every two convolutional blocks, ReLU is
used for the non-linearity, and BatchNorm layers are disposed after every
convolution. The final fully connected layer maps a blob of size [8x10x128] to a
128x1 vector, and a custom entiable malization provides the final
embedding.
Localization/Relocalization Systems and Methods
[0045] Now that the training of the convolutional neural network according to
one embodiment has been described, Figures 4 and 5 depict two similar methods
400, 500 of localizing/relocalizing a pose sensitive system according to two
embodiments.
Figure 4 schematically s a query image 410, which is ed
by a neural network 412 into a corresponding query data structure 414. The query
image 410 may have been acquired by a lost pose sensitive system for use in
relocalization. The neural network 412 may be a trained convolutional neural
network (see 300 in Figure 3). The query data ure 414 may be a 128
dimensional vector.
[0047] The query data structure 414 corresponding to the query image 410 is
compared to a database 416 of known data structures 18e. Each known data
structures 418a-418e is associated in the database 416 with corresponding
metadata 420a-420e, which includes pose data for the system which captured the
known image corresponding to the known data ure 418. The result of the
comparison is identification of the t neighbor (i.e., best match) to the query
data structure 414 corresponding to the query image 410. The nearest neighbor is
the known data structure (e.g., the known data structure 418b) having the shortest
relative Euclidean distances to the query data structure 414.
After the nearest neighbor known data structure, 418b in this
embodiment, has been identified, the associated metadata 420b is transferred to the
system. The system can then use the pose data in the metadata 420b to
localize/relocalize the previously lost pose sensitive system.
Figure 5 is a flow chart depicting a method 500 of image based
localization/relocalization. At step 502, a pose sensitive system without pose
information es an image. At step 504, the system compares the captured
image with a plurality of known images. At step 506, the system identifies the known
image that is the closest match to the captured image. At step 508, the system
es pose metadata for the closest match known image. At step 510, the
system generates pose information for itself from the pose metadata for the closest
match known image.
Relocalization using a triplet utional neural network outperforms
current lization methods in both accuracy and efficiency.
Image Based Mapping
When a localizing/relocalizing system is used to form a map based on
d images, the system obtains images of a location, encodes those pictures
using the triplet network, and locates the system on the map based on a location
corresponding to the t image(s) to the obtained images.
Various exemplary embodiments of the invention are described herein.
Reference is made to these examples in a non-limiting sense. They are provided to
illustrate more broadly applicable aspects of the ion. Various s may be
made to the invention bed and equivalents may be substituted without
departing from the true spirit and scope of the invention. In addition, many
modifications may be made to adapt a particular situation, material, composition of
, s, process act(s) or step(s) to the objective(s), spirit or scope of the
present invention. Further, as will be appreciated by those with skill in the art that
each of the individual ions described and illustrated herein has discrete
components and features which may be readily separated from or combined with the
features of any of the other several embodiments without departing from the scope
or spirit of the present inventions. All such modifications are intended to be within
the scope of claims associated with this disclosure.
The invention es methods that may be performed using the
subject devices. The methods may comprise the act of providing such a suitable
device. Such provision may be performed by the end user. In other words, the
“providing” act merely requires the end user obtain, access, approach, position, set-
up, activate, power-up or otherwise act to provide the requisite device in the subject
method. Methods recited herein may be carried out in any order of the recited
events which is logically possible, as well as in the d order of events.
Exemplary aspects of the invention, er with details ing
material selection and manufacture have been set forth above. As for other details
of the present invention, these may be appreciated in connection with the abovereferenced
s and publications as well as generally known or appreciated by
those with skill in the art. The same may hold true with respect to method-based
aspects of the invention in terms of additional acts as commonly or logically
employed.
[0055] In addition, though the invention has been described in reference to
several examples optionally incorporating various features, the invention is not to be
limited to that which is described or indicated as contemplated with respect to each
variation of the ion. Various changes may be made to the invention described
and equivalents (whether recited herein or not included for the sake of some brevity)
may be tuted without departing from the true spirit and scope of the invention.
In addition, where a range of values is provided, it is understood that every
ening value, between the upper and lower limit of that range and any other
stated or intervening value in that stated range, is encompassed within the invention.
Also, it is contemplated that any optional feature of the inventive
variations described may be set forth and claimed independently, or in combination
with any one or more of the es described herein. Reference to a singular item,
includes the possibility that there are plural of the same items present. More
specifically, as used herein and in claims ated hereto, the singular forms “a,”
“an,” “said,” and “the” e plural referents unless the specifically stated otherwise.
In other words, use of the articles allow for “at least one” of the subject item in the
description above as well as claims associated with this disclosure. It is further
noted that such claims may be d to exclude any optional element. As such,
this ent is ed to serve as antecedent basis for use of such exclusive
terminology as “solely,” “only” and the like in connection with the recitation of claim
elements, or use of a “negative” limitation.
Without the use of such exclusive terminology, the term “comprising” in
claims associated with this disclosure shall allow for the inclusion of any additional
element--irrespective of whether a given number of elements are enumerated in
such claims, or the addition of a feature could be regarded as transforming the
nature of an element set forth in such claims. Except as ically defined herein,
all technical and scientific terms used herein are to be given as broad a commonly
tood meaning as possible while maintaining claim validity.
The breadth of the present invention is not to be limited to the
examples provided and/or the subject specification, but rather only by the scope of
claim language associated with this sure.
In the foregoing specification, the invention has been described with
reference to specific embodiments thereof. It will, however, be t that various
modifications and changes may be made thereto without departing from the broader
spirit and scope of the invention. For e, the above-described process flows
are described with reference to a particular ordering of s actions. However,
the ordering of many of the described s actions may be changed without
affecting the scope or operation of the invention. The specification and drawings
are, accordingly, to be regarded in an illustrative rather than restrictive sense.
Claims (12)
1.
A method of determining a pose of an image capture device, comprising: ing an image using an image capture device; 5 generating a data structure corresponding to the captured image; comparing the data structure with a plurality of known data structures to fy a most similar known data structure; reading ta corresponding to the most r known data structure to determine a pose of the image capture device; and 10 training a neural network by mapping a plurality of known images to the plurality of known data structures, n the data ure is an N dimensional vector, wherein generating the data structure corresponding to the captured image comprises using a neural network to map the captured image to the N dimensional 15 vector, wherein each known image of the plurality has respective metadata including pose data, and n training the neural network comprises: accessing a database of the known images annotated with the 20 respective metadata; decreasing a first Euclidean distance between first and second known N dimensional vectors respectively corresponding to matching first and second known images in an N dimensional space; and increasing a second Euclidean distance between first and third known N dimensional vectors respectively corresponding to non-matching first and third known images in the N dimensional space. 5 2. The method of claim 1, wherein the data structure is a compact entation of the captured image.
3. The method of claim 1 or 2, wherein the neural network is a convolutional neural network.
4. The method of any one of claims 1 to 3, wherein the data structure is a 128 dimensional vector.
5. The method of claim 1, n each of the plurality of known data 15 ures is a respective known N dimensional vector in an N dimensional space.
6. The method of claim 5, wherein comparing the data structure with the plurality of known data structures to identify the most r known data structure comprises: 20 determining respective Euclidean distances n the N dimensional vector and each respective known N dimensional vector, and identifying a known N dimensional vector having a smallest distance to the N dimensional vector as the most similar known data structure.
7. The method of claim 1, wherein each of the plurality of known data structures is a respective known 128 dimensional vector in a 128 dimensional space.
8. The method of claim 1, wherein training the neural network comprises 5 modifying the neural network based on comparing a triplet of known images of the plurality.
9. The method of claim 8, wherein each known data structure of the plurality is a respective known N dimensional vector in an N dimensional space,
10 wherein a first known image of the triplet is a matching image for a second known image of the triplet, and wherein a third known image of the triplet is a non-matching image for the first known image of the triplet. 15 10. The method of any one of claims 1 to 9, wherein the pose data s a translation and a rotation of a camera corresponding to a known image.
11. The method of claim 1, wherein a first Euclidean distance n respective first and second pose data corresponding to the matching first and 20 second known images is less than a predefined threshold, and n a second Euclidean ce between tive first and third pose data corresponding to the non-matching first and third known images is more than the predefined threshold.
12. The method of any one of claims 1 to 11, further comprising comparing the data structure with the plurality of known data structures to fy the most similar known data structure in real time. 112*? 1126 $12d 112:: HG. 112i: 1?2a 2 i5 31’5 Space Feature w w parametar ter Newark Share Share 3 .e . ;;;; Conveiutianai PEG. Space image nomw iiiiii E3.353% Emfifimg momw$wwv mmgfimm Ema g; cmow 51’5 502 504 Ccmpare Captured Capture image with image with Known System images Access Page identify Known ta for image Cicsest Ci0$e$t Match Match t0 Captured Knawn image image Generate System Pose from Passe Metadata FEG. 5
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562263529P | 2015-12-04 | 2015-12-04 | |
US62/263,529 | 2015-12-04 | ||
PCT/US2016/065001 WO2017096396A1 (en) | 2015-12-04 | 2016-12-05 | Relocalization systems and methods |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ743271A NZ743271A (en) | 2021-06-25 |
NZ743271B2 true NZ743271B2 (en) | 2021-09-28 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11288832B2 (en) | Relocalization systems and methods | |
US11798175B2 (en) | Objects and features neural network | |
US10803365B2 (en) | System and method for relocalization and scene recognition | |
Workman et al. | Deepfocal: A method for direct focal length estimation | |
Bodla et al. | Deep heterogeneous feature fusion for template-based face recognition | |
Holte et al. | A local 3-D motion descriptor for multi-view human action recognition from 4-D spatio-temporal interest points | |
Tian et al. | Robust 3-d human detection in complex environments with a depth camera | |
Mocanu et al. | Deep-see face: A mobile face recognition system dedicated to visually impaired people | |
CN110969089B (en) | Lightweight face recognition system and recognition method in noise environment | |
Zhou et al. | Graph correspondence transfer for person re-identification | |
Liang et al. | Resolving ambiguous hand pose predictions by exploiting part correlations | |
US11861806B2 (en) | End-to-end camera calibration for broadcast video | |
Singh et al. | A proficient approach for face detection and recognition using machine learning and high‐performance computing | |
Li et al. | Sparse-to-local-dense matching for geometry-guided correspondence estimation | |
Wang et al. | Handling occlusion and large displacement through improved RGB-D scene flow estimation | |
NZ743271B2 (en) | Relocalization systems and methods | |
Essmaeel et al. | A new 3D descriptor for human classification: Application for human detection in a multi-kinect system | |
Tulyakov et al. | Facecept3d: real time 3d face tracking and analysis | |
Thomas et al. | Multi-frame approaches to improve face recognition | |
Anshu et al. | View invariant gait feature extraction using temporal pyramid pooling with 3D convolutional neural network | |
Fernandez-Labrador | Indoor scene understanding using non-conventional cameras | |
Raskin et al. | Using gaussian processes for human tracking and action classification | |
Zhang et al. | An Efficient Feature Extraction Scheme for Mobile Anti-Shake in Augmented Reality | |
Raskin et al. | Tracking and classifying of human motions with gaussian process annealed particle filter | |
Xompero et al. | Cross-camera view-overlap recognition |