WO2021179905A1

WO2021179905A1 - Motion blur robust image feature descriptor

Info

Publication number: WO2021179905A1
Application number: PCT/CN2021/077477
Authority: WO
Inventors: Xun Xu
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-03-13
Filing date: 2021-02-23
Publication date: 2021-09-16
Also published as: CN115362481A

Abstract

Methods, computer systems, and computer-storage media for image processing are disclosed. In one example, keypoints are selected in an image, and different motion blurs are applied to the image to generate a plurality of blurred images. Based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, an artificial neural network is trained to generate a keypoint descriptor.

Description

MOTION BLUR ROBUST IMAGE FEATURE DESCRIPTOR

BACKGROUND OF THE INVENTION

Augmented Reality (AR) superimposes virtual content over a user’s view of the real world. With the development of AR software development kits (SDK) , the mobile industry has brought smartphone AR to the mainstream. An AR SDK typically provides six degrees-of-freedom (6DoF) tracking capability. A user can scan the environment using a smartphone’s camera, and the smartphone performs visual inertial odometry (VIO) in real time. Once the camera pose is tracked continuously, virtual objects can be placed into the AR scene to create an illusion that real objects and virtual objects are merged together.

A keypoint (or “interest point” ) is a point of an image that is distinctive from other points in the image, has a well-defined spatial position or is otherwise localized within the image, and is stable under local and global variations (e.g., changes in scale, changes in illumination, etc. ) . A keypoint descriptor may be defined as a multi-element vector that describes (typically, in scale space) the neighborhood of a keypoint in an image. Examples of keypoint descriptor frameworks include Scale-Invariant Feature Transform (SIFT) , Speeded-Up Robust Features (SURF) , and Binary Robust Invariant Scalable Keypoints (BRISK) . An image feature may be defined as a keypoint and a corresponding keypoint descriptor.

Matching features across different images (e.g., across different frames of a video sequence) is an important component of many image processing applications. Images which are captured while the image sensor is moving may have significant motion blur. Because most of the widely used characterizations of image features are very sensitive to motion blur, feature matching is less likely to succeed when the features are extracted from images having different motion blurs. Therefore, there is a need in the art for improved methods of performing feature matching.

SUMMARY OF THE INVENTION

The present invention relates generally to methods and systems related to image processing. More particularly, embodiments of the present invention provide methods and systems for performing feature matching in augmented reality applications. Embodiments of the present invention are applicable to a variety of applications in augmented reality and computer-based display systems.

A method of generating a keypoint descriptor that is robust to motion blur according to a general configuration comprises selecting a plurality of keypoints in an image; for each of a plurality of motion blurs that are different from each other, applying each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, training an artificial neural network (ANN) to generate a keypoint descriptor. In this method, a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.

A computer system according to another general configuration includes one or more processors; and one or more memories configured to store computer-readable instructions that, upon execution by the one or more processors, configure the computer system to select a plurality of keypoints in an image; for each of a plurality of motion blurs that are different from each other, apply each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, train an artificial neural network (ANN) to generate a keypoint descriptor. In this system, a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.

One or more non-transitory computer-storage media according to a further general configuration store instructions that, upon execution on a computer system, cause the computer system to perform operations including selecting a plurality of keypoints in an image; for each of a plurality of motion blurs that are different from each other, applying each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, training an artificial neural network (ANN) to generate a keypoint descriptor. A criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.

Numerous benefits are achieved by way of the present invention over conventional techniques. For example, embodiments of the present disclosure involve methods and systems that utilize a deep learning network to generate a keypoint descriptor that is robust to motion blur, to address the challenge of image feature matching when different motion blurs are present in camera frames. Embodiments of the present invention may be applied, for example, to improve the performance of positioning and mapping in AR/VR applications. Moreover, embodiments of the present invention may increase feature matching accuracy when features are extracted from camera frames that are captured with different image motions. As a result, SLAM calculation may be more accurate and more stable when the device is undergoing quick motion, because one or more previously wasted blurred camera frames may now be effectively used for positioning and mapping estimation. Thus, embodiments of the present invention avoid shortcomings of existing approaches, such as lower matching accuracy when there is little motion or when the images are blurred by similar motions. These and other embodiments of the invention along with many of its advantages and features are described in more detail in conjunction with the text below and attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a simplified flowchart of a method of image processing according to an embodiment of the present invention.

FIG. 2A shows an example of an image according to an embodiment of the present invention.

FIG. 2B shows examples of keypoints in the image illustrated in FIG. 2A according to an embodiment of the present invention.

FIG. 3 shows an illustration of six degrees of freedom (6DOF) .

FIG. 4 shows a simplified flowchart illustrating a method of generating training data for training of a network according to an embodiment of the present invention.

FIG. 5 shows a simplified flowchart illustrating an example of a task of training an artificial neural network (ANN) to generate a keypoint descriptor according to an embodiment of the present invention.

FIG. 6 shows a simplified flowchart illustrating an operation of an ANN-based keypoint descriptor comparing module according to an embodiment of the present invention.

FIG. 7 shows a simplified flowchart illustrating a method of training an ANN of a keypoint descriptor comparing module according to an embodiment of the present invention.

FIGS. 8A and 8B show examples of training criteria according to an embodiment of the present invention.

FIG. 9 shows a simplified block diagram of an apparatus according to an embodiment of the present invention.

FIG. 10 shows a simplified schematic diagram of a keypoint descriptor converter according to an embodiment of the present invention.

FIG. 11 shows a simplified flowchart of a method of image processing according to an embodiment of the present invention.

FIG. 12 shows a simplified flowchart of a method of performing image processing according to an embodiment of the present invention.

FIG. 13 shows a simplified flowchart illustrating a method of training an ANN according to an embodiment of the present invention.

FIG. 14 shows a simplified block diagram of an apparatus according to an embodiment of the present invention.

FIG. 15 shows a block diagram of a computer system according to an embodiment of the present invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

Many applications depend heavily on the performance of image feature matching. Such applications may include image alignment (e.g., image stitching, image registration, panoramic mosaics) , three-dimensional (3D) reconstruction (e.g., stereoscopy) , indexing and content retrieval, motion tracking, object recognition, and many others.

A basic requirement in many augmented reality or virtual reality (AR/VR) applications is to determine a device’s position and orientation in 3D space. Such applications may use a simultaneous localization and mapping (SLAM) algorithm to determine a device’s real-time position and orientation and to infer a structure of the environment (or “scene” ) within which the device is operating. In one example of a SLAM application, frames of a video sequence from the device’s camera are input to a module that executes a SLAM algorithm. Features are extracted from the frames and matched across different frames, and the SLAM algorithm searches for matched features that correspond to the same spot in the scene being captured. By tracking the positions of the features across different frames, the SLAM module can determine the image sensor’s motion within the scene and infer the major structure of the scene.

In traditional feature matching, keypoint detection is performed on each of a plurality of images (e.g., on each frame of a video sequence) , and a corresponding keypoint descriptor is calculated for each of the detected keypoints from its neighborhood (typically in scale space) . The number of keypoints detected for each image is typically at least a few dozen and may be as high as five hundred or more, and the neighborhood from which a keypoint descriptor is calculated typically has a radius of about fifteen pixels around the keypoint. A keypoint descriptor f0 from a first image I0 (e.g., a frame of the video sequence) and a keypoint descriptor f1 from a second image I1 (e.g., a different frame of the video sequence, such as a consecutive frame in the sequence) are used to compute a matching score. The score metric is usually a distance d (f0, f1) between the keypoint descriptors f0 and f1 in the descriptor space, such as a distance according to any of the example distance metrics below. This score computation is repeated for different pairs of keypoint descriptors from the two images, and the resulting scores may be thresholded to identify matching features: e.g., to determine whether a particular pair of keypoint descriptors from the two images (and thus, whether the corresponding features from the two images) match. In a typical SLAM application, a pair of matched features corresponds to a single point in the physical environment, and this correspondence leads to a math constraint. A later stage of the SLAM computation may derive camera motion and an environmental model as an optimum solution that satisfies multiple constraints, including constraints generated by matching feature pairs.

Examples of distance metrics that may be used for matching-score computation include the Euclidean, city-block, chi-squared, cosine, and Minkowski distances. Assuming that f0 and f1 are n-dimensional vectors such that f0=x _0, 1, x _0, 2, x _0, 3, ..., x _0, n and f1=x _1, 1, x _1, 2, x _1, 3, ... , x _1, n, the distance d (f0, f1) between them may be described according to these distance metrics as follows:

Euclidean distance:

City-block distance:

Cosine distance:

where

Chi-squared distance (assuming that the values of all elements of f0 and f1 are larger than zero) :

Minkowski distance (also called generalized Euclidean distance) :

Thresholding of the resulting scores to determine whether the corresponding features match may be performed according to a procedure such as the following:

where T denotes a threshold value.

In a typical application of feature matching, a process of keypoint descriptor matching as described above is repeated, for multiple different pairs (f0, f1) of keypoint descriptors, over consecutive pairs of frames of a video sequence (e.g., on each consecutive pair of frames) . In one example, for each of a plurality of keypoints detected in the first image (each having a corresponding location in the first image) , the process of keypoint descriptor matching is repeated for each pair that comprises the descriptor of the keypoint in the first image and the descriptor of a keypoint that is within a threshold distance (e.g., twenty pixels) of the same location in the second image.

Keypoint descriptor matching is not limited to any particular size or format of the source images (i.e., first image I0 and second image I1) , but examples from typical feature matching applications are now provided. Most current AR/VR devices are configured to capture video in VGA format (i.e., having a frame size of 640 x 480 pixels) , with each pixel having a red, green, and blue component. The largest frame format typically seen in such devices is 1280 x 720 pixels, such that a maximum size of each of the first and second images in a typical application is about one thousand by two thousand pixels. The minimum size of each of the first and second images in a typical application is about one-quarter VGA (i.e., 320 x 240 pixels) , as a smaller image size would likely not be enough to support an algorithm such as SLAM.

Calculation of a keypoint descriptor is typically implemented using an existing keypoint descriptor framework, such as Scale-Invariant Feature Transform (SIFT) , Speeded-Up Robust Features (SURF) , Binary Robust Invariant Scalable Keypoints (BRISK) , etc. Such a task may comprise calculating an orientation of the keypoint, which may include determining how, or in what direction, a pixel neighborhood (also called an “image patch” ) that surrounds the keypoint is oriented. Calculating an orientation of the keypoint, which may include detecting the most dominant orientation of the gradient angles in the patch, is typically performed on the patch at different scales of a scale space. The SIFT framework, for example, assigns a 128-dimensional feature vector to each keypoint based on the gradient orientations of pixels in sixteen local neighborhoods of the keypoint. Some keypoint descriptor frameworks (e.g., SIFT and SURF) include both keypoint detection and keypoint descriptor calculation. Other keypoint descriptor frameworks (e.g., Binary Robust Independent Elementary Features (BRIEF) ) include keypoint descriptor calculation, but not keypoint detection.

Examples of AR/VR devices include mobile phones and head-mounted devices (e.g., AR or “smart” glasses) . Given the nature of AR/VR devices, many video frames are captured when the image sensor (e.g., a video camera) is moving. As a result, the captured frames may have significant motion blur. If the image sensor’s motion during capture of image I0 is the same as the image sensor’s motion during capture of image I1, and each of the descriptors f0 and f1 correspond to the same keypoint in the two images, then the values of the descriptors f0 and f1 tend to be similar such that the computed distance between them is small. However, in practical applications of AR/VR, the image sensor would typically experience different motions when capturing each image, so that the descriptors f0 and f1 could be distorted by different motion blurs. Unfortunately, almost all widely used image features (e.g., SIFT, SURF, BRISK) are very sensitive to motion blur, such that any significant motion blur (e.g., a blur of five pixels or more) is likely to cause the keypoint descriptors to be distorted. As a result, the values of the descriptors f0 and f1 may be very different, even if the descriptors correspond to the same keypoint. When such features are extracted from images with different motion blurs and a scoring metric as described above is used, feature matching is likely to fail.

For many applications in which the image sensor may be in motion (e.g., AR/VR applications) , it is possible to quantify the image motion during each capture interval. Motion direction and magnitude can be estimated from input of one or more motion sensors of the device, for example, and/or can be calculated from two temporal neighbor frames in the video sequence. Motion sensors (which may include one or more gyroscopes, accelerometers, and/or magnetometers) may indicate a displacement and/or change in orientation of the device and may be implemented within an inertial measurement unit (IMU) .

Examples of techniques for coping with motion blur may include the following:

1) Don’t Match: Since the matching of image features in blurred images becomes so unreliable, one potential solution is to give up image feature matching completely, at least between pairs of images that have significant and different motion blurs.

2) Deblur Image First: Before the image is used for feature extraction, a deblurring operation is performed to remove the motion blur from the image.

3) Compensate for Motion Blur When Calculating Keypoint Descriptors: An estimate of the image motion is used to compensate for the motion blur’s impact when calculating the keypoint descriptor. This method differs from “Deblur Image First” in that the motion blur removal or compensation is performed on the neighborhood of the keypoint rather than on the whole image.

4) Extract Blur-Invariant Feature: When calculating keypoint descriptors from the neighborhoods of the keypoints, one only uses components that are blur-invariant and ignores those that are sensitive to motion blur. Therefore, the keypoint descriptor will remain roughly the same even when the image is motion-blurred.

5) Motion Blur Robust Image Feature Matching: In many feature matching applications (e.g., in most SLAM applications) , the motion blur can be quantified. Therefore, a deep learning based motion-blur converter may be designed to simulate the descriptor distortion caused by motion blur. Before feature matching, at least one of the descriptors to be matched is converted, thereby making sure that the input descriptors include the same motion blur influences. In another example, a deep learning based motion-blur-aware descriptor comparing module may be designed to determine whether input features match or not, given the known image motions.

Shortcomings of the above approaches may include the following:

1) Don’t Match: To prevent the false matchings of features in blurred images from degrading the estimations in SLAM, one may choose not to do feature matching at all when significant image motion has been detected. For example, one may choose to perform SLAM using only motion sensor output at these times. However, this approach may cause the image sensor output at such moments to be completely wasted, and the SLAM calculation may in turn become less accurate and less stable.

2) Deblur Image First: Image deblurring usually involves significant computation. Since the SLAM computations are commonly carried out on a mobile platform, the additional computation required for deblurring processing may not be always available or affordable. Moreover, the deblurring operation tends to add new artifacts to the original image, which in turn are likely to negatively impact the image feature matching accuracy.

3) Compensate for Motion Blur When Calculating Keypoint Descriptor: Compensating for motion blur when calculating a keypoint descriptor typically requires less extra computation than deblurring the entire image, since the computation only involves the neighborhood of the keypoint. However, the drawback of introducing new artifacts into the image still exists.

4) Extract Blur-Invariant Feature: Since this method ignores components that are sensitive to motion blur, less information is available to perform feature matching. In other words, this approach may increase matching accuracy and stability for cases of large camera motion at the cost of reducing matching performance for other cases (e.g., cases in which motion is not obvious) .

5) Motion Blur Robust Image Feature Matching: These proposed solutions introduce improvements at the stage of keypoint descriptor matching. However, since traditional keypoint descriptors are designed without considering the motion the camera is undergoing, they are readily distorted by motion blur. As the result, the matching performance is still limited. Such an approach takes image motion into consideration at the stage of match scoring in order to solve the matching failure discussed above. To reach the full potential of an improved scoring scheme, however, it may be desired to design a new descriptor that is less sensitive to motion blur than traditional ones.

It may be desirable to increase a robustness of keypoint description to motion blur. Accordingly, embodiments described herein, implemented using appropriate systems, methods, apparatus, devices, and the like, as disclosed herein, may support increased accuracy of feature matching operations in applications that are prone to motion blur. The embodiments described herein can be implemented in any of a variety of applications that use feature matching, including image alignment (e.g., image stitching, image registration, panoramic mosaics) , 3D reconstruction (e.g., stereoscopy) , indexing and content retrieval, endoscopic imaging, motion tracking, object tracking, object recognition, automated navigation, SLAM, etc.

According to embodiments of the present invention, a deep learning network is trained and used to generate a keypoint descriptor that is robust to motion blur, to address the challenge of image feature matching when different motion blurs are present in camera frames. Embodiments of the present invention may be applied, for example, to improve the performance of positioning and mapping in AR/VR applications. Moreover, embodiments of the present invention may increase feature matching accuracy when features are extracted from camera frames that are captured with different image motions. As a result, SLAM calculation may be more accurate and more stable when the device is undergoing quick motion, because one or more previously wasted blurred camera frames may now be effectively used for positioning and mapping estimation. Thus, embodiments of the present invention avoid shortcomings of existing approaches, such as lower matching accuracy when there is little motion or when the images are blurred by similar motions.

FIG. 1 shows a simplified flowchart of a method of generating a keypoint descriptor that is robust to motion blur according to an embodiment of the present invention. The method of generating a keypoint descriptor that is robust to motion blur 100 illustrated in FIG. 1 includes

tasks

110, 120, and 130. Task 110 selects a corresponding plurality of keypoints in an image. For each of a plurality of motion blurs that are different from each other, task 120 applies each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images. Based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, task 130 trains an artificial neural network (ANN) to generate a keypoint descriptor, wherein a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same feature in different ones of the plurality of blurred images.

It should be appreciated that the specific steps illustrated in FIG. 1 provide a particular method of generating a keypoint descriptor that is robust to motion blur according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 1 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

A keypoint is a point of an image that is distinctive from other points in the image, has a well-defined spatial position or is otherwise localized within the image, and is stable under local and global variations (e.g., changes in scale, changes in illumination, etc. ) .

FIG. 2A shows an example of an image (e.g., a frame of a video sequence) according to an embodiment of the present invention. FIG. 2B shows examples of keypoints in the image illustrated in FIG. 2A according to an embodiment of the present invention. Referring to FIG. 2B, the circles shown in FIG. 2B indicate the locations of a few examples of keypoints 210 -220 in the image. The number of keypoints detected in each image in a typical feature matching application is at least a dozen, twenty-five, or fifty and may range up to one hundred, two hundred, or five hundred or more.

Referring to FIG. 1, task 110 selects a corresponding plurality of keypoints in an image. Examples of keypoint detectors that may be used to implement task 110 include corner detectors (e.g., Harris corner detector, Features from Accelerated Segment Test (FAST) ) and blob detectors (e.g., Laplacian of Gaussian (LoG) , Difference of Gaussians (DoG) , Determinant of Hessian (DoH) ) . Such a keypoint detector may be configured to blur (e.g., Gaussian blur) and resample the image with different blur widths and at different resolutions to create a scale space, and to detect corners and/or blobs at different scales. For example, task 100 may include downsampling the original images to create versions at different resolutions. For a case in which the original resolution of an image is 640 x 480 pixels, for example, it may be desired to downsample to obtain the same image at resolutions of 320 x 240 pixels and 160 x 120 pixels, etc. A lower resolution allows coverage of a larger scene area for the same size of neighborhood window.

Referring again to FIG. 1, for each of a plurality of motion blurs that are different from each other, task 120 applies each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images. It may be desired for the training data to be adequate for the trained ANN to encapsulate complex logic and computation. Image motion blur can be simulated by image processing operations (such as directional filtering, etc. ) , and one or more such operations may be used to produce a large amount of synthetic training data.

Each of the plurality of motion blurs may be implemented as a motion descriptor that describes a trajectory or path in a coordinate space of one, two, or three spatial dimensions. In one example, such a trajectory may be described as a sequence of positions in the two-dimensional image plane, and each position may be expressed as a motion vector relative to the previous sampled position (e.g., taking the position of the keypoint to be the starting position) . In another example, such a trajectory may be described as a sequence of one or more positions of the image sensor that captures the image, sampled at uniform intervals during the corresponding capture period, and each sampled position may be expressed as a motion vector relative to the previous sampled position (e.g., taking the position of the image sensor at the start of the capture period to be the origin of the coordinate space) .

It may be desired to calculate the motion blurs based on data collected during an actual image capture: for example, data collected during an actual instance of a feature matching application for which the descriptor generated by method 100 may be applied. For example, the motion blurs may be calculated from motion sensor (e.g., IMU) outputs and/or from neighboring frames in a video sequence (e.g., from the frame captured just before and the frame captured just after the frame for which the motion descriptor is being calculated) . The capture period for each frame in a video sequence is typically the reciprocal of the frame rate, although it is possible for the capture period to be shorter. The frame rate for a typical video sequence (e.g., as captured by an Android phone) is thirty frames per second (fps) . The frame rate for an iPhone or head-mounted device can be as high as 120 fps.

Each motion descriptor may be further implemented to describe a motion in six degrees of freedom (6DOF) . In addition to three spatial dimensions, such a motion may include a rotation about each of one or more of the axes of these dimensions. As shown in FIG. 3, which shows an illustration of six degrees of freedom (6DOF) , these rotations may be labeled as tilt, pitch, and yaw. For example, the motion descriptor may include, for each sampled position of the image sensor, a corresponding orientation of a reference direction of the image sensor (e.g., the look direction) relative to the orientation at the previous sampled position (e.g., taking the reference orientation to be the orientation at the start of the capture period) .

Referring once again to FIG. 1, task 130 trains an artificial neural network (ANN) to generate a keypoint descriptor. The training performed in task 130 is based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales. For a case in which task 110 includes blurring and resampling the image with different blur widths and at different resolutions to create a scale space, the neighborhoods of the keypoints may be obtained from the same created scale space. Alternatively (or in case a different scale space is desired) , task 130 may include blurring (e.g., Gaussian blurring) and resampling the image with different blur widths and at different resolutions to create a scale space.

In one example, the input to the ANN is the pixel values of the neighborhood window of a keypoint at different resolutions. The actual window size at each resolution may be chosen with the considerations of matching and computational performance. A larger window typically covers a larger neighborhood area, but also makes the descriptor generating network larger. As a result, more computation may be required to generate the corresponding descriptor.

The output of the ANN is a multi-element vector, which is the generated descriptor. The actual length of the descriptor may be determined according to considerations of matching and computational performance, as well as memory usage and memory access overhead when saving and retrieving a large number of descriptors. While a longer descriptor typically encapsulates more neighborhood information, it may also make the comparing network in keypoint descriptor comparing module 600 (as described below) larger. As a result, more computation may be consumed in matching the descriptors. Additionally, more memory and/or more data bandwidth may be required to save and retrieve longer descriptors. When the number of descriptors is large, such an increase in memory and/or bandwidth overhead may be significant.

FIG. 4 shows a simplified flowchart illustrating a method of generating training data for task 130 according to an embodiment of the present invention. For each of a plurality of training images 420, a plurality of keypoints are detected 422 in the training image (e.g., as described herein with reference to task 110) . A plurality of motion blurs M1 to Mn are also applied 424-1 to 424-n to the training image 420 to produce a corresponding plurality of blurred images B1 to Bn. From each of the plurality of blurred images B1 to Bn, a neighborhood is extracted for each of the detected keypoints and at each of the plurality of different scales to produce a corresponding set of neighborhoods S1 to Sn. It is possible for the number of neighborhoods to differ from one set to another: for example, a neighborhood of a keypoint may be omitted from a set if the corresponding motion blur caused the neighborhood to include an area beyond the boundaries of the image.

A criterion of the training performed in task 130 is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images. Examples of a distance measure that may be used in task 130 include Euclidean distance, chi-squared distance, etc.

It should be appreciated that the specific steps illustrated in FIG. 4 provide a particular method of generating training data according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 4 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 5 shows a simplified flowchart illustrating an example of task 130 of training an artificial neural network (ANN) 500 to generate a keypoint descriptor according to an embodiment of the present invention. At a first time, the ANN 500 being trained receives input, from the training data, of a neighborhood of a keypoint i from a blurred image Bj (i.e., from set Sj) and generates a descriptor Dij as output. At a second time, the ANN 500 being trained receives input, from the training data, of a neighborhood of the keypoint i from a different blurred image Bk (i.e., from set Sk) and generates a descriptor Dik as output.

A loss function selector 510 compares the motion blurs Mj and Mk that correspond to the two generated descriptors Dij and Dik. If selector 510 determines that the motion blurs are similar (e.g., that a distance between the motion blurs does not exceed a threshold value) , then a first loss function 512 is selected. In one example, the first loss function 412 is to minimize a distance between the generated descriptors (e.g., Euclidean distance, chi-squared distance, etc. ) .

If selector 510 determines instead that the motion blurs are different (e.g., that the distance between the motion blurs exceeds a threshold value) , then a second loss function 514 is selected. In one example, the second loss function 514 is to maximize the matching probability as indicated by an ANN-based keypoint descriptor comparing module 600. Although it is possible to omit the first loss function 512 and use only the second loss function 514, using a distance-based loss function 512 for cases in which the keypoint descriptors f0 and f1 come from images having similar motion blurs (e.g., when camera motion is slow) may be expected to reduce computational requirements. Inferring by a deep learning network usually involves much more computation than calculating a distance between two keypoint descriptors.

It should be appreciated that the specific steps illustrated in FIG. 5 provide a particular method of training an ANN to generate a keypoint descriptor according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 5 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 6 shows a simplified flowchart illustrating an operation of ANN-based keypoint descriptor comparing module 600 according to an embodiment of the present invention. As illustrated in FIG. 6, the keypoint descriptor comparing module 600 is used to determine whether two keypoint descriptors match, even if the features belong to two images with different image motions. As shown in FIG. 6, keypoint descriptor comparing module 600 receives four inputs, which include the two keypoint descriptors being compared (denoted as f0 and f1) and the corresponding motion blurs (denoted as M0 and M1) . It may be desired to normalize the values of the motion blurs to occupy the same range as the values of the generated keypoint descriptors before input to the ANN. The output is a binary decision 610 (i.e., whether or not the keypoint descriptors f0 and f1 match) as well as a value P (e.g., a probability value) 620 denoting the confidence of the output binary decision.

In a manner similar to the reasons stated above, a deep learning network is trained to produce the match indication and confidence value outputs of the keypoint descriptor comparing module 600. In this case, the network may be implemented as a classifier network as known in the field of deep learning. Given adequate training data, a classifier usually produces a good output. Training of such a network (e.g., a CNN) may be performed using training data obtained as described above (e.g., with reference to FIG. 4) . It may be desired to augment the synthetic training data with descriptors calculated from a relatively small amount of images that have real motion blur and are manually annotated by human annotators.

It should be appreciated that the specific steps illustrated in FIG. 6 provide for particular operation of an ANN-based keypoint descriptor comparing module according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 6 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 7 shows a simplified flowchart illustrating a method of training an ANN 700 of keypoint descriptor comparing module 600 according to an embodiment of the present invention. ANN 700 may be implemented as a binary classifier, such that the output of the ANN indicates a probability that the keypoint descriptor inputs represent the same feature in the corresponding images. The output layer of the ANN may be configured to apply a sigmoid or softmax function, for example, and the loss function may be implemented as a binary cross-entropy function, with an indication of whether the training inputs represent the same feature providing the ground truth for the loss function during training of the ANN. In one example, the ANN is implemented such that the input layer and the output layer are each arrays of size 32 x 32. As both of ANN 500 and ANN 700 are deep-learning based, the parameters in the two networks may be cross-optimized. For example, training of the two networks can be combined. In one such example, the training process is started using one or more traditional keypoint descriptor frameworks (e.g., SIFT, SURF, BRISK) .

In addition to the matching-accuracy-based training criteria described above for generating matching descriptors that correspond to the same keypoint and have different motion blurs (i.e., with reference to loss functions 512 and 514) , it may be desired to train ANN 700 to generate distinctive descriptors for different keypoints. For example, it may be desired to implement task 130 to include, for cases in which the generated descriptors correspond to different keypoints, a corresponding loss function to approximate a reference distance between descriptors calculated from those keypoints in the corresponding (unblurred) training image. The reference distance may be calculated, for example, using an existing keypoint descriptor framework, such as SIFT, SURF, or BRISK.

It should be appreciated that the specific steps illustrated in FIG. 7 provide a particular method of training an ANN of a keypoint descriptor comparing module according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 7 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 8A shows one such example of three different training criteria to be applied for matching neighborhoods that, on one hand, are blurred by similar or different motion blurs and, on the other hand, are neighborhoods of the same keypoint or of different keypoints. As shown in FIG. 8B, it may be desired to further modify task 130 to train ANN 700 only on training data from images having significant motion blur. A generated descriptor resulting from such training criteria may be used for feature matching in combination with a calculated descriptor framework (e.g., SIFT, SURF, or BRISK) , for example, such that the generated descriptor is used in the presence of significant motion blur and the calculated descriptor is used otherwise.

The trained ANN 500 may be used to generate keypoint descriptors (e.g., instead of a conventional calculated descriptor framework, such as SIFT, SURF, BRISK, etc. ) for feature matching as described herein. In a typical production environment, a copy of the trained ANN 500 is stored (e.g., during manufacture and/or provisioning) to each of a run of devices having the same model of video camera (and, possibly, the same model of IMU) .

FIG. 9 shows a simplified block diagram of an apparatus according to an embodiment of the present invention. As an example, the apparatus 900 illustrated in FIG. 9 can be utilized for keypoint descriptor generation on a mobile device (e.g., a cellular telephone, such as a smartphone, or a head-mounted device) or other computing device or system according to a general configuration that includes keypoint selector 910, motion blur applier 920, and ANN trainer 930.

Keypoint selector 910 is configured to select a corresponding plurality of keypoints in an image (e.g., as described herein with reference to task 110) . Motion blur applier 920 is configured to apply each of a plurality of motion blurs that are different from each other to the image to generate a plurality of blurred images (e.g., as described herein with reference to task 120) . ANN trainer 930 is configured to train an artificial neural network (ANN) to generate a keypoint descriptor, based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales (e.g., as described herein with reference to task 130) , wherein a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same feature in different ones of the plurality of blurred images.

In one example, apparatus 900 is implemented within a device such as a mobile phone, which typically has a video camera configured to produce a sequence of frames that include the first and second images. The device may also include one or more motion sensors, which may be configured to determine 6DOF motion of the device in space. In another example, apparatus 900 is implemented within a head-mounted device, such as a set of AR glasses, which may also have motion sensors and one or more cameras. Additionally or alternatively, such a device may be used to obtain the training images (and possibly the motion blur data) and/or to apply the generated descriptor for feature matching.

According to further embodiments of the present invention, a deep learning network may be used to function as a keypoint descriptor motion blur converter, or as a keypoint descriptor comparing module, to address the challenge of image feature matching when different motion blurs are present in camera frames. For example, a descriptor generated according to a method as described herein may be implemented together with the descriptor converting and/or comparing methods described in U.S. Provisional Patent Application No. 62/978,462 (Attorney Docket No. 105184-1166208-002300US) to further increase feature matching accuracy when features are from camera frames that are captured with different image motions. When the camera is undergoing little or no motion, the proposed solution allows high matching accuracy to be obtained using a matching scoring process that is significantly simplified, and in turn greatly reduces computation, as compared with cases that have significant camera motion.

FIG. 10 shows a simplified schematic diagram of a keypoint descriptor converter 1000 according to an embodiment of the present invention. As illustrated in FIG. 10, the conversion is performed on a keypoint descriptor f0 that is generated by trained ANN 500 from the neighborhood of a keypoint in image I0. The descriptor converter takes three inputs: the keypoint descriptor f0; a motion descriptor M0 of the image motion when image I0 was captured; and a motion descriptor M1 of the image motion when image I1 was captured. The output of the converter is a converted keypoint descriptor f1′. The design goal is for the converted keypoint descriptor f1′to be similar to a keypoint descriptor f1 that is generated by trained ANN 500 from the neighborhood of a keypoint in image I1, if the descriptors f0 and f1 correspond to the same keypoint, and for the converted descriptor f1′and the descriptor f1 to be very different if the descriptors f0 and f1 refer to different keypoints. As described above, the first and second motion descriptors describe a motion of the image sensor during capture of the first and second images, respectively, and may be calculated from motion sensor (e.g., IMU) outputs and/or from neighboring frames in a video sequence (e.g., from the frame captured just before and the frame captured just after the frame for which the motion descriptor is being calculated) .

FIG. 11 shows a simplified flowchart of a method of image processing according to an embodiment of the present invention. The method of image processing 1100 illustrated in FIG. 11 includes

tasks

1110, 1120, 1130, and 1140. Task 1110 uses trained ANN 500 to generate a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period. Task 1120 uses trained ANN 500 to generate a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period. Task 1130 uses a trained artificial neural network (ANN) to convert the generated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor. Task 1140 compares the converted descriptor to the generated descriptor for the keypoint in the second image. For example, task 1140 may include calculating a distance between the converted descriptor and the generated descriptor in the descriptor space (e.g., Euclidean distance, chi-squared distance, etc. ) and comparing the calculated distance to a threshold value.

It should be appreciated that the specific steps illustrated in FIG. 11 provide a particular method of performing image processing according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 11 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 12 shows a simplified flowchart of a method of performing image processing according to another embodiment of the present invention. As illustrated in FIG. 12, elements utilized in method 1100 are also utilized in method 1200 illustrated in FIG. 12, as well as additional elements as described below. Accordingly, the description provided in relation to FIG. 11 is applicable to FIG. 12 as appropriate. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

Referring to FIG. 12, task 1210 selects the keypoint in the first image and task 1220 selects the keypoint in the second image. Examples of keypoint detectors that may be used to implement

tasks

1210 and 1220 include corner detectors (e.g., Harris corner detector, Features from Accelerated Segment Test (FAST) ) and blob detectors (e.g., Laplacian of Gaussian (LoG) , Difference of Gaussians (DoG) , Determinant of Hessian (DoH) ) . Such a keypoint detector is typically configured to blur and resample the image with different blur widths and sampling rates to create a scale space, and to detect corners and/or blobs at different scales.

Referring once again to FIG. 12, task 1130 uses a trained artificial neural network (ANN) to convert the generated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor. Examples of ANNs that can be trained to perform such a complex conversion between multi-element vectors include convolutional neural networks (CNNs) and auto-encoders. It may be desired to implement the ANN to be rather small and fast: for example, to include less than ten thousand parameters, or less than five thousand parameters, and/or for the trained ANN to occupy less than five megabytes of storage. In one example, the ANN is implemented such that the input layer and the output layer are each arrays of size 32 x 32. In a typical production environment, a copy of the trained ANN is stored (e.g., during manufacture and/or provisioning) to each of a run of devices having the same model of video camera (and, possibly, the same model of IMU) . It may be desired to normalize the values of the motion descriptors to occupy the same range as the values of the calculated keypoint descriptor before input to the trained ANN.

It should be appreciated that the specific steps illustrated in FIG. 12 provide a particular method of performing image processing according to another embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 12 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 13 shows a simplified flowchart illustrating a method of training an ANN for keypoint conversion according to an embodiment of the present invention. Keypoint descriptors for training of this ANN may be generated by trained ANN 500 from the training data (images, motion blurs, and keypoint neighborhoods) as described with reference to FIG. 4 above. As shown in FIG. 13, the corresponding generated keypoint descriptors provide ground truth for the loss function during training of the ANN. It may be desired to augment the synthetic training data with descriptors calculated from a relatively small amount of images that have real motion blur and are manually annotated by human annotators.

Inferring by a deep learning network usually involves much more computation than calculating a distance between two keypoint descriptors. Therefore, to save some computation, it may be desired to implement

method

1100 or 1200 to compare the value of the first motion descriptor to the value of the second motion descriptor, and to avoid using the trained network (e.g., to use a traditional scoring metric instead) for cases in which the comparison indicates that the first and second motion descriptors have similar values.

Method

1100 or 1200 may be implemented, for example, to include a task that calculates a distance between the first motion descriptor and the second motion descriptor and compares the distance to a threshold value. Such an implementation of

method

1100 or 1200 may be configured to use a score metric (e.g., a distance as described above) , rather than the trained network, to determine whether keypoint descriptors from the first image match keypoint descriptors from the second image, in response to an indication by the motion descriptor comparison task that the motion blur of the first image is similar to the motion blur of the second image.

FIG. 14 shows a simplified block diagram of an apparatus according to an embodiment of the present invention. As an example, the apparatus 1400 illustrated in FIG. 14 can be utilized for image processing on a mobile device (e.g., a cellular telephone, such as a smartphone, or a head-mounted device) according to a general configuration that includes keypoint descriptor generator 1410, keypoint descriptor converter 1420, and keypoint descriptor comparer 1430.

Keypoint descriptor generator 1410 includes an instance of trained ANN 500 and is configured to generate a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period and to generate a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period (e.g., as described herein with reference to

tasks

1110 and 1120, respectively) . Keypoint descriptor converter 1420 is configured to use a trained ANN to convert the generated descriptor for the keypoint in the first image to a converted descriptor, based on a first motion descriptor that describes motion of the image sensor during the first time period and a second motion descriptor that describes motion of the image sensor during the second time period (e.g., as described herein with reference to task 1130) . Keypoint descriptor comparer 1430 is configured to compare the converted descriptor to the generated descriptor for the keypoint in the second image (e.g., as described herein with reference to task 1140) .

In one example, apparatus 1400 is implemented within a device such as a mobile phone, which typically has a video camera configured to produce a sequence of frames that include the first and second images. The device may also include one or more motion sensors, which may be configured to determine 6DOF motion of the device in space. In another example, apparatus 1400 is implemented within a head-mounted device, such as a set of AR glasses, which may also have motion sensors and one or more cameras.

In a further embodiment of the present invention, trained keypoint descriptor comparing module 600 is used instead of keypoint descriptor converter 1420 to perform feature matching. As described above, keypoint descriptor comparing module 600 is used to determine whether two keypoint descriptors match, even if the features belong to two images with different image motions. Using keypoint descriptor comparing module 600 is a more holistic way of addressing failure of feature matching caused by motion blur. As shown in FIG. 6, keypoint descriptor comparing module 600 receives four inputs, which include the two keypoint descriptors, f0 and f1, and the motion descriptors M0 and M1 that correspond to the motion blurs of the source images of f0 and f1. The output is a binary decision 610 (i.e., whether or not f0 and f1 match) as well as a value P 620 denoting the confidence of the output binary decision.

As compared with keypoint descriptor converter 1420, keypoint descriptor comparing module 600 encapsulates the score metric and tends to have higher matching accuracy. On the other hand, comparing module 600 takes more inputs than converter 1420 and typically includes a larger network. As the result, this solution tends to have a larger memory footprint and to consume more computational resources. As noted above, it may be desired to use a traditional scoring metric instead when the keypoint descriptors f0 and f1 come from images having similar image motions, in order to save some computation.

The embodiments discussed herein may be implemented in a variety of fields that may include feature matching, such as image alignment (e.g., panoramic mosaics) , 3D reconstruction (e.g., stereoscopy) , indexing and content retrieval, etc. The training images are not limited to images produced by a visible-light camera (e.g., in RGB or another color space) , but may also be images produced by a camera that is sensitive to non-visible light (e.g., infrared (IR) , ultraviolet (UV) ) , images produced by a structured light camera, and/or images produced by an image sensor other than a camera (e.g., imaging using RADAR, LIDAR, SONAR, etc. ) . Moreover, the embodiments described herein may also be extended beyond motion blur to cover other factors that may distort keypoint descriptors, such as illumination change, etc.

FIG. 15 illustrates examples of components of a computer system 1500 that may be configured to perform an implementation of a method as described herein (e.g.,

method

100, 1100, and/or 1200) . Although these components are illustrated as belonging to a same computer system 1500 (e.g., a smartphone or head-mounted device) , computer system 1500 may also be implemented such that the components are distributed (e.g., among different servers, among a smartphone and one or more network entities, etc. ) .

The computer system 1500 includes at least a processor 1502, a memory 1504, a storage device 1506, input/output peripherals (I/O) 1508, communication peripherals 1510, and an interface bus 1512. The interface bus 1512 is configured to communicate, transmit, and transfer data, controls, and commands among the various components of the computer system 1500. The memory 1504 and/or the storage device 1506 may be configured to store the training images (e.g., to store frames of a video sequence) and may include computer-readable storage media, such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM) , hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage, for example

memory, and other tangible storage media. Any of such computer readable storage media can be configured to store instructions or program codes embodying aspects of the disclosure. The memory 1504 and the storage device 1506 also include computer readable signal media. A computer readable signal medium includes a propagated data signal with computer readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer readable signal medium includes any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use in connection with the computer system 1500.

Further, the memory 1504 includes an operating system, programs, and applications. The processor 1502 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. The memory 1504 and/or the processor 1502 can be virtualized and can be hosted within another computer system of, for example, a cloud network or a data center. The I/O peripherals 1508 include user interfaces, such as a keyboard, screen (e.g., a touch screen) , microphone, speaker, other input/output devices (e.g., an image sensor configured to capture the images to be indexed) , and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 1508 are connected to the processor 1502 through any of the ports coupled to the interface bus 1512. The communication peripherals 1510 are configured to facilitate communication between the computer system 1500 and other computing devices (e.g., cloud computing entities configured to perform portions of indexing and/or query searching methods as described herein) over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals.

While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.

Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing, ” “computing, ” “calculating, ” “determining, ” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.

The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computer system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied -for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

The terms “comprising, ” “including, ” “having, ” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.

The various elements of an implementation of an apparatus or system as disclosed herein (e.g., apparatus 900 or 1400, system 1500) may be embodied in any combination of hardware with software and/or with firmware that is deemed suitable for the intended application. For example, such elements may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips) . Such an apparatus may also be implemented to include a memory configured to store the training images and/or the sets of neighborhoods.

A processor or other means for processing as disclosed herein may be fabricated as one or more electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips) . Examples of such arrays include fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, DSPs (digital signal processors) , FPGAs (field-programmable gate arrays) , ASSPs (application-specific standard products) , and ASICs (application-specific integrated circuits) . A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors. It is possible for a processor as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to a procedure of an implementation of

method

100, 1100, or 1200 (or another method as disclosed with reference to operation of an apparatus or system described herein) , such as a task relating to another operation of a device or system in which the processor is embedded (e.g., a voice communications device, such as a smartphone, or a smart speaker) . It is also possible for part of a method as disclosed herein to be performed under the control of one or more other processors.

Each of the tasks of the methods disclosed herein (e.g.,

methods

100, 1100, 1200) may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions) , embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc. ) , that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine) . The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP) . For example, such a device may include RF circuitry configured to receive and/or transmit encoded frames.

In one or more exemplary embodiments, the operations described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, such operations may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term “computer-readable media” includes both computer-readable storage media and communication (e.g., transmission) media. By way of example, and not limitation, computer-readable storage media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM) , or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage; and/or magnetic disk storage or other magnetic storage devices. Such storage media may store information in the form of instructions or data structures that can be accessed by a computer. Communication media can comprise any medium that can be used to carry desired program code in the form of instructions or data structures and that can be accessed by a computer, including any medium that facilitates transfer of a computer program from one place to another. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL) , or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD) , laser disc, optical disc, digital versatile disc (DVD) , floppy disk and Blu-ray Disc ^TM (Blu-Ray Disc Association, Universal City, Calif. ) , where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

In one example, a non-transitory computer-readable storage medium comprises code which, when executed by at least one processor, causes the at least one processor to perform a method of generating a keypoint descriptor that is robust to motion blur as described herein (e.g.,

method

100 or 1100 or 1200) . Further examples of such a storage medium include a medium further comprising code which, when executed by the at least one processor, causes the at least one processor to perform a method of generating a keypoint descriptor that is robust to motion blur as described herein.

Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, estimating, and/or selecting from a plurality of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device) , and/or retrieving (e.g., from an array of storage elements) . Unless expressly limited by its context, the term “selecting” is used to indicate any of its ordinary meanings, such as identifying, indicating, applying, and/or using at least one, and fewer than all, of a set of two or more. Unless expressly limited by its context, the term “determining” is used to indicate any of its ordinary meanings, such as deciding, establishing, concluding, calculating, selecting, and/or evaluating. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B” ) is used to indicate any of its ordinary meanings, including the cases (i) “derived from” (e.g., “B is a precursor of A” ) , (ii) “based on at least” (e.g., “A is based on at least B” ) and, if appropriate in the particular context, (iii) “equal to” (e.g., “A is equal to B” ) . Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least. ” Unless otherwise indicated, the terms “at least one of A, B, and C, ” “one or more of A, B, and C, ” “at least one among A, B, and C, ” and “one or more among A, B, and C” indicate “A and/or B and/or C. ” Unless otherwise indicated, the terms “each of A, B, and C” and “each among A, B, and C” indicate “A and B and C. ”

Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa) , and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa) . The term “configuration” may be used in reference to a method, apparatus, and/or system as indicated by its particular context. The terms “method, ” “process, ” “procedure, ” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. A “task” having multiple subtasks is also a method. The terms “apparatus” and “device” are also used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” are typically used to indicate a portion of a greater configuration. Unless expressly limited by its context, the term “system” is used herein to indicate any of its ordinary meanings, including “a group of elements that interact to serve a common purpose. ”

Unless initially introduced by a definite article, an ordinal term (e.g., “first, ” “second, ” “third, ” etc. ) used to modify a claim element does not by itself indicate any priority or order of the claim element with respect to another, but rather merely distinguishes the claim element from another claim element having a same name (but for use of the ordinal term) . Unless expressly limited by its context, each of the terms “plurality” and “set” is used herein to indicate an integer quantity that is greater than one.

The previous description is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein, but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.

Claims

A method of generating a keypoint descriptor that is robust to motion blur, the method comprising:

selecting a plurality of keypoints in an image;

for each of a plurality of motion blurs that are different from each other, applying each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and

based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, training an artificial neural network (ANN) to generate a keypoint descriptor,

wherein a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.
The method of claim 1 wherein each of the plurality of motion blurs describes a corresponding trajectory having at least two dimensions in space.
The method of claim 1 wherein each of the plurality of motion blurs describes a motion in six degrees of freedom.
The method of claim 1 further comprising extracting, from each of the plurality of blurred images and at each of the plurality of different scales, the neighborhoods of the keypoints.
The method of claim 4 wherein extracting includes applying at least one Gaussian blur to each of the plurality of blurred images.
The method of claim 4 further comprising downsampling each of the plurality of blurred images.
The method of claim 1 further comprising determining that a distance between two of the plurality of motion blurs is not less than a threshold value, wherein using a second trained artificial neural network to calculate a loss function for the training is contingent on the determining.
A computer system including:

one or more processors; and

one or more memories configured to store computer-readable instructions that, upon execution by the one or more processors, configure the computer system to:

select a plurality of keypoints in an image;

for each of a plurality of motion blurs that are different from each other, apply each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and

based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, train an artificial neural network (ANN) to generate a keypoint descriptor,

wherein a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.
The computer system of claim 8 wherein each of the plurality of motion blurs describes a corresponding trajectory having at least two dimensions in space.
The computer system of claim 8 wherein each of the plurality of motion blurs describes a motion in six degrees of freedom.
The computer system of claim 8 wherein the computer-readable instructions are further operable to configure the computer system to extract, from each of the plurality of blurred images and at each of the plurality of different scales, the neighborhoods of the keypoints.
The computer system of claim 11 wherein extracting includes applying at least one Gaussian blur to each of the plurality of blurred images.
The computer system of claim 8 wherein the computer-readable instructions are further operable to configure the computer system to downsample each of the plurality of blurred images.
The computer system of claim 8 wherein the computer-readable instructions are further operable to configure the computer system to determine that a distance between two of the plurality of motion blurs is not less than a threshold value and, in response, to use a second trained artificial neural network to calculate a loss function for the training.
One or more non-transitory computer-storage media storing instructions that, upon execution on a computer system, cause the computer system to perform operations including:

selecting a plurality of keypoints in an image;

for each of a plurality of motion blurs that are different from each other, applying each motion blur of the plurality of motion blurs to the image to generate a plurality of blurred images; and

based on neighborhoods of the keypoints in each of the plurality of blurred images and at each of a plurality of different scales, training an artificial neural network (ANN) to generate a keypoint descriptor,

wherein a criterion of the training is to minimize a distance measure between instances of the generated keypoint descriptor that correspond to the same keypoint in different ones of the plurality of blurred images.
The one or more non-transitory computer-storage media of claim 15 wherein each of the plurality of motion blurs describes a corresponding trajectory having at least two dimensions in space.
The one or more non-transitory computer-storage media of claim 15 wherein each of the plurality of motion blurs describes a motion in six degrees of freedom.
The one or more non-transitory computer-storage media of claim 15 wherein the instructions further cause the computer system to perform operations including extracting, from each of the plurality of blurred images and at each of the plurality of different scales, the neighborhoods of the keypoints.
The one or more non-transitory computer-storage media of claim 18 wherein extracting includes applying at least one Gaussian blur to each of the plurality of blurred images.
The one or more non-transitory computer-storage media of claim 18 wherein the instructions further cause the computer system to perform operations including determining that a distance between two of the plurality of motion blurs is not less than a threshold value, wherein using a second trained artificial neural network to calculate a loss function for the training is contingent on the determining.