WO2021164615A1

WO2021164615A1 - Motion blur robust image feature matching

Info

Publication number: WO2021164615A1
Application number: PCT/CN2021/076042
Authority: WO
Inventors: Xun Xu
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-02-19
Filing date: 2021-02-08
Publication date: 2021-08-26
Also published as: CN115210758A

Abstract

Methods, computer systems, and computer-storage media for image processing are disclosed. In one example, descriptors are calculated for keypoints in images that were captured during different time periods. A trained artificial neural network is used to convert a calculated descriptor to a converted descriptor, based on motion descriptors that describe motion of an image sensor during the time periods.

Description

MOTION BLUR ROBUST IMAGE FEATURE MATCHING

BACKGROUND OF THE INVENTION

Augmented Reality (AR) superimposes virtual content over a user's view of the real world. With the development of AR software development kits (SDK) , the mobile industry has brought smartphone AR to the mainstream. An AR SDK typically provides six degrees-of-freedom (6DoF) tracking capability. A user can scan the environment using a smartphone's camera, and the smartphone performs visual inertial odometry (VIO) in real time. Once the camera pose is tracked continuously, virtual objects can be placed into the AR scene to create an illusion that real objects and virtual objects are merged together.

A keypoint (or "interest point" ) is a point of an image that is distinctive from other points in the image, has a well-defined spatial position or is otherwise localized within the image, and is stable under local and global variations (e.g., changes in scale, changes in illumination, etc. ) . A keypoint descriptor may be defined as a multi-element vector that describes (typically, in scale space) the neighborhood of a keypoint in an image. Examples of keypoint descriptor frameworks include Scale-Invariant Feature Transform (SIFT) , Speeded-Up Robust Features (SURF) , and Binary Robust Invariant Scalable Keypoints (BRISK) . An image feature may be defined as a keypoint and a corresponding keypoint descriptor.

Matching features across different images (e.g., across different frames of a video sequence) is an important component of many image processing applications. Images which are captured while the image sensor is moving may have significant motion blur. Because most of the widely used characterizations of image features are very sensitive to motion blur, feature matching is less likely to succeed when the features are extracted from images having different motion blurs. Therefore, there is a need in the art for improved methods of performing feature matching.

SUMMARY OF THE INVENTION

The present invention relates generally to methods and systems related to image processing. More particularly, embodiments of the present invention provide methods and systems for performing feature matching in augmented reality applications. Embodiments of the present invention are applicable to a variety of applications in augmented reality and computer-based display systems.

A method of image processing according to a general configuration comprises calculating a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period; calculating a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period; using a trained artificial neural network to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and comparing the converted descriptor to the calculated descriptor for the keypoint in the second image.

A computer system according to another general configuration includes one or more memories configured to store a first image captured by an image sensor during a first time period, and a second image captured by the image sensor during a second time period different than the first time period. This computer system also includes one or more processors, and the one or more memories are further configured to store computer-readable instructions that, upon execution by the one or more processors, configure the computer system to calculate a descriptor for a keypoint in the first image, wherein a first motion descriptor describes motion of the image sensor during the first time period; calculate a descriptor for a keypoint in the second image, wherein a second motion descriptor describes motion of the image sensor during the second time period; use a trained artificial neural network to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and compare the converted descriptor to the calculated descriptor for the keypoint in the second image.

One or more non-transitory computer-storage media according to a further general configuration store instructions that, upon execution on a computer system, cause the computer system to perform operations including selecting a keypoint in a first image captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period; calculating a corresponding descriptor for the selected keypoint in the first image; selecting a keypoint in a second image captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period; calculating a corresponding descriptor for the selected keypoint in the second image; using a trained artificial neural network to convert the calculated descriptor for the selected keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and comparing the converted descriptor to the calculated descriptor for the selected keypoint in the second image.

Numerous benefits are achieved by way of the present invention over conventional techniques. For example, embodiments of the present disclosure involve methods and systems that utilize a deep learning network to function as a keypoint descriptor motion blur converter, or as a keypoint descriptor comparing module, to address the challenge of image feature matching when different motion blurs are present in camera frames. Embodiments of the present invention may be applied, for example, to improve the performance of positioning and mapping in AR/VR applications. Moreover, embodiments of the present invention may increase feature matching accuracy when features are extracted from camera frames that are captured with different image motions. As a result, SLAM calculation may be more accurate and more stable when the device is undergoing quick motion, because one or more previously wasted blurred camera frames may now be effectively used for positioning and mapping estimation. Thus, embodiments of the present invention avoid shortcomings of existing approaches, such as lower matching accuracy when there is little motion or when the images are blurred by similar motions. These and other embodiments of the invention along with many of its advantages and features are described in more detail in conjunction with the text below and attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a simplified schematic diagram of a keypoint descriptor converter according to an embodiment of the present invention.

FIG. 2 shows a simplified flowchart of a method of image processing according to an embodiment of the present invention.

FIG. 3A shows an example of an image according to an embodiment of the present invention.

FIG. 3B shows examples of keypoints in the image illustrated in FIG. 3A according to an embodiment of the present invention.

FIG. 4 shows a simplified flowchart of method of performing image processing according to an embodiment of the present invention.

FIG. 5 shows an illustration of six degrees of freedom (6DOF) .

FIG. 6 shows a simplified flowchart illustrating a method of generating training data for training of network according to an embodiment of the present invention.

FIG. 7 shows a simplified flowchart illustrating a method of training an ANN according to an embodiment of the present invention.

FIG. 8 shows a simplified block diagram of an apparatus according to an embodiment of the present invention.

FIG. 9 shows a simplified flowchart illustrating a method of performing feature matching using a keypoint descriptor comparing module according to an embodiment of the present invention.

FIG. 10 shows a block diagram of a computer system according to an embodiment of the present invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

Many applications depend heavily on the performance of image feature matching. Such applications may include image alignment (e.g., image stitching, image registration, panoramic mosaics) , three-dimensional (3D) reconstruction (e.g., stereoscopy) , indexing and content retrieval, motion tracking, object recognition, and many others.

A basic requirement in many augmented reality or virtual reality (AR/VR) applications is to determine a device's position and orientation in 3D space. Such applications may use a simultaneous localization and mapping (SLAM) algorithm to determine a device's real-time position and orientation and to infer a structure of the environment (or "scene" ) within which the device is operating. In one example of a SLAM application, frames of a video sequence from the device's camera are input to a module that executes a SLAM algorithm. Features are extracted from the frames and matched across different frames, and the SLAM algorithm searches for matched features that correspond to the same spot in the scene being captured. By tracking the positions of the features across different frames, the SLAM module can determine the image sensor's motion within the scene and infer the major structure of the scene.

In traditional feature matching, keypoint detection is performed on each of a plurality of images (e.g., on each frame of a video sequence) , and a corresponding keypoint descriptor is calculated for each of the detected keypoints from its neighborhood (typically in scale space) . The number of keypoints detected for each image is typically at least a few dozen and may be as high as five hundred or more, and the neighborhood from which a keypoint descriptor is calculated typically has a radius of about fifteen pixels around the keypoint. As illustrated in FIG. 1, a keypoint descriptor f0 from a first image I0 (e.g., a frame of the video sequence) and a keypoint descriptor f1 from a second image I1 (e.g., a different frame of the video sequence, such as a consecutive frame in the sequence) are used to compute a matching score. The score metric is usually a distance d (f0, f1) between the keypoint descriptors f0 and f1 in the descriptor space, such as a distance according to any of the example distance metrics below. This score computation is repeated for different pairs of keypoint descriptors from the two images, and the resulting scores are thresholded to identify matching features: e.g., to determine whether the current pair of keypoint descriptors (and thus the corresponding features from the two images) match. In a typical SLAM application, a pair of matched features corresponds to a single point in the physical environment, and this correspondence leads to a math constraint. A later stage of the SLAM computation may derive camera motion and an environmental model as an optimum solution that satisfies multiple constraints, including constraints generated by matching feature pairs.

Examples of distance metrics that may be used for matching-score computation include the Euclidean, city-block, chi-squared, cosine, and Minkowski distances. Assuming that f0 and f1 are n-dimensional vectors such that f0=x _0,1, x _0,2, x _0,3, ... , x _0,n and f1=x _1,1, x _1,2, x _1,3, ... , x _1,n, the distance d (f0, f1) between them may be described according to these distance metrics as follows:

Euclidean distance:

City-block distance:

Cosine distance:

where

Chi-squared distance (assuming that the values of all elements of f0 and f1 are larger than zero) :

Minkowski distance (also called generalized Euclidean distance) :

Thresholding of the resulting scores to determine whether the corresponding features match may be performed according to a procedure such as the following:

where T denotes a threshold value.

Examples of AR/VR devices include mobile phones and head-mounted devices (e.g., AR or "smart" glasses) . Given the nature of AR/VR devices, many video frames are captured when the image sensor (e.g., a video camera) is moving. As a result, the captured frames may have significant motion blur. If the image sensor's motion during capture of image I0 is the same as the image sensor's motion during capture of image I1, and each of the descriptors f0 and f1 correspond to the same keypoint in the two images, then the values of the descriptors f0 and f1 tend to be similar such that the computed distance between them is small. However, in practical applications of AR/VR, the image sensor would typically experience different motions when capturing each image, so that the descriptors f0 and f1 could be distorted by different motion blurs. Unfortunately, almost all widely used image features (e.g., SIFT, SURF, BRISK) are very sensitive to motion blur, such that any significant motion blur (e.g., a blur of five pixels or more) is likely to cause the keypoint descriptors to be distorted. As a result, the values of the descriptors f0 and f1 may be very different, even if the descriptors correspond to the same keypoint. When such features are extracted from images with different motion blurs and a scoring metric as described above is used, feature matching is likely to fail.

For many applications in which the image sensor may be in motion (e.g., AR/VR applications) , it is possible to quantify the image motion during each capture interval. Motion direction and magnitude can be estimated from input of one or more motion sensors of the device, for example, and/or can be calculated from two temporal neighbor frames in the video sequence. Motion sensors (which may include one or more gyroscopes, accelerometers, and/or magnetometers) may indicate a displacement and/or change in orientation of the device and may be implemented within an inertial measurement unit (IMU) .

Examples of techniques for coping with motion blur may include the following:

1) Don't Match: Since the matching of image features in blurred images becomes so unreliable, one potential solution is to give up image feature matching completely, at least between pairs of images that have significant and different motion blurs.

2) Deblur Image First: Before the image is used for feature extraction, a deblurring operation is performed to remove the motion blur from the image.

3) Compensate for Motion Blur When Calculating Keypoint Descriptors: An estimate of the image motion is used to compensate for the motion blur's impact when calculating the keypoint descriptor. This method differs from "Deblur Image First" in that the motion blur removal or compensation is performed on the neighborhood of the keypoint rather than on the whole image.

4) Extract Blur-Invariant Feature: When calculating keypoint descriptors from the neighborhoods of the keypoints, one only uses components that are blur-invariant and ignores those that are sensitive to motion blur. Therefore, the keypoint descriptor will remain roughly the same even when the image is motion-blurred.

Shortcomings of the above approaches may include the following:

1) Don't Match: To prevent the false matchings of features in blurred images from degrading the estimations in SLAM, one may choose not to do feature matching at all when significant image motion has been detected. For example, one may choose to perform SLAM using only motion sensor output at these times. However, this approach may cause the image sensor output at such moments to be completely wasted, and the SLAM calculation may in turn become less accurate and less stable.

2) Deblur Image First: Image deblurring usually involves significant computation. Since the SLAM computations are commonly carried out on a mobile platform, the additional computation required for deblurring processing may not be always available or affordable. Moreover, the deblurring operation tends to add new artifacts to the original image, which in turn are likely to negatively impact the image feature matching accuracy.

3) Compensate for Motion Blur When Calculating Keypoint Descriptor: Compensating for motion blur when calculating a keypoint descriptor typically requires less extra computation than deblurring the entire image, since the computation only involves the neighborhood of the keypoint. However, the drawback of introducing new artifacts into the image still exists.

4) Extract Blur-Invariant Feature: Since this method ignores components that are sensitive to motion blur, less information is available to perform feature matching. In other words, this approach may increase matching accuracy and stability for cases of large camera motion at the cost of reducing matching performance for other cases (e.g., cases in which motion is not obvious) .

It may be desirable to increase a robustness of a keypoint descriptor framework to motion blur. Accordingly, embodiments described herein, implemented using appropriate systems, methods, apparatus, devices, and the like, as disclosed herein, may support increased accuracy of feature matching operations in applications that are prone to motion blur. The embodiments described herein can be implemented in any of a variety of applications that use feature matching, including image alignment (e.g., image stitching, image registration, panoramic mosaics) , 3D reconstruction (e.g., stereoscopy) , indexing and content retrieval, endoscopic imaging, motion tracking, object tracking, object recognition, automated navigation, SLAM, etc.

According to embodiments of the present invention, a deep learning network is used to function as a keypoint descriptor motion blur converter, or as a keypoint descriptor comparing module, to address the challenge of image feature matching when different motion blurs are present in camera frames. Embodiments of the present invention may be applied, for example, to improve the performance of positioning and mapping in AR/VR applications. Moreover, embodiments of the present invention may increase feature matching accuracy when features are extracted from camera frames that are captured with different image motions. As a result, SLAM calculation may be more accurate and more stable when the device is undergoing quick motion, because one or more previously wasted blurred camera frames may now be effectively used for positioning and mapping estimation. Thus, embodiments of the present invention avoid shortcomings of existing approaches, such as lower matching accuracy when there is little motion or when the images are blurred by similar motions.

FIG. 1 shows a simplified schematic diagram of a keypoint descriptor converter 100 according to an embodiment of the present invention. As illustrated in FIG. 1, the conversion is performed on a keypoint descriptor f0 that is extracted from image I0. The descriptor converter takes three inputs: the keypoint descriptor f0; a motion descriptor M0 of the image motion when image I0 was captured; and a motion descriptor M1 of the image motion when image I1 was captured. The output of the converter is a converted keypoint descriptor f1′. The design goal is for the converted keypoint descriptor f1′to be similar to keypoint descriptor f1 in image I1, if the descriptors f0 and f1 correspond to the same keypoint, and for the converted descriptor f1′and the descriptor f1 to be very different if the descriptors f0 and f1 refer to different keypoints.

FIG. 2 shows a simplified flowchart of a method of image processing according to an embodiment of the present invention. The method of image processing 100 illustrated in FIG. 1 includes

tasks

210, 220, 230, and 240. Task 210 calculates a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period. Task 220 calculates a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period. Task 230 uses a trained artificial neural network (ANN) to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor. Task 240 compares the converted descriptor to the calculated descriptor for the keypoint in the second image. For example, task 240 may include calculating a distance between the converted descriptor and the calculated descriptor in the descriptor space (e.g., according to a distance metric as disclosed herein, such as Euclidean distance, chi-squared distance, etc. ) and comparing the calculated distance to a threshold value (e.g., according to a procedure as described above) .

It should be appreciated that the specific steps illustrated in FIG. 2 provide a particular method of performing image processing according to an embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 2 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

A keypoint is a point of an image that is distinctive from other points in the image, has a well-defined spatial position or is otherwise localized within the image, and is stable under local and global variations (e.g., changes in scale, changes in illumination, etc. ) .

FIG. 3A shows an example of an image (e.g., a frame of a video sequence) according to an embodiment of the present invention. FIG. 3B shows examples of keypoints in the image illustrated in FIG. 3A according to an embodiment of the present invention. Referring to FIG. 3B, the circles shown in FIG. 3B indicate the locations of a few examples of keypoints 310 -322 in the image. The number of keypoints detected in each image in a typical feature matching application is at least a dozen, twenty-five, or fifty and may range up to one hundred, two hundred, or five hundred or more.

In a typical application of feature matching, method 200 illustrated in FIG. 2 is repeated, for multiple different pairs (f0, f1) of keypoint descriptors, over consecutive pairs of frames of a video sequence (e.g., on each consecutive pair of frames) . In one example, for each of a plurality of keypoints detected in the first image (each having a corresponding location in the first image) , method 200 is repeated for each pair that comprises the descriptor of the keypoint in the first image and the descriptor of a keypoint that is within a threshold distance (e.g., twenty pixels) of the same location in the second image.

Method 200 is not limited to any particular size or format of the first and second images, but examples from typical feature matching applications are now provided. Most current AR/VR devices are configured to capture video in VGA format (i.e., having a frame size of 640 x 480 pixels) , with each pixel having a red, green, and blue component. The largest frame format typically seen in such devices is 1280 x 720 pixels, such that a maximum size of each of the first and second images in a typical application is about one thousand by two thousand pixels. The minimum size of each of the first and second images in a typical application is about one-quarter VGA (i.e., 320 x 240 pixels) , as a smaller image size would likely not be enough to support an algorithm such as SLAM.

Keypoint

descriptor calculation tasks

210 and 220 may be implemented using an existing keypoint descriptor framework, such as Scale-Invariant Feature Transform (SIFT) , Speeded-Up Robust Features (SURF) , Binary Robust Invariant Scalable Keypoints (BRISK) , etc. Such a task may comprise calculating an orientation of the keypoint, which may include determining how, or in what direction, a pixel neighborhood (also called an "image patch" ) that surrounds the keypoint is oriented. Calculating an orientation of the keypoint, which may include detecting the most dominant orientation of the gradient angles in the patch, is typically performed on the patch at different scales of a scale space. The SIFT framework, for example, assigns a 128-dimensional feature vector to each keypoint based on the gradient orientations of pixels in sixteen local neighborhoods of the keypoint.

Some keypoint descriptor frameworks (e.g., SIFT and SURF) include both keypoint detection and keypoint descriptor calculation. Other keypoint descriptor frameworks (e.g., Binary Robust Independent Elementary Features (BRIEF) ) include keypoint descriptor calculation but not keypoint detection. For a case in which the latter type of framework is used to implement

tasks

210 and 220, it may be desired to perform a keypoint detection operation on the first image to select the corresponding keypoint before performing task 210, and to perform the keypoint detection operation on the second image to select the corresponding keypoint before performing task 220. In either case, it may be desired to implement the methods described herein to skip an image pair in response to a determination that either of the first and second motion descriptors has a value that exceeds a threshold value (e.g., a blur of twenty pixels) , as keypoint detection is unlikely to succeed for an image having such an extensive blur.

FIG. 4 shows a simplified flowchart of method of performing image processing according to another embodiment of the present invention. As illustrated in FIG. 4, elements utilized in method 200 are also utilized in method 400 illustrated in FIG. 4, as well as additional elements as described below. Accordingly, the description provided in relation to FIG. 2 is applicable to FIG. 4 as appropriate. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

Referring to FIG. 4, task 410 selects the keypoint in the first image and task 420 selects the keypoint in the second image. Examples of keypoint detectors that may be used to implement

tasks

410 and 420 include corner detectors (e.g., Harris corner detector, Features from Accelerated Segment Test (FAST) ) and blob detectors (e.g., Laplacian of Gaussian (LoG) , Difference of Gaussians (DoG) , Determinant of Hessian (DoH) ) . Such a keypoint detector is typically configured to blur and resample the image with different blur widths and sampling rates to create a scale space, and to detect corners and/or blobs at different scales.

The first and second motion descriptors describe a motion of the image sensor during capture of the first and second images, respectively, and may be calculated from motion sensor (e.g., IMU) outputs and/or from neighboring frames in a video sequence (e.g., from the frame captured just before and the frame captured just after the frame for which the motion descriptor is being calculated) . The capture period for each frame in a video sequence is typically the reciprocal of the frame rate, although it is possible for the capture period to be shorter. The frame rate for a typical video sequence (e.g., as captured by an Android phone) is thirty frames per second (fps) . The frame rate for an iPhone or head-mounted device can be as high as 120 fps.

Each motion descriptor may be implemented to describe a trajectory or path in a coordinate space of one, two, or three spatial dimensions. Such a trajectory may be described as a sequence of one or more positions of the image sensor sampled at uniform intervals during the corresponding capture period, and each sampled position may be expressed as a motion vector relative to the previous sampled position (e.g., taking the position at the start of the capture period to be the origin of the coordinate space) . In one such example, the first motion descriptor describes a first trajectory having three dimensions in space, and the second motion descriptor describes a second trajectory having three dimensions in space that is different than the first trajectory.

Each motion descriptor may be further implemented to describe a motion in six degrees of freedom (6DOF) . In addition to three spatial dimensions, such a motion may include a rotation about each of one or more of the axes of these dimensions. As shown in FIG. 5, which shows an illustration of six degrees of freedom (6DOF) , these rotations may be labeled as tilt, pitch, and yaw. For example, the motion descriptor may include, for each sampled position of the image sensor, a corresponding orientation of a reference direction of the image sensor (e.g., the look direction) relative to the orientation at the previous sampled position (e.g., taking the reference orientation to be the orientation at the start of the capture period) .

Referring once again to FIG. 4, task 230 uses a trained artificial neural network (ANN) to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor. Examples of ANNs that can be trained to perform such a complex conversion between multi-element vectors include convolutional neural networks (CNNs) and auto-encoders. It may be desired to implement the ANN to be rather small and fast: for example, to include less than ten thousand parameters, or less than five thousand parameters, and/or for the trained ANN to occupy less than five megabytes of storage. In one example, the ANN is implemented such that the input layer and the output layer are each arrays of size 32 x 32. In a typical production environment, a copy of the trained ANN is stored (e.g., during manufacture and/or provisioning) to each of a run of devices having the same model of video camera (and, possibly, the same model of IMU) . It may be desired to normalize the values of the motion descriptors to occupy the same range as the values of the calculated keypoint descriptor before input to the trained ANN.

It should be appreciated that the specific steps illustrated in FIG. 4 provide a particular method of performing image processing according to another embodiment of the present invention. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 4 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

It may be desired for the training data to be adequate for the trained ANN to encapsulate complex logic and computation. Image motion blur can be simulated by image processing operations (such as directional filtering, etc. ) , and one or more such operations may be used to produce a large amount of synthetic training data.

FIG. 6 shows a simplified flowchart illustrating a method of generating training data for training of network according to an embodiment of the present invention. As shown in FIG. 6, for example, for each of a set of training images 620, a number of keypoints 622 may be detected; a number of different motion blurs (each being described by a corresponding one of a set of motion descriptors 610a -610n) may then be applied to the image to generate a corresponding number of blurred images B1-Bn; and keypoint descriptors for the detected keypoints (and corresponding motion descriptors) may be calculated from each of the blurred images.

FIG. 7 shows a simplified flowchart illustrating a method of training an ANN according to an embodiment of the present invention. As shown in FIG. 7, the corresponding calculated keypoint descriptors provide ground truth for the loss function during training of the ANN. It may be desired to augment the synthetic training data with descriptors calculated from a relatively small amount of images that have real motion blur and are manually annotated by human annotators.

Inferring by a deep learning network usually involves much more computation than calculating a distance between two keypoint descriptors. Therefore, to save some computation, it may be desired to implement

method

200 or 400 to compare the value of the first motion descriptor to the value of the second motion descriptor, and to avoid using the trained network (e.g., to use a traditional scoring metric instead) for cases in which the comparison indicates that the first and second motion descriptors have similar values.

Method

200 or 400 may be implemented, for example, to include a task that calculates a distance between the first motion descriptor and the second motion descriptor and compares the distance to a threshold value. Such an implementation of

method

200 or 400 may be configured to use a score metric (e.g., a distance as described above) , rather than the trained network, to determine whether keypoint descriptors from the first image match keypoint descriptors from the second image, in response to an indication by the motion descriptor comparison task that the motion blur of the first image is similar to the motion blur of the second image.

FIG. 8 shows a simplified block diagram of an apparatus according to an embodiment of the present invention. As an example, the apparatus 800 illustrated in FIG. 8 can be utilized for image processing on a mobile device (e.g., a cellular telephone, such as a smartphone, or a head-mounted device) according to a general configuration that includes keypoint descriptor calculator 810, keypoint descriptor converter 820, and keypoint descriptor comparer 830.

Keypoint descriptor calculator 810 is configured to calculate a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period and to calculate a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period (e.g., as described herein with reference to

tasks

210 and 220, respectively) . Keypoint descriptor converter 820 is configured to use a trained ANN to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on a first motion descriptor that describes motion of the image sensor during the first time period and a second motion descriptor that describes motion of the image sensor during the second time period (e.g., as described herein with reference to task 230) . Keypoint descriptor comparer 830 is configured to compare the converted descriptor to the calculated descriptor for the keypoint in the second image (e.g., as described herein with reference to task 240) .

In one example, apparatus 800 is implemented within a device such as a mobile phone, which typically has a video camera configured to produce a sequence of frames that include the first and second images. The device may also include one or more motion sensors, which may be configured to determine 6DOF motion of the device in space. In another example, apparatus 800 is implemented within a head-mounted device, such as a set of AR glasses, which may also have motion sensors and one or more cameras.

FIG. 9 shows a simplified flowchart illustrating a method of performing feature matching using a keypoint descriptor comparing module 900 according to an embodiment of the present invention. As illustrated in FIG. 9, the keypoint descriptor comparing module 900 is used to determine whether two keypoint descriptors match, even if the features belong to two images with different image motions. Using keypoint descriptor comparing module 900 is a more holistic way of addressing failure of feature matching caused by motion blur. As shown in FIG. 9, keypoint descriptor comparing module 900 receives four inputs, which include the two keypoint descriptors, f0 and f1, and the motion descriptors, M0 and M1, that correspond to the motion blurs of the source images of f0 and f1. The output is a binary decision 910 (i.e., whether or not f0 and f1 match) as well as a value P 912 denoting the confidence of the output binary decision.

In a manner similar to the reasons stated above, a deep learning network is trained to produce the match indication and confidence value outputs of the keypoint descriptor comparing module 900. In this case, the network may be implemented as a classifier network as known in the field of deep learning. Given adequate training data, a classifier usually produces a good output. Training of such a network (e.g., a CNN) may be performed using training data obtained as described above (e.g., with reference to FIG. 6) .

As compared with keypoint descriptor converter 820, keypoint descriptor comparing module 900 encapsulates the score metric and tends to have higher matching accuracy. On the other hand, comparing module 900 takes more inputs than converter 820 and typically includes a larger network. As the result, this solution tends to have a larger memory footprint and to consume more computational resources. As noted above, it may be desired to use a traditional scoring metric instead when the keypoint descriptors f0 and f1 come from images having similar image motions, in order to save some computation.

The embodiments discussed herein may be implemented in a variety of fields that may include feature matching, such as image alignment (e.g., panoramic mosaics) , 3D reconstruction (e.g., stereoscopy) , indexing and content retrieval, etc. The first and second images are not limited to images produced by a visible-light camera (e.g., in RGB or another color space) ; for example, they may be images produced by a camera that is sensitive to non-visible light (e.g., infrared (IR) , ultraviolet (UV) ) , images produced by a structured light camera, and/or images produced by an image sensor other than a camera (e.g., imaging using RADAR, LIDAR, SONAR, etc. ) . Moreover, the embodiments described herein may also be extended beyond motion blur to cover other factors that may distort keypoint descriptors, such as illumination change, etc.

FIG. 10 shows a block diagram of a computer system 1000 according to an embodiment of the present invention. As described herein, computer system 1000 and components thereof may be configured to perform an implementation of a method as described herein. Although these components are illustrated as belonging to a same computer system 1000 (e.g., a smartphone or head-mounted device) , computer system 1000 may also be implemented such that the components are distributed (e.g., among different servers, among a smartphone and one or more network entities, etc. ) .

The computer system 1000 includes at least a processor 1002, a memory 1004, a storage device 1006, input/output peripherals (I/O) 1008, communication peripherals 1010, and an interface bus 1012. The interface bus 1012 is configured to communicate, transmit, and transfer data, controls, and commands among the various components of the computer system 1000. The memory 1004 and/or the storage device 1006 may be configured to store the first and second images (e.g., to store frames of a video sequence) and may include computer-readable storage media, such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM) , hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage, for example

memory, and other tangible storage media. Any of such computer readable storage media can be configured to store instructions or program codes embodying aspects of the disclosure. The memory 1004 and the storage device 1006 also include computer readable signal media. A computer readable signal medium includes a propagated data signal with computer readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer readable signal medium includes any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use in connection with the computer system 1000.

Further, the memory 1004 includes an operating system, programs, and applications. The processor 1002 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. The memory 1004 and/or the processor 1002 can be virtualized and can be hosted within another computer system of, for example, a cloud network or a data center. The I/O peripherals 1008 include user interfaces, such as a keyboard, screen (e.g., a touch screen) , microphone, speaker, other input/output devices (e.g., an image sensor configured to capture the images to be indexed) , and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 1008 are connected to the processor 1002 through any of the ports coupled to the interface bus 1012. The communication peripherals 1010 are configured to facilitate communication between the computer system 1000 and other computing devices (e.g., cloud computing entities configured to perform portions of indexing and/or query searching methods as described herein) over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals.

While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.

Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as "processing, " "computing, " "calculating, " "determining, " and "identifying" or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.

The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computer system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied -for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

The terms "comprising, " "including, " "having, " and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term "or" is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term "or" means one, some, or all of the elements in the list. The use of "adapted to" or "configured to" herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.

The various elements of an implementation of an apparatus or system as disclosed herein (e.g., apparatus 800) may be embodied in any combination of hardware with software and/or with firmware that is deemed suitable for the intended application. For example, such elements may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips) . Such an apparatus may also be implemented to include a memory configured to store the first and second images.

A processor or other means for processing as disclosed herein may be fabricated as one or more electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips) . Examples of such arrays include fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, DSPs (digital signal processors) , FPGAs (field-programmable gate arrays) , ASSPs (application-specific standard products) , and ASICs (application-specific integrated circuits) . A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors. It is possible for a processor as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to a procedure of an implementation of method M100 or 400 (or another method as disclosed with reference to operation of an apparatus or system described herein) , such as a task relating to another operation of a device or system in which the processor is embedded (e.g., a voice communications device, such as a smartphone, or a smart speaker) . It is also possible for part of a method as disclosed herein to be performed under the control of one or more other processors.

Each of the tasks of the methods disclosed herein (e.g., methods 200 and/or 400) may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions) , embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc. ) , that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine) . The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP) . For example, such a device may include RF circuitry configured to receive and/or transmit encoded frames.

In one or more exemplary embodiments, the operations described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, such operations may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term "computer-readable media" includes both computer-readable storage media and communication (e.g., transmission) media. By way of example, and not limitation, computer-readable storage media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM) , or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage; and/or magnetic disk storage or other magnetic storage devices. Such storage media may store information in the form of instructions or data structures that can be accessed by a computer. Communication media can comprise any medium that can be used to carry desired program code in the form of instructions or data structures and that can be accessed by a computer, including any medium that facilitates transfer of a computer program from one place to another. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL) , or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD) , laser disc, optical disc, digital versatile disc (DVD) , floppy disk and Blu-ray DiscTM (Blu-Ray Disc Association, Universal City, Calif. ) , where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

In one example, a non-transitory computer-readable storage medium comprises code which, when executed by at least one processor, causes the at least one processor to perform a method of image processing as described herein (e.g., method 200 or 400) . Further examples of such a storage medium include a medium further comprising code which, when executed by the at least one processor, causes the at least one processor to perform a method of image processing as described herein.

Unless expressly limited by its context, the term "signal" is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term "generating" is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term "calculating" is used herein to indicate any of its ordinary meanings, such as computing, evaluating, estimating, and/or selecting from a plurality of values. Unless expressly limited by its context, the term "obtaining" is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device) , and/or retrieving (e.g., from an array of storage elements) . Unless expressly limited by its context, the term "selecting" is used to indicate any of its ordinary meanings, such as identifying, indicating, applying, and/or using at least one, and fewer than all, of a set of two or more. Unless expressly limited by its context, the term "determining" is used to indicate any of its ordinary meanings, such as deciding, establishing, concluding, calculating, selecting, and/or evaluating. Where the term "comprising" is used in the present description and claims, it does not exclude other elements or operations. The term "based on" (as in "A is based on B" ) is used to indicate any of its ordinary meanings, including the cases (i) "derived from" (e.g., "B is a precursor of A" ) , (ii) "based on at least" (e.g., "A is based on at least B" ) and, if appropriate in the particular context, (iii) "equal to" (e.g., "A is equal to B" ) . Similarly, the term "in response to" is used to indicate any of its ordinary meanings, including "in response to at least. " Unless otherwise indicated, the terms "at least one of A, B, and C, " "one or more of A, B, and C, " "at least one among A, B, and C, " and "one or more among A, B, and C" indicate "A and/or B and/or C. " Unless otherwise indicated, the terms "each of A, B, and C" and "each among A, B, and C" indicate "A and B and C. "

Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa) , and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa) . The term "configuration" may be used in reference to a method, apparatus, and/or system as indicated by its particular context. The terms "method, " "process, " "procedure, " and "technique" are used generically and interchangeably unless otherwise indicated by the particular context. A "task" having multiple subtasks is also a method. The terms "apparatus" and "device" are also used generically and interchangeably unless otherwise indicated by the particular context. The terms "element" and "module" are typically used to indicate a portion of a greater configuration. Unless expressly limited by its context, the term "system" is used herein to indicate any of its ordinary meanings, including "a group of elements that interact to serve a common purpose. "

Unless initially introduced by a definite article, an ordinal term (e.g., "first, " "second, " "third, " etc. ) used to modify a claim element does not by itself indicate any priority or order of the claim element with respect to another, but rather merely distinguishes the claim element from another claim element having a same name (but for use of the ordinal term) . Unless expressly limited by its context, each of the terms "plurality" and "set" is used herein to indicate an integer quantity that is greater than one.

The previous description is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.

Claims

A method of image processing, the method comprising:

calculating a descriptor for a keypoint in a first image that was captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period;

calculating a descriptor for a keypoint in a second image that was captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period;

using a trained artificial neural network to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and

comparing the converted descriptor to the calculated descriptor for the keypoint in the second image.
The method of claim 1 wherein:

the first motion descriptor describes a first trajectory having three dimensions in space; and

the second motion descriptor describes a second trajectory having three dimensions in space that is different than the first trajectory.
The method of claim 1 wherein each of the first motion descriptor and the second motion descriptor describe a motion in six degrees of freedom.
The method of claim 1 wherein:

the calculated descriptor for the keypoint in the first image describes a neighborhood of the keypoint in the first image; and

the calculated descriptor for the keypoint in the second image describes a neighborhood of the keypoint in the second image.
The method of claim 4 wherein the motion described by the first motion descriptor is based on information from an image captured by the image sensor that is not the first image.
The method of claim 4 further comprising determining that a distance between the first motion descriptor and the second motion descriptor is not less than a threshold value, wherein using the trained artificial neural network is contingent on the determining.
The method of claim 1 wherein the method further comprises selecting the keypoint in the first image and selecting the keypoint in the second image.
A computer system including:

one or more memories configured to store:

a first image captured by an image sensor during a first time period, and

a second image captured by the image sensor during a second time period different than the first time period; and

one or more processors; wherein:

the one or more memories are further configured to store computer-readable instructions that, upon execution by the one or more processors, configure the computer system to:

calculate a descriptor for a keypoint in the first image, wherein a first motion descriptor describes motion of the image sensor during the first time period;

calculate a descriptor for a keypoint in the second image, wherein a second motion descriptor describes motion of the image sensor during the second time period;

use a trained artificial neural network to convert the calculated descriptor for the keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and

compare the converted descriptor to the calculated descriptor for the keypoint in the second image.
The computer system of claim 8 wherein:

the first motion descriptor describes a first trajectory having three dimensions in space; and

the second motion descriptor describes a second trajectory having three dimensions in space that is different than the first trajectory.
The computer system of claim 8 wherein each of the first motion descriptor and the second motion descriptor describe a motion in six degrees of freedom.
The computer system of claim 8 wherein:

the calculated descriptor for the keypoint in the first image describes a neighborhood of the keypoint in the first image, and

the calculated descriptor for the keypoint in the second image describes a neighborhood of the keypoint in the second image.
The computer system of claim 8 wherein the motion described by the first motion descriptor is based on information from an image captured by the image sensor that is not the first image.
The computer system of claim 8 wherein the computer-readable instructions are further operable to configure the computer system to determine that a distance between the first motion descriptor and the second motion descriptor is not less than a threshold value, wherein using the trained artificial neural network is contingent on the determining.
The computer system of claim 8 further comprising selecting the keypoint in the first image and selecting the keypoint in the second image.
One or more non-transitory computer-storage media storing instructions that, upon execution on a computer system, cause the computer system to perform operations including:

selecting a keypoint in a first image captured by an image sensor during a first time period, wherein a first motion descriptor describes motion of the image sensor during the first time period;

calculating a corresponding descriptor for the selected keypoint in the first image;

selecting a keypoint in a second image captured by the image sensor during a second time period that is different than the first time period, wherein a second motion descriptor describes motion of the image sensor during the second time period;

calculating a corresponding descriptor for the selected keypoint in the second image;

using a trained artificial neural network to convert the calculated descriptor for the selected keypoint in the first image to a converted descriptor, based on the first motion descriptor and the second motion descriptor; and

comparing the converted descriptor to the calculated descriptor for the selected keypoint in the second image.
The one or more non-transitory computer-storage media of claim 15 wherein:

the first motion descriptor describes a first trajectory having three dimensions in space; and

the second motion descriptor describes a second trajectory having three dimensions in space that is different than the first trajectory.
The one or more non-transitory computer-storage media of claim 15 wherein each of the first motion descriptor and the second motion descriptor describe a motion in six degrees of freedom.
The one or more non-transitory computer-storage media of claim 15 wherein:

the calculated descriptor for the keypoint in the first image describes a neighborhood of the keypoint in the first image; and

the calculated descriptor for the keypoint in the second image describes a neighborhood of the keypoint in the second image.
The one or more non-transitory computer-storage media of claim 18 wherein the motion described by the first motion descriptor is based on information from an image captured by the image sensor that is not the first image.
The one or more non-transitory computer-storage media of claim 18 wherein the instructions further cause the computer system to perform operations including determining that a distance between the first motion descriptor and the second motion descriptor is not less than a threshold value, wherein the using the trained artificial neural network is contingent on the determining.