CN113361542A

CN113361542A - Local feature extraction method based on deep learning

Info

Publication number: CN113361542A
Application number: CN202110611600.1A
Authority: CN
Inventors: 刘晓平; 蔡有城; 李琳; 王冬; 黄鑫涛
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-09-07
Anticipated expiration: 2041-06-02
Also published as: CN113361542B

Abstract

The invention discloses a local feature extraction method based on deep learning, which comprises the following steps: firstly, network training is carried out, a pre-constructed network is trained on an image data set MS-COCO, the data set is divided into a training set and a verification set which respectively comprise 82783 images and 40504 images, then image matching is carried out, in the experiment, the performance of the local feature extraction method is evaluated by utilizing a standard local feature pipeline, the standard local feature pipeline is to extract and match features from any given pair of images in an experiment, followed by a repetition score (repetition) calculation, then, the matching Score (M-Score) calculation is carried out, and finally, the evaluation of the effect of the homography estimation is carried out, and by postponing the detection step until the description, compared with the traditional non-machine learning mode, the method has a more flexible characteristic searching process, obtains a large number of key points and improves the characteristic extraction precision.

Description

Local feature extraction method based on deep learning

Technical Field

The invention relates to the technical field of local feature extraction frameworks for deep learning, in particular to a local feature extraction method based on deep learning.

Background

In many areas of computer vision, learning-based methods have emerged and begin to outperform traditional methods, intuitively, the feature extraction process requires only a network of several convolutional layers to be able to model the behavior of traditional detectors and descriptors by learning the appropriate parameters, some existing learning-based methods focus on training detectors or descriptors individually, while others succeed in building an end-to-end feature detection and description pipeline, for the former, when these individually optimized detectors or descriptors are integrated into the complete pipeline, the performance gain of these individual components may disappear, for the latter, jointly training detectors and descriptors may be more desirable, which makes them synergistically optimized.

However, it is challenging to achieve two different optimization goals by training a network, because the optimization goal of the detector is repeatability, while the optimization goal of the descriptor is differentiable, there is not a good set of solutions for unifying and combining the two, and the prior art cannot well balance the two optimization goals.

Disclosure of Invention

The invention aims to provide a local feature extraction method based on deep learning, which aims to solve the problems in the prior art in the technical background.

In order to achieve the purpose, the invention provides the following technical scheme: a local feature extraction method based on deep learning comprises the following steps:

s1, firstly, training data is carried out

Training our network on an image dataset MS-COCO, the dataset being segmented into a training set and a validation set comprising 82783 and 40504 images respectively;

s2, and then performing image matching

In an experiment, evaluating the performance of the local feature extraction method by using a standard local feature pipeline, wherein the standard local feature pipeline is used for extracting and matching features from any given pair of images in the experiment;

s3, calculating the repetition fraction (Repeatability)

The repetition score is used for evaluating the performance of a detector in the local feature extraction method, more specifically, let epsilon represent a correct distance threshold value to obtain a correct key point correspondence between two detected images in an experiment, and the repetition score is defined as the number of correct corresponding key points divided by the total number of key points in an image pair;

s4, and then performing matching Score (M-Score) calculation

Evaluating the comprehensive performance of a detector in the local feature extraction method and a descriptor in the local feature extraction method by using a matching score, wherein the matching score is the ratio of correct matching obtained by a matching strategy of the standard local feature pipeline to the total matching quantity;

s5, finally, evaluating the effect of homography estimation

Evaluating the capability of the local feature extraction method for estimating a homography matrix by using a homography estimation effect, wherein the homography estimation is realized by RANSAC calculation;

the homography estimation effect adopts an indirect comparison method to adapt to homography matrixes with different scales, and the average distance between the homography matrix estimated by RANSAC and four corners of a group-route homography matrix transformation image is measured.

Preferably, the local feature extraction method includes a descriptor, a detector and a loss function, wherein:

the descriptor comprises a Homography Convolutional Network (HCN) and a feature description, and the descriptor operates on the original image to finally obtain a dense descriptor with the same resolution size as the original image;

the detector comprises a detector CNN network and key point extraction, and the detector operates tensor F obtained by the HCN to finally obtain sparse key point positions;

the loss function:

in order to jointly optimize the detector and the descriptor, the loss function is composed of two intermediate losses, namely a detection loss function and a description loss function, wherein the detection loss function enables the network to generate repeatable key point positions which are covariant with viewpoints or illumination, and the description loss function enables the network to output descriptors with strong distinctiveness, obtain reliable matching, jointly optimize the two losses, and simultaneously improve the effect and the performance of the detector and the descriptor.

Preferably, the Homography Convolutional Network (HCN):

receiving input original image data, predicting different original image transformations by using a homography estimation module in HCN, providing the transformed original image to a full convolution network instead of forcing the full convolution network to learn extra geometric changes, so that more original image information of network learning can be obtained, and a tensor F is obtained;

the characteristics are described as follows:

tensor derived from calculation of HCN

As inputs:

output a tensor by Bi-cubic interpolation

② obtaining a normalized descriptor vector d by L2-normalizes

d_ij＝o_ij/‖o_ij‖₂

Where i is 1, …, H, j is 1, …, W, H 'is H/4, W' is W/4, H and W are the height and width of the original image, respectively, and D is 256, these descriptor vectors can be easily matched between images by euclidean distance, thus obtaining a reliable correspondence;

the detector CNN network:

the detector CNN network aims at outputting a pixel-level detection fraction, the detection fraction represents the probability that the position is a key point, a tensor F is input into the detector CNN network to obtain the detection fraction of each pixel in original image data, the detector CNN network consists of a convolution layer and two upper convolution layers, the spatial resolution is gradually increased along with the gradual reduction of the number of channels, and finally a final result is obtained through a sigmoid activation function;

and (3) extracting the key points:

the key point extraction aims at outputting sparse key point positions, inputting detection scores obtained by the detector CNN network, and obtaining a specified number of feature points by using non-maximum suppression (NMS) and TopK operations.

Preferably, the homography estimation module consists of a convolution layer and a linear layer, and the original image data is predicted to be 6 xN after passing through a network layer of the homography estimation module_hA parameter for obtaining a homography transformation matrix;

wherein, 1 XN_hOne parameter for calculating the scale transformation, 2 XN_hOne parameter for calculating the rotation transformation, 3 XN_hThe parameters are used for calculating perspective transformation;

the scale can be derived from one parameter:

λ(α)＝exp(tanh(α))；

for rotation, it can be calculated from two parameters by the following formula:

θ(α,β)＝arctan2(tanh(α),tanh(β))；

for the perspective transformation matrix A, three parameters can be processed by tanh activation function for representation (a)₁,a₂,a₃) Thus, 6 XN_hN can be obtained from one parameter_hA homographic transformation matrix, N_hIs a hyper-parameter, and sets N in consideration of the efficiency and effectiveness of the network_h＝4；

Specifically, four corners of the image are set as initial points

x＝[(-1,-1),(1,-1),(1,1),(-1,1)]，

Four corresponding points are then predicted using the homography estimation module, where the corresponding initial point transform can be expressed as:

the homography transformation matrix H is computed from these 4 pairs of corresponding points x and x' in a differentiable manner using the Tensor direct linear transformation (Tensor DLT) as follows:

x′＝Hx。

preferably, the detector performs inverse gradient update using a detection loss function;

the detection loss function calculation process is as follows:

giving a pair of real images I₁And I₂And giving out a ground-truth corresponding relation expressed as w (·), as shown in I₁＝w(I₂) In other words, by this w (-) all the pixels I in the first image₁Can be in the second image I₂Find, we image pair I₁And I₂Inputting the network to obtain the detection score S₁And S₂Definition of G₁And G₂Detecting a loss function L for a key point label of ground-truth_detDefined by cross-entropy loss:

L_det＝L_s(S₁,G₁)+L_s(S₂,G₂)

where (i, j) represents the position of the coordinate point.

Preferably, the descriptor performs inverse gradient update by using a description loss function;

the description loss function is calculated as follows:

the loss describing function is based on the modified hardest-coherent loss, which is modified by a more strict negative distance, minimizes the distance between positive examples, maximizes the distance of the nearest negative example, and is expressed by L_des：

Herein, define

And

representing the kth corresponding descriptor of the image pair, K represents the number of all corresponding descriptors, and thus, the positive distance is represented as:

‖·‖₂expressed as euclidean distance, negative distance is defined as:

where n (I, j, k) denotes the image I₁Descriptor in (1)

And image I₂The minimum distance of all non-corresponding descriptors in the set, then

Is shown and

the non-corresponding descriptor with the smallest distance,

the threshold C is a safety radius that is set to exclude feature points that are spatially too close to the correct correspondence, notably the describing loss function takes into account both the negative distance between pairs of images and the negative distance within images;

finally, a loss function L is described in connection with_detAnd said detection loss function L_desThe final loss function is obtained:

L＝L_des+L_det

preferably, during network training, the MS-COCO dataset is processed, the resolutions of all images are adjusted to 320 × 240, then the images are converted into gray scales, in order to generate pixel correspondence, a suitable homography transformation matrix is randomly generated for each training sample, the homography transformed images and the images are simultaneously input into the network for training, and simultaneously the positions of the group-route key points are transformed to generate correspondingly transformed group-route key point labels.

Preferably, in the network test, the evaluation is performed on a HPatches data set, which has 116 image sequences, of which 57 sequences are illumination changes and 59 sequences are viewing angle changes, for each sequence, the first image is taken as a reference image and matched with all subsequent images, resulting in 580 image HPatches data set calculated at a resolution of 240 × 320 and extracting N1000 feature points, and the same Mutual Nearest Neighbor (MNN) matching strategy is employed, which is based on nearest neighbor search, i.e. only when two descriptors are mutually nearest, a match is accepted, and in order to emphasize the accuracy of the match, a threshold value e (e ∈ ═ 3) is set for the corresponding pixel, i.e. a match with a reprojection error below this threshold value is considered to be a correct match.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the existing method which mostly adopts a feature matching method of firstly detecting and then describing or simultaneously detecting and describing, the application can provide more distinguishing descriptors and then detecting, thereby greatly improving the effectiveness of feature matching, and the homography transformation operation in the existing method aims to generate more key points and is irrelevant to the generation of descriptors, and the HCN utilizes the homography transformation operation to generate more distinguishing descriptors, so the application cannot be extended from the prior art, and the homography transformation of the HCN is obtained through learning, so that the obtained homography transformation better accords with the characteristics of the descriptors, more distinguishing descriptors can be generated, the transformation in the existing method is obtained through sampling by a non-learning method, the method cannot be applied to the description of the characteristic symbol and cannot generate a descriptor with distinctiveness.

2. The method adopts a CNN network as a detector network to detect key points, combines a self-supervision training strategy to enable the obtained key points to be more repetitive, adopts a strategy of description before detection, and obtains more stable key points by postponing the detection step until the description is finished.

3. Two new loss functions are designed to further improve the performances of a descriptor and a detector, similarity loss is proposed, the repeatability of key point detection is further improved, the hardest-coherent loss with stricter negative distance constraint is adopted to avoid fuzzy areas and achieve more advanced performance, in the process of determining the loss functions, feature description and feature extraction double losses are utilized to act in the network together, so that the method not only considers the description process of more distinctive descriptors, but also considers obtaining more repetitive key points, the extraction of the descriptor HCN is associated with the loss of subsequent feature extraction, the whole network operates end to end, the time is saved, the network has good robustness due to the strategy of double loss superposition, and the feature description and the feature extraction of the picture after the HCN have better relevance, on one hand, the method can promote HCN to accurately and quickly generate the distinctive descriptors, on the other hand, the method can promote the utilization of the distinctive descriptors in the key point detection process, so that the more accurate key point detection is realized, and the superiority of the method is shown in the characteristic matching experiment.

Drawings

Fig. 1 is a flow chart of the RDFeat of the present invention.

Fig. 2 is a diagram of the RDFeat network architecture of the present invention.

Fig. 3 is a diagram of the RDFeat training architecture of the present invention.

Fig. 4 is a diagram of a homography estimation module network architecture of the present invention.

Fig. 5 is a diagram of a homographic transformation matrix based on scale, rotation and symmetric ray estimation in accordance with the present invention.

FIG. 6 is a graph of positive and negative distances, double arrowed lines representing Euclidean distances, of descriptors of the present invention.

FIG. 7 is a graph of the qualitative results of the HPatches data set of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A Local feature extraction method based on deep Learning is provided, which is fully called reproducible and dispersive Detection and Description for Learning Local Features (RDFeat), and is used for obtaining reliable matching correspondence between images, distinguishing a frame with classic first Detection and later Description, adopting a strategy of first Description and later Detection, and obtaining more stable key points by postponing a Detection step until after Description, in the work, focusing on obtaining Repeatable key points and distinguishable descriptors, firstly, providing a dense and multiple Homography Convolution Network (HCN) as a descriptor network to estimate dense descriptors to obtain highly distinguishable descriptors, secondly, using a CNN network as a detector network to detect key points, and combining a self-supervision training strategy to make the obtained key points have more repeatability, finally, two new loss functions are designed to further improve the performance of the descriptor and the detector, RDFeat is trained on the MS-COCO image data set and then evaluated on a plurality of reference data sets, and experimental results show that the performance of the RDFeat is superior to that of the latest method.

As mentioned above, most of the existing feature extraction methods are feature matching methods that are performed after detection and then description or perform feature detection and description simultaneously, the purpose of the homography transformation operations in these methods is to allow the generation of more keypoints, independent of the generation of descriptors, therefore, it is even impossible to generate descriptors with distinction, it is difficult to improve the effectiveness of feature matching, whereas our HCN uses a homographic transformation operation to generate more distinct descriptors, it cannot be extended from the prior art to the present application, furthermore, the homographic transformation of HCN is obtained through learning, so that the obtained homographic transformation is more consistent with the characteristics of descriptors and can generate more distinctive descriptors, the transformation in the comparison file is obtained through sampling by a non-learning method, the method cannot be applied to the description of the characteristic symbol and cannot generate a descriptor with distinctiveness, and a specific scheme of the application is introduced below.

Referring to fig. 1-7, the present invention provides a technical solution:

a local feature extraction method based on deep learning, also called RDFeat, comprises the following steps:

s1, firstly, training data is carried out

s2, and then performing image matching

During the experiment, the performance of the local feature extraction method is evaluated by using a standard local feature pipeline, and the standard local feature pipeline extracts and matches features from any given pair of images in the experiment;

s3, calculating the repetition fraction (Repeatability)

s4, and then performing matching Score (M-Score) calculation

s5, finally, evaluating the effect of homography estimation

In this embodiment, the local feature extraction method includes a descriptor, a detector, and a loss function, where:

the method adopts a strategy of description before detection, and the detection step is postponed to the description so as to obtain more stable key points, and after more distinguishing descriptors are obtained, the key points with high repeatability are obtained in the detection process of the picture key points by using a self-supervision mode.

The loss function:

In this embodiment, the Homography Convolutional Network (HCN):

receiving input original image data, predicting different image transformations by using a homography estimation module in HCN, providing the transformed images to a CNN network instead of forcing the CNN network to learn extra geometric changes, and thus, enabling the CNN network to learn more image information so as to obtain a tensor F;

as shown in fig. 2, the HCN takes an original image as input and then obtains N using the homography estimation module_hThe homography matrix transforms the image I to obtain a set of transformed images, where H (I) represents the image I transformed by the homography matrix H, and then we apply a full convolution network Q as a descriptor extraction network to extract a dense descriptor f for all the transformed images, defined as:

f＝Q(H(I))

finally, the different dense signatures are inversely transformed back and fused into a dense signature by averaging:

this is done for two reasons, first, such a methodThe method allows the deep network to learn more about the geometric information of the image and secondly improves the distinctiveness of the descriptors under different geometric variations, that is, their string representations will be sufficiently similar (euclidean distance is sufficiently small) for descriptors of corresponding positions (matching) and sufficiently large (euclidean distance is sufficiently large) for descriptors of non-corresponding positions (non-matching), thereby improving the accuracy of the image matching, in practice, the full convolutional network F uses an vgg type encoder consisting of convolutional layers, pooling layers and activation functions, noting that our encoder uses two maximum pooling layers to reduce the resolution to 1/4, all convolutional layers are zero-padded to produce the same output size, we define H × W as the resolution of the input image, where H' ═ H/4, w' ═ W/4, the tensor of the output is defined as

Wherein D is the number of channels;

the descriptors obtained by the Homography Convolutional Network (HCN) have scale, rotation and affine invariance, although the CNN descriptor can also show a certain degree of scale invariance after being trained, the scale invariance is not the inherent property of the CNN, when the scale change is large or the visual angle changes, the CNN descriptor matching effect is greatly influenced, in order to handle the limitation, D2-Net uses an image pyramid model to make it more robust in scale change, but ignores other geometric changes, further, LF-Net learns different scales and directions of feature points, and then uses a micro-cuttable image block to calculate the robust descriptor, in addition, alsfiet uses a Deformable Convolutional Network (DCN) to predict and apply dense spatial transformation, thereby obtaining the capability of geometric change, in our work, we input images under different transformations to the full convolution network instead of forcing the full convolution network to learn additional geometric changes, so that more image information can be learned by the full convolution network, and thus more distinctive descriptors can be obtained, and the image matching effect is improved;

the homography describes the mapping of the position of an object in the pixel coordinate system of the image pair, so camera motion with rotation and translation can be easily modeled with the homography, and in addition, the homography can be easily estimated from a pair of images, which is a good model for simulating the same physical position of an object, for which reason the homography is used in our method to model geometric changes;

the characteristics are described as follows:

tensor derived from calculation of HCN

As inputs:

output a tensor by Bi-cubic interpolation

② obtaining a normalized descriptor vector d by L2-normalizes

d_ij＝o_ij/‖o_ij‖₂

Where i is 1, …, H, j is 1, …, W, H 'is H/4, W' is W/4, and H and W are the height and width of the original image, respectively; d is 256, and the descriptor vectors can be easily matched between the images through Euclidean distance, so that reliable corresponding relation is obtained;

the detector CNN network:

a descriptor extraction module is developed by adopting a structure similar to U-Net, and although an additional learning weight is introduced by the method, more stable and accurate key points can be obtained, which is reflected in the repeatability of the key points; meanwhile, different loss functions are proposed to further improve the network performance.

And (3) extracting the key points:

In this embodiment, the homography estimation module is composed of a convolution layer and a linear layer, and the original image data is predicted to be 6 × N after passing through the network layer of the homography estimation module_hA parameter for obtaining a homography transformation matrix;

the scale can be derived from one parameter:

λ(α)＝exp(tanh(α))；

θ(α,β)＝arctan2(tanh(α),tanh(β))；

Specifically, four corners of the image are set as initial points

x＝[(-1,-1),(1,-1),(1,1),(-1,1)-，

the homography transformation matrix H is computed from these 4 pairs of corresponding points x and x in a differentiable manner using the Tensor direct linear transformation (Tensor DLT) as follows:

x′＝Hx。

the homography evaluation module has the main idea that an original picture is converted into 4 pictures (subjected to scale, rotation and symmetrical perspective) different from the original picture by utilizing a conversion matrix, the conversion matrix is controlled by 6 parameters, wherein 1 parameter controls the scale, 2 parameters controls the rotation, and 3 parameters controls the symmetrical radiation, so that a homography conversion matrix H is obtained;

and (3) coordinate corresponding process of the homography estimation module:

the difficulty of the current method lies in that a proper matrix cannot be directly found to directly model the transformation, but the method does not record in the prior art by limiting the 3 types of transformation to provide a matrix with 6 parameters, and from the realization process, the transformation matrix H of the method can easily find the inverse matrix H ', the transformation operation of the original image can be realized by utilizing the transformation matrix H, and then the original image can be inversely transformed through the inverse matrix H', specifically, the method finds the corresponding position coordinates of the transformed image after the coordinate points of the original image and the obtained H matrix are subjected to matrix corresponding multiplication. Then, extracting descriptors from the transformed image, wherein the position coordinates of the descriptors obtained by extracting the transformed image can be inversely transformed to the original image by using H', namely the descriptors obtained by extracting the 4 transformed images can be transformed back to the original positions, so that the information of the descriptors is enhanced;

in this embodiment, the detector performs inverse gradient update using a detection loss function;

the detection loss function calculation process is as follows:

giving a pair of real images I₁And I₂And giving out a ground-truth corresponding relation expressed as w (·), as shown in I₁＝w(I₂) In other words, by this w (·), the image I₁May be in image I₂Find, we image pair I₁And I₂Input deviceNetwork obtains detection score S₁And S₂Definition of G₁And G₂Detecting a loss function L for a key point label of ground-truth_detDefined by cross-entropy loss:

L_det＝L_s(S₁,G₁)+L_s(S₂,G₂)

because it is difficult to determine the position of a group-channel key point, the conventional supervised training cannot solve the feature detection problem, and as observed in the previous work, there is no strict standard to define which positions are key points, so we solve the problem according to the self-supervision strategy proposed in the SuperPoint, and monitor the network by taking the group-channel key point generated by MagicPoint as the group-channel, MagicPoint trains on the Synthetic peaks dataset and then populates the dataset to a real image by using the homographic adaptation technology, and MagicPoint shows excellent performance in the aspect of key point detection, and quantitative indexes such as average accuracy (mAP) and repeatability are proved.

The descriptor updates the inverse gradient by adopting a description loss function;

the description loss function is calculated as follows:

Herein, define

And

‖·‖₂expressed as euclidean distance, negative distance is defined as:

where n (I, j, k) denotes the image I₁Descriptor in (1)

Is shown and

the non-corresponding descriptor with the smallest distance,

finally, a loss function L is described in connection with_desAnd said detection loss function L_detThe final loss function is obtained:

L＝L_des+L_det

final loss function to jointly optimize the detector and the descriptor, we propose a final loss function consisting of two intermediate losses, the detection loss and the description loss, for detection we want the network to produce repeatable keypoint locations, which are covariant to the viewpoint or illumination, and for description we want the network to output descriptors with strong distinctiveness, able to obtain a reliable match, for which we jointly optimize the two losses while improving the effect and performance of the detector and the descriptor.

By utilizing the common action of the feature description and the feature extraction in our network, the method not only considers the description process of a descriptor with more distinctiveness, but also considers the acquisition of a key point with more repeatability, and associates the extraction of the descriptor HCN with the loss of the subsequent feature extraction, so that the whole network operates end to end, thereby not only saving time, but also ensuring that the network has good robustness due to the strategy of overlapping the double losses, thereby realizing better relevance of the feature description and the feature extraction of the picture after being subjected to HCN, on one hand, the HCN can be promoted to accurately and rapidly generate the descriptor with more distinctiveness, on the other hand, the utilization of the descriptor with distinctiveness in the key point detection process can be promoted, thereby realizing more accurate key point detection, and in the feature matching experiment, the superiority of the method is shown.

In this embodiment, during network training, the MS-COCO data set is performed, the resolutions of all images are adjusted to 320 × 240, then the images are converted into gray scales, a suitable homographic transformation matrix is randomly generated for each training sample in order to generate pixel correspondence, the homographic transformed images and the images themselves are simultaneously input into the network for training, and meanwhile, the ground-route key point positions are transformed, and it should be noted that generating the corresponding transformed ground-route key point labels, the random generation of the homographic transformation matrix is limited within a reasonable range (generally, the range is determined by using a set value of SuperPoint) to simulate real-world camera transformation, thereby avoiding extreme situations.

In this embodiment, during network testing, evaluation is performed on an HPatches data set, where there are 116 image sequences in the HPatches, 57 of which are illumination changes, 59 of which are viewing angle changes, and for each sequence, the first image serves as a reference image and is matched with all subsequent images, so that 580 image HPatches data set is calculated, and the same Mutual Nearest Neighbor (MNN) matching strategy is employed, the MNN matching strategy is based on nearest neighbor search, that is, only when two descriptors are mutually nearest neighbors, the two descriptors are accepted as a match, in order to emphasize the accuracy of the match, a threshold e (e ═ 3) of the corresponding pixel is set, that is, the match with a reprojection error lower than the threshold is considered as a correct match;

for fair comparison, all methods are calculated with a resolution of 240 × 320 and extraction of N — 1000 feature points.

TABLE 1 evaluation results on HPatches

As shown in table 1, it can be found that our RDFeat is superior to all other methods in almost all indexes, when SIFT is at a low error threshold (epsilon ═ 1), due to its higher sub-pixel accuracy, the homography estimation capability is the best, when the threshold is larger, RDFeat can better estimate the homography matrix, it should be noted that RDFeat and SuperPoint are trained on the same data set, but RDFeat achieves better repeatability and matching score, proving its superiority;

FIG. 7 is a qualitative result of HPatches data set, large shaded area indicates more correct matches, small shaded area indicates less correct matches, RD-Net produces the most correct matches compared to SuperPoint, SIFT and ORB, covering the whole image even under extreme rotation and affine changes, although ORB performs as well as RDNet in repeatability, but its detection tends to form sparse clusters, and thus has poor performance on the homography estimation task.

The processing result shows that the invention can jointly learn the feature detector and the descriptor according to the above scheme, namely a novel deep network architecture, following the means of description re-detection, and combining the learning feature detector and the descriptor, and provides three innovations in the aspects of feature description, feature detection and loss function, thereby remarkably improving the differentiability and the key point repeatability of the descriptor, specifically, we provide a novel HCN to extract the dense descriptor, can collect more geometric image information under different transformations, and realize the scale, rotation and affine invariant performance of the descriptor, in addition, we also develop a detector CNN network based on the self-supervision training strategy, thereby realizing the effective detection of stable key points, in addition, considering different optimization targets of the detector and the descriptor, we design two loss functions to improve the feature performance, and finally, we perform comprehensive evaluation on a plurality of reference data sets, the experimental result shows that RDFeat shows impressive performance, the feature description and the feature extraction double loss are jointly acted in our network, so that the method not only considers the description process of a more distinctive descriptor, but also considers the acquisition of a more repetitive key point, the extraction of the descriptor HCN is associated with the loss of the subsequent feature extraction, the end-to-end operation of the whole network is realized, the time is saved, the network has good robustness due to the strategy of overlapping double losses, the feature description and the feature extraction of a picture after HCN have better relevance are realized, on one hand, the HCN can be promoted to accurately and quickly generate the more distinctive descriptor, on the other hand, the utilization of the distinctive descriptor in the key point detection process can be promoted, the more accurate key point detection is realized, in the feature matching experiment, the superiority of the method is shown.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A local feature extraction method based on deep learning is characterized in that: the local feature extraction method comprises the following steps:

s1, firstly, network training is carried out

Training a pre-constructed network on an image data set MS-COCO, wherein the data set is divided into a training set and a verification set which respectively comprise 82783 images and 40504 images;

s2, and then performing image matching

s3, calculating the repeat fraction

s4, and then performing matching Score M-Score calculation

s5, finally, evaluating the effect of homography estimation

the homography estimation effect evaluation adopts an indirect comparison method to adapt to homography matrixes with different scales, and the average distance between the homography matrix estimated by RANSAC and four corners of a group-route homography matrix transformation image is measured.

2. The local feature extraction method based on deep learning of claim 1, wherein:

the local feature extraction method comprises a descriptor, a detector and a loss function, wherein:

the descriptor comprises a homography convolutional network HCN and a characteristic description, and the descriptor operates on the original image to finally obtain a dense descriptor with the same resolution size as the original image;

the loss function:

3. The local feature extraction method based on deep learning according to claim 2, wherein:

the homography convolutional network HCN:

the characteristics are described as follows:

tensor derived from calculation of HCN

As inputs:

output a tensor by Bi-cubic interpolation

② obtaining a normalized descriptor vector d by L2-normalizes

d_ij＝o_ij/‖o_ij‖₂

the detector CNN network:

and (3) extracting the key points:

the key point extraction aims at outputting sparse key point positions, inputting detection scores obtained by the detector CNN network, and utilizing non-maximum values to inhibit NMS and TopK operations so as to obtain a specified number of feature points.

4. The method of claim 3, wherein the local feature extraction method based on deep learning is characterized in that:

the homography estimation module consists of a convolution layer and a linear layer, and original image data is predicted to be 6 multiplied by N after passing through a network layer of the homography estimation module_hA parameter for obtaining a homography transformation matrix;

the scale can be derived from one parameter:

λ(α)＝exp(tanh(α))；

θ(α,β)＝arctan2(tanh(α),tanh(β))；

Specifically, four corners of the image are set as initial points

x＝[(-1,-1),(1,-1),(1,1),(-1,1)]，

x′＝Hx。

5. the local feature extraction method based on deep learning according to claim 2, wherein:

the detector adopts a detection loss function to update the reverse gradient;

the detection loss function calculation process is as follows:

giving a pair of real images I₁And I₂And giving out a ground-truth corresponding relation expressed as w (·), as shown in I₁＝w(I₂) In other words, by means of the function w (·), the image I₁May be in image I₂Find, image pair I₁And I₂Inputting the network to obtain the detection score S₁And S₂Definition of G₁And G₂Detecting the loss function L for the corresponding ground-truth key point label_detDefined by cross-entropy loss:

L_det＝L_s(S₁,G₁)+L_s(S₂，G₂)

where (i, j) represents the position of the coordinate point.

6. The method of claim 5, wherein the local feature extraction method based on deep learning is characterized in that:

the description loss function is calculated as follows:

Herein, define

And

‖·‖₂expressed as euclidean distance, negative distance is defined as:

where n (I, j, k) denotes the image I_iDescriptor in (1)

And image I_jThe minimum distance of all non-corresponding descriptors in the set, then

Is shown and

the non-corresponding descriptor with the smallest distance,

L＝L_des+L_det。

7. the local feature extraction method based on deep learning according to any one of claims 1-6, characterized in that: when the network training is performed, the resolution of all images is adjusted to 320 × 240 on the MS-COCO data set, then the images are converted into gray scale, in order to generate pixel correspondence, a suitable homography transformation matrix is randomly generated for each training sample, the homography transformed images and the images are simultaneously input into the network for training, and simultaneously the positions of the group-route key points are transformed to generate correspondingly transformed group-route key point labels.

8. The local feature extraction method based on deep learning according to any one of claims 1-6, characterized in that: in the network test, evaluation is performed on an HPatches data set, which has 116 image sequences, of which 57 sequences are illumination changes and 59 sequences are viewing angle changes, and for each sequence, the first image is taken as a reference image and matched with all subsequent images, resulting in 580 image HPatches data set calculated at a resolution of 240 × 320 and with extracted N1000 feature points, and the same mutual nearest MNN matching strategy is employed, which is based on nearest neighbor search, i.e. only when two descriptors are nearest to each other, a match is accepted, and in order to emphasize the accuracy of the match, a threshold e (e ∈ 3) is set for the corresponding pixel, i.e. a match with a reprojection error below this threshold is considered to be a correct match.