CN115984587A

CN115984587A - Image matching method for combining consistency of mixed scale feature descriptors and neighbors

Info

Publication number: CN115984587A
Application number: CN202211500472.4A
Authority: CN
Inventors: 杜松林; 李东岳
Original assignee: Shenzhen Institute Of Southeast University; Southeast University
Current assignee: Shenzhen Institute Of Southeast University; Southeast University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-04-18

Abstract

The invention discloses an image matching method for consistency of a joint mixed scale feature descriptor and a neighbor, which comprises the steps of sequentially passing through feature description networks of different branches, splicing a single-scale feature descriptor and a multi-scale feature descriptor on a dimension to generate a mixed scale feature descriptor, inputting the mixed scale descriptor into an optimal transmission matching layer, and obtaining an initial distribution matrix; the initial matching point pairs pass through a graph neural network sharing weight again to refine the initial distribution matrix so as to obtain final matching. According to the invention, through fusing the single-scale descriptors and the multi-scale descriptors, the mixed descriptors can keep robustness to various geometric deformations and high significance, and meanwhile, the geometric prior is utilized to remove wrong matching point pairs, so that the matching effect with high accuracy is finally achieved.

Description

Image matching method for combining consistency of mixed scale feature descriptors and neighbors

Technical Field

The invention belongs to the technical field of computer vision based on deep learning, and mainly relates to an image matching method for consistency of a joint mixed scale feature descriptor and a neighbor.

Background

Image feature matching refers to establishing point-to-point correspondence between two-dimensional views of the same three-dimensional scene, and image matching is a foundation for many downstream three-dimensional computer vision tasks, including three-dimensional reconstruction, visual positioning, motion structure (SfM), synchronous positioning and mapping (SLAM), and the like. Given a pair of images, the conventional feature matching method is: feature detection (1), feature description (3), feature matching (4) and outlier rejection.

Early feature matching methods tended to design feature point extractors and descriptors manually with some success. In recent years, deep learning methods adopt a data-driven strategy, descriptors which are more robust to illumination and visual angle changes can be obtained, and a convolutional neural network is adopted as a tool for detecting and describing feature points firstly. In recent years, for the purpose of expanding the field of reception and aggregating larger context information, transformers are widely used in feature matching. The method without the feature point detector tends to establish dense matching between views first and refine the extracted reliable matching. However, fine-grained detail information can be lost when the features extracted by the convolutional neural network are subjected to multi-layer down-sampling, and correct matching cannot be established on small objects in a scene. How to overcome the defect that the learned descriptor has rich fine-grained details and is robust to various geometric deformations becomes a problem to be solved urgently by the technical personnel in the field.

Disclosure of Invention

The invention provides an image matching method for consistency of joint mixed scale feature descriptors and neighbors aiming at the problem that feature point-free detector methods in the prior art have defects. According to the invention, through fusing the single-scale feature descriptor and the multi-scale feature descriptor, the detail loss caused by the downsampling operation in the convolutional neural network is avoided, meanwhile, the geometric consistency of matching is ensured by considering the neighbor consistency, and finally, the matching effect with high accuracy is achieved.

In order to achieve the purpose, the invention adopts the technical scheme that: the image matching method combining the consistency of the mixed scale feature descriptors and neighbors comprises the steps of sequentially carrying out convolution and attention mixing and enhancement self-attention based networks, splicing the feature descriptors with different scales on feature dimensions to obtain an initial distribution matrix, and modifying the distribution matrix after an initial matching point pair passes through a shared weight based graph neural network to realize image matching.

As an improvement of the invention, the method comprises the following steps:

s1, feature extraction: performing feature extraction of different resolutions on input original pictures shot from different perspectives on the same picture through an FPN network, wherein feature maps obtained through the feature extraction have different spatial resolutions and semantic information, and feature maps with the resolution of 1/2 and feature maps with the resolution of 1/8 of the original pictures are used for feature description of the next step;

s2, single-scale feature description; after the position of the feature map with the size of 1/8 obtained in the step S1 is coded, inputting the feature map into a neural network based on convolution and attention mixing to obtain a single-scale feature descriptor; a convolution branch is additionally added to a mixed self-attention layer in the neural network based on convolution and attention mixing, a cross attention layer is kept unchanged, the convolution branch of the mixed self-attention layer restores the local geometric structure of an original image, and the attention branch carries out information interaction inside features; the cross attention layer realizes information interaction of different characteristics and updates the characteristics of each layer;

s3, describing multi-scale features; inputting the original pictures shot from different viewpoints obtained in the step S1 into a network based on self-attention enhancement and outputting a multi-scale feature descriptor; the key matrix (K) and the value matrix (V) in the enhanced self-attention in the network based on the enhanced self-attention are sampled in different proportions in different self-attention heads, and each self-attention head carries out information transfer of features with different scales to generate a multi-scale feature descriptor;

s4, fusing features of different scales: splicing the single-scale feature descriptor obtained in the step S2 and the multi-scale feature descriptor obtained in the step S3 on the feature dimension;

s5, inputting the mixed scale descriptor obtained in the step S4 into an optimal matching layer to obtain an initial distribution matrix; selecting an initial matching point pair based on a set threshold value;

s6, neighbor consistency filtering outliers: modeling the initial matching point pairs obtained in the step S4 into a graph structure, inputting the graph structure into a graph neural network sharing weight, and using the output of the graph neural network to correct the initial distribution matrix to obtain new matching point pairs.

S7, matching and fine trimming: inputting the feature map with the size of 1/2 obtained in the step S1 and the mixed descriptor obtained in the step S4 into a fully-connected neural network to obtain an enhanced feature map with the size of 1/2; and inputting the obtained characteristic graph and the new matching point pairs with the pixel-level precision obtained in the step S6 into a matching refinement network, and outputting the final matching with the sub-pixel-level precision, so that a complete image matching model is constructed, and image matching is realized.

As an improvement of the present invention, in step S2, the feature maps with the size of 1/8 are position-coded and rearranged into a one-dimensional tensor; obtaining single-scale feature descriptors via mixed self-attention and cross-attention layers of convolution and self-attention fusion

As an improvement of the present invention, the sparse-based attention network training process in step S2 specifically includes:

the hybrid self-attentive mechanism and the cross-attentive mechanism are used alternately at different levels in the network. When a hybrid self-attention mechanism is used, the similarity between each pixel is learned within the feature map; when a cross attention mechanism is used, the similarity of each pixel between the feature maps is learned, and finally information transmitted between network layers is obtained through a layer of fully-connected neural network.

As another improvement of the present invention, the step S3 further includes:

s31: the key matrix (K) and the value matrix (V) are down-sampled in different proportions in different self-attentional heads,

V _i ＝V _i +LE(V _i )，

in the formula, X represents the input characteristic,

representing a linear mapping matrix, r _i Represents the downsampling proportion of the ith feature header, MTA (-) represents the multi-scale aggregation operation, LE (-) is a convolutional neural network;

s32: the query matrix (Q), the key matrix (K) and the value matrix (V) obtained in step S31 are used for information transmission,

in the formula d _h Representing the feature dimension of each feature head.

As another improvement of the present invention, the operation with the same dimension in step S4 specifically includes: and splicing the 256-dimensional single-scale feature descriptors and the 128-dimensional multi-scale feature descriptors on the feature dimension to obtain 384-dimensional feature descriptors.

As another improvement of the present invention, the step S5 specifically includes: first, two mixed descriptors are calculated

A matrix of the degree of similarity between them,

where τ is a constant, and <. Indicates the inner product. The similarity matrix is used as a cost matrix of a part of assignment problems, and the optimal solution confidence coefficient distribution matrix can be obtained by solving the part of assignment problems, so that initial matching is obtained.

As a further improvement of the present invention, in step S6, a sparse descriptor of the corresponding point pair is extracted, a sparse similarity matrix P is calculated by inner product, and the corresponding relationship of the point set between the images can be regarded as the corresponding relationship of the nodes in the graph structure, so as to construct a node matrix R _A ，R _B And the edge matrix E _A ，E _B Where each node retains only the edges between the two nodes that are most similar to the rest, via a graph neural network sharing parameters,

d _A ＝Ψ(R _A ，E _A )，

d _B ＝Ψ(R _A ，E _B )，

where Ψ is a neural network of the diagram, d _A ，d _B The difference of (d) can be used to modify the initial distribution matrix to obtain a new match that meets the pixel-level accuracy of neighbor consistency.

As a further improvement of the present invention, in step S7, in the enhanced 1/2 original-size feature map, a local window with a size of 5 × 5 is cut out with each matching point as a center, after the windows are serialized, a local fine-grained descriptor is obtained through the single-scale feature description network in step S2, and a peak response of the descriptor at each matching point on the local fine-grained descriptor in another map is respectively calculated, so as to obtain a final matching result with a final sub-pixel precision.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention improves the feature matching method of the feature-free point detector and supplements the single-scale feature descriptor with local geometric structure information.

2. The invention combines the mixed scale feature descriptors, thereby not only enhancing the significance of the feature descriptors, but also keeping the robustness of the descriptors to illumination and visual angle transformation.

3. The invention designs a novel outlier filtering method, which is used for detecting whether the obtained initial matching has neighbor consistency or not, enhancing the reliability of the matching result and having wide application prospects in the fields of three-dimensional reconstruction, visual positioning, navigation and the like.

Drawings

FIG. 1 is a flow chart of the steps of the method of the present invention;

fig. 2 is a schematic diagram of picture matching after the method of the present invention is used in embodiment 2 of the present invention.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

Example 1

An image matching method for combining consistency of mixed-scale feature descriptors and neighbors is shown in fig. 1 and comprises the following steps:

s1, feature extraction: and performing feature extraction with different resolutions on input pictures which take the same picture from different perspectives through an FPN network. The feature maps extracted by the features have different spatial resolutions and semantic information, and the feature map with the size of 1/2 resolution and the feature map with the size of 1/8 of the original image are used for feature description of the next step;

s2, single-scale feature description: after the position of the feature map with the size of 1/8 obtained in the step S1 is coded, inputting the feature map into a neural network based on convolution and attention mixing to obtain a single-scale feature descriptor; a convolution branch is additionally added to a mixed self-attention layer in the neural network based on convolution and attention mixing, a cross-attention layer is kept unchanged, the convolution branch of the mixed self-attention layer restores a partial geometric structure of an original image, and the attention branch carries out information interaction inside features; the cross attention layer realizes information interaction of different characteristics and updates the characteristics of each layer;

s6, in the step S6, extracting sparse descriptors of corresponding point pairs, calculating a sparse similarity matrix P through inner products, and constructing a node matrix E by regarding the corresponding relation of point sets among images as the corresponding relation of nodes in a graph structure _A ，E _B And edge matrix E _A ，E _B Where each node retains only the edges between the two nodes that are most similar to the rest, via a graph neural network sharing parameters,

d _A ＝Ψ(R _A ，E _A )，

d _B ＝Ψ(R _A ，E _B )，

where Ψ is a neural network of the graph, d _A ，d _B The difference of (d) can be used to modify the initial distribution matrix to obtain a new match that meets the pixel-level accuracy of neighbor consistency.

S7, inputting the feature map with the size of 1/2 obtained in the step S1 and the mixed descriptor obtained in the step S4 into a fully-connected neural network to obtain an enhanced feature map with the size of 1/2; and inputting the obtained feature map and the new matching point pairs with the pixel-level precision obtained in the step S6 into a matching refinement network, and outputting the final matching with the sub-pixel-level precision.

Example 2

The image matching method for combining the consistency of the mixed scale feature descriptors and the neighbors comprises the following steps:

s1: and performing feature extraction on the input image pair, and obtaining feature maps with different resolution sizes by the input image pair through a feature extraction network.

The experimental dataset was a MegaDepth, which consisted of 100 million internet images of 196 different outdoor scenes.

Each picture is initially cropped to 840 x 840 size and converted to a grey scale map format as input.

S2: based on the neural network in which convolution and attention are mixed, a feature map of 1/8 of the size of the original image obtained by feature extraction in step S1 after position coding is used as input.

The input is input into a neural network based on convolution and attention mixing, and a single-scale feature descriptor is output, and the dimensionality of the descriptor is 256.

S3, training a network based on self-attention enhancement, using the original image in the step S1 as input, and enabling the dimensionality of a descriptor to be 128;

and S4, splicing the single-scale feature descriptors obtained in the step S2 and the multi-scale feature descriptors obtained in the step S3 on feature dimensions.

S5, inputting the mixed scale descriptor obtained in the step S4 into an optimal matching layer to obtain an initial distribution matrix; and selecting an initial matching point pair based on the set threshold value.

First, two mixed descriptors are calculated

A matrix of the degree of similarity between them,

where τ is a constant, < - > denotes the inner product. The similarity matrix is used as a cost matrix of a part of assignment problems, and the optimal solution confidence coefficient distribution matrix can be obtained by solving the part of assignment problems, so that initial matching is obtained.

S6, extracting sparse descriptors of corresponding point pairs, calculating a sparse similarity matrix P through inner products, and constructing a node matrix R by regarding the corresponding relation of point sets among images as the corresponding relation of nodes in a graph structure _A ，R _B And the edge matrix E _A ，E _B Where each node retains only the edges between the two nodes that are most similar to the rest, via a graph neural network sharing parameters,

d _A ＝Ψ(R _A ，E _A )，

d _B ＝Ψ(R _A ，E _B )，

where Ψ is a neural network of the diagram, d _A ，d _B The difference of (d) can be used to modify the initial distribution matrix to obtain a new match with pixel level accuracy that is consistent with the neighbor consistency.

S7, inputting the feature map with the size of 1/2 obtained in the step S1 and the mixed descriptor obtained in the step S4 into a fully-connected neural network to obtain an enhanced feature map with the size of 1/2; and inputting the obtained feature map and the new matching point pairs with the pixel-level precision obtained in the step S6 into a matching refinement network, and outputting the final matching with the sub-pixel-level precision. FIG. 2 shows all the matching results after the method of the present invention. Fig. 2 shows that the method can accurately match small-scale objects in the image, and has strong robustness to scale and view angle transformation.

In summary, the image matching method combining the consistency of the mixed scale feature descriptors and the neighbors is provided, the image matching model trained by the method can be operated on a computer or other equipment, the corresponding point set with sub-pixel precision can be output by inputting the original image pair, and the method is widely applied to the fields of three-dimensional reconstruction, visual positioning and navigation, multi-target tracking and the like.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. The image matching method for combining the consistency of the mixed scale feature descriptors and the neighbors is characterized in that: the method sequentially passes through a network based on convolution and attention mixing and enhanced self-attention, and then the feature descriptors with different scales are spliced on feature dimensions to obtain an initial distribution matrix, and after an initial matching point pair passes through a graph neural network based on a shared weight, the distribution matrix is corrected to realize image matching.

2. The method of image matching for joint mixed-scale feature descriptor and neighbor coherence according to claim 1, comprising the steps of:

s1, feature extraction: performing feature extraction of different resolutions on an input original image shot from different perspectives on the same picture through an FPN network, wherein feature maps obtained through the feature extraction have different spatial resolutions and semantic information, and feature maps with the size of 1/2 resolution and feature maps with the size of 1/8 of the original image are used for feature description of the next step;

s2, single-scale feature description: after the position of the feature map with the size of 1/8 obtained in the step S1 is coded, inputting the feature map into a neural network based on convolution and attention mixing to obtain a single-scale feature descriptor; a convolution branch is additionally added to a mixed self-attention layer in the neural network based on convolution and attention mixing, a cross attention layer is kept unchanged, the convolution branch of the mixed self-attention layer restores the local geometric structure of an original image, and the attention branch carries out information interaction inside features; the cross attention layer realizes information interaction of different characteristics and updates the characteristics of each layer;

s3, multi-scale feature description: inputting the original pictures shot from different viewpoints obtained in the step S1 into a network based on self-attention enhancement and outputting a multi-scale feature descriptor; the key matrix (K) and the value matrix (V) in the enhanced self-attention in the network based on the enhanced self-attention are sampled in different proportions in different self-attention heads, and each self-attention head carries out information transfer of features with different scales to generate a multi-scale feature descriptor;

s6, neighbor consistency filtering outliers: modeling the initial matching point pairs obtained in the step S4 into a graph structure, inputting the graph structure into a graph neural network sharing weight, and using the output of the graph neural network to correct an initial distribution matrix to obtain new matching point pairs;

s7, matching and fine trimming: inputting the feature map with the size of 1/2 obtained in the step S1 and the mixed descriptor obtained in the step S4 into a full-connection neural network to obtain an enhanced feature map with the size of 1/2; and inputting the obtained characteristic graph and the new matching point pairs with the pixel-level precision obtained in the step S6 into a matching refinement network, and outputting the final matching with the sub-pixel-level precision, so that a complete image matching model is constructed, and image matching is realized.

3. The method of image matching for joint mixed-scale feature descriptor and neighbor consistency of claim 2, wherein: in the step S2, the feature images with the size of 1/8 are subjected to position coding and rearranged into a one-dimensional tensor; obtaining single-scale feature descriptors via mixed self-attention and cross-attention layers of convolution and self-attention fusion

4. The method of image matching for joint mixed-scale feature descriptor and neighbor consistency of claim 3, wherein: the neural network training process based on convolution and attention mixing in the step S2 specifically comprises the following steps: the hybrid self-attention mechanism and the cross-attention mechanism are alternately used at different layers in the network, and when the hybrid self-attention mechanism is used, the similarity between each pixel is learned in the feature map; when a cross attention mechanism is used, the similarity of each pixel between the feature maps is learned, and finally information transmitted between network layers is obtained through a layer of fully-connected neural network.

5. A method of image matching incorporating mixed-scale feature descriptor and neighbor coherence as claimed in claim 3, wherein: the step S3 further includes:

V _i ＝V _i +LE(V _i )，

in the formula, X represents the input characteristic,

representing a linear mapping matrix, r _i Representing the downsampling proportion of the ith feature header, MTA (-) representing the multiscale aggregation operation, LE (-) being a convolutional neural network;

s32: the query matrix (Q), the key matrix (K) and the value matrix (V) obtained in step S31 are subjected to information transfer,

in the formula d _h Representing the feature dimensions of each feature header.

6. The joint mixed-scale feature descriptor and neighbor consensus image matching method of claim 4 or 5, wherein: the splicing in the feature dimension in the step S4 specifically includes: and splicing the 256-dimensional single-scale feature descriptors and the 128-dimensional multi-scale feature descriptors on the feature dimension to obtain 384-dimensional feature descriptors.

7. The method of image matching for joint mixed-scale feature descriptor and neighbor consistency of claim 6, wherein: the step S5 specifically comprises the following steps: first, two mixed descriptors are calculated

A matrix of the degree of similarity between them,

in the formula, tau is a constant, the < · > represents an inner product, the similarity matrix is used as a cost matrix of a partial assignment problem, and the optimal solution confidence coefficient distribution matrix can be obtained by solving the partial assignment problem, so that initial matching is obtained.

8. The method of image matching for joint mixed-scale feature descriptor and neighbor consistency of claim 7, wherein: in step S6, a sparse descriptor of the corresponding point pair is extracted, a sparse similarity matrix P is calculated by inner product, and the correspondence between the point sets in the images can be regarded as the correspondence between the nodes in the graph structure, thereby constructing a node matrix R _A ，R _B And the edge matrix E _A ，E _B Where each node retains only the edges between the two nodes that are most similar to the rest, via a graph neural network sharing parameters,

d _A ＝Ψ(R _A ，E _A )，

d _B ＝Ψ(R _A ，E _B )，

9. The method of image matching for joint mixed-scale feature descriptor and neighbor consistency of claim 8, wherein: in step S7, in the enhanced 1/2 original image-sized feature map, a local window of 5 × 5 size is intercepted with each matching point as the center, after the windows are serialized, a local fine-grained descriptor is obtained through the single-scale feature description network in step S2, peak responses of the descriptor at each matching point on the local fine-grained descriptor in another image are respectively calculated, and a final matching result of final sub-pixel precision is obtained.