CN116486101B

CN116486101B - Image feature matching method based on window attention

Info

Publication number: CN116486101B
Application number: CN202310268014.0A
Authority: CN
Inventors: 廖赟; 段清; 邸一得; 刘俊晖; 周豪; 朱开军
Original assignee: Yunnan Lanyi Network Technology Co ltd; Yunnan University YNU
Current assignee: Yunnan Lanyi Network Technology Co ltd; Yunnan University YNU
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2024-02-23
Anticipated expiration: 2043-03-20
Also published as: CN116486101A

Abstract

The invention provides an image feature matching method based on window attention, which uses an MBConv module to perform preliminary extraction and downsampling on features of a group of images; window partitioning is carried out on the features by using a window attention module; comparing different windows of a group of images, and searching k windows closest to a target window; combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction; attention features were processed using bidirectional softmax, the model was trained, and feature matching was achieved.

Description

Image feature matching method based on window attention

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a window attention-based image feature matching method.

Background

Local feature matching is central to many basic computer vision tasks, including visual localization, movement structure (SfM), synchronous localization and mapping (SLAM), and the like. Many recent efforts have used Convolutional Neural Network (CNNS) based descriptors, such as VGG, resNet, denseNet and EfficientNet. For two images to be matched, the existing matching method is mostly divided into three independent stages: feature detection, feature description and feature matching. In the detection phase, salient points like corner points are first detected as keypoints for each image, and then local descriptors are extracted in the neighborhood of these keypoints. The feature detection and description stage generates two groups of key points with descriptors, then finds the point-to-point corresponding relation between the two key points through nearest neighbor search or a more complex matching algorithm, and the model is not efficient.

With the advent and great success of window attention modules in the field of natural language processing, researchers have also developed their applications and research in the field of computer vision. With the introduction of VIT, a large number of methods are applied to various fields of computer vision, such as image classification, object detection, feature matching, stereo matching, and the like. These methods have contributed significantly to various areas of computer vision, but have some drawbacks in performance and efficiency. Methods using linear attention perform poorly and methods using dot product attention are inefficient. Therefore, it is a great challenge to construct a model with excellent performance and high efficiency.

Disclosure of Invention

The embodiment of the invention aims to provide an image feature matching method based on window attention, which expands a window attention module, reduces model calculation amount and improves efficiency and performance of the window attention module so as to solve the problem of image feature matching.

In order to solve the technical problems, the technical scheme adopted by the invention is an image feature matching method based on window attention, which comprises the following specific steps:

s1: using an MBConv module to perform preliminary extraction and downsampling on the features of a group of images;

s2: window partitioning is carried out on the features by using a window attention module;

s3: comparing different windows of a group of images, and searching z windows closest to a target window;

s4: combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction;

s5: the attention features are processed using a bi-directional softmax, the matching probabilities are obtained, the model is trained, and feature matching is achieved.

Specifically, the MBConv module in S1 mainly comprises a 1×1 common convolution layer, a 1×1 depth separable convolution layer, a SE module, a 1×1 common convolution layer and a Dropout layer; the SE module consists of a global average pooling layer and two full-connection layers; the number of the nodes of the first full connection layer is input into the MBConv characteristic matrix channelAnd using an activation function, the number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and using the activation function.

Specifically, the first full connection layer uses a Swish activation function, and the second full connection layer uses a Sigmoid activation function.

Further, the specific steps of using the window attention module to perform window partitioning on the feature in S2 are as follows: x is x ₁ And x ₂ For two images to be matched, x is set up ₁ And x ₂ The window attention module is input, and information is transmitted between the window attention modules; in the case of self-attention, x ₁ And x ₂ Similarly, in the case of cross-attention, x ₁ And x ₂ From different pictures; generating a query vector q, a key vector k and a value vector v using the following formula;

wherein mapping (·) is a function of mapping features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively, and R represents a real number;

image x is set according to the set window size ₁ And x ₂ Dividing into n windows, rearranging the features in the windows, partitioning the windows, and generating q _w ，k _w And v _w ：

Where window_partition (·) is a function of dividing the image into windows of side length s, n is the number of windows, n=h×w/s ² ；q _w ，k _w And v _w Is the rearrangement of features after window partitioning.

Further, the step S3 of comparing different windows of a group of images, and searching z windows closest to the target window comprises the following specific steps:

for x ₁ And x ₂ The features of a plurality of pixels in the window are averaged, a query vector q, a key vector k and a window average feature vector of a value vector vIt can be calculated as:

then average the eigenvectors for the windowAnd->Designing a similarity matrix SM, wherein SM represents the similarity relation of window average feature vectors, and the window and the target window are in x ₁ The higher the similarity in x ₂ The higher the similarity score in (c); SM may be generated by:

select x ₁ Neutralization of x ₂ The first z windows with highest target window similarity; select x ₁ Neutralization of x ₂ Z windows with highest target window similarity; r represents a real number and,representing the average eigenvector->T_z represents the first z windows to be extracted, top_z_index represents the index of the z windows with the highest similarity.

top_z_index＝get_top_z_index(SM,T_z)∈R ^n×z

Further, the step S4 of merging the pixel level features of the similar window and the window level features of the remaining windows, and performing the final feature extraction specifically includes the following steps:

extracting the fine features of each pixel in the first z windows T_z, the fine key feature vector k _fine And a fine value feature vector v _fine Obtained from the following formula:

combining the pixel level features of the z windows with the window level average features of all windows, concat () is a function of the merged channel, as follows:

and finally generating the final Top K window attention, wherein the formula is as follows:

O＝attention(q _w ,K,V)

will eventually query vector q _w The final key vector K and the final value vector V are combined, O is the final Top K window attention, and attention () is the attention function of the transducer.

Further, in the step S5, attention features are processed by using a bidirectional softmax, and matching probabilities are obtained; wherein the matching probability P can be defined as:

P(i,j)＝softmax(S(i,·)) _j ·softmax(S(·,j)) _i

in the above formula, softmax () is a normalized exponential function, and the multi-classification result is displayed in the form of probability; softmax (S (i,)) _j Representing that the softmax operation is carried out on all elements of the ith row to obtain a row vector with the sum of 1 and different probability distribution; softmax (S (.j)) _i The method is that softmax operation is carried out on all elements in the j-th column to obtain a column vector with the sum of 1 and different probability distribution; and multiplying the two results to obtain a confidence matrix.

Further, in S5, the model is trained using the loss function L as follows:

in the above formula, N represents the number of samples,represents summing m samples, L _m Representing the probability prediction function for solving the mth sample, GT _i,j For a label sample, P (i, j) represents the probability that the match is correct.

The beneficial effects of the invention are as follows: the invention provides a new feature matching method, which designs a window attention method to improve the attention mechanism in a window attention module, and the improved model only needs to extract the features of the window level, thereby obviously reducing the required calculation amount and improving the efficiency of the window attention module. The method solves the problem of feature matching of the images based on window attention, has excellent matching capacity and matching accuracy, can be very good in generalization on various data, and has high practical value. In addition, when the model is used for feature matching, the feature matching can be fully automatically performed only by inputting the data set to be matched into a deep learning network trained based on window attention.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a window attention based image feature matching method according to an embodiment of the present invention;

FIG. 2 is a general architecture diagram of a window attention based image feature matching method of an embodiment of the present invention;

FIG. 3 is a network block diagram of MBConv in accordance with an embodiment of the present invention;

FIG. 4 is a view of an image feature window partition; wherein (a) is a vector extraction diagram of window attention according to an embodiment of the present invention, (b) is a window partition diagram of window attention according to an embodiment of the present invention, (c) is a window selection diagram of window attention according to an embodiment of the present invention, and (d) is a window attention extraction diagram according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, this embodiment discloses an image feature matching method based on window attention, which realizes feature matching under various image data, and includes the following steps:

s1: the MBConv module is used to perform preliminary extraction and downsampling of features of a set of images.

As shown in fig. 2, the subject model framework of the present invention includes a plurality of TKwinBlock, each of which includes a self-attention layer and a cross-attention layer. Each attention layer further comprises an MBConv module and a window attention module. The use of MBConv in the present invention improves activation and normalization between convolutions compared to the conventional approach. The design effectively combines the MBConv and window attention modules. MBConv is placed in front of the window attention module, and the responsibility for downsampling is delegated to the deep convolution of MBConv to learn better downsampling. After repeated TkwinBlock, the features are input into the dual softmax to generate a confidence matrix, and the model is subjected to subsequent training.

As shown in fig. 3, the MBConv module structure is composed of a 1x1 normal convolution layer (dimension-increasing function, including batch normalization and Swish activation functions), a depth separable convolution layer, a SE module, a 1x1 normal convolution layer (dimension-decreasing function, including batch normalization), and a Droupout layer in this order.

The depth separable convolution is formed by combining two parts of a channel-by-channel convolution and a point-by-point convolution, one convolution kernel of the channel-by-channel convolution is responsible for one channel, one channel is only convolved by one convolution kernel, and the number of channels of a characteristic map generated by the process is identical to the number of channels of an input channel. The operation of point-by-point convolution is very similar to the conventional convolution operation, the size of the convolution kernel is 1×1×m, and M is the number of channels of the upper layer. The convolution operation here will weight-combine the maps of the previous step in the depth direction to generate a new feature map. There are several convolution kernels with several output feature maps. The feature map is extracted by using the depth separable convolution, and compared with the conventional convolution operation, the feature map has lower parameter quantity and operation cost.

The SE module consists of a global average pooling and two fully connected layers. The number of nodes of the first full connection layer is input into the MBConv feature matrix channelAnd a Swish activation function is used. The number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and a Sigmoid activation function is used.

The SE modules are not a complete network, but rather a substructure, and may be embedded in other classification or detection models. The core idea of the SE module is to learn the feature weights according to the loss through the network, so that the effective feature graphs are larger in weight, and the feature graphs with small invalidity or effect are used for training the model in a mode of smaller weight so as to achieve a better result.

S2: window partitioning features using a window attention module

As shown in FIG. 4 (a), the input to the window attention module is x ₁ And x ₂ Two images, information is passed between the window attention modules. In the case of self-attention, x ₁ And x ₂ Are identical. In the case of cross-attention, x ₁ And x ₂ From different pictures, which are different. They generate a query vector q, a key vector k and a value vector v using the following formulas;

wherein mapping (·) is a function mapping features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively, R represents a real number;

as shown in fig. 4 (b), the image x is set according to the set window size ₁ And x ₂ Divided into n windows and features within the windows are rearranged. After partitioning the window, q is generated _w ，k _w And v _w ：

Where window_partition (·) is a function of dividing the image into windows of side length s, n is the number of windows, n=h×w/s ² 。q _w ，k _w And v _w Is the rearrangement of features after window partitioning.

S3: and comparing different windows of a group of images, and searching for z windows closest to the target window.

As shown in FIG. 4 (c), the present invention is directed to x first ₁ And x ₂ The features of a plurality of pixels in the window are averaged, a query vector q, a key vector k and a window average feature vector of a value vector vIt can be calculated as:

then average the eigenvectors for the windowAnd (I)>The similarity matrix SM is designed. SM denotes the similarity of window average feature vectors. Window and target window are at x ₁ The higher the similarity in (a) is, the more x is ₂ The higher the similarity score. SM may be generated by:

next, we select x ₁ Intermediate and x ₂ The first z windows with the highest target window similarity.

top_z_index＝get_top_k_index(SM,T_z)∈R ^n×z

S4: and combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction.

As shown in fig. 4 (d), the first z windows have a high similarity to the target window, so that fine features at the pixel level are extracted. Other windows have low similarity to the target window, so only window-level features need to be extracted. This can significantly reduce the amount of computation required and increase the efficiency of the window attention module.

O＝attention(q _w ,K,V)

The bi-directional Softmax function uses the Softmax algorithm in both dimensions to obtain the probability of a nearest neighbor match, the match probability P can be defined as:

P(i,j)＝softmax(S(i,·)) _j ·softmax(S(·,j)) _i

in the above formula, softmax () is a normalized exponential function, and the multi-classification result is displayed in the form of probability; softmax (S (i,)) _j The method is that softmax operation is carried out on all elements of the ith row to obtain a row vector with the sum of 1 and different probability distribution; softmax (S (.j)) _i The method is that softmax operation is carried out on all elements in the j-th column to obtain a column vector with the sum of 1 and different probability distribution; and multiplying the two results to obtain a probability matrix, namely a confidence matrix.

The model is trained using the loss function L as follows:

in the above formula, N represents the number of samples, Σm represents summing m samples, and L _m Representing the probability prediction function for solving the mth sample, GT _i,j For a label sample (a correctly matching sample in the dataset), P (i, j) represents the probability of matching correctly.

When the model is used for feature matching, the feature matching can be fully automatically performed by inputting the data set to be matched into a deep learning network which is trained and based on window attention.

Example 2

Relative pose estimation (Relative Pose Estimation) experiment

Data set: verification of pose estimation validity is performed using MegaDepth dataset. The MegaDepth dataset contained 1M internet images, containing 196 different outdoor scenes. The invention selects 1500 to compare the scenes of 'Shengxin' and 'Shengbaide square'. For training and verification, the size of the image is adjusted to 840×840.

Evaluation index: the pose errors (maximum angular errors in rotation and translation) of the different matching methods are counted. And solving the basic matrix of the prediction matching by using a RANSAC algorithm to recover the pose of the camera. The comparison of AUC values of pose errors at three different thresholds (5 °,10 °,20 °) and matching accuracy of the method of the present application to the LoFTR method is shown in table 1.

Table 1 evaluation of pose estimation on MegaDepth dataset

Analysis of results: as shown in Table 1, the performance of the present invention is superior to all competitors (DRC-Net, superGlue and LoFTR) for pose estimation AUC and matching accuracy at three different thresholds, demonstrating the effectiveness of the present design. In addition, this example further compares the differences in different percentage data sets between the present invention and the LoFTR for a more comprehensive comparison. The different percentages are 10%,30%,50% and 70%. The performance of the invention on all data sets is superior to LoFTR, which shows that the invention has strong robustness under the condition of less training data.

Example 3

Experimental Mono-enantioscopy estimation (Homography Estimation)

Data set: the present example evaluates the present invention and other methods on HPatches datasets; the HPatches dataset includes 52 sequences with significant illumination variation and 56 sequences with large viewpoint variation.

Evaluation index: the present embodiment uses OpenCV for homography estimation computation and RANSAC as robust estimation. In each test sequence, one reference image is paired with the other five images. Under the cumulative curve, the accuracy under the areas where the angular error reaches the thresholds 3, 5, and 10 pixels, respectively, is reported.

TABLE 2 homography estimation on HPatches dataset

Analysis of results: as shown in Table 2, the homography estimation performance of the present invention on the HPatches basis is superior to other methods. At 1/3/5 pixel error, the method reaches an optimal level in illumination variation with accuracy (0.78,0.98,0.99). The invention also achieves the highest number of matches (4.7K). To evaluate robustness with less training data, we compared the robustness of the invention with LoFTR at different data set percentages. Under the same experimental conditions, the homography experimental method provided by the invention has the advantages of obviously better performance in homography experiments, relatively no influence of limited training data, and better generalization capability.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. An image feature matching method based on window attention is characterized by comprising the following steps:

s5: processing attention features by using a bidirectional softmax, obtaining matching probability, training a model, and realizing feature matching;

and S2, window partitioning is carried out on the features by using a window attention module, wherein the window partitioning is specifically as follows:

x ₁ and x ₂ For two images to be matched, x is set up ₁ And x ₂ The window attention module is input, and information is transmitted between the window attention modules; in the case of self-attention, x ₁ And x ₂ Similarly, in the case of cross-attention, x ₁ And x ₂ From different pictures; generating a query vector q, a key vector k and a value vector v using the following formula;

wherein mapping (·) is a function that maps features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively; r represents a real number;

Wherein Window_part (·) is a function of dividing the image into windows with a side length s, n is the number of windows，n＝h×w/s ² ；q _w ，k _w And v _w Is the rearrangement of features after window partitioning;

s3, comparing different windows of a group of images, and searching z windows closest to a target window, wherein the z windows are specifically as follows:

select x ₁ Neutralization of x ₂ Z windows with highest target window similarity; r represents a real number and,representing the average eigenvector->T_z represents the first z windows to be extracted, top_z_index represents the index of the z windows with highest similarityLeading, top_z_index=get_top_z_index (SM, t_z) ∈r ^n×z 。

2. The window attention-based image feature matching method according to claim 1, wherein the MBConv module in S1 is structured by a 1×1 normal convolution layer, a 1×1 depth separable convolution layer, a SE module, a 1×1 normal convolution layer, and a Dropout layer in this order;

the SE module consists of a global average pooling layer and two full-connection layers; the number of nodes of the first full connection layer is the number of the input MBConv characteristic matrix channelsAnd using an activation function, the number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and using the activation function.

3. The window attention based image feature matching method of claim 2 wherein a first full connection layer uses a Swish activation function and a second full connection layer uses a Sigmoid activation function.

4. The method for matching image features based on window attention according to claim 1, wherein S4 combines the pixel level features of the similar window and the window level features of the remaining windows, and performs final feature extraction, specifically as follows:

O＝attention(q _w ,K,V)

5. The window attention-based image feature matching method of claim 1, wherein the attention features are processed using bidirectional softmax in S5 to obtain matching probabilities; wherein the matching probability P can be defined as:

P(i,j)＝softmax(S(i,·)) _j ·softmax(S(·,j)) _i

6. The window attention based image feature matching method of claim 1, wherein in S5, the model is trained using a loss function L as follows: