CN116486101A - Image feature matching method based on window attention - Google Patents

Image feature matching method based on window attention Download PDF

Info

Publication number
CN116486101A
CN116486101A CN202310268014.0A CN202310268014A CN116486101A CN 116486101 A CN116486101 A CN 116486101A CN 202310268014 A CN202310268014 A CN 202310268014A CN 116486101 A CN116486101 A CN 116486101A
Authority
CN
China
Prior art keywords
window
attention
windows
features
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310268014.0A
Other languages
Chinese (zh)
Other versions
CN116486101B (en
Inventor
廖赟
段清
邸一得
刘俊晖
周豪
朱开军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Lanyi Network Technology Co ltd
Yunnan University YNU
Original Assignee
Yunnan Lanyi Network Technology Co ltd
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Lanyi Network Technology Co ltd, Yunnan University YNU filed Critical Yunnan Lanyi Network Technology Co ltd
Priority to CN202310268014.0A priority Critical patent/CN116486101B/en
Publication of CN116486101A publication Critical patent/CN116486101A/en
Application granted granted Critical
Publication of CN116486101B publication Critical patent/CN116486101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an image feature matching method based on window attention, which uses an MBConv module to perform preliminary extraction and downsampling on features of a group of images; window partitioning is carried out on the features by using a window attention module; comparing different windows of a group of images, and searching k windows closest to a target window; combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction; attention features were processed using bidirectional softmax, the model was trained, and feature matching was achieved.

Description

Image feature matching method based on window attention
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a window attention-based image feature matching method.
Background
Local feature matching is central to many basic computer vision tasks, including visual localization, movement structure (SfM), synchronous localization and mapping (SLAM), and the like. Many recent efforts have used Convolutional Neural Network (CNNS) based descriptors, such as VGG, resNet, denseNet and EfficientNet. For two images to be matched, the existing matching method is mostly divided into three independent stages: feature detection, feature description and feature matching. In the detection phase, salient points like corner points are first detected as keypoints for each image, and then local descriptors are extracted in the neighborhood of these keypoints. The feature detection and description stage generates two groups of key points with descriptors, then finds the point-to-point corresponding relation between the two key points through nearest neighbor search or a more complex matching algorithm, and the model is not efficient.
With the advent and great success of window attention modules in the field of natural language processing, researchers have also developed their applications and research in the field of computer vision. With the introduction of VIT, a large number of methods are applied to various fields of computer vision, such as image classification, object detection, feature matching, stereo matching, and the like. These methods have contributed significantly to various areas of computer vision, but have some drawbacks in performance and efficiency. Methods using linear attention perform poorly and methods using dot product attention are inefficient. Therefore, it is a great challenge to construct a model with excellent performance and high efficiency.
Disclosure of Invention
The embodiment of the invention aims to provide an image feature matching method based on window attention, which expands a window attention module, reduces model calculation amount and improves efficiency and performance of the window attention module so as to solve the problem of image feature matching.
In order to solve the technical problems, the technical scheme adopted by the invention is an image feature matching method based on window attention, which comprises the following specific steps:
s1: using an MBConv module to perform preliminary extraction and downsampling on the features of a group of images;
s2: window partitioning is carried out on the features by using a window attention module;
s3: comparing different windows of a group of images, and searching z windows closest to a target window;
s4: combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction;
s5: the attention features are processed using a bi-directional softmax, the matching probabilities are obtained, the model is trained, and feature matching is achieved.
Specifically, the MBConv module in S1 mainly comprises a 1×1 common convolution layer, a 1×1 depth separable convolution layer, a SE module, a 1×1 common convolution layer and a Dropout layer; the SE module consists of a global average pooling layer and two full-connection layers; the number of the nodes of the first full connection layer is input into the MBConv characteristic matrix channelAnd using an activation function, the number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and using the activation function.
Specifically, the first full connection layer uses a Swish activation function, and the second full connection layer uses a Sigmoid activation function.
Further, the specific steps of using the window attention module to perform window partitioning on the feature in S2 are as follows: x is x 1 And x 2 For two images to be matched, x is set up 1 And x 2 The window attention module is input, and information is transmitted between the window attention modules; in the case of self-attention, x 1 And x 2 Similarly, in the case of cross-attention, x 1 And x 2 From different pictures; generating a query vector q, a key vector k and a value vector v using the following formula;
wherein mapping (·) is a function of mapping features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively, and R represents a real number;
image x is set according to the set window size 1 And x 2 Dividing into n windows, rearranging the features in the windows, partitioning the windows, and generating q w ,k w And v w
Where window_partition (·) is a function of dividing the image into windows of side length s, n is the number of windows, n=h×w/s 2 ;q w ,k w And v w Is the rearrangement of features after window partitioning.
Further, the step S3 of comparing different windows of a group of images, and searching z windows closest to the target window comprises the following specific steps:
for x 1 And x 2 The features of a plurality of pixels in the window are averaged, a query vector q, a key vector k and a window average feature vector of a value vector vIt can be calculated as:
then average the eigenvectors for the windowAnd->Designing a similarity matrix SM, wherein SM represents the similarity relation of window average feature vectors, and the window and the target window are in x 1 The higher the similarity in x 2 The higher the similarity score in (c); SM may be generated by:
select x 1 Neutralization of x 2 The first z windows with highest target window similarity; select x 1 Neutralization of x 2 Z windows with highest target window similarity; r represents a real number and,representing the average eigenvector->T_z represents the first z windows to be extracted, top_z_index represents the index of the z windows with the highest similarity.
top_z_index=get_top_z_index(SM,T_z)∈R n×z
Further, the step S4 of merging the pixel level features of the similar window and the window level features of the remaining windows, and performing the final feature extraction specifically includes the following steps:
extracting the fine features of each pixel in the first z windows T_z, the fine key feature vector k fine And a fine value feature vector v fine Obtained from the following formula:
combining the pixel level features of the z windows with the window level average features of all windows, concat () is a function of the merged channel, as follows:
and finally generating the final Top K window attention, wherein the formula is as follows:
O=attention(q w ,K,V)
will eventually query vector q w The final key vector K and the final value vector V are combined, O is the final Top K window attention, and attention () is the attention function of the transducer.
Further, in the step S5, attention features are processed by using a bidirectional softmax, and matching probabilities are obtained; wherein the matching probability P can be defined as:
P(i,j)=softmax(S(i,·)) j ·softmax(S(·,j)) i
in the above formula, softmax () is a normalized exponential function, and the multi-classification result is displayed in the form of probability; softmax (S (i,)) j Representing that the softmax operation is carried out on all elements of the ith row to obtain a row vector with the sum of 1 and different probability distribution; softmax (S (.j)) i The method is that softmax operation is carried out on all elements in the j-th column to obtain a column vector with the sum of 1 and different probability distribution; and multiplying the two results to obtain a confidence matrix.
Further, in S5, the model is trained using the loss function L as follows:
in the aboveN represents the number of samples,represents summing m samples, L m Representing the probability prediction function for solving the mth sample, GT i,j For a label sample, P (i, j) represents the probability that the match is correct.
The beneficial effects of the invention are as follows: the invention provides a new feature matching method, which designs a window attention method to improve the attention mechanism in a window attention module, and the improved model only needs to extract the features of the window level, thereby obviously reducing the required calculation amount and improving the efficiency of the window attention module. The method solves the problem of feature matching of the images based on window attention, has excellent matching capacity and matching accuracy, can be very good in generalization on various data, and has high practical value. In addition, when the model is used for feature matching, the feature matching can be fully automatically performed only by inputting the data set to be matched into a deep learning network trained based on window attention.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a window attention based image feature matching method according to an embodiment of the present invention;
FIG. 2 is a general architecture diagram of a window attention based image feature matching method of an embodiment of the present invention;
FIG. 3 is a network block diagram of MBConv in accordance with an embodiment of the present invention;
FIG. 4 is a view of an image feature window partition; wherein (a) is a vector extraction diagram of window attention according to an embodiment of the present invention, (b) is a window partition diagram of window attention according to an embodiment of the present invention, (c) is a window selection diagram of window attention according to an embodiment of the present invention, and (d) is a window attention extraction diagram according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, this embodiment discloses an image feature matching method based on window attention, which realizes feature matching under various image data, and includes the following steps:
s1: the MBConv module is used to perform preliminary extraction and downsampling of features of a set of images.
As shown in fig. 2, the subject model framework of the present invention includes a plurality of TKwinBlock, each of which includes a self-attention layer and a cross-attention layer. Each attention layer further comprises an MBConv module and a window attention module. The use of MBConv in the present invention improves activation and normalization between convolutions compared to the conventional approach. The design effectively combines the MBConv and window attention modules. MBConv is placed in front of the window attention module, and the responsibility for downsampling is delegated to the deep convolution of MBConv to learn better downsampling. After repeated TkwinBlock, the features are input into the dual softmax to generate a confidence matrix, and the model is subjected to subsequent training.
As shown in fig. 3, the MBConv module structure is composed of a 1x1 normal convolution layer (dimension-increasing function, including batch normalization and Swish activation functions), a depth separable convolution layer, a SE module, a 1x1 normal convolution layer (dimension-decreasing function, including batch normalization), and a Droupout layer in this order.
The depth separable convolution is formed by combining two parts of a channel-by-channel convolution and a point-by-point convolution, one convolution kernel of the channel-by-channel convolution is responsible for one channel, one channel is only convolved by one convolution kernel, and the number of channels of a characteristic map generated by the process is identical to the number of channels of an input channel. The operation of point-by-point convolution is very similar to the conventional convolution operation, the size of the convolution kernel is 1×1×m, and M is the number of channels of the upper layer. The convolution operation here will weight-combine the maps of the previous step in the depth direction to generate a new feature map. There are several convolution kernels with several output feature maps. The feature map is extracted by using the depth separable convolution, and compared with the conventional convolution operation, the feature map has lower parameter quantity and operation cost.
The SE module consists of a global average pooling and two fully connected layers. The number of nodes of the first full connection layer is input into the MBConv feature matrix channelAnd a Swish activation function is used. The number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and a Sigmoid activation function is used.
The SE modules are not a complete network, but rather a substructure, and may be embedded in other classification or detection models. The core idea of the SE module is to learn the feature weights according to the loss through the network, so that the effective feature graphs are larger in weight, and the feature graphs with small invalidity or effect are used for training the model in a mode of smaller weight so as to achieve a better result.
S2: window partitioning features using a window attention module
As shown in FIG. 4 (a), the input to the window attention module is x 1 And x 2 Two images, information is passed between the window attention modules. In the case of self-attention, x 1 And x 2 Are identical. In the case of cross-attention, x 1 And x 2 From different pictures, which are different. They generate a query vector q, a key vector k and a value vector v using the following formulas;
wherein mapping (·) is a function mapping features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively, R represents a real number;
as shown in fig. 4 (b), the image x is set according to the set window size 1 And x 2 Divided into n windows and features within the windows are rearranged. After partitioning the window, q is generated w ,k w And v w
Where window_partition (·) is a function of dividing the image into windows of side length s, n is the number of windows, n=h×w/s 2 。q w ,k w And v w Is the rearrangement of features after window partitioning.
S3: and comparing different windows of a group of images, and searching for z windows closest to the target window.
As shown in FIG. 4 (c), the present invention is directed to x first 1 And x 2 The features of a plurality of pixels in the window are averaged, a query vector q, a key vector k and a window average feature vector of a value vector vIt can be calculated as:
then average the eigenvectors for the windowAnd (I)>The similarity matrix SM is designed. SM denotes the similarity of window average feature vectors. Window and target window are at x 1 The higher the similarity in (a) is, the more x is 2 The higher the similarity score. SM may be generated by:
next, we select x 1 Intermediate and x 2 The first z windows with the highest target window similarity.
top_z_index=get_top_k_index(SM,T_z)∈R n×z
S4: and combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction.
As shown in fig. 4 (d), the first z windows have a high similarity to the target window, so that fine features at the pixel level are extracted. Other windows have low similarity to the target window, so only window-level features need to be extracted. This can significantly reduce the amount of computation required and increase the efficiency of the window attention module.
Extracting the fine features of each pixel in the first z windows T_z, the fine key feature vector k fine And a fine value feature vector v fine Obtained from the following formula:
combining the pixel level features of the z windows with the window level average features of all windows, concat () is a function of the merged channel, as follows:
and finally generating the final Top K window attention, wherein the formula is as follows:
O=attention(q w ,K,V)
will eventually query vector q w The final key vector K and the final value vector V are combined, O is the final Top K window attention, and attention () is the attention function of the transducer.
S5: the attention features are processed using a bi-directional softmax, the matching probabilities are obtained, the model is trained, and feature matching is achieved.
The bi-directional Softmax function uses the Softmax algorithm in both dimensions to obtain the probability of a nearest neighbor match, the match probability P can be defined as:
P(i,j)=softmax(S(i,·)) j ·softmax(S(·,j)) i
in the above formula, softmax () is a normalized exponential function, and the multi-classification result is displayed in the form of probability; softmax (S (i,)) j The method is that softmax operation is carried out on all elements of the ith row to obtain a row vector with the sum of 1 and different probability distribution; softmax (S (.j)) i The method is that softmax operation is carried out on all elements in the j-th column to obtain a column vector with the sum of 1 and different probability distribution; and multiplying the two results to obtain a probability matrix, namely a confidence matrix.
The model is trained using the loss function L as follows:
in the above formula, N represents the number of samples, Σm represents summing m samples, and L m Representing the probability prediction function for solving the mth sample, GT i,j For a label sample (a correctly matching sample in the dataset), P (i, j) represents the probability of matching correctly.
When the model is used for feature matching, the feature matching can be fully automatically performed by inputting the data set to be matched into a deep learning network which is trained and based on window attention.
Example 2
Relative pose estimation (Relative Pose Estimation) experiment
Data set: verification of pose estimation validity is performed using MegaDepth dataset. The MegaDepth dataset contained 1M internet images, containing 196 different outdoor scenes. The invention selects 1500 to compare the scenes of 'Shengxin' and 'Shengbaide square'. For training and verification, the size of the image is adjusted to 840×840.
Evaluation index: the pose errors (maximum angular errors in rotation and translation) of the different matching methods are counted. And solving the basic matrix of the prediction matching by using a RANSAC algorithm to recover the pose of the camera. The comparison of AUC values of pose errors at three different thresholds (5 °,10 °,20 °) and matching accuracy of the method of the present application to the LoFTR method is shown in table 1.
Table 1 evaluation of pose estimation on MegaDepth dataset
Analysis of results: as shown in Table 1, the performance of the present invention is superior to all competitors (DRC-Net, superGlue and LoFTR) for pose estimation AUC and matching accuracy at three different thresholds, demonstrating the effectiveness of the present design. In addition, this example further compares the differences in different percentage data sets between the present invention and the LoFTR for a more comprehensive comparison. The different percentages are 10%,30%,50% and 70%. The performance of the invention on all data sets is superior to LoFTR, which shows that the invention has strong robustness under the condition of less training data.
Example 3
Experimental Mono-enantioscopy estimation (Homography Estimation)
Data set: the present example evaluates the present invention and other methods on HPatches datasets; the HPatches dataset includes 52 sequences with significant illumination variation and 56 sequences with large viewpoint variation.
Evaluation index: the present embodiment uses OpenCV for homography estimation computation and RANSAC as robust estimation. In each test sequence, one reference image is paired with the other five images. Under the cumulative curve, the accuracy under the areas where the angular error reaches the thresholds 3, 5, and 10 pixels, respectively, is reported.
TABLE 2 homography estimation on HPatches dataset
Analysis of results: as shown in Table 2, the homography estimation performance of the present invention on the HPatches basis is superior to other methods. At 1/3/5 pixel error, the method reaches an optimal level in illumination variation with accuracy (0.78,0.98,0.99). The invention also achieves the highest number of matches (4.7K). To evaluate robustness with less training data, we compared the robustness of the invention with LoFTR at different data set percentages. Under the same experimental conditions, the homography experimental method provided by the invention has the advantages of obviously better performance in homography experiments, relatively no influence of limited training data, and better generalization capability.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (8)

1. An image feature matching method based on window attention is characterized by comprising the following steps:
s1: using an MBConv module to perform preliminary extraction and downsampling on the features of a group of images;
s2: window partitioning is carried out on the features by using a window attention module;
s3: comparing different windows of a group of images, and searching z windows closest to a target window;
s4: combining the pixel level features of the similar windows and the window level features of the remaining windows, and carrying out final feature extraction;
s5: the attention features are processed using a bi-directional softmax, the matching probabilities are obtained, the model is trained, and feature matching is achieved.
2. The window attention-based image feature matching method according to claim 1, wherein the MBConv module in S1 is structured by a 1×1 normal convolution layer, a 1×1 depth separable convolution layer, a SE module, a 1×1 normal convolution layer, and a Dropout layer in this order;
the SE module consists of a global average pooling layer and two full-connection layers; the number of the nodes of the first full connection layer is the number of the input MBConv characteristic matrix channelsAnd using an activation function, the number of nodes of the second full connection layer is equal to the feature matrix channel output by the depth separable convolution layer, and using the activation function.
3. The window attention based image feature matching method of claim 2 wherein a first full connection layer uses a Swish activation function and a second full connection layer uses a Sigmoid activation function.
4. The window attention-based image feature matching method according to claim 1, wherein the step S2 is to use a window attention module to perform window segmentation on the feature, specifically as follows:
x 1 and x 2 For two images to be matched, x is set up 1 And x 2 The window attention module is input, and information is transmitted between the window attention modules; in the case of self-attention, x 1 And x 2 Similarly, in the case of cross-attention, x 1 And x 2 From different pictures; generating a query vector q, a key vector k and a value vector v using the following formula;
wherein mapping (·) is a function that maps features onto vectors, h, w, and c are the height, width, and number of channels of the image, respectively; r represents a real number;
image x is set according to the set window size 1 And x 2 Dividing into n windows, rearranging the features in the windows, partitioning the windows, and generating q w ,k w And v w
Where window_partition (·) is a function of dividing the image into windows of side length s, n is the number of windows, n=h×w/s 2 ;q w ,k w And v w Is the rearrangement of features after window partitioning.
5. The window attention-based image feature matching method according to claim 1, wherein the step S3 is to compare different windows of a set of images, and find z windows closest to a target window, and specifically comprises the following steps:
for x 1 And x 2 The features of a plurality of pixels in the window are averaged, a query vector q, a key vector k and a window average feature vector of a value vector vIt can be calculated as:
then average the eigenvectors for the windowAnd->Designing a similarity matrix SM, wherein SM represents the similarity relation of window average feature vectors, and the window and the target window are in x 1 The higher the similarity in x 2 The higher the similarity score in (c); SM may be generated by:
select x 1 Neutralization of x 2 Z windows with highest target window similarity; r represents a real number and,representing the average eigenvector->T_z represents the first z windows to be extracted, top_z_index represents the index of the z windows with the highest similarity,
top_z_index=get_top_z_index(SM,T_z)∈R n×z
6. the method for matching image features based on window attention according to claim 1, wherein S4 combines the pixel level features of the similar window and the window level features of the remaining windows, and performs final feature extraction, specifically as follows:
extracting the fine features of each pixel in the first z windows T_z, the fine key feature vector k fine And a fine value feature vector v fine Obtained from the following formula:
combining the pixel level features of the z windows with the window level average features of all windows, concat () is a function of the merged channel, as follows:
and finally generating the final Top K window attention, wherein the formula is as follows:
O=attention(q w ,K,V)
will eventually query vector q w The final key vector K and the final value vector V are combined, O is the final Top K window attention, and attention () is the attention function of the transducer.
7. The window attention-based image feature matching method of claim 1, wherein the attention features are processed using bidirectional softmax in S5 to obtain matching probabilities; wherein the matching probability P can be defined as:
P(i,j)=softmax(S( i ,·)) j ·softmax(S(·,j)) i
in the above formula, softmax () is a normalized exponential function, and the multi-classification result is displayed in the form of probability; softmax (S (i,)) j Representing that the softmax operation is carried out on all elements of the ith row to obtain a row vector with the sum of 1 and different probability distribution; softmax (S (.j)) i The method is that softmax operation is carried out on all elements in the j-th column to obtain a column vector with the sum of 1 and different probability distribution; and multiplying the two results to obtain a confidence matrix.
8. The window attention based image feature matching method of claim 1, wherein in S5, the model is trained using a loss function L as follows:
in the above formula, N represents the number of samples,represents summing m samples, L m Representing the probability prediction function for solving the mth sample, GT i,j For a label sample, P (i, j) represents the probability that the match is correct.
CN202310268014.0A 2023-03-20 2023-03-20 Image feature matching method based on window attention Active CN116486101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310268014.0A CN116486101B (en) 2023-03-20 2023-03-20 Image feature matching method based on window attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310268014.0A CN116486101B (en) 2023-03-20 2023-03-20 Image feature matching method based on window attention

Publications (2)

Publication Number Publication Date
CN116486101A true CN116486101A (en) 2023-07-25
CN116486101B CN116486101B (en) 2024-02-23

Family

ID=87225857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310268014.0A Active CN116486101B (en) 2023-03-20 2023-03-20 Image feature matching method based on window attention

Country Status (1)

Country Link
CN (1) CN116486101B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743017A (en) * 2022-04-15 2022-07-12 北京化工大学 Target detection method based on Transformer global and local attention interaction
CN114743020A (en) * 2022-04-02 2022-07-12 华南理工大学 Food identification method combining tag semantic embedding and attention fusion
CN115713546A (en) * 2022-11-13 2023-02-24 复旦大学 Lightweight target tracking algorithm for mobile terminal equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743020A (en) * 2022-04-02 2022-07-12 华南理工大学 Food identification method combining tag semantic embedding and attention fusion
CN114743017A (en) * 2022-04-15 2022-07-12 北京化工大学 Target detection method based on Transformer global and local attention interaction
CN115713546A (en) * 2022-11-13 2023-02-24 复旦大学 Lightweight target tracking algorithm for mobile terminal equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHENGLIN YANG等: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", 《ARXIV》, 30 January 2023 (2023-01-30), pages 1 - 23 *
KEYA WANG等: "Combining convolutional neural networks and self-attention for fundus diseases identification", 《SCIENTIFIC REPORTS》, vol. 13, no. 76, 2 January 2023 (2023-01-02), pages 1 - 15 *
QING WANG等: "MatchFormer: Interleaving Attention in Transformers for Feature Matching", 《ARXIV》, 23 September 2022 (2022-09-23), pages 1 - 24 *
ZIHANG DAI等: "CoAtNet: Marrying Convolution and Attention for All Data Sizes", 《ARXIV》, 15 September 2021 (2021-09-15), pages 1 - 18 *

Also Published As

Publication number Publication date
CN116486101B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN112926396B (en) Action identification method based on double-current convolution attention
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111582044B (en) Face recognition method based on convolutional neural network and attention model
CN111738143B (en) Pedestrian re-identification method based on expectation maximization
CN114220124A (en) Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Lv et al. Application of face recognition method under deep learning algorithm in embedded systems
CN116704611A (en) Cross-visual-angle gait recognition method based on motion feature mixing and fine-granularity multi-stage feature extraction
CN117058437B (en) Flower classification method, system, equipment and medium based on knowledge distillation
CN116030495A (en) Low-resolution pedestrian re-identification algorithm based on multiplying power learning
CN114998995A (en) Cross-view-angle gait recognition method based on metric learning and space-time double-flow network
CN117237858B (en) Loop detection method
Qu et al. PMA-Net: A parallelly mixed attention network for person re-identification
CN111144469B (en) End-to-end multi-sequence text recognition method based on multi-dimensional associated time sequence classification neural network
CN116486101B (en) Image feature matching method based on window attention
CN116311345A (en) Transformer-based pedestrian shielding re-recognition method
Wang et al. Feature extraction method of face image texture spectrum based on a deep learning algorithm
CN114821631A (en) Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion
Jun et al. Two-view correspondence learning via complex information extraction
Gong et al. TripleFormer: improving transformer-based image classification method using multiple self-attention inputs
Shelare et al. StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant