CN115620036A

CN115620036A - Image feature matching method based on content perception

Info

Publication number: CN115620036A
Application number: CN202211232715.0A
Authority: CN
Inventors: 李佐勇; 王伟策; 许惠亮; 刘伟霞; 赖桃桃
Original assignee: Minjiang University
Current assignee: Minjiang University
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-01-17

Abstract

The invention relates to an image feature matching method based on content perception. Firstly, an improved two-stage feature matching method is provided, and a most advanced model fitting method is used for pre-aligning an image pair in the first stage; the pre-aligned image is used as input for the second stage; secondly, a block consisting of a complete convolution network and a mask predictor is used in front of a feature extractor to weight the features of the input image so as to enhance the extraction of local effective features; the method improves the matching accuracy.

Description

Image feature matching method based on content perception

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image feature matching method based on content perception.

Background

Feature matching refers to finding the correct correspondence between two images, as shown in fig. 1. It is also the basis for higher level tasks in the field of computer vision (e.g. three-dimensional reconstruction, image stitching, SLAM and lane line detection, etc.), and improving the probability of a correct match can enable these higher level tasks to be better developed.

Classical feature matching methods typically comprise three steps: feature detection, feature description, and feature matching. Before the advent of deep learning methods, most of them were based on the above-described procedure. These methods generally improve performance by improving one step in the flow. For example, sadder detector [1], FAST detector [2] based on luminance comparison and its extended version FAST-ER [3], which improve the performance of the feature detection step. Other methods [4,5] focus on improving the characterization step. In addition, the famous traditional methods of SIFT [6], SURF [7], KAZE [8], AKAZE [9], etc. improve both the first two steps. For the last step, classical model fitting methods such as RANSAC [10] and its modified algorithm DSAC [11] improve the matching accuracy by estimating geometric transformations (e.g., polar geometry and homography).

With the advent of deep learning methods, the feature matching methods proposed in recent years have widely used neural networks to improve performance. Some of these learning-based methods [12,13] follow classical flow, while others [14,15] belong to the end-to-end approach. SuperPoint [12] jointly detects key points and computes the associated descriptors. SuperGlue [13] improves cross-and self-attention-based descriptors using graph neural networks. In different scenarios, improving the performance of only one or two steps of the method is not the best choice, and therefore an end-to-end method is proposed. D2-Net [14] uses pre-trained VGG-16 as a feature extractor to obtain features. DFM [15] uses pre-trained VGG-19 as a feature extractor to obtain depth features while pre-aligning the input images prior to matching to improve algorithm performance. However, the number of feature points extracted by D2-Net and DFM is insufficient, resulting in a small number of correct matches on non-planar images. In addition, the geometric estimation algorithm used as pre-alignment in DFM is not efficient enough, and the final matching accuracy is also affected.

Disclosure of Invention

The invention aims to provide an image feature matching method based on content perception, which improves the matching accuracy.

In order to achieve the purpose, the technical scheme of the invention is as follows: an image feature matching method based on content perception is characterized in that firstly, an improved two-stage feature matching method is provided, a most advanced model fitting method is used for pre-aligning an image pair in a first stage, and the pre-aligned image is used as input of a second stage; second, a block of a complete convolution network and a mask predictor is used prior to the feature extractor to weight the features of the input image.

Compared with the prior art, the invention has the following beneficial effects: in order to improve the precision of an end-to-end feature matching method, particularly the precision when the method is applied to challenging scenes such as non-planar images, repeated images or high-light-intensity change images, the invention provides an image feature matching method based on content perception. The method of the invention firstly uses the most advanced model fitting method to pre-align the input images, improves the quality of feature extraction, and takes the aligned images as the input of the second stage. Secondly, a content perception block is added into the feature extractor, a probability graph is predicted, the effective part of the image is highlighted, feature extraction is guided, and a larger number of effective features are extracted. Experiments show that the accuracy of the method on the HPatches data set exceeds the best traditional method and deep learning method at present.

Drawings

Fig. 1 is an example of image feature matching.

FIG. 2 is a flow chart of the method of the present invention.

FIG. 3 is a comparison of image alignment effects of two homography estimation algorithms: (a) and (b) are input images, where (a) is a target image, (c) is a result of alignment using magsa + +, and (d) is a result of alignment using RANSAC.

Fig. 4 is a content aware block diagram.

Fig. 5 shows MMA evaluation results on HPatches datasets for 9 feature matching methods at different ratios: the method comprises three scenes, namely Illumination change (Illumination), visual angle change (Viewpoint) and all (overhead).

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The invention relates to an image feature matching method based on content perception, which comprises the steps of firstly, providing an improved two-stage feature matching method, using the most advanced model fitting method to pre-align an image pair in the first stage, and using the pre-aligned image as the input of the second stage; second, a block of complete convolution networks and mask predictors is used to weight the features of the input image before the feature extractor.

The following is a specific implementation process of the present invention.

The invention relates to an image feature matching method based on content perception, which is shown in figure 2. The first stage comprises three steps. First, a pre-trained feature extractor (VGG-19) is used to extract features from an input image I _A And I _B Extracting features; secondly, performing initial matching on the feature map of the last layer by using Dense Nearest Neighbor Search (DNNS); finally, these initial matches are used for homography matrix estimation for pre-alignment. The second stage comprises two steps. Firstly, inputting a pre-alignment result into a feature extractor, wherein the feature extractor consists of a content perception block and VGG-19 and is used for extracting features; second, DNNS is used for feature matching. The first highlight of the method is to use a more robust model fitting algorithm to obtain a more accurate homography matrix for image alignment. The second highlight is that the content perception block is used for predicting a probability map to guide feature extraction. Compared with the existing feature extraction method, the method effectively enhances the extraction of local effective features.

1. Image pre-alignment based on model fitting algorithm

Use with existing methodsThe method uses a recently proposed robust model fitting method MAGSAC + + to estimate the homography matrix H _BA ，H _BA For warping images I _B Obtaining an image I _Bwarped (as shown in fig. 2). The method uses MAGSAC + + to more efficiently align two images and then find a more accurate match. At this stage, as shown in fig. 3 (c), the magscac + + algorithm achieves better results than RANSAC in the homography estimation task.

2. Content aware blocks

Feature-based matching methods are generally able to achieve satisfactory performance on popular data sets, but they are highly dependent on the number and quality of features. When the image includes a challenging scene such as a non-planar image, a repeating image, or an illumination-varying image, performance may be degraded due to an insufficient number of active features. Therefore, more efficient features are required for feature matching.

In order to solve the above problem, the method adds a content sensing block before the VGG-19. The content-aware block is composed of a feature extractor and a mask predictor, which can improve the quantity and quality of useful features. The feature extractor is used for preliminarily extracting a feature map of the input image. The mask predictor is used to predict the probability map, i.e. the locations where more significant content has a higher probability, which is then used to weight the feature map of the input image. The content aware block is shown in fig. 4.

A feature extractor: in order to enable the network to learn the deep features of an image pair autonomously, we use a fully convolutional network to form a feature extractor f (-) whose structural details are shown in table 1. It accepts an input of size H x W1 and generates a signature of size H x W C. For input image I _A And I _B Feature extractor sharing weights, generating a feature graph F _A And F _B I.e. by

F _i ＝f(I _i ),i∈{A,B}

A mask predictor: the method establishes a network to automatically learn the positions of the effective features, and the detailed structure is shown in table 2. The network m (-) goesAnd forming an interior point probability graph, and highlighting the positions with more contributions in the characteristic graph. Size and feature of probability map F _A And F _B Is further weighted by the probability map, and then two weighted feature maps G are used _A And G _B Input VGG-19, i.e.

M _i ＝m(I _i ),G _i ＝F _i M _i ，i∈{A,B}

TABLE 1 feature extractor architecture

TABLE 2 mask predictor Structure

We typically evaluate the feature matching task on a sequence of images based on illumination and perspective changes. We are at HPatches ^[16] The method of the invention was examined on a data set comprising 116 sets of images, each set comprising 6 images of the same scene, taken at different viewing angles or under different lighting conditions, including planar and non-planar scenes. At the same time, each set of images also includes a homography matrix as a label. In the experiment, we compared with the classical algorithm SIFT ^[6] 、SURF ^[7] 、ORB ^[17] 、KAZE ^[8] 、AKAZE ^[9] And deep learning-based algorithm SuperPoint ^[12] 、Patch2Pix ^[18] And DFM ^[15] A comparison is made. At the same time, we also deleted the content aware block (without C-se:Sub>A) in this method to verify the impact of this module. In the experiments, the performance of each method was measured using the average Match Accuracy (MMA) which describes the correct Match characteristicsMean values of percent characteristic (i.e., interior points) over the entire data set. If the value of the reprojection error (obtained from the tag homography matrix and the match calculation) is less than a given threshold, the match is considered an inlier.

We performed two experiments to measure the effectiveness of the proposed method. (1) All comparative methods used mutual nearest neighbor search and bi-directional ratio tests to find the correct match, measuring the MMA at different ratios from 0.1 to 1.0 in steps of 0.1. (2) For each method, the ratio at which the best performance is obtained is fixed, and the MMA of all methods at pixel threshold 1,3,5, 10 is compared.

Fig. 5 shows MMA on HPatches data set for 9 feature matching methods at different ratios. It can be observed from FIG. 5 that the curves for all comparative methods change significantly as the ratio changes. This indicates that the change in ratio has a more significant effect on the other methods, whereas the method of the present invention is less affected.

Table 3 lists the MMA of each method at different pixel thresholds, from which it can be seen that the method of the present invention is very competitive. Compared to other methods (SIFT, SURF, ORB, KAZE, AKAZE, superPoint, patch2Pix, and DFM), the method has the highest accuracy at any pixel threshold. At a pixel threshold of 1, the MMA of this method is significantly higher than that of the suboptimal method. When the threshold is set to 5, the MMA of the method is equal to that obtained by Patch2 Pix. When the thresholds are set to 1,3,5 and 10, the MMA of the method exceeds the end-to-end method DFM 0.19, 0.06, 0.03 and 0.01, respectively. At a threshold of 1, SIFT reaches suboptimal performance (0.60). With a threshold of 3, the less preferred method is Patch2Pix (0.88). When the threshold is 10, the suboptimal methods are Patch2Pix and DFM (0.96).

As shown in table 3, without content aware blocks, the method also shows superiority, where the MMA is 0.07 higher than the next best method (SIFT) at a pixel threshold of 1. When the pixel thresholds are set to 3 and 5, the MMA of the method is on par with the suboptimal method. The advantages of the inventive method are further extended after the introduction of the content-aware block. Thus, the content aware block may effectively improve the performance of the method.

TABLE 3 optimal MMA for different image matching algorithms (optimal values are indicated by bold font)

Reference:

[1]Aldana-Iuit J,D Mishkin,Chum O,et al.In the Saddle:Chasing Fast and Repeatable Features[C].Proceedings of the IEEE International Conference on Pattern Recognition,2016:675-680.

[2]Miroslav,Trajkovi,and,et al.Fast corner detection[J].Image and Vision Computing,1998,16(2):75-87.

[3]Rosten,Edward,Porter,et al.Faster and Better:A Machine Learning Approach to Corner Detection[J].IEEE Transactions on PatternAnalysis&Machine Intelligence,2008,32(1):105-119.

[4]Gong Y,Kumar S,Rowley H A,et al.Learning Binary Codes for High-Dimensional Data Using Bilinear Projections[C].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.,2013:484-491.

[5]Trzcinski T,Lepetit V.Efficient discriminative projections for compact binary descriptors[C].European Conference on Computer Vision.Springer,Berlin,Heidelberg,2012:228-242.

[6]Lowe D G.Distinctive image features from scale-invariant keypoints[J].International Journal ofComputer Vision,2004,60(2):91-110.

[7]Bay H,Tuytelaars T,Gool L V.SURF:Speeded up robust features[C].European Conference on Computer Vision.Springer,Berlin,Heidelberg,2006:404-417.

[8]Alcantarilla P F,Bartoli A,Davison A J.KAZE features[C].European Conference on Computer Vision.Springer,Berlin,Heidelberg,2012:214-227.

[9]Alcantarilla P F,Solutions T.Fast explicit diffusion for accelerated features in nonlinear scale spaces[J].IEEE Transactions on PatternAnalysis&Machine Intelligence,2011,34(7):1281-1298.

[10]Fischler M A,Bolles R C.Random Sample Consensus:A paradigm for model fitting with applications to image analysis and automated cartography[J].Communications of the ACM,1981,24(6):381-395.

[11]Brachmann E,Krull A,Nowozin S,et al.DSAC-Differentiable RANSAC for camera localization[C].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6684-6692.

[12]DeTone D,Malisiewicz T,Rabinovich A.SuperPoint:Self-supervised interest point detection and description[C].Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:224-236.

[13]Sarlin P E,DeTone D,Malisiewicz T,et al.SuperGlue:Learning feature matching with graph neural networks[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:4938-4947.

[14]Dusmanu M,Rocco I,Pajdla T,et al.D2-Net:A trainable CNN for joint description and detection of local features[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8092-8101.

[15]Efe U,Ince K G,Alatan A.DFM:A performance baseline for deep feature matching[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4284-4293.

[16]Balntas V,Lenc K,Vedaldi A,et al.HPatches:A benchmark and evaluation of handcrafted and learned local descriptors[C].Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition.2017:5173-5182.

[17]Rublee E,Rabaud V,Konolige K,et al.ORB:An efficient alternative to SIFT or SURF[C].International Conference on Computer Vision.IEEE,2011:2564-2571.

[18]Zhou Q,Sattler T,Leal-Taixe L.Patch2Pix:Epipolar-guided pixel-level correspondences[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4669-4678.。

the above are preferred embodiments of the present invention, and all changes made according to the technical solutions of the present invention that produce functional effects do not exceed the scope of the technical solutions of the present invention belong to the protection scope of the present invention.

Claims

1. An image feature matching method based on content perception is characterized in that firstly, an improved two-stage feature matching method is provided, a most advanced model fitting method is used for pre-aligning an image pair in a first stage, and the pre-aligned image is used as input of a second stage; second, a block of complete convolution networks and mask predictors is used to weight the features of the input image before the feature extractor.

2. The image feature matching method based on content awareness according to claim 1, wherein the first stage is implemented specifically as follows: first, a pre-trained feature extractor VGG-19 is used to extract features from an input image I _A And I _B Extracting features; secondly, performing initial matching on the feature map of the last layer by using dense nearest neighbor searching DNNS; finally, the initial match is used for homography matrix estimation for pre-alignment.

3. The image feature matching method based on content awareness according to claim 2, wherein the second stage is specifically implemented as follows: firstly, inputting a pre-alignment result into a feature extractor consisting of a content sensing block and VGG-19 for extracting features; second, feature matching is performed using a dense nearest neighbor search DNNS.

4. The image feature matching method based on content perception according to claim 2, wherein the homography matrix H is estimated using a robust model fitting method magscac ++ _BA ，H _BA For warping images I _B Obtaining an image I _Bwarped 。

5. The method as claimed in claim 3, wherein the content-aware block is composed of a second feature extractor for preliminarily extracting a feature map of the input image and a mask predictor for predicting a probability map that more locations of the effective content have higher probability, and then the probability map is used to weight the feature map of the input image.

6. The method of content-aware-based image feature matching according to claim 5, wherein the second feature extractor is formed using a complete convolution network that accepts an input of size H W1 and generates a feature map of size H W C; for an input image I _A And I _B The second feature extractor shares the weight to generate a feature map F _A And F _B Namely:

F _i ＝f(I _i ),i∈{A,B}

where f (-) represents the second feature extractor.

7. The method of claim 6, wherein the mask predictor automatically learns the locations of valid features by building a network, which m (-) generates an interior point probability map highlighting more contributing locations in the feature map; size and feature of probability map F _A And F _B Is further weighted by the probability map, and then two weighted feature maps G are used _A And G _B Input VGG-19, namely:

M _i ＝m(I _i ),G _i ＝F _i M _I ，i∈{A,B}。