Background
Image matching is an important research area for computer vision. It is widely used in preprocessing in many fields, such as three-dimensional reconstruction, simultaneous localization and mapping, panoramic stitching, stereo matching, etc. It consists essentially of two steps to construct matching pairs and remove mismatching.
Many methods of image matching currently exist. They can be classified into parametric methods, nonparametric methods and learning-based methods. The parametric approach is a popular strategy to solve the matching problem, such as RANSAC and its variants: PROSAC and USAC. Specifically, it first performs a random minimum subset sampling, generates a homography matrix or basis matrix, then verifies the matrix (whether it is the smallest possible outlier subset) and loops through the first and second steps. However, these methods have two basic disadvantages: 1) when the ratio of correct matches to total matches is low, they (parametric methods) do not work efficiently; 2) they cannot express complex model non-parametric methods to mine local information for corresponding selection. It is assumed that under view angle variations or non-rigid deformations, the spatial neighborhood relationship between feature points of an image pair of the same scene or object is similar. Based on this fact, researchers use spatial neighbor relations to remove false matches. Researchers use superpixels to obtain the feature appearance of the feature matching problem and build the adjacency matrix of the graph. The nodes represent potential correspondences and the weights on the links represent pairwise agreements between potential matches. These methods involve compatibility information between matches. However, they do not mine local information from compatible communications.
Methods based on deep learning have enjoyed tremendous success in a variety of computer vision tasks. Many researchers have attempted to solve the matching task using a learning-based approach. They can be broadly divided into two categories: sparse Point correspondences are constructed from image pairs of the same or similar scenes using a deep learning architecture, as well as a Point-Net-like architecture. Although the learning-based approach has proven to be superior to the parametric and non-parametric approaches, there are still a large number of false matches in the generated hypothesis matches for the network model of Choy et al. The network model of MooYi et al captures global context information through context normalization and embeds the context information in nodes, but its context normalization is easily affected by other matching pairs. While learning-based approaches have been able to achieve good results on various data sets, batch normalization in the network layer is often limited by batch size, and different convolutions result in poor performance with the same normalization, so how to switch flexibly is more challenging.
In order to effectively deal with the difficulties existing in the matching process, an end-to-end network is provided. Given the correspondence of feature points in two views, the existing deep learning-based method expresses the feature matching problem as a binary classification problem. Among these methods, normalization plays an important role in network performance. However, they employ the same normalizer in all normalization layers of the entire network, which results in poor performance. To solve this problem, the present invention proposes a two-step switchable normalization block that combines the advantages of adaptive normalizers for different convolution layers of the switchable normalization and robust global context information for context normalization. Therefore, the invention can avoid the influence of the difficulties mentioned above to a certain extent, and finally improve the matching precision. Experimental results show that the invention achieves the most advanced performance on the basis of a data set.
Disclosure of Invention
In view of this, the present invention provides an image matching method based on a two-step switchable normalized depth neural network, which can improve matching accuracy.
The invention is realized by adopting the following scheme: an image matching method based on a two-step switchable normalized depth neural network is characterized in that:
the method comprises the following steps:
step S1: data set processing: providing an image pair (I, I'), extracting feature points kp from each image separately using a detector based on black-plug mappingi,kp′i(ii) a The set of feature points for information extraction of the image I is KP ═ { KP ═ KPi}i∈N(ii) a Obtaining a feature point set KP ' ═ KP ' from the image I 'i}i∈N(ii) a Each corresponding relation (kp)i,kp′i) 4D data can be generated:
D=[d1;d2;d3;.......dN;],di=[xi,yi,x′i,y′i]
d represents a matched set of image pairs, i.e. input data, DiRepresents a matching pair, (x)i,yi),(x′i,y′i) Representing the coordinates of two feature points in the matching;
step S2, feature enhancement, namely, using a convolution layer with convolution kernel size of 1 × 1 to map the 4D data processed in the step S1 into a 32-dimensional feature vector, namely D(1×N×4)→D(1×N×32)The method is used for reducing information loss caused by network feature learning; wherein N is the number of feature points extracted from one picture;
step S3: extracting features of the enhanced features, namely the mapped feature vectors, by using a residual error network, replacing Batch Normalization (Batch Normalization) with two-step sparse switchable Normalization for extracting global features of the enhanced data more robustly and outputting a preliminary prediction result;
step S4: in the testing phase, the output of the residual network is set as a preliminary prediction result and the preliminary prediction result is processed using the activation functions tanh and relu, i.e., fx=relu(tanh(xout) Obtaining a final result with a predicted value of 0, 1, wherein 0 represents error matching and 1 represents correct matching; in the training of the whole network, a cross entropy loss function is adopted to guide the learning of the network; as shown in the formula:
wherein, yiIs denoted label, y'iIndicating the predicted value.
Further, the step S3 specifically includes the following steps:
the two-step sparse switchable normalization is divided into two layers: the first layer is a context normalization and,the second layer is switchable normalization; context normalization is to tessellate global context information for each data, given input data xiAt layer l, context normalization is defined as follows:
wherein the content of the first and second substances,
an output result representing context normalization; u. of
lAnd o
lRespectively representing the average value and standard deviation of data of the network layer;
context normalization embeds global information into each feature point data; in the second layer of normalization, a differential feedforward sparse learning algorithm is used for selecting the most appropriate normalization from batch normalization, example normalization and layer normalization so as to reduce the influence of fixed normalization on the final result; the switchable normalization is defined as follows:
wherein
Representing the output of the second Normalization layer,. lambda.and β represent the scale and displacement parameters, respectively,. phi. psi. u.represents the set of three normalizations (i.e., L eye Normalization, Batchnormalization, instant Normalization)
jAnd
mean and variance of the corresponding network layer data,
j 1,2,3, position index IN three normalization { L N, BN, IN }, r
jAnd r'
jRespectively representing the scaling parameters of the mean and variance of their respective network layer data.
Compared with the prior art, the invention has the following beneficial effects:
the invention proposes a two-step switchable normalization block that combines the advantages of adaptive normalizers and context-normalized robust global context information for different convolution layers of switchable normalization. Therefore, the invention can finally improve the matching precision. Experimental results show that the invention achieves the most advanced performance on the basis of a data set.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides an image matching method based on a two-step switchable normalized deep neural network, which includes firstly performing data set processing on original data, secondly performing feature enhancement on the processed data, then extracting features from the enhanced features, and finally outputting a result in a test stage.
The method specifically comprises the following steps:
step S1: data set processing: providing an image pair (I, I'), extracting feature points kp from each image separately using a detector based on black-plug mappingi,kp′i(ii) a The information of the image I is extractedThe characteristic point set is KP ═ KPi}i∈N(ii) a Obtaining a feature point set KP ' ═ KP ' from the image I 'i}i∈N(ii) a Each corresponding relation (kp)i,kp′i) 4D data can be generated:
D=[d1;d2;d3;.......dN;],di=[xi,yi,x′i,y′i]
d represents a matched set of image pairs, i.e. input data, DiRepresents a matching pair, (x)i,yi),(x′i,y′i) Representing the coordinates of two feature points in the matching;
step S2, feature enhancement, namely, using a convolution layer with convolution kernel size of 1 × 1 to map the 4D data processed in the step S1 into a 32-dimensional feature vector, namely D(1×N×4)→D(1×N×32)The method is used for reducing information loss caused by network feature learning; wherein N is the number of feature points extracted from one picture;
step S3: extracting features of the enhanced features, namely the mapped feature vectors, by using a residual error network, replacing Batch Normalization (Batch Normalization) with two-step sparse switchable Normalization for extracting global features of the enhanced data more robustly and outputting a preliminary prediction result;
step S4: in the testing phase, the output of the residual network is set as a preliminary prediction result and the preliminary prediction result is processed using the activation functions tanh and relu, i.e., fx=relu(tanh(xout) Obtaining a final result with a predicted value of 0, 1, wherein 0 represents error matching and 1 represents correct matching; in the training of the whole network, a cross entropy loss function is adopted to guide the learning of the network; as shown in the formula:
wherein, yiIs denoted label, y'iIndicating the predicted value.
As shown in fig. 2, in this embodiment, the step S3 specifically includes the following steps:
the two-step sparse switchable normalization is divided into two layers: the first layer is context normalization, and the second layer is switchable normalization; context normalization is to tessellate global context information for each data, given input data xiAt layer l, context normalization is defined as follows:
wherein the content of the first and second substances,
an output result representing context normalization; u. of
lAnd o
lRespectively representing the average value and standard deviation of data of the network layer;
context normalization embeds global information into each feature point data; it is noted that the conventional context normalization post-processing is susceptible to interference from other data. This is because the context information of different data in the post-processing is mixed by the subsequent batch normalization operation, so we adopt a switching strategy in the second step. Specifically, we use the differential feed-forward sparse learning algorithm (i.e., sparestmax) in the second layer normalization to select the most appropriate normalization from batch normalization, instance normalization, and layer normalization to reduce the effect of the fixed normalization on the final result; the switchable normalization is defined as follows:
wherein
Representing the output of the second Normalization layer,. lambda.and β represent the scale and displacement parameters, respectively,. phi. psi. u.represents the set of three normalizations (i.e., L eye Normalization, Batchnormalization, instant Normalization)
jAnd
mean and variance of the corresponding network layer data,
j 1,2,3, position index IN three normalization { L N, BN, IN }, r
jAnd r'
jRespectively representing the scaling parameters of the mean and variance of their respective network layer data.
Preferably, in the present embodiment, a sparse exchangeable normalization (SSN) is introduced to learn different combination normalizers for different convolution layers of the hierarchical deep learning network, so as to solve the feature matching problem. At the same time, a two-step switchable normalization block (TSSN block) is established that combines the advantages of robust global context information for adaptive normalizers and Context Normalization (CN) for different convolutional layers of sparse exchangeable normalization (SSN).
Preferably, the present embodiment adaptively outputs matched pairs through analyzing the input features to be matched and then training a novel deep neural network. Specifically, given the correspondence of feature points in two views, the image feature matching problem is expressed as a binary problem. Namely: given the correspondence of feature points in the two views-i.e. the input data (data processing) represents the image feature matching problem as a binary problem, i.e. our network considers the matching data as a binary problem, with 1 representing a correct match and 0 representing a false match.
An end-to-end neural network framework is then constructed, i.e. the input data can directly obtain well-matched output data (0, 1) through the network of the embodiment without passing through other steps. The network diagram of the embodiment is shown in figure 2; and the advantages of the adaptive normalizer aiming at different convolution layers of sparse exchangeable normalization and robust global context information of context normalization are combined, and a two-step switchable normalization block is designed to improve the network performance. The image matching method based on the deep neural network mainly comprises the following steps: preparing a data set, feature enhancement, feature learning, and testing.
The quantification and characterization of the method of the present embodiment and the current state-of-the-art matching method were performed on a common data set (CO L MAP), and the results show that the method of the present embodiment is significantly superior to other algorithms.
Preferably, table 1 shows the quantitative comparison of the F-measurement value, accuracy and recall of the CO L MAP data set of this embodiment with several other matching algorithms, and the comparison methods include ranac, L PM, Point-Net + +, L CG-Net.
TABLE 1
|
F-measured value
|
Rate of accuracy
|
Recall rate
|
Ransac
|
0.1914
|
0.2222
|
0.1879
|
LPM
|
0.2213
|
0.2415
|
0.2579
|
Point-Net
|
0.1683
|
0.1205
|
0.3847
|
Point-Net++
|
0.3298
|
0.2545
|
0.5668
|
LCG-Net
|
0.3953
|
0.3063
|
0.6839
|
TSSN-Net
|
0.4357
|
0.3733
|
0.5518 |
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.