Remote sensing image feature matching method based on self-supervision and self-learning feature points
Technical Field
The invention relates to the field of remote sensing image feature matching, in particular to a remote sensing image feature matching method based on self-supervision and self-learning feature points.
Background
The image matching aims to identify, align and match the content or the structure with the same or similar attributes in the two images on the pixel. Generally, the images to be matched are usually taken from the same or similar scene or object, or other types of image pairs having the same shape or semantic information, so as to have certain matchability. Because the deep learning method has excellent learning and expression capability on deep features of images, preliminary results on the image matching problem are obtained at present. The main application of deep learning in image matching is to directly learn a pixel-level matching relationship from image pairs containing the same or similar structural content, and the main application form of the deep learning comprises (1) directly designing an end-to-end matching network for learning a feature point set which is more accurate and reliable to detect from an image, learning a main direction or a main scale of each feature point and a feature descriptor which has more distinguishing and matching capabilities. (2) The deep learning method can acquire deep features among image blocks, measure similarity among the features to establish a corresponding relationship, and is generally used for extracting good feature points, constructing descriptors, image retrieval, image registration and the like.
However, the current image matching based on deep learning seriously depends on a large amount of artificial labels as real feature points which can be used for training, for remote sensing images, factors such as a large amount of image data generated by characteristics of multiple time phases, multiple sensors and the like, the illumination angle, shooting conditions and the like not only bring error influence to the image matching process, but also increase the labeling difficulty in the process of artificially labeling the real feature points, and greatly improve the labeling cost.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a remote sensing image feature matching method based on self-supervision and self-learning feature points, a remote sensing image pair is obtained by selecting and registering remote sensing image data in a three-dimensional database, a feature point extraction network is used for carrying out feature point extraction training and obtaining a feature point model m1, data repeated training is carried out on the basis of a feature point model m1 through the feature point extraction network to obtain a feature point model mn and a feature point labeln, and then training is carried out through the feature matching network to obtain a remote sensing image matching model s 1.
The purpose of the invention is realized by the following technical scheme:
a remote sensing image feature matching method based on self-supervision and self-learning feature points comprises the following steps:
A. acquiring a three-dimensional database, wherein three-dimensional data in the three-dimensional database is remote sensing image data; selecting three-dimensional data of data registration from a three-dimensional database to form remote sensing image pairs, wherein each remote sensing image pair comprises two three-dimensional data, a plurality of remote sensing image pairs are formed, and ID numbering is carried out on the remote sensing image pairs; performing data preprocessing operation on all remote sensing image pairs, wherein the data preprocessing operation comprises simultaneous shearing of the image pairs, rotating of mirror images, adjustment of image definition contrast and image Gaussian blur;
B. dividing and classifying all remote sensing image pairs into a training set and a testing set, wherein the number ratio of the remote sensing image pairs in the training set to the remote sensing image pairs in the testing set is 8-9: 1; constructing a feature point extraction network, performing feature point extraction training on the three-dimensional data in the training set, and obtaining a feature point model m1 after the training is finished;
C. randomly selecting two remote sensing image pairs from the training set, and performing characteristic point extraction training on a first remote sensing image pair of the randomly selected two remote sensing image pairs by adopting a characteristic point extraction network based on a characteristic point model m1 to obtain a characteristic point label1 of the remote sensing image pair; constructing a twin characteristic point extraction network, and training characteristic points by randomly selecting two remote sensing image pairs as training data and characteristic point label1 as a true value through the twin characteristic point extraction network to obtain a characteristic point model m 2;
D. randomly selecting two remote sensing image pairs from the training set, and performing characteristic point extraction training on a first remote sensing image pair of the randomly selected two remote sensing image pairs by adopting a characteristic point extraction network based on a characteristic point model m2 to obtain a characteristic point label2 of the remote sensing image pair; randomly selecting two remote sensing image pairs as training data and a characteristic point label2 as a true value through a twin characteristic point extraction network to perform characteristic point training, and training to obtain a characteristic point model m 3; … …, randomly selecting two remote sensing image pairs from the training set, and performing feature point extraction training on a first remote sensing image pair of the two randomly selected remote sensing image pairs by adopting a feature point extraction network based on a feature point model mk to obtain a feature point labelk of the remote sensing image pair; randomly selecting two remote sensing image pairs as training data and a characteristic point label as a true value through a twin characteristic point extraction network to perform characteristic point training, and training to obtain a characteristic point model mn, wherein n is k + 1;
E. carrying out feature point extraction training on any remote sensing image pair in the test set or the training set by adopting a feature point extraction network based on a feature point model mn to obtain a feature point labeln of the remote sensing image pair; and (3) constructing a feature matching network, performing feature point matching training on the remote sensing image pairs in the test set or/and the training set by using the feature matching network and taking the feature point labeln as a true value, constraining the feature point descriptors by adopting a matching relationship during matching, and finishing training to generate a remote sensing image matching model s 1.
The invention also comprises the following method:
F. and carrying out feature point matching test on the remote sensing image pair in the test set based on the generated remote sensing image matching model s 1.
Preferably, the feature point extraction network structure in step B of the present invention is an encoder-decoder structure based on semantic segmentation, the encoder-decoder structure includes an encoder portion and a decoder portion, the encoder portion adopts a VGG type full convolution network, the encoder portion includes eight convolution layers and four maximum pooling layers, and the decoder portion includes an oftmax feature point function sampling model and a reshape feature point sampling model.
Preferably, the twin feature point extraction network structure in step D of the present invention is a twin encoder-decoder structure based on semantic segmentation, the twin encoder-decoder structure includes a twin encoder portion and a merging decoder portion, the twin encoder portion includes an encoder unit using two weight sharing, the encoder unit uses a VGG type full convolution network, the encoder unit includes eight convolution layers and four maximum pooling layers; and the merging decoder part is used for merging data of the two encoder units and comprises an soft ofmax characteristic point function sampling model and a reshape characteristic point sampling model.
Preferably, in step E of the present invention, the feature matching network structure adopts an encoder-decoder network matching structure, the encoder-decoder network matching structure includes two encoder units and two decoder units, the two encoder units of the encoder-decoder network matching structure correspond to the two decoder units one by one, the encoder units adopt a VGG type full convolution network, and the encoder units include eight convolution layers and four maximum pooling layers; the decoder unit comprises an soft max characteristic point function sampling model and a reshape characteristic point sampling model, and the decoder unit is provided with a descriptor generation network.
Preferably, the data registration requirement of the remote sensing image pair selected from the three-dimensional database in the step A of the invention is more than 90%, and the data registration index comprises the number of the characteristic points and the positions of the characteristic points.
Preferably, the three-dimensional data in the three-dimensional database in the step a of the present invention is derived from a remote sensing image device, the remote sensing image is cut to enable the length and width of the cut remote sensing image to be a multiple of 8 when the three-dimensional data is acquired, and the cut remote sensing image is stored in the three-dimensional database.
Preferably, in step a of the present invention, the remote sensing image data includes a geometric structure, and the geometric structure includes a point, a line, a plane, and a cube.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) according to the method, remote sensing image pairs are obtained through selecting and registering remote sensing image data in a three-dimensional database, feature point extraction training is carried out through a feature point extraction network to obtain a feature point model m1, data repeated training is carried out through the feature point extraction network based on the feature point model m1 to obtain a feature point model mn and a feature point labeln, and then training is carried out through a feature matching network to obtain a remote sensing image matching model s 1.
(2) The invention extracts real feature points in the remote sensing image pairs by using a self-supervision feature point learning method, takes the registration matching relationship between the real feature points and the remote sensing image pairs as descriptors of the feature points, and realizes the matching function of the remote sensing images by using an improved super-point feature matching network, thereby improving the feature matching efficiency and accuracy.
Drawings
Fig. 1 is a schematic diagram of a feature point extraction network according to this embodiment;
FIG. 2 is a schematic diagram of a twin feature point extraction network according to the present embodiment;
FIG. 3 is a schematic diagram of a feature matching network according to the present embodiment;
fig. 4 is a schematic diagram of the feature point extraction network and the descriptor learning network in fig. 3.
Detailed Description
The present invention will be described in further detail with reference to the following examples:
examples
As shown in fig. 1 to 4, a remote sensing image feature matching method based on self-supervision and self-learning feature points includes the following steps:
A. the method includes the steps of acquiring a three-dimensional database, wherein three-dimensional data in the three-dimensional database is remote sensing image data (the three-dimensional data is also called remote sensing image data, the three-dimensional data can be from various remote sensing image devices, the three-dimensional data can also be virtual three-dimensional data, and the three-dimensional data is widely recorded with remote sensing images of three-dimensional buildings and the like). Selecting three-dimensional data of data registration from a three-dimensional database to form remote sensing image pairs, wherein each remote sensing image pair comprises two three-dimensional data, a plurality of remote sensing image pairs are formed, and ID numbering is carried out on the remote sensing image pairs; and performing data preprocessing operation on all remote sensing image pairs, wherein the data preprocessing operation comprises simultaneous shearing of the image pairs, rotating of mirror images, adjustment of image definition contrast, Gaussian image blurring and the like.
Preferably, in this embodiment, the data registration requirement of the remote sensing image pair (the remote sensing image pair may be a remote sensing image pair in the same area in different time phases and different illumination) selected from the three-dimensional database in step a is over 90%, and the data registration index mainly includes the number of feature points and the position of the feature points. The remote sensing image pair selected by the embodiment has strict requirements on data registration, and the selected remote sensing image pair (especially the remote sensing image pair in the training set) is strictly registered (the data registration requirement is more than 90%).
Preferably, the three-dimensional data in the three-dimensional database in step a of this embodiment is derived from a remote sensing image device, the remote sensing image needs to be cut to acquire the three-dimensional data, the length and the width of the cut remote sensing image are multiples of 8, and the cut remote sensing image is stored in the three-dimensional database. The length and width multiples of the remote sensing image are set according to the remote sensing image equipment from which the remote sensing image comes, the difference of hardware equipment and the like are considered, the length and width of the image pair can be only the multiples of 8 when the image pair is cut, and the specific cutting size can be automatically adjusted according to the condition of the hardware equipment.
B. Dividing and classifying all remote sensing image pairs into a training set and a testing set, wherein the number ratio of the remote sensing image pairs in the training set to the remote sensing image pairs in the testing set is 8-9: 1; and constructing a feature point extraction network, performing feature point extraction training on the three-dimensional data in the training set, and finishing the training to obtain a feature point model m 1. The principle of the feature point extraction network structure adopted in this embodiment is shown in fig. 1, where input refers to input of a remote sensing image (the remote sensing image is data, the remote sensing image is simply referred to as an image, and the remote sensing image data is also referred to as three-dimensional data); output refers to outputting an image with feature points; h denotes the height of the input image; w indicates the width of the input image; h/8 means that the height of the image is one eighth of the original image; w/8 means that the width of the image is one eighth of the original image; the Encoder refers to a network coding structure in the feature point extraction process, and the embodiment mainly adopts a VGG type network and consists of eight convolutional layers and four maximum pooling layers. The Decoder refers to a network decoding structure in a feature point extraction process, and mainly comprises a convolution, an oftmax feature point function sampling model (whether each pixel point of an image is a feature point or not is represented in a probability mode by adopting an index function softmax), and a reshape feature point sampling model (reshape refers to an up-sampling process of an image, and the feature point image with the width being one eighth of the original image is up-sampled to the size of the original image). conv refers to the convolution process.
Preferably, the feature point extraction network structure in step B of this embodiment is an encoder-decoder structure based on semantic segmentation, the encoder-decoder structure includes an encoder portion and a decoder portion, the encoder portion adopts a VGG type full convolution network, the encoder portion includes eight convolution layers and four maximum pooling layers, and the decoder portion includes an oftmax feature point function sampling model and a reshape feature point sampling model.
C. Randomly selecting two remote sensing image pairs from the training set, and performing characteristic point extraction training on a first remote sensing image pair of the randomly selected two remote sensing image pairs by adopting a characteristic point extraction network based on a characteristic point model m1 to obtain a characteristic point label1 of the remote sensing image pair; and (3) constructing a twin characteristic point extraction network, and training the characteristic points by randomly selecting two remote sensing image pairs as training data and the characteristic point label1 as a true value through the twin characteristic point extraction network to obtain a characteristic point model m 2.
D. Randomly selecting two remote sensing image pairs from the training set, and performing characteristic point extraction training on a first remote sensing image pair of the randomly selected two remote sensing image pairs by adopting a characteristic point extraction network based on a characteristic point model m2 to obtain a characteristic point label2 of the remote sensing image pair; randomly selecting two remote sensing image pairs as training data and a characteristic point label2 as a true value through a twin characteristic point extraction network to perform characteristic point training, and training to obtain a characteristic point model m 3; … … (training in sequence according to the method described in step D, thus obtaining label3, label4 … …, a feature point model m4, a feature point model m5 … …, and taking label3 and a feature point model m4 as examples), selecting two remote sensing image pairs randomly from the training set, using a feature point extraction network to perform feature point extraction training on the first remote sensing image pair randomly selected from the two remote sensing image pairs based on the feature point model m3 to obtain the feature point label3 of the remote sensing image pair, using a twin feature point extraction network to perform feature point training on the two remote sensing image pairs randomly selected as training data and the feature point label3 as a true value to obtain the feature point model m4, selecting two remote sensing image pairs randomly from the training set, using the feature point extraction network to perform feature point extraction training on the first remote sensing image pair randomly selected from the two remote sensing image pairs based on the feature point model mk, obtaining a characteristic point labelk of a remote sensing image pair; randomly selecting two remote sensing image pairs as training data and a characteristic point label as a true value through a twin characteristic point extraction network to perform characteristic point training, and training to obtain a characteristic point model mn, wherein n is k + 1;
preferably, the twin feature point extraction network structure in step D of this embodiment is a twin encoder-decoder structure based on semantic segmentation, the twin encoder-decoder structure includes a twin encoder portion and a merging decoder portion, the twin encoder portion includes an encoder unit using two weight shares, the encoder unit uses a VGG-type full convolution network, and the encoder unit includes eight convolution layers and four maximum pooling layers; and the merging decoder part is used for merging data of the two encoder units and comprises an soft ofmax characteristic point function sampling model and a reshape characteristic point sampling model.
The twin feature point extraction network structure principle adopted in this embodiment is shown in fig. 2, where input1 or input2 indicates that a remote sensing image (the remote sensing image is data, the remote sensing image is simply referred to as an image, and the remote sensing image data is also referred to as three-dimensional data) is input; output refers to outputting an image with feature points; h denotes the height of the input image; w indicates the width of the input image; h/8 means that the height of the image is one eighth of the original image; w/8 means that the width of the image is one eighth of the original image; the Encoder refers to a network coding structure in the feature point extraction process, and the embodiment mainly adopts a VGG type network, and consists of eight convolutional layers and four maximum pooling layers, and the network coding structure comprises two Encoder units. The Decoder refers to a network decoding structure in a feature point extraction process, and mainly comprises a convolution, an oftmax feature point function sampling model (whether each pixel point of an image is a feature point or not is represented in a probability mode by adopting an index function softmax), and a reshape feature point sampling model (reshape refers to an up-sampling process of an image, and the feature point image with the width being one eighth of the original image is up-sampled to the size of the original image). conv refers to the convolution process; add refers to the addition of two input images between channels.
E. Carrying out feature point extraction training on any remote sensing image pair in the test set by adopting a feature point extraction network based on a feature point model mn to obtain a feature point labeln (which can be used for feature point extraction and feature descriptor generation) of the remote sensing image pair; and (3) constructing a feature matching network, performing feature point matching training on the remote sensing image pair in the test set training set by using the feature matching network and taking the feature point labeln as a true value, constraining the feature point descriptor by adopting a matching relation during matching, and finishing training to generate a remote sensing image matching model s 1.
Preferably, in step E of this embodiment, the feature matching network structure adopts an encoder-decoder network matching structure, the encoder-decoder network matching structure includes two encoder units and two decoder units, the two encoder units of the encoder-decoder network matching structure correspond to the two decoder units one by one, the encoder unit adopts a VGG-type full convolution network, and the encoder unit includes eight convolution layers and four maximum pooling layers; the decoder unit comprises an soft max characteristic point function sampling model and a reshape characteristic point sampling model, and the decoder unit is provided with a descriptor generation network.
The structural principle of the feature matching network adopted in the embodiment is shown in fig. 3 and 4, in which input1 or input2 indicates that a remote sensing image (the remote sensing image is data, the remote sensing image is simply referred to as an image, and the remote sensing image data is also referred to as three-dimensional data) is input; output1 or output 2 indicates that images with feature points are output; h denotes the height of the input image; w indicates the width of the input image; h/8 means that the height of the image is one eighth of the original image; w/8 means that the width of the image is one eighth of the original image; the Encoder refers to a network coding structure in the feature point extraction process, and the embodiment mainly adopts a VGG type network, and consists of eight convolutional layers and four maximum pooling layers, and the network coding structure comprises two Encoder units. The Decoder refers to a network decoding structure in a feature point extraction process, and mainly comprises a convolution, an oftmax feature point function sampling model (whether each pixel point of an image is a feature point or not is represented in a probability mode by adopting an index function softmax), and a reshape feature point sampling model (reshape refers to an up-sampling process of an image, and the feature point image with the width being one eighth of the original image is up-sampled to the size of the original image). conv refers to the convolution process; add refers to the addition of two input images between channels. Interest Points Network refers to a feature point extraction Network; descriptors Networks refers to descriptor generation Networks. Bi-Cubic Interpolate refers to a bicubic interpolation process; L2-Norm refers to the L2 Norm.
F. And carrying out feature point matching test on the remote sensing image pair in the test set based on the generated remote sensing image matching model s 1.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.