CN113111937A

CN113111937A - Image matching method based on deep learning

Info

Publication number: CN113111937A
Application number: CN202110384410.0A
Authority: CN
Inventors: 贺迅; 方敏; 郭龙飞; 李海翔; 杜辉; 刘友江; 曹韬
Original assignee: Xidian University; Institute of Electronic Engineering of CAEP
Current assignee: Xidian University; Institute of Electronic Engineering of CAEP
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-07-13

Abstract

The invention discloses an image matching method based on deep learning, which mainly solves the problem of weak matching effect in the prior art. The scheme is as follows: constructing a feature extraction network and an image matching network; cutting an original image to obtain a sub-image data set, and constructing a training set sample and a test set sample; inputting the training samples into a feature extraction network and an image matching network in sequence to obtain the predicted coordinates of the training samples; obtaining the mean square error loss by using the real coordinates and the predicted coordinates of the training samples; training the feature extraction network and the image matching network by minimizing mean square error loss; inputting the test sample into the trained feature extraction network and the trained image matching network to obtain a test sample prediction coordinate; and finding the sub-image which is the same as the test sample in the original image by using the test sample prediction coordinates to realize image matching. The invention reduces the loss of the characteristic information of the original image, reduces the noise data interference, enhances the matching effect and can be used for target identification.

Description

Image matching method based on deep learning

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image matching method which can be used for target identification.

Background

In the field of image recognition, it is often necessary to match two images. Image matching is a method for searching similar image parts in different images, and aims to identify and align the content or structure with the same or similar attributes in two images on pixels. Generally, images to be matched are usually taken from the same or similar scene or object, or other types of image pairs with the same shape or semantic information, so as to have certain matching performance, and image matching is widely applied to object recognition and computer vision. Currently, image matching methods can be classified into region-based methods, grayscale-based methods, and feature-based methods. The region-based method is the most direct image matching method, and is extremely sensitive to imaging conditions, image deformation and noise, and particularly requires extremely high overlapping degree between image pairs and high computational complexity, so that the application capability of the region-based method is limited. The image matching method based on the gray scale completes the matching of the image by utilizing the gray scale information of the image, although the method does not need to segment the image and extract the characteristic of the image characteristic, and reduces the precision loss caused by preprocessing to the minimum, the matching method is slow in matching speed, too dependent on the information of point pixels, sensitive to noise, and easy to be influenced by the gray scale, the angle, the size change and the like. For an image, features are very important information in the image and are abstract descriptions of local information of the image, and the features can greatly reduce the data volume while retaining key information of the image, so that the feature-based image matching method is widely used.

Image matching methods are proposed in the patent document "image matching method" (application No. 201110411204.0, application publication No. CN 102682275 a) of suzhou korei core electronics, ltd. The method comprises the steps of firstly obtaining two V value images of HSV format to be matched, then respectively establishing two-dimensional coordinate systems for the two images to enable each pixel in the images to obtain a corresponding coordinate value, then respectively establishing a three-dimensional coordinate system for the two images, drawing the abscissa and the ordinate of each pixel in the two images and the gray value corresponding to the point in the respective three-dimensional coordinate systems, respectively solving the fitting function of the two images by utilizing a surface fitting technology, and judging the similarity of the two images by judging the similarity of the fitting functions of the two images. The method has the following defects: when the data is processed, a large number of coordinate systems are used for mapping, the processing process is complicated, and due to a large number of affine calculations, the influence of noise, distortion and other factors on the matching performance can be increased, so that a plurality of unnecessary calculations for matching are increased.

The university of Harbin engineering proposed an image feature extraction and matching method in "an image feature extraction and matching method" of the patent document filed by the university of Harbin (application No. 202010204462.0, application publication No. CN 111444948A). The method comprises the steps of firstly carrying out gray processing on an original color image, then carrying out primary screening on feature points, carrying out secondary screening on candidate corner points obtained by primarily screening the feature points by utilizing gradients in the directions of a transverse axis and a longitudinal axis in the candidate corner points, calculating an autocorrelation matrix of each candidate corner point obtained in the secondary screening, obtaining sub-pixel-level corner point coordinates of pixel-level corner points by iteratively optimizing Harris positions, then representing a local area selected by a description sub-unit by taking a circular area with one feature point and 12 pixels as a center, dividing the selected local area into three layers by three circles, taking the feature point as the center and 4 pixels, 8 pixels and 12 pixel points by taking the feature point as the center, wherein the circle in the middle is a sub-area, the ring in the middle layer is uniformly divided into 4 sub-areas, the ring in the outermost layer is uniformly divided into 8 sub-areas, 13 sub-areas are in total, and 8-direction gradient vectors are extracted from each, and taking the obtained 104-dimensional feature vector as a descriptor, then calculating a rotation invariant fast change descriptor, and finally performing feature extraction and feature matching. The method has the following defects: in addition, the method can not fully mine abundant characteristic information in the image by the characteristic extraction, lacks more key data information points, and reduces the matching efficiency.

Disclosure of Invention

The present invention is directed to provide an image matching method based on deep learning, so as to simplify the preprocessing operation on the original image, reduce the amount of data preprocessing calculation, reduce the loss of feature information in the original image, reduce the influence of noise data on the matching performance, and enhance the matching effect.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) constructing a feature extraction network comprising 6 convolutional layers and 5 pooling layers in cascade connection, and setting the size standard of input data of the feature extraction network;

(2) constructing an image matching network formed by cascading an input layer, two parallel full-connection layers and two parallel output layers;

(3) randomly clipping original image data according to the size of the network input data extracted by the features to obtain a sub-image data set;

(4) taking 60% of the sub-image data set as a training set and 40% as a test set;

(5) inputting the training set into a feature extraction network, and extracting the data features of the training set sample to obtain a training set sample data feature vector;

(6) inputting the characteristic vectors of the training set sample data into an image matching network, respectively passing through two full-connection layers, and respectively outputting the abscissa prediction vector and the ordinate prediction vector of the training set sample data by a network output layer;

(7) respectively taking element values in the abscissa prediction vector and the ordinate prediction vector as an abscissa prediction value x 'and an ordinate prediction value y';

(8) calculating the mean square error loss (MSE) between the real coordinates and the predicted coordinates according to the real coordinates (x, y) and the predicted coordinates (x ', y') of the training set samples;

(9) looping the steps (5) - (7), and completing the training of the feature extraction network and the image matching network by minimizing the loss of the mean square error;

(10) inputting the test set into a trained feature extraction network to obtain test set sample data feature vectors, and inputting the test set sample data feature vectors into a trained image matching network to obtain predicted coordinates of the test set sample data;

(11) and finding the sub-image which is the same as the sample of the test set in the original image according to the obtained data prediction coordinates of the sample of the test set so as to realize image matching.

Compared with the prior art, the method has the following advantages:

firstly, the invention reduces the loss of data characteristic information in the original image, reduces the influence of noise data increase on the matching performance, simplifies the processing operation of the original image, reduces the calculation amount of other processing operations on the original data and enhances the matching effect because only the original image is cut.

Secondly, the data after being cut are sequentially transmitted to the feature extraction network and the image matching network, and the coordinate prediction result is directly output, so that the whole model learning process is finished in the model from data input to the end of the matching task, other human intervention is not needed in the middle process, and the end-to-end image matching is realized.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram of a feature extraction network architecture constructed in accordance with the present invention;

fig. 3 is a diagram illustrating an image matching network constructed in the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, constructing a feature extraction network, and setting a data size standard input by the network.

As shown in fig. 2, the feature extraction network constructed in this example is a convolutional neural network composed of 6 convolutional layers and 5 pooling layers, and its structural relationship is: the 1 st coiling layer- > the 1 st pooling layer- > the 2 nd coiling layer- > the 2 nd pooling layer- > the 3 rd coiling layer- > the 3 rd pooling layer- > the 4 th coiling layer- > the 4 th pooling layer- > the 5 th coiling layer- > the 5 th pooling layer- > the 6 th coiling layer;

setting the activation functions used by each convolution layer to be ReLU activation functions, and setting the convolution kernels to be 3 multiplied by 3;

each pooling layer uses the maximum pooling method with a convolution step size of 2 and a convolution kernel size of 2.

According to the preset input parameters of the 1 st convolution layer of the feature extraction network of this example, the input size standard of the network is set to 300 × 500 pixels.

And 2, constructing an image matching network.

As shown in fig. 3, the image matching network constructed in this example is composed of an input layer, two parallel fully-connected layers and two parallel output layers in cascade, each fully-connected layer is composed of a 1 st hidden layer and a 2 nd hidden layer which are connected in sequence, wherein the input layer and each hidden layer use an activation function with an activation function of ReLU, and each output layer uses an activation function with an activation function of softmax.

And 3, randomly clipping the original image data according to the data input size standard of the feature extraction network.

According to the data input size standard of the feature extraction network, the original image data is cut into sub-images with the pixel size of 300 x 500 by utilizing a Python self-contained software package to obtain a data set consisting of the sub-images, so that the data set can be directly input into the feature extraction network in the following process.

And 4, constructing a training set sample and a test set sample.

60% of the sub-images in the sub-image data set are used as training set samples, and the remaining 40% of the sub-images are used as test set samples.

And 5, acquiring the characteristic vector of the sample data of the training set to obtain the prediction coordinates of the sample data of the training set.

5.1) inputting the sample data of the training set into a characteristic extraction network, extracting the sample data characteristics of the training set, and obtaining the characteristic vector [ s ] of the sample data of the training set₁,s₂,...,s_i,...,s_n]Wherein s is_iRepresenting the element value of the ith dimension of the training set sample data characteristic vector, and n represents the total dimension of the training set sample data characteristic vector;

5.2) training set sample data feature vector [ s ]₁,s₂,...,s_i,...,s_n]Inputting the data into an image matching network, respectively outputting the abscissa prediction vector [ h ] of the sample data of the training set by a network output layer through two full-connection layers]And the ordinate prediction vector [ z]Wherein h represents an abscissa predictor vector element value and z represents an ordinate predictor vector element value;

5.3) respectively taking element values in the abscissa prediction vector [ h ] and the ordinate prediction vector [ z ] as an abscissa prediction value x 'and an ordinate prediction value y' to obtain a training set sample data prediction coordinate (x ', y');

and 6, calculating to obtain mean square error loss (MSE) by using the real coordinates and the predicted coordinates of the training set samples.

Calculating the mean square error loss MSE between the real coordinates and the predicted coordinates in the training set samples by using the real coordinates (x, y) of the training set samples and the predicted coordinates (x ', y') of the training set samples obtained in 5.3):

wherein (x)_i,y_i) Real coordinates representing the ith training set data sample, (x)_i',y_i') denotes the predicted coordinates of the ith training set data sample, i-1, 2,3, n denotes the ith input training set data sample, n denotes the training input into the feature extraction networkTotal number of samples collected.

And 7, training the feature extraction network and the image matching network.

7.1) circulating the step 5 to the step 6, and reversely transmitting the obtained mean square error loss MSE to a feature extraction network and an image matching network;

7.2) reducing and updating the parameters of the feature extraction network and the image matching network by using mean square error loss (MSE), and stopping updating the parameters of the feature extraction network and the image matching network when the MSE is minimum to obtain the trained feature extraction network and the trained image matching network.

And 8, acquiring characteristic vectors of the sample data of the test set, acquiring real coordinates of the sample data of the test set, and realizing image matching.

8.1) inputting the test set data sample into the trained feature extraction network to obtain the test set sample data feature vector [ s ]₁',s'₂,...,s_i',...,s'_n]Wherein s is_i' represents the element value of the ith dimension of the characteristic vector of the sample data of the test set, and n represents the total dimension of the characteristic vector of the sample data of the test set;

8.2) testing set sample data characteristic vector [ s ] obtained in 8.1)₁',s'₂,...,s_i',...,s'_n]Inputting the data into an image matching network, respectively outputting the data to obtain abscissa prediction vectors [ h ] of the sample data of the test set by a network output layer through two full connection layers']And ordinate prediction vector [ z']Wherein h 'represents the abscissa predictor vector element value, and z' represents the ordinate predictor vector element value;

8.3) respectively taking element values in an abscissa prediction vector [ h '] and an ordinate prediction vector [ z' ] of the sample data of the test set, and respectively taking the element values as an abscissa prediction value x 'and an ordinate prediction value y' of the sample data of the test set to obtain prediction coordinates (x 'and y') of the sample data of the test set;

8.4) finding out the sub-image which is the same as the sample of the test set at the position corresponding to the data prediction coordinates (x, y') of the sample of the test set in the original image, and realizing image matching.

The foregoing description is only an example of the present invention and is not intended to limit the invention, so that it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. An image matching method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein: (1) the structural relationship of the feature extraction network constructed in the method is as follows: the 1 st curled layer- > the 1 st pooling layer- > the 2 nd curled layer- > the 2 nd pooling layer- > the 3 rd curled layer- > the 3 rd pooling layer- > the 4 th curled layer- > the 4 th pooling layer- > the 5 th curled layer- > the 5 th pooling layer- > the 6 th curled layer- > the 6 th pooling layer;

the activation function used by each convolution layer is a ReLU activation function, and the size of a convolution kernel is 3 multiplied by 3;

each pooling layer is a maximum pooling with convolution step size of 2 and convolution kernel size of 2.

3. The method of claim 1, wherein: (2) the input layer and the two parallel fully-connected layers use the activation function which is the activation function of the ReLU, the two parallel output layers use the activation function which is the activation function of the softmax, and each fully-connected layer consists of a 1 st hidden layer and a 2 nd hidden layer which are connected in sequence.

4. The method of claim 1, wherein: (8) the mean square error loss MSE between the real coordinates and the predicted coordinates is calculated by the following formula:

wherein (x)_i,y_i) Real coordinates representing the ith training set data sample, (x'_i,y′_i) Representing the i-th training set data sampleAnd measuring coordinates, wherein i is 1,2 and 3, n represents the ith input training set data sample, and n represents the total number of training set samples input into the feature extraction network.