CN110245678B

CN110245678B - Image matching method based on heterogeneous twin region selection network

Info

Publication number: CN110245678B
Application number: CN201910376172.1A
Authority: CN
Inventors: 杨卫东; 蒋哲兴; 王祯瑞; 姜昊; 王公炎
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2021-10-08
Anticipated expiration: 2039-05-07
Also published as: CN110245678A

Abstract

The invention discloses an image matching method based on a heterogeneous twin region selection network, and belongs to the field of computer vision. The system comprises a heterogeneous twin network and a regional matching network which are sequentially connected in series; the heterogeneous twin network is used for extracting a characteristic diagram of the template diagram and a characteristic diagram of a diagram to be matched; the area matching network is used for obtaining an area matching result according to the characteristic graph of the template graph and the characteristic graph of the graph to be matched; the heterogeneous twin network comprises a sub-network A and a sub-network B which are mutually connected in parallel, each sub-network comprises a feature extraction module, a feature fusion module and a maximum value pooling module which are sequentially connected in series, the modules of the two sub-networks are the same, and only the convolution kernels of the first layer of convolution of the feature extraction module are different. The heterogeneous twin region selection network is applied to image matching, so that the input of a non-fixed-scale template and a to-be-matched image is realized, the multi-layer characteristics of the image are fully utilized, the performance of the matching method is effectively improved, and the success rate and the speed of matching are increased.

Description

Image matching method based on heterogeneous twin region selection network

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an image matching method based on a heterogeneous twin region selection network.

Background

Image matching refers to the process of finding images or image areas (sub-images) in an image (or a collection of images) that are similar to a given scene area image. The known scene area image is generally referred to as a template image, and the sub-image of the image to be searched, which may correspond to the known scene area image, is referred to as a scene area image to be matched of the template. Image matching is to find a correspondence between two or more images of the same scene from different times or different perspectives, and specific applications include object or scene recognition, solving 3D structures in multiple images, stereo correspondence, motion tracking, and the like.

Most of the current image matching algorithms only use shallow artificial features, such as gray features, gradient features and the like, because of the influence of the change of shooting time, shooting angle and natural environment, the defects of a sensor, noise and the like, the shot images have gray distortion and geometric distortion, scenes have large change in the images, so that the template images and the images to be matched have difference to a certain degree, and the shallow features often fail, so that a large amount of manpower is required for preparing the template at present, the operation process is complex, and the efficiency is low; at present, most deep neural networks have higher requirements on the number of data, and the types of samples with smaller number in a training set, namely the conditions of few samples and single sample, cannot be identified frequently.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problems that the image matching algorithm in the prior art can not basically adapt to imaging viewpoint and scale change, the scene adaptability and the anti-interference capability are weak, and the matching success rate is low.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides an isomerous twin region selection network, where the isomerous twin region selection network includes an isomerous twin network and a region matching network that are sequentially connected in series;

the heterogeneous twin network is used for extracting a characteristic diagram of the template diagram and a characteristic diagram of a diagram to be matched;

the area matching network is used for obtaining an area matching result according to the characteristic diagram of the template diagram and the characteristic diagram of the diagram to be matched;

the heterogeneous twin network comprises a sub-network A and a sub-network B which are mutually connected in parallel, each sub-network comprises a feature extraction module, a feature fusion module and a maximum value pooling module which are sequentially connected in series, the two sub-networks only have different convolution kernels of the first layer of convolution of the feature extraction module, and the other sub-networks are the same.

Specifically, the feature extraction module is configured to extract a feature map of the input image, the feature fusion module is configured to fuse convolution features of the last three layers of the feature extraction module, and the maximum pooling module normalizes a scale of the fused features.

Specifically, the feature extraction module replaces the ResNet18 second layer with a convolution.

Specifically, the feature extraction module replaces the last layer of the ResNet18 with a convolutional layer.

Specifically, the area matching network includes: the system comprises a feature division module, a classification module and a position regression module; the classification module and the position regression module are connected in parallel, processed in parallel and connected in series behind the characteristic division module;

the feature division module includes: a first convolution, a second convolution, a third convolution and a fourth convolution;

the first convolution is used for extracting a template classification feature map from the feature map of the template map;

the second convolution is used for extracting a template position feature map from the feature map of the template map;

the third convolution is used for extracting a classification characteristic diagram of the image to be matched from the characteristic diagram of the image to be matched;

the fourth convolution is used for extracting the position characteristic graph of the graph to be matched from the characteristic graph of the graph to be matched;

the classification module is used for convolving the template classification characteristic graph with the classification characteristic graph of the graph to be matched by using the template classification characteristic graph as a convolution kernel and outputting a matched class;

and the position regression module is used for convolving the template position characteristic graph serving as a convolution kernel with the position characteristic graph of the graph to be matched and outputting a matched position.

In a second aspect, an embodiment of the present invention provides an image matching method based on the heterogeneous twin region selection network in the first aspect, including the following steps:

s1, training a heterogeneous twin region selection network by using a training sample, wherein the training sample is a template picture-picture to be matched pair, and a label of the training sample is position information of a region corresponding to the template picture in a corresponding region in the picture to be matched;

and S2, inputting the sample to be tested into the trained heterogeneous twin region selection network, and outputting the matching result of the sample to be tested.

Specifically, the total loss function during training is HCE + HSL;

d＝|t-t^*|＝|(x₀-x_G)+(y_o-y_G)+(w_o-w_G)+(h_o-h_G)|

where HCE is the classification loss function, HSL is the positional regression loss, p represents the probability that the predicted sample class is positive, p^*Is the corresponding label, N represents the total number of samples, t represents the sample position of the network output

t^*Is the corresponding label, i.e. the actual position of the specimen

x and y represent abscissa and ordinate, and w and h represent width and length.

Specifically, if the output position and actual tag intersection ratio IOU >0.7, p is^*1, otherwise, p^*＝0。

Specifically, after step S1 and before step S2, the heterogeneous twin region selection network may be optimized by using the sample to be tested, specifically as follows:

inputting the template graph and the graph to be matched in the test sample set into the network, calculating the cross-over ratio of the output result and the label, measuring the matching success probability by the cross-over ratio, evaluating the performance of the image matching network and determining whether to continue training the network.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for image matching based on a heterogeneous twin region selection network according to the second aspect is implemented.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

1. the heterogeneous twin region selection network is applied to image matching, the method is improved according to the problems existing in the practical application of the network model, the algorithm realizes the input of a non-fixed scale template and a graph to be matched, simultaneously, the multi-layer characteristic information of the image is fully utilized, the anti-interference capability of the matching method is effectively improved, the method is suitable for the change of the viewpoint and the scale of imaging, the success rate and the speed of matching are increased, the requirement of the matching on the quality of the template is reduced, the method is suitable for the matching under the condition of less samples and single sample, and the labor cost can be greatly reduced in the practical application.

2. In the process of training the network, the invention provides a novel loss function, namely, a balance loss function is added, wherein the balance loss function comprises two items of balance cross entropy loss HCE and balance regression loss HSL, the two loss items have excellent effect on improving the matching success rate of the network, and the network convergence speed is effectively improved.

Drawings

FIG. 1 is a schematic diagram of a training sample provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a test sample provided in an embodiment of the present invention;

fig. 3 is a schematic diagram of a heterogeneous twin region selection network structure according to an embodiment of the present invention;

fig. 4 is a flowchart of an image matching method based on an isomeric twin region selection network according to an embodiment of the present invention;

FIG. 5(a) is a diagram of a template to be tested according to an embodiment of the present invention;

fig. 5(b) shows a matching result of the sample to be tested according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Sample generation

(1) Prepare n different scenes { P₁，...，P_i...，P_n}, same scene P_iContaining a plurality of visible light images { P_i1，...，P_ij，...P_iM}，P_ijI.e. scene P_iThe jth graph of (1).

(2) For each image of the same scene, manually selecting a region { P } in the scene_ij1，...，P_ijk，...P_ijs}，P_ijkI.e. representing a scene P_iThe jth picture of (1) in the kth region. In different images in the same scene, the size, the brightness, the angle and the like of the area are different, and the area has certain deformation.

(3) In the same scene P_iA plurality of visible light images { P }_i1，...，P_ij，...P_iMIn the method, an image is randomly selected

Cutting out the selected region k as template drawing

Randomly selecting another image

As the graph to be matched, and marking k in the graph to be matched

The position of the corresponding area in (1) is used as a label (ground route).

Representing a scene P_iLabel, x, corresponding to jth image area k_GIndicating that region k is in image P_ijAbscissa of center of middle corresponding region, y_GRepresenting a region kIn picture P_ijOrdinate of center of middle corresponding region, w_GIndicating that region k is in image P_ijWidth of middle corresponding region, h_GIndicating that region k is in image P_ijThe length of the corresponding region in (a). The above coordinates are pixel coordinates.

(4) Drawing template

With corresponding graph to be matched

As input to the network in graph pair form, is recorded as

And repeat this operation for the full scene { P }₁，P₂，P₃...，P_nThe method carries out traversal and selects partial scenes

As a training sample set, the rest of the scenarios

As the test sample set, in order to verify the adaptability of the network to the case of few samples and single sample during the test, the training set does not overlap with the sample classes of the test set, the training set does not contain the sample classes of the test set, for example, the test set contains airplanes, ships and the like, and the training set does not contain airplanes and ships. Some training samples in the training set are shown in fig. 1 and include the categories of car owners, cars, and pedestrians. The test samples of the test set are shown in FIG. 2 and include the categories building, rabbit, airplane.

Heterogeneous twin zone selection network

As shown in fig. 3, the heterogeneous twin region selection network includes a heterogeneous twin network and a region matching network connected in series, the heterogeneous twin network is used to extract a feature map of the template map and a feature map of the to-be-matched map, and the region matching network is used to obtain an image matching result according to the feature map of the template map and the feature map of the to-be-matched map. The heterogeneous twin region selection network is only suitable for sensor images in the same spectral band, such as a visible light template image and a visible light image to be matched, and is not suitable for matching images in different spectral bands, such as an SAR template image and a visible light image to be matched.

The heterogeneous twin network comprises a sub-network A and a sub-network B which are connected in parallel, and the sub-network comprises a feature extraction module, a feature fusion module and a maximum value pooling module which are connected in series. Considering that the imaging viewpoints and image scales of the template image and the image to be matched are different and the difference of the characteristics is large, the characteristics are extracted by adopting the same convolution kernels, and the effect is general, so that the convolution kernels of the first layer of convolution of the two sub-network characteristic extraction modules are different, and the rest modules are the same.

The feature extraction module is used for extracting a feature map of an input image, the fusion module is used for fusing convolution features of the last three layers of the feature extraction module, and the maximum pooling module is used for normalizing the fusion feature scale.

In the embodiment of the invention, the feature extraction module adopts a residual error network structure, preferably ResNet 18. Further, the second layer of ResNet18 is changed from the maximum pooling layer (down-sampling) to convolution, so that the network automatically learns the proper sampling kernel function; the last layer of the ResNet18 is changed into a convolution layer from a full connection layer and then connected with a maximum value pooling module, so that the network meets the image input requirements of different sizes; the parameters of the last three layers of the ResNet18 are modified so that the feature sizes of the last three layers are identical. The modified network is referred to as ResNet18v 2.

And the multi-layer convolution feature map fusion in the feature fusion module has better effect on multi-scale region matching. Assuming that the last three feature layers of the a network are Conv _ a1, Conv _ a2, Conv _ A3, and the last three feature layers of the B network are Conv _ B1, Conv _ B2, and Conv _ B3, the fusion feature Conv _ A, Conv _ B is:

wherein, w₁＝2，w₂＝4，w₃＝6。

And performing block maximum pooling on the feature maps to enable the network to adapt to the input of template maps and to-be-matched maps with any sizes, and improving the processing speed, wherein the block maximum pooling means that feature maps with fixed sizes are obtained by using a maximum pooling method for the feature maps with different sizes.

The area matching network includes: the system comprises a feature division module, a classification module and a position regression module, wherein the classification module is connected with the position regression module in parallel, processed in parallel and connected behind the feature division module in series

The feature division module includes: the template matching method comprises a first convolution, a second convolution, a third convolution and a fourth convolution, wherein the first convolution is used for extracting a template classification feature map from a feature map of the template map, the second convolution is used for extracting a template position feature map from the feature map of the template map, the third convolution is used for extracting a to-be-matched map classification feature map, and the fourth convolution is used for extracting a to-be-matched map position feature map.

The classification module is used for using the template classification characteristic graph as a convolution kernel to be convolved with the classification characteristic graph of the graph to be matched and outputting matched classes including matching classes and unmatching classes.

Balanced loss function (HCE + HSL)

Classification loss function HCE

Consider a simple two-class cross entropy loss function (binary cross entropy loss):

wherein x is the initial class output of the HES-RPN network, p is the positive sample class probability of the network output, and the value range of p is [0, 1 ]]，p^*Is a corresponding tag, p^*The value is 0 or 1. The output position and the actual label intersection ratio IOU is more than 0.7, then p^*1, otherwise, p^*＝0。

Its gradient (derivative) to x is:

a gradient mode length can then be defined,

i denotes a sample i. Specifically, the range of values of the gradient modulo length is divided into a number of unit areas. For a sample, if its gradient mode length is g, its density is defined as the number of samples in the unit area where it is located divided by the length of this unit area, ε:

the initial value of alpha is 1, and then gradually rises to 2 along with the training process. k represents a sample k; n represents the total number of samples, and ε is a very small number, typically 0.01. And the reciprocal of the gradient density is the weight to be multiplied after the sample computation loss, the loss of the new classification is:

position regression loss HSL

Whereas the conventional commonly used regression branch loss function Smooth L1 is a piecewise function:

then smoothL is used when the distance deviation of the sample from the label is large, i.e. | x | ≧ 1₁The derivative of the function is constant at 1, and the influence on the update of the network parameters is the same, so that the specific difficulty of the sample cannot be distinguished specifically. To solve this problem, DSL losses are introduced:

d＝|t-t^*|＝|(x₀-x_G)+(y_o-y_G)+(w_o-w_G)+(h_o-h_G)|

r is a position regression loss mark, and is convenient to distinguish from classification loss. The new position regression loss HSL is then expressed as:

wherein the content of the first and second substances,

the initial value of beta is 1, and then gradually rises to 1.5 along with the training process. u is a decimal number, typically 0.02, and t represents the sample position of the network output

x_oIs shown in the graph P to be matched_NMAbscissa, y, of the center of the output area of the medium heterogeneous twin area selection network_oIs shown in the graph P to be matched_NMLongitudinal coordinate, w, of the center of the output area of the medium heterogeneous twin area selection network_oIs shown in the graph P to be matched_NMWidth of output area of medium heterogeneous twin area selection network, h_oIs shown in the graph P to be matched_NMThe medium heterogeneous twin region selects the length of the output region of the network. t is t^*Is the corresponding label, i.e. the actual position of the specimen

As shown in fig. 4, an image matching method based on a heterogeneous twin region selection network includes the following steps:

s1, training a heterogeneous twin region selection network by using a training sample, wherein the training sample is a template picture-picture to be matched pair, and a label of the training sample is the position of a region corresponding to the template picture in a corresponding region in the picture to be matched;

Image matching is performed using a network. And selecting any two images with unlimited categories as templates and images to be matched, inputting the images into the trained image matching heterogeneous twin region selection network in an image pair mode, and obtaining a matching result.

FIG. 5(a) is a template map to be tested, and FIG. 5(b) is a matching result of the sample to be tested.

The heterogeneous twin region selection network in the invention and SiameseFC, SiamRPN + +, gray level cross-correlation and HOG matching in the prior art are respectively adopted to match the same data set, and the matching accuracy and the running time are shown in table 1.

TABLE 1

The running time refers to the time required for completing one matching under the condition that the size of the template graph is 127 × 127 and the size of the graph to be matched is 512 × 512.

After step S1 and before step S2, the heterogeneous twin region selection network may be optimized by using the sample to be tested.

Network output results

And actual label

The specific formula of the cross-over ratio IOU is as follows:

oarea＝w_o*h_o

garea＝w_G*h_G

and (4) regarding the image with the intersection ratio IOU being more than 0.3 as successful matching, and calculating the ratio of the successfully matched image to all the images, namely the success ratio of matching, wherein the higher the success ratio is, the better the matching performance is.

Set test set scenario

The total number of the images is T, wherein the intersection ratio of the A images is more than 0.3, and then the matching success rate SR (success rate) is:

along with the training process, if the SR does not rise any more, the network performance is not promoted any more, the network training is stopped, otherwise, the network training is continued.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image matching method based on a heterogeneous twin region selection network is characterized by comprising the following steps:

s1, training an isomeric twin region selection network by using a training sample, wherein the training sample is a template picture-picture to be matched pair, a label of the training sample is position information of a region corresponding to the template picture in the region corresponding to the picture to be matched, and the isomeric twin region selection network comprises an isomeric twin network and a region matching network which are sequentially connected in series; the heterogeneous twin network is used for extracting a characteristic diagram of the template diagram and a characteristic diagram of a diagram to be matched; the area matching network is used for obtaining an area matching result according to the characteristic diagram of the template diagram and the characteristic diagram of the diagram to be matched; the heterogeneous twin network comprises a sub-network A and a sub-network B which are mutually connected in parallel, each sub-network comprises a feature extraction module, a feature fusion module and a maximum value pooling module which are sequentially connected in series, the two sub-networks only have different convolution kernels of the first layer of convolution of the feature extraction module, and the other sub-networks are the same;

2. The image matching method of claim 1, wherein the feature extraction module is configured to extract a feature map of the input image, the feature fusion module is configured to fuse convolution features of the last three layers of the feature extraction module, and the maximum pooling module normalizes a scale of the fused features.

3. The image matching method of claim 1, wherein the feature extraction module replaces the ResNet18 second layer with a convolution.

4. The image matching method of claim 1, wherein the feature extraction module replaces the last layer of the ResNet18 with a convolutional layer.

5. The image matching method of claim 1, wherein the area matching network comprises: the system comprises a feature division module, a classification module and a position regression module; the classification module and the position regression module are connected in parallel, processed in parallel and connected in series behind the characteristic division module; the feature division module includes: a first convolution, a second convolution, a third convolution and a fourth convolution; the first convolution is used for extracting a template classification feature map from the feature map of the template map; the second convolution is used for extracting a template position feature map from the feature map of the template map; the third convolution is used for extracting a classification characteristic diagram of the image to be matched from the characteristic diagram of the image to be matched; the fourth convolution is used for extracting the position characteristic graph of the graph to be matched from the characteristic graph of the graph to be matched; the classification module is used for convolving the template classification characteristic graph with the classification characteristic graph of the graph to be matched by using the template classification characteristic graph as a convolution kernel and outputting a matched class; and the position regression module is used for convolving the template position characteristic graph serving as a convolution kernel with the position characteristic graph of the graph to be matched and outputting a matched position.

6. The image matching method of any one of claims 1 to 5, wherein the output position and the actual label have an intersection ratio IOU >0.7, the sample label p is 1, otherwise the sample label p is 0.

7. The image matching method of any one of claims 1 to 5, wherein after step S1 and before step S2, the heterogeneous twin region selection network is further optimized by using the sample to be tested, specifically as follows:

8. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, implements the image matching method based on a heterogeneous twin region selection network according to any one of claims 1 to 7.