CN115330876A

CN115330876A - Target template graph matching and positioning method based on twin network and central position estimation

Info

Publication number: CN115330876A
Application number: CN202211131672.7A
Authority: CN
Inventors: 郑永斌; 任强; 徐婉莹; 白圣建; 孙鹏; 朱笛; 杨东旭
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-11-11
Anticipated expiration: 2042-09-15
Also published as: CN115330876B

Abstract

The invention belongs to the technical field of image processing and deep learning, and particularly relates to a target template map matching and positioning method based on twin network and central position estimation, which comprises the following steps: s1, constructing a target template graph matching positioning network; s2, training a target template graph to match a positioning network; and S3, matching and positioning the target template picture by applying the trained target template picture matching and positioning network model. Compared with the traditional template matching method, the target template map matching and positioning method based on the twin network and the central position estimation can fully utilize the powerful feature extraction and characterization capability of the deep twin network and the high-precision positioning capability of the central position estimation network, and obtains a target template map matching and positioning network model corresponding to the complex difference through training on the basis of a training image set covering large differences of different sources, dimensions, rotation, visual angles and the like.

Description

Target template graph matching and positioning method based on twin network and central position estimation

Technical Field

The invention belongs to the technical field of image processing and deep learning, and particularly relates to a target template map matching positioning method based on a twin network and central position estimation.

Background

The target template map matching positioning refers to that the template map of a target is given in advance, and the position corresponding to the center of the template map is accurately positioned in a real-time image acquired by imaging equipment through the steps of feature extraction, similarity measurement, maximum similar position searching and the like. The method is a basic technology in the field of computer vision and target identification, and is widely applied to various tasks such as remote sensing, medical image processing, video monitoring, imaging guidance and the like. In specific application, the real-time image and the template image are different in acquiring device, the shooting time, the shooting visual angle, the illumination condition and other acquiring conditions are different, and the real-time image and the target template image often have great difference in different sources, rotation, visual angles, noise and the like, which brings great challenge to the accurate positioning of the target template image.

A study "Image Registration Methods: A Survey" (Image and Vision Computing,2003,21 (11): 977-1000), published by Barbara Zitova' and Jan Fluser, divides the template graph matching localization task into four elements, namely feature extraction, similarity measure, search space and search method. The features extracted by the traditional target template matching positioning method are manually designed features, and simple similarity measurement is adopted, so that the capabilities of feature extraction and similarity measurement are weak, and the challenge of solving the problems is difficult. In addition, the search space of the traditional method is the coupling of dimensions such as translation, scale, rotation and the like, and the searched matching position is easy to fall into a local optimal value, so that the target template map is inaccurate in positioning and even wrong in positioning. The strong feature extraction and utilization capability of deep learning provides a new technical approach for improving the matching and positioning performance of the target template graph. A paper A Robust and Accurate End-to-End Template Matching Method Based on the Template Network (IEEE Geoscience and Remote Sensing Letters,2022.01, 19, 1-5) published by Qiang Ren et al proposes an End-to-End Template Matching Method Based on a twin Network, and the Method takes a Template Matching task as Template classification and position regression for processing, thereby improving the robustness of the Template Matching positioning to large differences such as heterogeneities, rotation, visual angles, noise and the like. However, the method adopts a method of dense prediction of a rectangular positioning frame when positioning the template graph, namely, the positioning of the central position of the template graph is indirectly realized by predicting a template boundary frame, so that the positioning accuracy and robustness of the template graph are still influenced by factors such as heterogenous sources, scales, visual angle differences and the like.

Disclosure of Invention

Aiming at the problems of the existing target template map matching and positioning method, the invention provides a target template map matching and positioning method based on a depth twin network and central position estimation.

In order to achieve the above object, the present invention provides the following solution, a target template map matching positioning method based on a depth twin network and a center position estimation, comprising the following steps:

s1, constructing a target template graph matching positioning network

The target template graph matching and positioning network is formed by sequentially cascading a feature extraction twin network, a depth correlation convolution network and a central position estimation network, and is input into a template graph T and a real-time graph S, wherein the sizes of T and S are mxm and nxn respectively, m and n are positive integers, and n is larger than m; thermodynamic diagram P with single-channel output _hm Let it be m _h ×m _h ，m _h The larger the thermodynamic value at a coordinate on the thermodynamic diagram is, say, a positive integerThe greater the likelihood that the coordinate is the position of the template map center on the real-time map. The method comprises the following specific steps:

s1.1, constructing a feature extraction twin network, and extracting feature information of an input template graph and a real-time graph

The feature extraction twin network is formed by cascading two convolution neural networks with shared parameters and the same structure, and takes a template graph T and a real-time graph S as input and outputs as a template graph feature graph f (T) and a real-time graph feature graph f (S), wherein the size of f (T) is m ₁ ×m ₁ X d, f (S) size n ₁ ×n ₁ X d, wherein m ₁ Denotes the length and width of f (T), n ₁ Length and width of f (S), d number of channels, m ₁ 、n ₁ And d is a positive integer.

The convolutional neural network is obtained by modifying a standard ResNet network (He K., zhang X., ren S., sun J. Deep Residual Learning for Image registration [ C ]// IEEE Conference on Computer Vision & Pattern registration. IEEE Computer Society, 2016), and the specific modification is as follows:

(1) 3 x 3 convolution is added to the third, fourth and fifth layers of the standard ResNet network to realize feature dimension reduction, and the obtained feature maps are respectively marked as

And

(2) For characteristic diagram

Carrying out 3 multiplied by 3 deconvolution to obtain a characteristic diagram which is spliced on the characteristic diagram

Then, carrying out 3 x 3 convolution on the spliced feature map to obtain the feature map

(3) For characteristic diagram

After that, the final output is obtained: template map feature map f (T) and real-time map feature map f (S).

S1.2, fusing the extracted template graph feature graph f (T) and the real-time graph feature graph f (S) by using a depth-dependent convolution network

The depth-dependent convolution network takes the template graph feature graph f (T) and the real-time graph feature graph f (S) extracted in S1.1 as input, takes f (T) as a convolution kernel to perform depth-dependent convolution operation with f (S), and outputs the result as a combined related feature graph f (T) and f (S) _Fusion Having a size of (m) ₁ +1)×(m ₁ +1)×d；

S1.3, a central position estimation network is constructed, and a single-channel thermodynamic diagram is calculated

The central position estimation network is formed by cascading three 3 x 3 deconvolution layers and one 3 x 3 convolution layer, wherein: the number of channels of each 3 multiplied by 3 deconvolution layer is d, the step length is s, and s is a positive integer; the number of channels in the 3 × 3 convolutional layers is d, and the step size is 1.

The central position estimation network uses the fused related feature map f in S1.2 _Fusion As input, the output is a single-channel thermodynamic diagram P _hm Dimension m _h ×m _h ，m _h ＝m ₁ ·s ³ . Note p _x,y Is a thermodynamic diagram P _hm The heat value at the upper (x, y) position is 1-m _h Then p is _x,y Has a value range of [0,1 ]]。

S2 training target template graph matching positioning network

S2.1 making a training image set

S2.1.1, aiming at various targets such as houses, roads, bridges, vehicles, ships, airplanes and the like, shooting at different distances, different visual angles and different positions by using a visible light camera and an infrared camera respectively at different time periods to obtain a large number of images;

s2.1.2 making n from the captured image _train For the image pair composed of the template image and the real-time image, n _train Not less than 40000. The specific manufacturing method comprises the following steps: cutting an image block containing a certain target in a certain image, zooming into a size of m multiplied by m, and selecting as a template drawing, wherein m is a positive integer; and cutting image blocks containing the same target in other images, scaling the image blocks into n multiplied by n, and selecting the image blocks as a real-time image, wherein n is a positive integer.

S2.1.3 n to be fabricated _train And taking the images as a training image set.

As can be seen from the above training image set production process, there are significant differences between the template image and the real-time image, such as different sources, scales, rotations, visual angles, etc.

S2.2 calibrating a training image set

When calibrating the image pair composed of the template graph and the real-time graph in the training image set, firstly, the coordinate c of the center of the template graph on the real-time graph needs to be calibrated _ref ＝(x _ref ,y _ref ) Then mapped to coordinates (x) on a thermodynamic diagram _hm ,y _hm ) Namely, the corresponding position of the center of the template diagram on the thermodynamic diagram is calculated by the specific calculation method

Wherein

Indicating a rounding down operation.

After obtaining the corresponding coordinates of the center of the template drawing on the thermodynamic diagram, the thermodynamic diagram labels corresponding to the pair of training samples are generated

Different from the calibration method of directly recording the positive sample as '1' and the negative sample as '0', the thermodynamic diagram is calibrated by adopting a Gaussian kernel weighting mode in the step, and the purpose is to control the negative sample to occupy in the loss functionThe specific gravity reduces the influence caused by the unbalance of the positive and negative samples, and the specific calibration method comprises the following steps:

wherein:

presentation in thermodynamic diagram Label

The value of the specific calibrated heat value at the (x, y) position of (A), the value range of x and y is [1,m ] _h ]；σ _p Is a hyper-parameter related to the size of the template graph, the invention takes

Calculating the heat value of all (x, y) positions to obtain a thermodynamic diagram label calibrated for the training sample

S2.3 design loss function

The loss function used for the design training is as follows:

wherein: p is a radical of _x,y Represents the thermal power value (confidence) of the template graph center at the position of the real-time graph (x, y) calculated by the target template graph matching positioning network in S1,

representing a thermodynamic diagram calibrated for the training samples in S2.2

The thermodynamic values at position (x, y), α and β are adjustable hyper-parameters, in the present invention α =2 and β =4 are taken.

S2.4, using the training image set acquired in S2.1 and the training image set calibrated in S2.2, performing network training by using a random gradient descent (SGD) (LeCun Y, boser B, denker J S, et al. Back propagation applied to hand written text code recognition [ J ]. Neural computation,1989,1 (4): 541-551.) method, namely, minimizing the loss function designed in S2.3 to obtain a trained target template map matching positioning network model.

S3, matching and positioning the target template picture by applying the trained target template picture matching and positioning network model

The specific process is as follows:

s3.1, inputting a template picture T (with the size of m multiplied by m) to be matched and positioned and a real-time picture S (with the size of n multiplied by n) into a trained target template picture matching and positioning network model in S2.4;

s3.2 calculating and outputting thermodynamic diagram P through the target template diagram matching positioning network model _hm ；

S3.3 finding thermodynamic diagram P _hm The maximum value above, the coordinate of the point with the maximum value is marked as (x) _max ,y _max )；

S3.4 will (x) _max ,y _max ) And (5) positioning the position (u, v) of the center of the target template graph on the real-time graph by substituting the following formula:

compared with the traditional template matching method, the target template graph matching and positioning method based on the twin network and the central position estimation can fully utilize the powerful feature extraction and characterization capability of the deep twin network and the high-precision positioning capability of the central position estimation network, and on the basis of a training image set covering large differences such as heterogeneities, dimensions, rotation, visual angles and the like, a target template graph matching and positioning network model corresponding to the complex differences is obtained through training.

Drawings

FIG. 1 is a schematic network structure diagram of a target template map matching and positioning method based on twin network and center position estimation according to the present invention;

FIG. 2 is a schematic diagram of a novel ResNet 18-based feature extraction network structure according to the present invention;

FIG. 3 is an example of a template graph and a real-time graph in a training image set according to the present invention;

fig. 4 shows some of the template matching results provided by the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The invention provides a target template graph matching and positioning method based on twin networks and central position estimation, which comprises the following steps:

s1, constructing a target template graph matching positioning network

The target template graph matching positioning network is formed by sequentially cascading a feature extraction twin network, a depth correlation convolution network and a central position estimation network. Fig. 1 is a schematic diagram of a specific structure of the entire network. In an embodiment, the network inputs the template graph T with the size of 127 × 127 and the real-time graph S with the size of 255 × 255; the output is a single channel thermodynamic diagram of size 129 x 129.

The feature extraction twin network is formed by cascading two convolution neural networks with shared parameters and the same structure, and takes a template graph T and a real-time graph S as input and outputs as a template graph feature graph f (T) and a real-time graph feature graph f (S), respectively. M in the example of implementation ₁ ＝16，n ₁ =32,d =128, i.e.: the size of f (T) is 16X 128, and the size of f (S) is 32X 128.

As shown in fig. 2, the convolutional neural network is obtained by modifying on the basis of a standard ResNet network, and the specific modifications are as follows:

(1) 3 x 3 convolution is added at the third, fourth and fifth layers of the standard ResNet network to realize feature dimension reduction, and the obtained feature maps are respectively marked as

And

(2) For characteristic diagram

Performing 3 × 3 deconvolution to obtain a feature map, and splicing the feature map

Then, carrying out 3 x 3 convolution on the spliced feature map to obtain a feature map

(3) For characteristic diagram

Then, the final output is obtained: template map feature map f (T) and live map feature map f (S).

In the embodiment where the ResNet18 network is selected, the number of channels for the 3 × 3 convolution is 128, the number of channels for the step size 1,3 × 3 deconvolution is 128, and the step size is 2.

S1.2, fusing the extracted template graph characteristic graph f (T) and the real-time graph characteristic graph f (S) by utilizing a depth correlation convolution network

The input of the depth correlation convolution operation is f (T) and f (S), the depth convolution operation is carried out on the f (T) serving as a convolution kernel and the f (S), and the output is a correlation characteristic graph f after the two are fused _Fusion . Example of embodiment f _Fusion Has a size of 17 × 17 × 128.

S1.3, a central position estimation network is constructed, and a thermodynamic diagram is calculated

The central position estimation network consists of three 3 x 3 deconvolution layers and one 3 x 3 convolution levelIs formed by connecting, the input is f _Fusion The output is a single-channel thermodynamic diagram P _hm . In the embodiment, the number of channels per 3 × 3 deconvolution layer is 128, the step size is 2, the number of channels per 3 × 3 convolution layer is 128, the step size is 1, and the output P is _hm Has a size of 129 × 129.

S2 training target template graph matching positioning network

S2.1 making a training image set

In the present embodiment, a Dajiang M300 unmanned aerial vehicle is used to carry a Zen Si H20 pan-tilt camera, visible light pictures and infrared pictures of the ground are taken from the air, and a 40000 pair of template images and real-time images are made as training image sets according to the method provided in the previous step S2.1, wherein the sizes of the template images and the real-time images are 127 × 127 and 255 × 255 pixels, respectively.

S2.2 calibrating a training image set

S2.1.1, for each pair of training samples, calibrating the coordinate c of the center of the template graph on the real-time graph _ref ＝(x _ref ,y _ref )；

S2.1.2 calculating the corresponding position of the center of the template graph on the thermodynamic diagram, wherein the calculation method in the implementation example is

Wherein

Indicating a rounding down operation.

S2.1.3 obtaining the corresponding coordinates of the center of the template drawing on the thermodynamic diagram, and then generating the thermodynamic diagram labels corresponding to the pair of training samples

In the implementation example

The thermal value calibrated at each (x, y) position is calculated as follows:

wherein x is more than or equal to 1, y is less than or equal to 129,

is a hyper-parameter related to the size of the template graph.

S2.3 design loss function

The loss function used for design training is as follows:

represents the thermodynamic diagram calibrated for the training samples in S2.2

The thermodynamic values at position (x, y), α and β, are adjustable hyper-parameters, taking α =2 and β =4 in this embodiment example.

S2.4 utilizes the collected training image set and the calibrated data to carry out network training by using a Stochastic Gradient Descent (SGD) (method, namely, the trained target template graph matching positioning network model is obtained by minimizing the loss function designed in S2.3. In the implementation example, when the model is trained, the batch _ size is set to be 128 (the number of GPUs is 4, 32 pairs of images are loaded on each GPU), and the parameters Momentum and weight _ decade are respectively set to be 0.9 and 0.001. The model trains 20 epochs together, wherein in the first 5 epochs, the learning rate is increased to 0.005 from the equal interval of 0.001, and in the last 15 epochs, the learning rate is attenuated to 0.0005 from the equal logarithmic interval of 0.005.

The specific process is as follows:

s3.1, inputting a template picture T (with the size of 127 multiplied by 127) to be matched and positioned and a real-time picture S (with the size of 256 multiplied by 256) into a trained target template picture matching and positioning network model in S2.4;

s3.2, calculating and outputting a thermodynamic diagram P by matching the target template diagram with the positioning network model _hm ；

S3.3 finding thermodynamic diagram P _hm Maximum value of (c), and the coordinate of the maximum value point is (x) _max ,y _max )；

in order to qualitatively evaluate the template matching method provided by the invention, in the embodiment, a Dajiang M300 unmanned aerial vehicle is used to carry a Zen Si H20 pan-tilt camera, visible light photos and infrared photos of the ground are taken from the air, 350 pairs of image pairs consisting of template images and real-time images are made, and a test data set is constructed and recorded as Hard350. The template graph and the real-time graph in the test data set have great differences of rotation, visual angle, shielding, heterogeneities (visible light and infrared) and the like, and do not appear in the training set. In the present embodiment, the average central error (MCE) defined based on the central error and the matching Success Rate (SR) are used as evaluation indexes, where SR2 represents the matching success rate obtained when the central error is smaller than 2 pixels and the matching is successful.

Table 1 shows the comparison result of the method provided by the present invention and some existing typical template matching methods on the test data set, wherein typical representative algorithms include Normalized Cross Correlation (NCC), normalized Mutual Information (NMI), SIFT-based image matching algorithm and HOG-based image matching algorithm, and Ours in the table represents the method provided by the present invention. As can be seen from the comparison of the results in table 1: compared with the traditional template matching method, the method provided by the invention can greatly improve the accuracy and robustness of template matching in a complex environment.

TABLE 1 test results on Easy150 and Hard350 datasets for different methods

FIG. 4 shows some target template map matching positioning results obtained under the interference of heterogeneous source, viewing angle difference, rotation difference and scale difference by using the method provided by the invention. As can be seen from the figure, the target template map matching and positioning method provided by the invention still has good performance under the complex challenge condition.

In conclusion, the target template map matching and positioning method based on the twin network and the central position estimation has good target template map matching and positioning accuracy and robustness under the complex challenge condition.

Claims

1. A target template map matching positioning method based on a depth twin network and center position estimation is characterized by comprising the following steps:

s1, constructing a target template graph matching positioning network

The target template graph matching and positioning network is formed by sequentially cascading a feature extraction twin network, a depth correlation convolution network and a central position estimation network, and is input into a template graph T and a real-time graph S, wherein the sizes of T and S are mxm and nxn respectively, m and n are positive integers, and n is larger than m; thermodynamic diagram P with single-channel output _hm In the size of m _h ×m _h ，m _h Is a positive integer, and specifically comprises the following components:

The feature extraction twin network is formed by cascading two convolution neural networks with shared parameters and the same structure, and takes a template graph T and a real-time graph S as input and outputs as a template graph feature graph f (T) and a real-time graph feature graph f (S), wherein the size of f (T) is m ₁ ×m ₁ Size of x d, f (S)Is n ₁ ×n ₁ X d, wherein m ₁ Denotes the length and width of f (T), n ₁ Length and width of f (S), d number of channels, m ₁ 、n ₁ D is a positive integer;

the convolutional neural network is obtained by modifying on the basis of a standard ResNet network, and the specific modification is as follows:

And

(2) For characteristic diagram

(3) For characteristic diagram

After that, the final output is obtained: a template graph feature graph f (T) and a real-time graph feature graph f (S);

The depth-dependent convolution network takes the template graph feature graph f (T) and the real-time graph feature graph f (S) extracted in S1.1 as input, takes f (T) as a convolution kernel to perform depth-dependent convolution operation with f (S), and outputs the result as a combined related feature graph f (T) and f (S) _Fusion Of size (m) ₁ +1)×(m ₁ +1)×d；

The central position estimation network is formed by cascading three 3 x 3 deconvolution layers and one 3 x 3 convolution layer, wherein: the number of channels of each 3 multiplied by 3 deconvolution layer is d, the step length is s, and s is a positive integer; the number of channels of the 3 multiplied by 3 convolutional layers is d, and the step length is 1;

the central position estimation network uses the fused related feature map f in S1.2 _Fusion As input, the output is a single-channel thermodynamic diagram P _hm Dimension m _h ×m _h ，m _h ＝m ₁ ·s ³ (ii) a Note p _x,y Is a thermodynamic diagram P _hm The heat value at the upper (x, y) position is 1-x, y-m _h Then p is _x,y Has a value range of [0,1 ]]；

S2 training target template graph matching positioning network

S2.1 making a training image set

S2.1.1, shooting at different distances, different visual angles and different positions respectively by using a visible light camera and an infrared camera at different time periods aiming at various targets of houses, roads, bridges, vehicles, ships and warships and airplanes to obtain a large number of images;

s2.1.2 making n from the acquired image _train The image pair consisting of the template image and the real-time image is processed;

s2.1.3 n to be made _train Taking the images as a training image set;

s2.2 calibrating a training image set

Wherein

Represents a round-down operation;

after the corresponding coordinates of the center of the template drawing on the thermodynamic diagram are obtained, the thermodynamic diagram labels corresponding to the pair of training samples are generated

In the step, a Gaussian kernel weighting mode is adopted to calibrate the thermodynamic diagram, and the specific calibration method comprises the following steps:

wherein:

presentation in thermodynamic diagram Label

The value of the specific calibrated heat value at the (x, y) position of (A), the value range of x and y is [1,m ] _h ]；σ _p Is a hyper-parameter related to the size of the template graph; calculating the heat value of all (x, y) positions to obtain a thermodynamic diagram label calibrated for the training sample

S2.3 design loss function

The loss function used for design training is as follows:

wherein: p is a radical of _x,y Represents the thermal force value of the template graph at the position of the real-time graph (x, y) calculated by the target template graph matching positioning network in S1,

The thermodynamic values at position (x, y), α and β are adjustable hyper-parameters;

s2.4, performing network training by using the training image set acquired in S2.1 and the training image set calibrated in S2.2 by using a random gradient descent method, namely, minimizing the loss function designed in S2.3 to obtain a trained target template map matching positioning network model;

The specific process is as follows:

s3.1, inputting a template picture T with the size of m multiplied by m and a real-time picture S with the size of n multiplied by n to be matched and positioned into a trained target template picture matching and positioning network model in S2.4;

2. a target template map matching positioning method based on a depth twin network and center position estimation according to claim 1, characterized in that: s2.1.2, the number n of image pairs consisting of template images and real-time images _train ≥40000。

3. A target template map matching positioning method based on a depth twin network and center position estimation according to claim 1, characterized in that: s2.1.2, preparation of n _train The method for the image pair consisting of the template image and the real-time image comprises the following steps: cutting an image block containing a certain target in a certain image, zooming the image block into a size of m multiplied by m, and selecting the image block as a template picture, wherein m is a positive integer; and cutting image blocks containing the same target in other images, scaling the image blocks into n multiplied by n, and selecting the image blocks as a real-time image, wherein n is a positive integer.

4. A target template map matching positioning method based on a depth twin network and center position estimation according to claim 1, characterized in that: s2.2, hyper-parameters related to the size of the template graph

5. A target template map matching positioning method based on a depth twin network and center position estimation according to claim 1, characterized in that: in S2.3, the adjustable hyper-parameters are α =2 and β =4.