CN112215872B

CN112215872B - Multi-full convolution fusion single-target tracking method based on twin network

Info

Publication number: CN112215872B
Application number: CN202011213160.6A
Authority: CN
Inventors: 鄢展锋; 姚敏
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2024-03-22
Anticipated expiration: 2040-11-04
Also published as: CN112215872A

Abstract

The invention provides a single target tracking method based on multi-full convolution fusion of a twin network, which comprises the steps of preprocessing a target image; acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network; the extracted features are subjected to cross-correlation operation respectively by layers,wherein,andthe characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation; overlapping the two response graphs in a channel mode; aiming at the superimposed response graphs, the weights occupied by the channels and the spaces of the response graphs are found; and determining the maximum response value point on the score map. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the method provided by the invention has the advantage that the marked central position is more accurate even if the target changes.

Description

Multi-full convolution fusion single-target tracking method based on twin network

Technical Field

The invention relates to the technical field of computer vision digital image processing, in particular to a multi-full convolution fusion single-target tracking method based on a twin network.

Background

Twin Network (Siamese Network) means that two neural networks share weights. In general, a twin network has two inputs, and the twin network functions to measure the similarity of the two inputs. The specific process is as follows: the two inputs are fed into two neural networks sharing weights respectively, then mapped to a new feature space, and finally compared with the similarity degree of the two inputs through a loss function.

The role of the channel attention module is to pay attention to what features are meaningful. The method comprises the steps of compressing a feature map in a space dimension, obtaining two one-dimensional vectors by adopting two modes of average pooling and maximum pooling, then sending the two vectors into the same multi-layer perceptron, and then summing and combining output features element by element to generate a channel attention diagram. The channel attention mechanism can be expressed as:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, of the spatial dimensions, MLP represents a multi-layer perceptron, and σ represents a Sigmoid activation function.

The spatial attention module compresses channels, performs average pooling and maximum pooling in the channel direction respectively, then superimposes the extracted features according to the channel direction to obtain a two-channel feature map, and finally obtains final features through convolution operation and an activation function. The spatial attention mechanism can be expressed as:

M _s (F)＝σ(f ^7×7 ([AvgPool(F)；MaxPool(F)]))

wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F ^7x7 The convolution operation is represented, the convolution kernel size is 7x 7, and σ is represented as a Sigmoid activation function.

The extracted features based on the twin network contain information of templates and search areas, the target positions in which vary continuously and the extracted features are slightly different. Based on the extracted features, the point of the maximum value on the score map is the center of the current target by calculating the similarity of the template and the search area. The response diagram obtained by cross-correlating the last layer of features is only capable of finding the center position of the target approximately, and the marked center position of the target can be inaccurate when the target changes.

Disclosure of Invention

The invention aims to provide a multi-full convolution fusion single-target tracking method based on a twin network, which aims to solve the problem that the marked target center position may be inaccurate due to the fact that the extracted features based on the twin network contain information of templates and search areas and the target positions in the templates and the search areas continuously change and the extracted features slightly differ.

In order to solve the technical problems, the technical scheme of the invention is as follows: the single target tracking method based on multi-full convolution fusion of the twin network comprises the following steps:

step one, preprocessing a target image;

step two, acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network;

step three, performing cross-correlation operation on the extracted features according to layers, wherein the formula is as follows:

wherein (1)>And->The characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation;

step four, superposing the two response graphs in a channel mode;

step five, expression of channel focusing mechanism:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

wherein, avgPool (F) and MaxPool (F) represent respectively carrying out average pooling and maximum pooling on space dimension, MLP represents a multi-layer perceptron, and sigma represents a Sigmoid activation function; expression of spatial attention mechanism: m is M _s (F)＝σ(f ^7×7 ([AvgPool(F)；MaxPool(F)]))

Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F ^7x7 Representing a convolution operation, the convolution kernel size is 7×7, and σ is represented as a Sigmoid activation function; the total attentiveness process is:

wherein,representing element-by-element multiplication, F is a superimposed response graph, F' is a score graph output after channel attention, and F "is a score graph finally output.

And step six, determining the maximum response value point on the score map.

Further, the preprocessing the target image includes: determining the side length of a template and a search area, taking a target of a first frame image as a center, taking a picture block cut by the side length of the template as a template area, and cutting each frame image by the side length of the search area as a search area.

Further, the template area is smaller, and edges are filled with the average value of the pictures.

Further, an Alexnet five-layer convolution is selected as a backbone network, model parameters passed by two inputs are identical, and feature maps of a fourth layer 8x8x192, a 24x24x192 and a fifth layer 6x6x128 and a 22x22x128 are selected respectively.

Further, the two response maps are overlapped in a channel mode, the sizes of the two response maps are 17x17x1, and the two response maps are overlapped in the channel direction, so that the sizes of the two response maps are 17x17x2.

Further, the final response graph is subjected to a convolution layer of 1x1 to obtain a score graph with the size of 17x17x1, bicubic interpolation is performed according to the obtained score graph of 17x17x1 to generate an image of 272x272, and the point with the maximum response value is the midpoint of the object.

According to the multi-full convolution fusion single-target tracking method based on the twin network, feature extraction is carried out on the preprocessed picture through a simple five-layer neural network, then the similarity of a target calibrated by a subsequent frame and a first frame is judged through cross-correlation operation, more important features are focused through a channel focusing and space focusing module, unnecessary features are restrained, and finally the maximum value on a score graph is determined, namely the center of the target to be tracked. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the marked central position is more accurate even if the target changes.

Drawings

The invention is further described below with reference to the accompanying drawings:

fig. 1 is a schematic flow chart of steps of a multi-full convolution fusion single target tracking method based on a twin network according to an embodiment of the present invention.

Detailed Description

The following describes the single-target tracking method based on multi-full convolution fusion of the twin network in detail by referring to the attached drawings and the specific embodiments. Advantages and features of the invention will become more apparent from the following description and from the claims. It is noted that the drawings are in a very simplified form and utilize non-precise ratios, and are intended to facilitate a convenient, clear, description of the embodiments of the invention.

The method for tracking the single target based on the multi-full convolution fusion of the twin network comprises the steps of extracting features of a preprocessed picture through a simple five-layer neural network, judging the similarity of targets calibrated by a subsequent frame and a first frame through cross-correlation operation, focusing on more important features and inhibiting unnecessary features through a channel focusing and space focusing module, and finally determining the maximum value on a score map, namely the center of the target to be tracked. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the marked central position is more accurate even if the target changes.

According to the technical scheme, the invention provides a multi-full convolution fusion single-target tracking method based on a twin network, and fig. 1 is a flow chart of steps of the multi-full convolution fusion single-target tracking method based on the twin network. Referring to fig. 1, the multi-full convolution fusion single target tracking method based on the twin network comprises the following steps:

s11: preprocessing a target image;

s12: acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network;

the Alexnet five-layer convolution is selected as a backbone network, model parameters which are passed by two inputs are identical, and feature maps of a fourth layer 8x8x192, a 24x24x192 and a fifth layer 6x6x128 and a 22x22x128 are respectively selected.

S13: performing cross-correlation operation on the extracted features according to layers, and respectively constructing a matching mechanism by using a fourth layer and a fifth layer convolution feature diagrams of a template and a search area, wherein the sizes of response diagrams obtained by matching are equal, and the formula is as follows:

s14: overlapping the two response graphs in a channel mode;

the two response maps are 17x17x1, and are superimposed in the direction of the channel to 17x17x2.

S15: the use of channel and spatial information focuses on more important features and suppresses unnecessary features. For the superimposed response graphs, the weighting of their respective channels and spaces should be found, using a mechanism of interest. Expression of channel attention mechanism:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

S16: and determining the maximum response value point on the score map.

The resulting response map is passed through a 1x1 convolution layer to yield a score map of size 17x17x 1. And performing bicubic interpolation according to the obtained score graph of 17x17x1 to generate 272x272 images, wherein the point with the maximum response value is the midpoint of the object.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The multi-full convolution fusion single target tracking method based on the twin network is characterized by comprising the following steps of:

preprocessing target images, determining the side lengths of a template and a search area, taking a target of a first frame image as a center, taking a picture block cut by the side length of the template as a template area, cutting each frame image as a search area by the side length of the search area, wherein the template area is smaller, and filling edges by the average value of the pictures;

step four, superposing two response graphs in a channel mode, wherein the sizes of the two response graphs are 17x17x1, and the two response graphs are superposed in the channel direction to be 17x17x2;

step five, expression of channel focusing mechanism:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

wherein, avgPool (F) and MaxPool (F) represent respectively carrying out average pooling and maximum pooling on space dimension, MLP represents a multi-layer perceptron, and sigma represents a Sigmoid activation function; expression of spatial attention mechanism: expression of spatial attention mechanism: m is M _s (F)＝σ(f ^7×7 ([AvgPool(F)；MaxPool(F)]))

wherein,representing element-by-element multiplication, wherein F is a response diagram after superposition, F 'is a score diagram output after channel attention, and F' is a score diagram finally output;

step six, determining the maximum response value point on the score map, obtaining a score map with the size of 17x17x1 by a 1x1 convolution layer of the obtained response map, and performing bicubic interpolation according to the obtained score map of 17x17x1 to generate an image of 272x272, wherein the maximum response value point is the midpoint of the object.

2. The twin network-based multi-full convolution fusion single-target tracking method according to claim 1, wherein the Alexnet five-layer convolution is selected as a backbone network, model parameters through which two inputs pass are identical, and feature maps of a fourth layer 8x8x192, 24x24x192 and a fifth layer 6x6x128, 22x22x128 are selected respectively.