CN112215872B - Multi-full convolution fusion single-target tracking method based on twin network - Google Patents

Multi-full convolution fusion single-target tracking method based on twin network Download PDF

Info

Publication number
CN112215872B
CN112215872B CN202011213160.6A CN202011213160A CN112215872B CN 112215872 B CN112215872 B CN 112215872B CN 202011213160 A CN202011213160 A CN 202011213160A CN 112215872 B CN112215872 B CN 112215872B
Authority
CN
China
Prior art keywords
layer
convolution
response
template
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011213160.6A
Other languages
Chinese (zh)
Other versions
CN112215872A (en
Inventor
鄢展锋
姚敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202011213160.6A priority Critical patent/CN112215872B/en
Publication of CN112215872A publication Critical patent/CN112215872A/en
Application granted granted Critical
Publication of CN112215872B publication Critical patent/CN112215872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a single target tracking method based on multi-full convolution fusion of a twin network, which comprises the steps of preprocessing a target image; acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network; the extracted features are subjected to cross-correlation operation respectively by layers,wherein,andthe characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation; overlapping the two response graphs in a channel mode; aiming at the superimposed response graphs, the weights occupied by the channels and the spaces of the response graphs are found; and determining the maximum response value point on the score map. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the method provided by the invention has the advantage that the marked central position is more accurate even if the target changes.

Description

Multi-full convolution fusion single-target tracking method based on twin network
Technical Field
The invention relates to the technical field of computer vision digital image processing, in particular to a multi-full convolution fusion single-target tracking method based on a twin network.
Background
Twin Network (Siamese Network) means that two neural networks share weights. In general, a twin network has two inputs, and the twin network functions to measure the similarity of the two inputs. The specific process is as follows: the two inputs are fed into two neural networks sharing weights respectively, then mapped to a new feature space, and finally compared with the similarity degree of the two inputs through a loss function.
The role of the channel attention module is to pay attention to what features are meaningful. The method comprises the steps of compressing a feature map in a space dimension, obtaining two one-dimensional vectors by adopting two modes of average pooling and maximum pooling, then sending the two vectors into the same multi-layer perceptron, and then summing and combining output features element by element to generate a channel attention diagram. The channel attention mechanism can be expressed as:
M c (F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))
wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, of the spatial dimensions, MLP represents a multi-layer perceptron, and σ represents a Sigmoid activation function.
The spatial attention module compresses channels, performs average pooling and maximum pooling in the channel direction respectively, then superimposes the extracted features according to the channel direction to obtain a two-channel feature map, and finally obtains final features through convolution operation and an activation function. The spatial attention mechanism can be expressed as:
M s (F)=σ(f 7×7 ([AvgPool(F);MaxPool(F)]))
wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F 7x7 The convolution operation is represented, the convolution kernel size is 7x 7, and σ is represented as a Sigmoid activation function.
The extracted features based on the twin network contain information of templates and search areas, the target positions in which vary continuously and the extracted features are slightly different. Based on the extracted features, the point of the maximum value on the score map is the center of the current target by calculating the similarity of the template and the search area. The response diagram obtained by cross-correlating the last layer of features is only capable of finding the center position of the target approximately, and the marked center position of the target can be inaccurate when the target changes.
Disclosure of Invention
The invention aims to provide a multi-full convolution fusion single-target tracking method based on a twin network, which aims to solve the problem that the marked target center position may be inaccurate due to the fact that the extracted features based on the twin network contain information of templates and search areas and the target positions in the templates and the search areas continuously change and the extracted features slightly differ.
In order to solve the technical problems, the technical scheme of the invention is as follows: the single target tracking method based on multi-full convolution fusion of the twin network comprises the following steps:
step one, preprocessing a target image;
step two, acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network;
step three, performing cross-correlation operation on the extracted features according to layers, wherein the formula is as follows:
wherein (1)>And->The characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation;
step four, superposing the two response graphs in a channel mode;
step five, expression of channel focusing mechanism:
M c (F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))
wherein, avgPool (F) and MaxPool (F) represent respectively carrying out average pooling and maximum pooling on space dimension, MLP represents a multi-layer perceptron, and sigma represents a Sigmoid activation function; expression of spatial attention mechanism: m is M s (F)=σ(f 7×7 ([AvgPool(F);MaxPool(F)]))
Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F 7x7 Representing a convolution operation, the convolution kernel size is 7×7, and σ is represented as a Sigmoid activation function; the total attentiveness process is:
wherein,representing element-by-element multiplication, F is a superimposed response graph, F' is a score graph output after channel attention, and F "is a score graph finally output.
And step six, determining the maximum response value point on the score map.
Further, the preprocessing the target image includes: determining the side length of a template and a search area, taking a target of a first frame image as a center, taking a picture block cut by the side length of the template as a template area, and cutting each frame image by the side length of the search area as a search area.
Further, the template area is smaller, and edges are filled with the average value of the pictures.
Further, an Alexnet five-layer convolution is selected as a backbone network, model parameters passed by two inputs are identical, and feature maps of a fourth layer 8x8x192, a 24x24x192 and a fifth layer 6x6x128 and a 22x22x128 are selected respectively.
Further, the two response maps are overlapped in a channel mode, the sizes of the two response maps are 17x17x1, and the two response maps are overlapped in the channel direction, so that the sizes of the two response maps are 17x17x2.
Further, the final response graph is subjected to a convolution layer of 1x1 to obtain a score graph with the size of 17x17x1, bicubic interpolation is performed according to the obtained score graph of 17x17x1 to generate an image of 272x272, and the point with the maximum response value is the midpoint of the object.
According to the multi-full convolution fusion single-target tracking method based on the twin network, feature extraction is carried out on the preprocessed picture through a simple five-layer neural network, then the similarity of a target calibrated by a subsequent frame and a first frame is judged through cross-correlation operation, more important features are focused through a channel focusing and space focusing module, unnecessary features are restrained, and finally the maximum value on a score graph is determined, namely the center of the target to be tracked. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the marked central position is more accurate even if the target changes.
Drawings
The invention is further described below with reference to the accompanying drawings:
fig. 1 is a schematic flow chart of steps of a multi-full convolution fusion single target tracking method based on a twin network according to an embodiment of the present invention.
Detailed Description
The following describes the single-target tracking method based on multi-full convolution fusion of the twin network in detail by referring to the attached drawings and the specific embodiments. Advantages and features of the invention will become more apparent from the following description and from the claims. It is noted that the drawings are in a very simplified form and utilize non-precise ratios, and are intended to facilitate a convenient, clear, description of the embodiments of the invention.
The method for tracking the single target based on the multi-full convolution fusion of the twin network comprises the steps of extracting features of a preprocessed picture through a simple five-layer neural network, judging the similarity of targets calibrated by a subsequent frame and a first frame through cross-correlation operation, focusing on more important features and inhibiting unnecessary features through a channel focusing and space focusing module, and finally determining the maximum value on a score map, namely the center of the target to be tracked. Compared with the traditional response diagram obtained by only selecting the last layer of features for cross-correlation, the marked central position is more accurate even if the target changes.
According to the technical scheme, the invention provides a multi-full convolution fusion single-target tracking method based on a twin network, and fig. 1 is a flow chart of steps of the multi-full convolution fusion single-target tracking method based on the twin network. Referring to fig. 1, the multi-full convolution fusion single target tracking method based on the twin network comprises the following steps:
s11: preprocessing a target image;
s12: acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network;
the Alexnet five-layer convolution is selected as a backbone network, model parameters which are passed by two inputs are identical, and feature maps of a fourth layer 8x8x192, a 24x24x192 and a fifth layer 6x6x128 and a 22x22x128 are respectively selected.
S13: performing cross-correlation operation on the extracted features according to layers, and respectively constructing a matching mechanism by using a fourth layer and a fifth layer convolution feature diagrams of a template and a search area, wherein the sizes of response diagrams obtained by matching are equal, and the formula is as follows:
wherein (1)>And->The characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation;
s14: overlapping the two response graphs in a channel mode;
the two response maps are 17x17x1, and are superimposed in the direction of the channel to 17x17x2.
S15: the use of channel and spatial information focuses on more important features and suppresses unnecessary features. For the superimposed response graphs, the weighting of their respective channels and spaces should be found, using a mechanism of interest. Expression of channel attention mechanism:
M c (F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))
wherein, avgPool (F) and MaxPool (F) represent respectively carrying out average pooling and maximum pooling on space dimension, MLP represents a multi-layer perceptron, and sigma represents a Sigmoid activation function; expression of spatial attention mechanism: m is M s (F)=σ(f 7×7 ([AvgPool(F);MaxPool(F)]))
Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F 7x7 Representing a convolution operation, the convolution kernel size is 7×7, and σ is represented as a Sigmoid activation function; the total attentiveness process is:
wherein,representing element-by-element multiplication, F is a superimposed response graph, F' is a score graph output after channel attention, and F "is a score graph finally output.
S16: and determining the maximum response value point on the score map.
The resulting response map is passed through a 1x1 convolution layer to yield a score map of size 17x17x 1. And performing bicubic interpolation according to the obtained score graph of 17x17x1 to generate 272x272 images, wherein the point with the maximum response value is the midpoint of the object.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (2)

1. The multi-full convolution fusion single target tracking method based on the twin network is characterized by comprising the following steps of:
preprocessing target images, determining the side lengths of a template and a search area, taking a target of a first frame image as a center, taking a picture block cut by the side length of the template as a template area, cutting each frame image as a search area by the side length of the search area, wherein the template area is smaller, and filling edges by the average value of the pictures;
step two, acquiring a convolution feature map of a preprocessing target image, and respectively extracting convolution features of a fourth layer and a fifth layer of a template and searching the convolution features of the fourth layer and the fifth layer of a branch by taking an Alexnet five-layer network as a main network;
step three, performing cross-correlation operation on the extracted features according to layers, wherein the formula is as follows:
wherein (1)>And->The characteristic mapping obtained by the same convolution operation of the template region z and the search region x is represented, the inner product of the response diagram is represented, and b1 represents deviation;
step four, superposing two response graphs in a channel mode, wherein the sizes of the two response graphs are 17x17x1, and the two response graphs are superposed in the channel direction to be 17x17x2;
step five, expression of channel focusing mechanism:
M c (F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))
wherein, avgPool (F) and MaxPool (F) represent respectively carrying out average pooling and maximum pooling on space dimension, MLP represents a multi-layer perceptron, and sigma represents a Sigmoid activation function; expression of spatial attention mechanism: expression of spatial attention mechanism: m is M s (F)=σ(f 7×7 ([AvgPool(F);MaxPool(F)]))
Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, along the channel axis, F 7x7 Representing a convolution operation, the convolution kernel size is 7×7, and σ is represented as a Sigmoid activation function; the total attentiveness process is:
wherein,representing element-by-element multiplication, wherein F is a response diagram after superposition, F 'is a score diagram output after channel attention, and F' is a score diagram finally output;
step six, determining the maximum response value point on the score map, obtaining a score map with the size of 17x17x1 by a 1x1 convolution layer of the obtained response map, and performing bicubic interpolation according to the obtained score map of 17x17x1 to generate an image of 272x272, wherein the maximum response value point is the midpoint of the object.
2. The twin network-based multi-full convolution fusion single-target tracking method according to claim 1, wherein the Alexnet five-layer convolution is selected as a backbone network, model parameters through which two inputs pass are identical, and feature maps of a fourth layer 8x8x192, 24x24x192 and a fifth layer 6x6x128, 22x22x128 are selected respectively.
CN202011213160.6A 2020-11-04 2020-11-04 Multi-full convolution fusion single-target tracking method based on twin network Active CN112215872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011213160.6A CN112215872B (en) 2020-11-04 2020-11-04 Multi-full convolution fusion single-target tracking method based on twin network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011213160.6A CN112215872B (en) 2020-11-04 2020-11-04 Multi-full convolution fusion single-target tracking method based on twin network

Publications (2)

Publication Number Publication Date
CN112215872A CN112215872A (en) 2021-01-12
CN112215872B true CN112215872B (en) 2024-03-22

Family

ID=74058112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011213160.6A Active CN112215872B (en) 2020-11-04 2020-11-04 Multi-full convolution fusion single-target tracking method based on twin network

Country Status (1)

Country Link
CN (1) CN112215872B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111291679A (en) * 2020-02-06 2020-06-16 厦门大学 Target specific response attention target tracking method based on twin network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10740651B2 (en) * 2016-10-27 2020-08-11 General Electric Company Methods of systems of generating virtual multi-dimensional models using image analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111291679A (en) * 2020-02-06 2020-06-16 厦门大学 Target specific response attention target tracking method based on twin network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Tiny Darknet全卷积孪生网络的目标跟踪;史璐璐;张索非;吴晓富;;南京邮电大学学报(自然科学版)(04);全文 *

Also Published As

Publication number Publication date
CN112215872A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112308092B (en) Light-weight license plate detection and identification method based on multi-scale attention mechanism
CN109685842B (en) Sparse depth densification method based on multi-scale network
CN102859535B (en) Daisy descriptor is produced from precalculated metric space
CN114708585A (en) Three-dimensional target detection method based on attention mechanism and integrating millimeter wave radar with vision
CN106952225B (en) Panoramic splicing method for forest fire prevention
CN112560774A (en) Obstacle position detection method, device, equipment and storage medium
Rastogi et al. Automatic building footprint extraction from very high-resolution imagery using deep learning techniques
CN114821390B (en) Method and system for tracking twin network target based on attention and relation detection
CN112365523A (en) Target tracking method and device based on anchor-free twin network key point detection
CN112507862A (en) Vehicle orientation detection method and system based on multitask convolutional neural network
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN109141432B (en) Indoor positioning navigation method based on image space and panoramic assistance
CN110956119A (en) Accurate and rapid target detection method in image
CN112163995A (en) Splicing generation method and device for oversized aerial photographing strip images
CN115578616A (en) Training method, segmentation method and device of multi-scale object instance segmentation model
CN115761258A (en) Image direction prediction method based on multi-scale fusion and attention mechanism
CN114429459A (en) Training method of target detection model and corresponding detection method
CN116222577A (en) Closed loop detection method, training method, system, electronic equipment and storage medium
CN112215872B (en) Multi-full convolution fusion single-target tracking method based on twin network
CN114492755A (en) Target detection model compression method based on knowledge distillation
CN113223065B (en) Automatic matching method for SAR satellite image and optical image
CN114067142A (en) Method for realizing scene structure prediction, target detection and lane level positioning
CN116258758A (en) Binocular depth estimation method and system based on attention mechanism and multistage cost body
WO2021103027A1 (en) Base station positioning based on convolutional neural networks
CN111753766A (en) Image processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant