CN110807794A

CN110807794A - Single target tracking method based on multiple features

Info

Publication number: CN110807794A
Application number: CN201910939321.0A
Authority: CN
Inventors: 宋建锋; 苗启广; 申猛; 王宇杰; 王崇晓; 刘向增; 权义宁; 盛立杰; 刘如意; 戚玉涛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-18

Abstract

The invention discloses a single-target tracking method based on multiple features. The method comprises the steps of respectively carrying out correlation operation on convolution characteristics and difference image characteristics by adopting a correlation filter tracking method, fusing response graphs obtained by the correlation operation, and tracking the target by taking a fusion result as a dynamic target coordinate correction basis. The method improves the current situation of low tracking precision caused by the change of the surrounding environment of the target, self deformation or the shielding factor of the target in the traditional method, and effectively improves the precision of target tracking.

Description

Single target tracking method based on multiple features

Technical Field

The invention belongs to the field of computer vision, relates to an infrared target tracking and deep convolution neural network algorithm, and can be applied to an infrared target tracking scene.

Background

In recent years, the discriminant method has become the mainstream method in target tracking, and the following methods are commonly used: support vector machine tracking algorithm, correlation filtering tracking algorithm and the like. Avidan et al introduced the support vector machine approach to target tracking for the first time, and Hare et al proposed a Structured Output target tracking algorithm (Structured Output tracking with Kernels, Struck). A Support Vector Machine (SVM) with kernel structure output is used, the tracking effect is achieved by explicitly introducing an output space, and an intermediate classification link can be avoided.

In 2015, Danelljan et al proposed srdcf (spatial regulated discriminatory filters) algorithm, which has weak anti-interference capability, easy target loss and low tracking accuracy.

Disclosure of Invention

Aiming at the difference or the deficiency of the prior art, the invention provides a single-target tracking method based on multiple characteristics.

The invention provides a multi-feature-based single-target tracking method, which is characterized in that each frame image in a video to be tracked is sequentially tracked according to a time sequence, and the frame images are in the same coordinate system, and the method comprises the following steps:

marking target coordinates of a first frame of image, initializing a correlation filter by using the first frame of image, and acquiring an initial convolution characteristic correlation filter model;

extracting the multi-scale convolution characteristics of the second frame image, wherein the first frame target coordinate is used as the temporary target coordinate of the second frame image during extraction;

performing correlation operation on the multi-scale convolution characteristics of the second frame image by using an initial convolution characteristic correlation filter model to obtain a convolution characteristic response image of the second frame, wherein the difference value of the peak value coordinate of the response image and the center coordinate of the response image plus the coordinate of the first frame image is the target coordinate of the second frame;

obtaining a difference image of a second frame image and a first frame image, initializing a correlation filter by using the difference image, obtaining a convolution characteristic correlation filter model of the initial difference image, and taking the target coordinate of the second frame as the target coordinate of the current difference image in the initialization process;

updating the initial convolution characteristic correlation filter model by using the second frame image;

step three, circularly executing the steps (1) to (5) to sequentially track the target of the third frame and the subsequent images:

(1) extracting the multi-scale convolution characteristics of the nth frame, wherein n is more than or equal to 3, and taking the target coordinates of the n-1 frame as the temporary target coordinates of the current frame during extraction; obtaining a difference image of the nth frame and the (n-1) th frame, and extracting multi-scale convolution characteristics of the difference image;

(2) performing correlation operation on the multi-scale convolution characteristics of the nth frame by using a current convolution characteristic correlation filter model to obtain a first response image; performing correlation operation on the multi-scale convolution characteristics of the current difference image by using the convolution characteristic correlation filter model of the current difference image to obtain a second response image;

(3) fusing the first response diagram and the second response diagram by adopting a formula (I) to obtain a fused response diagram (or a fused response diagram matrix) R;

in formula (1):

m-1 or 2, wherein m-1 represents a convolution characteristic or a first response map, and m-2 represents a convolution characteristic or a second response map of the difference image;

f_mrepresents a weight, f₁＝1.1,f₂＝1；

R_mIs a corresponding characteristic response map (or a corresponding characteristic response map matrix);

max(R₁，R₂) Representing the maximum response value in the two response graphs;

PSR_mrepresenting the peak-to-side lobe ratio of the corresponding characteristic response plot,

representing the maximum response value in the corresponding response map,

represents the average value of the response values corresponding to each pixel point in the side lobe in the corresponding response map,

representing the standard deviation of response values corresponding to all pixel points in the side lobe in a corresponding response graph;

(4) the difference value of the peak value coordinate of the fused response image and the center coordinate of the fused response image is added with the target coordinate of the n-1 frame image to form the target coordinate of the n frame image;

(5) updating the current convolution characteristic correlation filter model by using the image of the current frame; and updating the convolution characteristic correlation filter model of the current differential image by using the current differential image.

Preferably, the convolution feature of the invention is extracted by using a VGG convolution neural network, and the convolution feature is a conv3-3 convolution feature.

Preferably, the difference image is sequentially subjected to corrosion operation and expansion operation, and then multi-scale convolution feature extraction is performed.

Compared with the prior art, the invention has the following advantages:

(1) the invention fully exerts the advantages of each characteristic by dynamically fusing the related filtering response graphs of different characteristics, improves the infrared tracking accuracy, and is obviously superior to a tracker which uses two characteristics of a differential characteristic and a convolution characteristic independently by dynamically fusing trackers of different characteristics.

(2) The invention selects the better convolution characteristic for the tracking of the correlation filtering, and obtains the best tracking accuracy.

Drawings

FIG. 1 is an image (a) and a target response image (b) of a first frame image labeled with target coordinates in an embodiment;

FIG. 2 is a second frame image in the embodiment;

FIG. 3 is a differential image of a second frame and a first frame in the embodiment;

FIG. 4 is a graph of convolution response characteristics for a third frame of the embodiment;

FIG. 5 is a graph of differential feature responses for a third frame and a second frame according to an embodiment;

FIG. 6 is a graph of responses after fusion of the embodiments;

FIG. 7 the algorithm of the present invention compares the results with other tracker accuracy (a) and success rate (b).

Detailed Description

The target response map acquisition (or generation) method, the multi-scale image, the multi-scale Convolution feature extraction, initialization, update and related operations described in the present invention are all methods known in the art, and specifically, a scheme disclosed in document 1(Danelljan, Martin, et. "ECO: Efficient Convolution Operators for tracking."2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE Computer Society,2017.) may be adopted. It should be further explained that the multi-scale convolution features refer to the result of performing convolution feature extraction on the multi-scale images respectively. The initialization and the update may adopt the same operation, and the specific operation process according to the disclosure of document 1 may be understood as: generating a target response image of the image according to the target coordinates of the image, extracting a multi-scale image of the target coordinates of the image, further adopting a corresponding neural network to sequentially extract convolution characteristics of the multi-scale image, and then carrying out time domain convolution operation on a correlation filter by using the target response image (expected value) of the image and the multi-scale convolution characteristics of the image to obtain an initial convolution characteristic correlation filter model or an updated convolution characteristic correlation filter model.

The invention preferably extracts features using the last convolution layer of the first to the fifth groups of the VGG 16 network with the full connection layer and the softmax layer removed to obtain conv3-3 convolution features as the convolution features for tracking of the invention.

It should also be explained that the PSR according to formula (I) of the present invention_mThe meaning of peak to side lobe ratio (PSR), side lobes and related terms representing The corresponding characteristic response plots is disclosed in document 2(Bolme, David s., et al, "Visual object tracking using adaptive correlation filters." The screw-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA,13-18June 2010IEEE, 2010.).

Example (b):

the example is directed to data that is The forest snow horizon thermal infrared dataset (The

Thermal InfraReddataset, LTIR) adopts the method of the present invention to track the target, and each frame image to be tracked in the data is located in the same coordinate system, specifically through the item of coordinate normalization. The specific tracking method comprises the following steps:

marking target coordinates of a first frame of image, and initializing a correlation filter by using the first frame of image to obtain an initial convolution characteristic correlation filter model; specifically, a target response image of the first frame image is generated according to target coordinates of the first frame image, as shown in fig. 1, multi-scale convolution characteristics of the target coordinates of the first frame image are extracted, and a correlation filter is initialized by using the target response image of the first frame image and the multi-scale convolution characteristics of the first frame image, so as to obtain an initial convolution characteristic correlation filter model;

step two, taking the target coordinates of the first frame as temporary target coordinates of a second frame image (shown in figure 2), and extracting the multi-scale convolution characteristics of the second frame;

performing correlation operation on the multi-scale convolution characteristics of the second frame by using an initial convolution characteristic correlation filter to obtain a convolution characteristic response image of the second frame, wherein the difference value of the peak value coordinate of the response image and the central coordinate of the response image is added with the image coordinate of the first frame to be the target coordinate of the second frame;

obtaining a difference image (as shown in fig. 3) between the second frame and the first frame, taking the target coordinates of the second frame as the target of the current difference image, and initializing a correlation filter by using the current difference image to obtain a convolution characteristic correlation filter model of the initial difference image;

updating the initial convolution characteristic correlation filter model by using the image of the second frame;

step three, circularly executing the steps (1) to (5) to track the third frame and the following image frames:

(1) taking the target coordinates of the previous frame as temporary target coordinates of the current frame, and extracting multi-scale convolution characteristics of the nth frame image (namely the current frame); obtaining a difference image of the nth frame and the (n-1) th frame, and extracting multi-scale convolution characteristics of the difference image;

(2) performing correlation operation on the multi-scale convolution characteristics of the nth frame by using a current convolution characteristic correlation filter to obtain a first response image; the first response diagram of the third frame is shown in fig. 4; performing correlation operation on the multi-scale convolution characteristics of the current difference image by using a current difference characteristic correlation filter to obtain a second response image; third frame second response graph is shown in fig. 5;

(3) fusing the first response diagram and the second response diagram by adopting a formula (I) to obtain a fused response diagram R; the fused response graph of the second frame and the third frame is shown in FIG. 6;

(4) adding the difference value of the peak value coordinate of the fused response image and the center coordinate of the response image to the coordinate of the previous frame image to obtain the coordinate of the current frame;

Preferably, in the embodiment, when extracting the relevant convolution features (including the convolution features of the images of the scales of each frame of image and the convolution features of the difference images of each frame of image), extracting the conv3-3 convolution features by using a VGG convolution neural network; in the embodiment, the multi-scale proportion of each frame of image is as follows: 1,1.02,1.04,1.06,0.98,0.96,0.94.

On an LTIR infrared data set, the method is compared with four tracking algorithms of KCF, DSST, CSK and SimFC, and the comparison result is shown in FIG. 7, which shows that the tracking result of the image sequence of the method has better performance in accuracy.

Claims

1. A multi-feature-based single-target tracking method sequentially tracks each frame image in a video to be tracked according to a time sequence, wherein each frame image is in the same coordinate system, and the method comprises the following steps:

(3) fusing the first response diagram and the second response diagram by adopting a formula (I) to obtain a fused response diagram R;

in formula (1):

f_mrepresents a weight, f₁＝1.1，f₂＝1；

R_mIs a corresponding characteristic response graph;

PSR_mindicating the peak sidelobes of the corresponding characteristic response map,

representing the maximum response value in the corresponding response map,

2. The multi-feature-based single target tracking method according to claim 1, wherein the convolution feature is extracted by adopting a VGG convolution neural network, and the convolution feature is conv3-3 convolution feature.

3. The multi-feature-based single-target tracking method according to claim 1, wherein the difference image is subjected to erosion operation and dilation operation in sequence and then subjected to multi-scale convolution feature extraction.