CN112991385B

CN112991385B - Twin network target tracking method based on different measurement criteria

Info

Publication number: CN112991385B
Application number: CN202110171718.7A
Authority: CN
Inventors: 刘龙; 付志豪; 史思琦
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2023-04-28
Anticipated expiration: 2041-02-08
Also published as: CN112991385A

Abstract

The invention discloses a twin network target tracking method based on different measurement criteria, which comprises the following specific steps: step 1, selecting a feature extraction network; and 2, acquiring a tracking video, and manually selecting an area where the target is located on a first frame of the video. Obtaining depth characteristics of the template; step 3, entering a subsequent frame, and obtaining depth characteristics of a current frame searching area by utilizing the tracking target coordinate position and the width and height of the previous frame; step 4, similarity measurement is carried out on the depth features of the template and the depth features of the current frame searching area by adopting cosine similarity, and a response chart is obtained; step 5, similarity measurement is carried out on the template depth features and the depth features of the current frame search area by using Euclidean distance, and a response chart obtained by using the Euclidean distance measurement mode is obtained; and 6, carrying out weighted fusion on the two response graphs, and determining the position of the target according to the maximum value on the response graphs. The method solves the problems that a target tracking method based on a twin network is easy to be interfered by a similar object and is not robust to the appearance of the target.

Description

Twin network target tracking method based on different measurement criteria

Technical Field

The invention belongs to the technical field of video single-target tracking, and relates to a twin network target tracking method based on different measurement criteria.

Background

In the field of computer vision, object tracking has been an important topic and research direction at present. The target tracking work is to estimate the position, shape or occupied area of the tracked target in a continuous video image sequence and determine the motion information such as the motion speed, direction and track of the target. The target tracking has important research significance and wide application prospect, and is mainly applied to aspects of video monitoring, man-machine interaction, intelligent transportation, autonomous navigation and the like.

The object tracking method based on the twin network is the main stream of the current object tracking method. The main idea of the twin network architecture is to find a function that can map the input picture to a high-dimensional space such that the simple distance in the target space approximates the "semantic" distance of the input space. More precisely the structure tries to find a set of parameters such that the similarity measure is small when belonging to the same category and large when belonging to different categories. The network is mainly used for metric learning in the past and is used for calculating the similarity of information such as images, sounds, texts and the like. Especially in the field of face recognition. In target tracking, the twin network typically uses the first frame target region as a template, and continuously makes similarity measurements with the template in subsequent frames to obtain the target position and size. The existing target tracking method based on the twin network generally adopts cosine similarity as measurement, the measurement mode is too single, and the situation that the appearance of the target changes greatly or similar target interference exists cannot be well dealt with.

Disclosure of Invention

The invention aims to provide a twin network target tracking method based on different measurement criteria, which solves the problem that the existing twin network target tracking method is easy to be interfered by similar targets or is not robust to the change of the appearance of the targets so as to cause tracking failure.

The technical scheme adopted by the invention is that the twin network target tracking method based on different measurement criteria comprises the following steps:

step 1, selecting a feature extraction network

Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1

In, obtain depth features of templates +.>

Step 3, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x _t-1 ,y _t-1 ) And width and height (w) _t-1 ,h _t-1 ) Search area S for obtaining current frame image _t Search area S of current frame image _t Input to the feature extraction network selected in step 1

Depth feature of search area of current frame image is obtained>

Step 4, adopting cosine similarity to obtain the depth characteristics of the template obtained in the step 2

And (3) searching the depth feature of the area of the current frame obtained in the step (3)>

Performing similarity measurement to obtain a response graph h obtained by using a cosine similarity measurement mode _c (Z,S _t )；

Step 5, using Euclidean distance to obtain the depth characteristics of the template obtained in the step 2

Performing similarity measurement to obtain a response graph h obtained by using Euclidean distance measurement mode _d (Z,S _t )。

Step 6, for the response graph h obtained in the step 4 _c (Z,S _t ) And the response diagram h obtained in the step 5 _d (Z,S _t ) Weighting and fusing to obtain final response diagram h (Z, S) _t ) And the fused response diagram h (Z, S _t ) To a fixed size, the response map h (Z, S _t ) The maximum point is the position of the tracking target, and then the width and height of the target are updated in a linear interpolation mode, so that the tracking of the target of the current frame is realized.

The invention is also characterized in that:

the specific process of the step 1 is as follows:

selecting AlexNet network pre-trained on ImageNet dataset as feature extraction network of twin network

The specific process of the step 2 is as follows: the specific process of the step 2 is as follows:

step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, enabling (x, y) to be the center point coordinate of the target in the first frame, m and n to be the width and height of the target area respectively, taking the center point (x, y) of the target in the first frame as the center, and intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:

wherein p= (m+n)/4 represents the filling amount;

step 2.2, if the square region with the side length of z_sz exceeds the first frame image of the tracking video, filling the excess with the average value of the first frame image of the tracking video, wherein the average value of the first frame image

The calculation is carried out by adopting the following formula (2):

wherein ,

representing pixel values of the ith channel, jth row, and kth column in the first frame target region；

Step 2.3, scaling the square region with the side length of z_sz to b×b to obtain a target region Z of the first frame image, and inputting the target region Z of the first frame image into the feature extraction network

Depth characteristics w×h×c are obtained for the width, height, and channel number>

The specific process of the step 3 is as follows:

step 3.1, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x _t-1 ,y _t-1 ) And width and height (m _t-1 ,n _t-1 ) A square area with the side length of x_sx is cut off on the current t frame image, and the side length of x_sx is calculated according to the following formula:

wherein p_t-1 ＝(w _t-1 +n _t-1 ) 4, representing the filling amount;

step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, filling the excess part with the mean value of the t frame image, wherein the mean value of the current t frame image is calculated by adopting the following formula:

wherein ,

pixel values representing an ith channel, a jth row, and a kth column in a subsequent frame target region;

step 3.3, scaling the square area with the side length of x_sx to a size of a×a to obtain a search area S of the current t-th frame image _t Will be the current t frameSearch area S of image _t Feature extraction network input to step 1

Depth feature of current frame search region of W×H×C with width, height and channel number obtained>

The specific process of the step 4 is as follows:

first, template frame depth features

Search area +.>

Sliding operation is performed, and every time sliding operation is performed, the current frame search area is +.>

There will always be one and template frame depth feature +.>

Areas of equal size->

Wherein i represents +.>

At->

Subscript of upper horizontal movement, j represents +.>

At->

Subscript of vertical movement is given by

At every time at +.>

The up shift s is a step of size, then i and j will take values within the interval:

wherein i and j are integers;

due to

and />

The dimensions of (2) are w.times.h.times.C, will now be +.>

and />

One-dimensional vector flattened to (wXhXC) x 1>

and />

Measuring the similarity of the two vectors by cosine similarity, solving for +.>

And

the cosine similarity of (2) is as follows:

response diagram h finally obtained by cosine similarity measurement mode _c (Z,S _t ) Is that

The specific steps of the step 5 are as follows:

first template frame depth feature

Search area +.>

Sliding operation is performed, and during the sliding operation, euclidean distance is used for measuring the depth characteristic of the template frame +.>

And the current frame search area->

Neutron block->

Is measured as follows:

finally, a response diagram h is obtained through Euclidean distance measurement mode _d (Z,S _t ) Is that

The specific process of the step 6 is as follows:

step 6.1, for response graph h obtained by cosine similarity _d (Z,S _t ) And Euclidean distance obtained response graph h _d (Z,S _t ) The following weighted fusion is carried out to obtain a fused response diagram h(Z,S _t )：

h(Z,S _t )＝λh _c (Z,S _t )+(1-λ)h _d (Z,S _t ) (6)；

Step 6.2, the fused response diagram h (Z, S) _t ) Interpolation into response maps H (Z, S) of size lambda x lambda _t ) Response map H (Z, S _t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S _t ) Maximum value and response diagram H (Z, S _t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame _t-1 ,y _t-1 ) Obtain the target position (x _t ,y _t ) The specific calculation mode is as follows:

step 6.3, update the width and height (w _t ,h _t ) Firstly, obtaining a scale of the target width-height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:

wherein r is the update rate;

step 6.4, updating the width and height (w _t ,h _t ) The specific formula is as follows:

and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame, and jumping to the step 3 to track the next frame.

The beneficial effects of the invention are as follows:

1. the invention additionally introduces the Euclidean distance measurement mode, so that the network can better cope with the condition of similar target change, and effectively solves the problem of tracking failure caused by similar target occurrence

2. The advantages of the two measurement modes are fully utilized by fusing the response graph obtained by Euclidean distance measurement and the response graph obtained by cosine similarity measurement, so that the network is more robust to the change of the appearance of the target, and the tracking drift problem caused by the change of the appearance of the target is effectively solved.

Drawings

FIG. 1 is a block diagram of a network architecture of the twin network target tracking method of the present invention based on different metric criteria;

FIG. 2 is a schematic diagram of similarity measurement in the twin network target tracking method based on different measurement criteria of the present invention;

FIG. 3 is a process schematic diagram of embodiment 1 of the twin network target tracking method of the present invention based on different metric criteria.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a twin network target tracking method based on different measurement criteria, which is shown in figure 1 and comprises the following specific steps:

step 1, selecting a feature extraction network of a twin network

Step 1 is specifically selecting AlexNet network pre-trained on ImageNet dataset as feature extraction network of twin network

In, obtain depth features of templates +.>

wherein p= (m+n)/4 represents the filling amount;

The calculation is carried out by adopting the following formula (2):

wherein ,

pixel values representing an ith channel, a jth row, and a kth column in a first frame target region;

Depth feature of search area of current frame image is obtained>

The specific process of the step 3 is as follows:

wherein p_t-1 ＝(w _t-1 +n _t-1 ) 4, representing the filling amount;

wherein ,

step 3.3, scaling the square area with the side length of x_sx to a size of a×a to obtain a search area S of the current t-th frame image _t Search area S of current t-th frame image _t Input to the feature extraction network selected in step 1

Step 4, adopting cosine similarity to form template depth characteristics

And the current frame search region depth feature +.>

And (5) similarity measurement is carried out. First of all the depth features of the template frame->

Search area +.>

A sliding operation as shown in fig. 2 is performed. Every time a sliding operation is made, the current frame searches for an area +.>

There will always be one and template frame depth feature +.>

Areas of the same size. Define the area as +.>

Wherein i represents +.>

At->

Subscript of upper horizontal movement, j represents +.>

At->

And vertically moving subscripts. Let->

At every time at +.>

due to

and />

The dimensions of (2) are w.times.h.times.C, will now be +.>

and />

One-dimensional vector flattened to (wXhXC) x 1>

and />

The degree of similarity of the two vectors is measured by cosine similarity. Solving->

And

the cosine similarity of (2) is as follows:

Is a set of (3). h is a _c (Z,S _t ) The expression of (c) can be written in the following form:

* Representing a cross-correlation metric operation.

Step 5, using Euclidean distance to make template frame depth feature

And the current frame search region depth feature +.>

And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step (4). First template frame depth feature->

Search area +.>

A sliding operation as shown in fig. 2 is performed. During the sliding operation, euclidean distance is used to measure the depth characteristic of template frame>

And the current frame search area->

Neutron block->

Is measured as follows:

Is a set of (3). h is a _d (Z,S _t ) The expression of (c) can be written in the following form:

the letter indicates Euclidean distance metric operation.

Step 6, for response graph h obtained by cosine similarity _d (Z,S _t ) And Euclidean distance obtained response graph h _d (Z,S _t ) The following weighted fusion is carried out to obtain a fused response diagram:

h(Z,S _t )＝λh _c (Z,S _t )+(1-λ)h _d (Z,S _t ) (10)；

and then the fused response diagram h (Z, S) _t ) Interpolation into response maps H (Z, S) of size lambda x lambda _t ). Response diagram H (Z, S _t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S _t ) Maximum value and response diagram H (Z, S _t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame _t-1 ,y _t-1 ) Obtaining the destination of the current frameTarget position (x) _t ,y _t ) The specific calculation mode is as follows:

second update the width and height (w) _t ,h _t ) Firstly, obtaining a scale of the target width-height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:

where r is the update rate. And updating the width and height (w) of the current frame target by multiplying the variable scale _t ,h _t ) The specific formula is as follows:

and (3) finishing the tracking process of the current frame target, taking the next frame as the current frame, and jumping to the step (3) to track the next frame.

Example 1

Step 1, selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network

TABLE 1 feature extraction network parameter table

/>

Feature extraction network

The parameters of (2) are shown in table 1 and consist of a total of 5 convolutional layers and 2 pooling layers. The first two convolutional layers are followed by two maximum pooling layers. The first 4 convolution layers are all followed by adding a random inactivation layer and a RELU nonlinear activation function

And 2, acquiring a tracking video, and manually selecting an area where the target is located on a first frame of the video. Let (x, y) be the center point coordinates of the object in the first frame, m and n be the width and height of the object region, respectively. Taking the center point (x, y) of the target in the first frame as the center, intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:

where p= (m+n)/4 represents the filling amount. If the square area exceeds the image, the excess is filled with the image mean. The square region of z_sz is then scaled to a size of 127 x 127, resulting in the target region Z of the first frame. Finally, inputting the target area Z of the first frame into the feature extraction network

Depth features of 6×6×256 are obtained>

Step 3, entering the subsequent frame, and utilizing the tracking target coordinate position (x _t-1 ,y _t-1 ) And width and height (m _t-1 ,n _t-1 ) A square region with a side length of x_sx is intercepted, and the calculation formula of x_sx is as follows:

wherein p_t-1 ＝(w _t-1 +n _t-1 ) And/4, the filling amount. If the square area exceeds the image, the excess is filled with the image mean. A kind of electronic deviceThen scaling the square region of x_sx to 255×255 to obtain the search region S of the current frame _t And input into a feature extraction network to obtain depth features of the current frame search region

The size is 22 multiplied by 256;

step 4, adopting cosine similarity to form template depth characteristics

And the current frame search region depth feature +.>

Search area +.>

There will always be one and template frame depth feature +.>

Areas of equal size->

Wherein i represents +.>

At->

Subscript of upper horizontal movement, j represents +.>

At->

And vertically moving subscripts. Let->

At every time at +.>

Up-shift by a step of s=1, i and j will take values within the interval:

i∈[1,2,...,17]j∈[1,2,...,17]

due to

and />

The dimensions of (2) are 6X 256, and +.>

and />

One-dimensional vector flattened to (6X 256) X1>

and />

The degree of similarity of the two vectors is measured by cosine similarity. Solving for

and />

The cosine similarity of (2) is as follows: />

* Representing a cross-correlation metric operation.

Step 5, using Euclidean distance to make template frame depth feature

And the current frame search region depth feature +.>

Search area +.>

And the current frame search area->

Neutron block->

Is measured as follows:

the letter indicates Euclidean distance metric operation.

Step 6, response diagram h obtained by cosine similarity _d (Z,S _t ) And the size of the response plots obtained from euclidean distances are 17×17×1. For h _d (Z,S _t) and h_d (Z,S _t ) The weighted fusion is carried out by the weight lambda=0.5 as follows, and a fused response chart is obtained:

and then the fused response diagram h (Z, S) _t ) Interpolated into response plots H (Z, S) of size 272×272 _t ). Response diagram H (Z, S _t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S _t ) Maximum value and response diagram H (Z, S _t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame _t-1 ,y _t-1 ) Obtain the target position (x _t ,y _t ) The specific calculation mode is as follows:

second update the width and height (w) _t ,h _t ). Firstly, obtaining a scale of the target width and height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:

let the update rate r=0.59. The width and height (w) of the current frame target are updated as follows _t ,h _t )，

And (3) finishing the tracking process of the current frame target, taking the next frame as the current frame, and jumping to the step (3) to track the next frame. As shown in fig. 3, the target can be positioned and the target size can be determined in the current frame by means of combining the maximum value of the response diagram with the wide-high updating mode.

Claims

1. A twin network target tracking method based on different measurement criteria is characterized in that: the method specifically comprises the following steps:

step 1, selecting a feature extraction network

In, obtain depth features of templates +.>

Step 3, entering a follow-up frame image of the tracking video, and utilizing the follow-up frame imageTracking target coordinate position (x of one frame image _t-1 ,y _t-1 ) And width and height (w) _t-1 ,h _t-1 ) Search area S for obtaining current frame image _t Search area S of current frame image _t Input to the feature extraction network selected in step 1

Depth feature of search area of current frame image is obtained>

The specific process of the step 4 is as follows:

first, template frame depth features

Search area +.>

There will always be one and template frame depth feature +.>

Areas of equal size->

Wherein i represents +.>

At->

Subscript of upper horizontal movement, j represents +.>

At->

Upper vertically shifted subscript, assume +.>

At every time at +.>

wherein i and j are integers; w, H the feature extraction networks selected for step 1 respectively

The width and height obtained in (a); w and h are depth features respectively>

Is the width and height of (2);

due to

and />

The dimensions of (2) are w.times.h.times.C, will now be +.>

and />

One-dimensional vector flattened to (wXhXC) x 1>

and />

and />

The cosine similarity of (2) is as follows:

finally, a response diagram h obtained by cosine similarity measurement _c (Z,S _t ) Is that

Performing similarity measurement to obtain a response graph h obtained by using Euclidean distance measurement mode _d (Z,S _t )；

The specific steps of the step 5 are as follows:

first template frame depth feature

Search area +.>

And the current frame search area->

Neutron block->

Is measured as follows: />

Step 6, for the response graph h obtained in the step 4 _c (Z,S _t ) And the response diagram h obtained in the step 5 _d (Z,S _t ) Weighting and fusing to obtain final response diagram h (Z, S) _t ) And fusingResponse plot h (Z, S) _t ) To a fixed size, the response map h (Z, S _t ) The maximum point is the position of the tracking target, and then the width and height of the target are updated in a linear interpolation mode, so that the tracking of the target of the current frame is realized.

2. The twin network target tracking method based on different metric criteria of claim 1, wherein: the specific process of the step 1 is as follows:

3. A twin network target tracking method based on different metric criteria as defined in claim 2 wherein: the specific process of the step 2 is as follows: the specific process of the step 2 is as follows:

wherein p= (m+n)/4 represents the filling amount;

The calculation is carried out by adopting the following formula (2):

wherein ,

4. A twin network target tracking method based on different metric criteria as defined in claim 3, wherein: the specific process of the step 3 is as follows:

wherein p_t-1 ＝(w _t-1 +n _t-1 ) 4, representing the filling amount;

wherein ,

5. The twin network target tracking method based on different metric criteria of claim 1, wherein: the specific process of the step 6 is as follows:

step 6.1, for response graph h obtained by cosine similarity _d (Z,S _t ) And Euclidean distance obtained response graph h _d (Z,S _t ) The following weighted fusion is performed to obtain a fused response diagram h (Z, S) _t )：

h(Z,S _t )＝λh _c (Z,S _t )+(1-λ)h _d (Z,S _t ) (6)；

wherein r is the update rate;