CN112991385A

CN112991385A - Twin network target tracking method based on different measurement criteria

Info

Publication number: CN112991385A
Application number: CN202110171718.7A
Authority: CN
Inventors: 刘龙; 付志豪; 史思琦
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-18
Anticipated expiration: 2041-02-08
Also published as: CN112991385B

Abstract

The invention discloses a twin network target tracking method based on different measurement criteria, which comprises the following specific steps: step 1, selecting a feature extraction network; and 2, acquiring a tracking video, and manually selecting an area where the target is located on the first frame of the video. Obtaining the depth characteristic of the template; step 3, entering a subsequent frame, and obtaining the depth characteristic of the current frame search area by utilizing the coordinate position and the width and the height of the tracking target of the previous frame; step 4, performing similarity measurement on the template depth feature and the current frame search area depth feature by adopting cosine similarity to obtain a response map; step 5, performing similarity measurement on the template depth characteristic and the current frame search area depth characteristic by adopting the Euclidean distance to obtain a response map obtained by using an Euclidean distance measurement mode; and 6, performing weighted fusion on the two response graphs, and determining the position of the target according to the maximum value on the response graphs. The method solves the problems that a target tracking method based on a twin network is easily interfered by similar objects and is not robust to the appearance of a target.

Description

Twin network target tracking method based on different measurement criteria

Technical Field

The invention belongs to the technical field of video single-target tracking, and relates to a twin network target tracking method based on different measurement criteria.

Background

In the field of computer vision, target tracking has been an important topic and research direction at present. The target tracking work is to estimate the position, shape or occupied area of a tracked target in a continuous video image sequence and determine the motion information of the target such as the motion speed, direction, track and the like. The target tracking has important research significance and wide application prospect, and is mainly applied to the aspects of video monitoring, man-machine interaction, intelligent transportation, autonomous navigation and the like.

The target tracking method based on the twin network is the mainstream of the current target tracking method. The main idea of the twin network structure is to find a function that can map the input picture to a high dimensional space, so that the simple distance in the target space approximates the "semantic" distance of the input space. More precisely, the structure tries to find a set of parameters such that the similarity measure is small when belonging to the same category and large when belonging to different categories. The network was mainly used for metric learning to calculate the similarity of information such as images, sounds, texts, etc. Especially in the field of face recognition. In target tracking, the twin network usually adopts the target area of the first frame as a template, and continuously performs similarity measurement with the template in subsequent frames to obtain the target position and size. The existing twin network-based target tracking method generally adopts cosine similarity as measurement, and the measurement mode is too single and cannot well deal with the situation that the appearance of a target is greatly changed or similar target interference exists.

Disclosure of Invention

The invention aims to provide a twin network target tracking method based on different measurement criteria, and solves the problem that the conventional twin network target tracking method is easily interfered by similar targets or is not robust to target appearance change to cause tracking failure.

The invention adopts the technical scheme that a twin network target tracking method based on different measurement criteria specifically comprises the following steps:

step 1, selecting a feature extraction network

Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1

In the method, depth characteristics of the template are obtained

Step 3, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the previous frame image_t-1,y_t-1) And width and height (w)_t-1,h_t-1) Obtaining a search region S of the current frame image_tSearching area S of the current frame image_tInput to the feature extraction network selected in step 1

Obtaining the depth characteristic of the current frame image search area

Step 4, the depth characteristics of the template obtained in the step 2 are subjected to cosine similarity

And the depth characteristic of the current frame search area obtained in the step 3

Similarity measurement is carried out to obtain a response graph h obtained by using a cosine similarity measurement mode_c(Z,S_t)；

Step 5, the template depth characteristics obtained in the step 2 are subjected to Euclidean distance

And the current frame search area obtained in step 3Depth feature

Performing similarity measurement to obtain a response graph h obtained by using an Euclidean distance measurement mode_d(Z,S_t)。

Step 6, response graph h obtained in the step 4_c(Z,S_t) And the response graph h obtained in the step 5_d(Z,S_t) Performing weighted fusion to obtain the final response graph h (Z, S)_t) And the fused response graph h (Z, S)_t) Interpolation to fixed size, response plot h (Z, S)_t) The maximum value point is the position of the tracking target, and the width and the height of the target are updated by adopting a linear interpolation mode, so that the tracking of the current frame target is realized.

The invention is also characterized in that:

the specific process of the step 1 is as follows:

selecting an AlexNet network pre-trained on the ImageNet data set as a feature extraction network of the twin network

The specific process of the step 2 is as follows: the specific process of the step 2 is as follows:

step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, setting (x, y) as coordinates of a central point of the target in the first frame, setting m and n as the width and the height of the target area respectively, taking the central point (x, y) of the target in the first frame as a center, and intercepting a square area with the side length of z _ sz, wherein a calculation formula of z _ sz is as follows:

wherein p ═ m + n)/4, represents the fill level;

step 2.2, if the square area with the side length of z _ sz exceeds the first frame image of the tracking video, filling the exceeding part with the mean value of the first frame image of the tracking video, wherein the mean value of the first frame image

Calculated using the following equation (2):

wherein ,

representing the pixel values of the ith channel, the jth row and the kth column in the target area of the first frame;

step 2.3, the square area with the side length of Z _ sz is scaled to b multiplied by b to obtain the target area Z of the first frame image, and the target area Z of the first frame image is input into the feature extraction network

Obtaining the depth characteristics of width, height and channel number of w multiplied by h multiplied by C

The specific process of the step 3 is as follows:

step 3.1, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the t-1 frame image_t-1,y_t-1) And width and height (m)_t-1,n_t-1) A square area with the side length of x _ sx is cut from the current t frame image, and the calculation formula of the side length x _ sx is as follows:

wherein p_t-1＝(w_t-1+n_t-1) (4) represents the filling amount;

step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, the exceeding part is filled with the mean value of the t frame image, and the mean value of the current t frame image is calculated by adopting the following formula:

wherein ,

representing the pixel values of the ith channel, the jth row and the kth column in the target area of the subsequent frame;

step 3.3, the square area with the side length of x _ sx is scaled to a size of a multiplied by a to obtain a search area S of the current t frame image_tSearching area S of current t frame image_tFeature extraction network input to step 1

Obtaining the depth characteristics of the current frame search area with width, height and channel number of W multiplied by H multiplied by C

The specific process of the step 4 is as follows:

first depth features of template frames

Searching for a region in a current frame

Performing sliding operation, and searching the region in the current frame every time sliding operation is performed

There will always be one sum template frame depth feature

Areas of the same size

Wherein i represents

In that

Subscript of upper horizontal shift, j represents

In that

Up a vertically moving subscript, assuming

Each time at

The up shift s is a step size, then i and j will take values within the following interval:

wherein i and j are integers;

due to the fact that

And

all of which are w x h x C, will now be

And

one-dimensional vector flattened to (w × h × C) × 1

And

using cosine similarity measure of the two vectorsDegree of similarity, solving

And

the cosine similarity of (c) is as follows:

finally, a response graph h obtained by a cosine similarity measurement mode_c(Z,S_t) Is composed of

The specific steps of the step 5 are as follows:

first template frame depth features

Searching for a region in a current frame

Performing sliding operation, and measuring the depth characteristic of the template frame by using Euclidean distance in the sliding operation process

And a current frame search area

Chinese medicine block

The measure of similarity is as follows:

finally obtaining a response graph h through an Euclidean distance measurement mode_d(Z,S_t) Is composed of

The specific process of the step 6 is as follows:

step 6.1, response graph h obtained by cosine similarity_d(Z,S_t) Response graph h obtained from Euclidean distance_d(Z,S_t) Weighted fusion is performed as shown below to obtain a fused response graph h (Z, S)_t)：

h(Z,S_t)＝λh_c(Z,S_t)+(1-λ)h_d(Z,S_t) (6)；

Step 6.2, the fused response graph h (Z, S) is processed by a bicubic interpolation mode_t) Interpolated to a response plot H (Z, S) of size λ ×_t) Response graph H (Z, S)_t) The maximum point is the position of the target, and then the response diagram H (Z, S) is used_t) Maximum and response plot H (Z, S) of_t) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous frame_t-1,y_t-1) Obtaining the target position (x) of the current frame_t,y_t) The specific calculation method is as follows:

step 6.3, updating the width and height (w) of the current frame target_t,h_t) Firstly, a linear interpolation mode is adopted to obtain a scale with target width and high variation, and the calculation mode is as follows:

wherein r is the update rate;

step 6.4, updating the width and height (w) of the current t frame target in a mode of multiplying the changed scale_t,h_t) The concrete formula is as follows:

and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame and skipping to the step 3 to track the subsequent frame.

The invention has the following beneficial effects:

1. the invention leads the network to better cope with the situation of similar target change by additionally introducing the Euclidean distance measurement mode, and effectively solves the problem of tracking failure caused by the appearance of similar targets

2. The response graph obtained by Euclidean distance measurement and the response graph obtained by cosine similarity measurement are fused, and the advantages of the two measurement modes are fully utilized, so that the network is more robust to the appearance change of the target, and the tracking drift problem caused by the appearance change of the target is effectively solved.

Drawings

FIG. 1 is a network structure diagram of a twin network target tracking method based on different measurement criteria according to the present invention;

FIG. 2 is a schematic diagram of similarity measurement performed in a twin network target tracking method based on different measurement criteria according to the present invention;

FIG. 3 is a process diagram of an embodiment 1 of the twin network target tracking method based on different metric criteria.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The twin network target tracking method based on different measurement criteria, as shown in fig. 1, comprises the following specific steps:

step 1, selecting a feature extraction network of a twin network

Step 1 specifically comprises the step of selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network

In the method, depth characteristics of the template are obtained

wherein p ═ m + n)/4, represents the fill level;

Calculated using the following equation (2):

wherein ,

step 2.3, mixingThe square area with the side length of Z _ sz is scaled to b multiplied by b to obtain the target area Z of the first frame image, and the target area Z of the first frame image is input into the feature extraction network

Obtaining the depth characteristic of the current frame image search area

The specific process of the step 3 is as follows:

wherein p_t-1＝(w_t-1+n_t-1) (4) represents the filling amount;

wherein ,

step 3.3, the square area with the side length of x _ sx is scaled to a size of a multiplied by a to obtain a search area S of the current t frame image_tSearching area S of current t frame image_tInput to the feature extraction network selected in step 1

Step 4, adopting cosine similarity to match the depth characteristic of the template

And current frame search region depth features

And (5) performing similarity measurement. First depth features of template frames

Searching for a region in a current frame

A sliding operation is performed as shown in fig. 2. Searching the area of the current frame every time sliding operation is performed

There will always be one sum template frame depth feature

Areas of the same size. Defining the region as

Wherein i represents

In that

Subscript of upper horizontal shift, j represents

In that

Up the vertically shifted subscript. Suppose that

Each time at

due to the fact that

And

all of which are w x h x C, will now be

And

is flattened into (w)One-dimensional vector of x h x C) x 1

And

the cosine similarity measures the degree of similarity of the two vectors. Solving for

And

the cosine similarity of (c) is as follows:

A collection of (a). h is_c(Z,S_t) The expression of (c) can be written as follows:

denotes the cross-correlation metric operation.

Step 5, the depth characteristics of the template frame are subjected to Euclidean distance

And current frame search region depth features

And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step 4. First template frame depth features

Searching for a region in a current frame

A sliding operation is performed as shown in fig. 2. In the process of sliding operation, the Euclidean distance is used for measuring the depth characteristic of the template frame

And a current frame search area

Chinese medicine block

The measure of similarity is as follows:

finally obtaining a response graph h through an Euclidean distance measurement mode_d(Z,S_t) Is that

A collection of (a). h is_d(Z,S_t) The expression of (c) can be written as follows:

an "-" indicates an Oldham distance metric operation.

Step 6, response graph h obtained by cosine similarity_d(Z,S_t) Response graph h obtained from Euclidean distance_d(Z,S_t) Performing weighted fusion as shown below to obtain a fused response graph:

h(Z,S_t)＝λh_c(Z,S_t)+(1-λ)h_d(Z,S_t) (10)；

then the fused response graph h (Z, S) is processed by a bicubic interpolation mode_t) Interpolated to a response plot H (Z, S) of size λ ×_t). Response graph H (Z, S)_t) The maximum point is the position of the target, and then the response diagram H (Z, S) is used_t) Maximum and response plot H (Z, S) of_t) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous frame_t-1,y_t-1) Obtaining the target position (x) of the current frame_t,y_t) The specific calculation method is as follows:

secondly, the width and height (w) of the current frame target is updated_t,h_t) Firstly, a linear interpolation mode is adopted to obtain a scale with target width and high variation, and the calculation mode is as follows:

where r is the update rate. And then updating the width and height (w) of the target of the current frame in a mode of multiplying the changed scale_t,h_t) The concrete formula is as follows:

and (3) ending the tracking process of the target of the current frame, taking the next frame as the current frame and skipping to the step (3) to track the subsequent frame.

Example 1

Step 1, selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network

Table 1 feature extraction network parameter table

Feature extraction network

As shown in table 1, consists of a total of 5 convolutional layers and 2 pooling layers. The first two convolutional layers are followed by two max pooling layers. Random deactivation layers and RELU nonlinear activation functions are added after the first 4 convolutional layers

And 2, acquiring a tracking video, and manually selecting an area where the target is located on the first frame of the video. Let (x, y) be the coordinates of the center point of the object in the first frame, and m and n be the width and height of the object region, respectively. Taking the center point (x, y) of the target in the first frame as the center, cutting out a square area with the side length of z _ sz, wherein the calculation formula of z _ sz is as follows:

wherein p ═ m + n)/4 represents the filling amount. If the square area is beyond the image, the excess portion is filled with the image mean. The square area of Z _ sz is then scaled to 127 x 127, resulting in the target area Z of the first frame. Finally, inputting the target area Z of the first frame into the feature extraction network

Resulting in a depth feature with dimensions of 6 x 256

Step 3, entering the subsequent frame, and utilizing the tracking target coordinate position (x) of the previous frame_t-1,y_t-1) And width and height (m)_t-1,n_t-1) And (3) cutting a square area with the side length of x _ sx, wherein the calculation formula of x _ sx is as follows:

wherein p_t-1＝(w_t-1+n_t-1) And/4, the filling amount. If the square area is beyond the image, the excess portion is filled with the image mean. Then, the square area of x _ sx is scaled to 255 × 255 size to obtain the search area S of the current frame_tAnd inputting the depth feature of the current frame search area into a feature extraction network

The size is 22 × 22 × 256;

And current frame search region depth features

Searching for a region in a current frame

There will always be one sum template frame depth feature

Areas of the same size

Wherein i represents

In that

Subscript of upper horizontal shift, j represents

In that

Up the vertically shifted subscript. Suppose that

Each time at

If the up shift s is 1 step, i and j will take values within the following interval:

i∈[1,2,...,17]j∈[1,2,...,17]

due to the fact that

And

are all 6X 256, will now be

And

one-dimensional vector flattened to (6 × 6 × 256) × 1

And

And

the cosine similarity of (c) is as follows:

denotes the cross-correlation metric operation.

And current frame search region depth features

Searching for a region in a current frame

And a current frame search area

Chinese medicine block

The measure of similarity is as follows:

an "-" indicates an Oldham distance metric operation.

Step 6, obtaining a response graph h by cosine similarity_d(Z,S_t) The response plots obtained with the euclidean distance were all 17 × 17 × 1 in size. To h_d(Z,S_t) and h_d(Z,S_t) And performing weighted fusion as shown in the following by taking the weight lambda as 0.5 to obtain a fused response graph:

then the fused response graph h (Z, S) is processed by a bicubic interpolation mode_t) Interpolated to a response plot H (Z, S) of size 272X 272_t). Response graph H (Z, S)_t) The maximum point is the position of the target, and then the response diagram H (Z, S) is used_t) Maximum and response plot H (Z, S) of_t) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous frame_t-1,y_t-1) Obtaining the target position (x) of the current frame_t,y_t) The specific calculation method is as follows:

secondly, the width and height (w) of the current frame target is updated_t,h_t). Firstly, obtaining a target width and height variable scale by adopting a linear interpolation mode, wherein the calculation mode is as follows:

the update rate r is made 0.59. The width and height (w) of the current frame target are updated as follows_t,h_t)，

And (3) ending the tracking process of the target of the current frame, taking the next frame as the current frame and skipping to the step (3) to track the subsequent frame. As shown in fig. 3, by using the maximum value of the fused response map and the width and height updating method, the target can be located and the size of the target can be determined in the current frame.

Claims

1. A twin network target tracking method based on different measurement criteria is characterized in that: the method specifically comprises the following steps:

step 1, selecting a feature extraction network

In the method, depth characteristics of the template are obtained

Obtaining the depth characteristic of the current frame image search area

Step 6, response graph h obtained in the step 4_c(Z,S_t) And the response chart obtained in the step 5h_d(Z,S_t) Performing weighted fusion to obtain the final response graph h (Z, S)_t) And the fused response graph h (Z, S)_t) Interpolation to fixed size, response plot h (Z, S)_t) The maximum value point is the position of the tracking target, and the width and the height of the target are updated by adopting a linear interpolation mode, so that the tracking of the current frame target is realized.

2. The twin network target tracking method based on different metric criteria as claimed in claim 1, wherein: the specific process of the step 1 is as follows:

3. The twin network target tracking method based on different metric criteria as claimed in claim 2, wherein: the specific process of the step 2 is as follows: the specific process of the step 2 is as follows:

wherein p ═ m + n)/4, represents the fill level;

Calculated using the following equation (2):

wherein ,

4. The twin network target tracking method based on different metric criteria as claimed in claim 3, wherein: the specific process of the step 3 is as follows:

wherein p_t-1＝(w_t-1+n_t-1) (4) represents the filling amount;

wherein ,

5. The twin network target tracking method based on different metric criteria as claimed in claim 4, wherein: the specific process of the step 4 is as follows:

first depth features of template frames

Searching for a region in a current frame

There will always be one sum template frameDepth feature

Areas of the same size

Wherein i represents

In that

Subscript of upper horizontal shift, j represents

In that

Up a vertically moving subscript, assuming

Each time at

wherein i and j are integers;

due to the fact that

And

all of which are w x h x C, will now be

And

one-dimensional vector flattened to (w × h × C) × 1

And

measuring the similarity of the two vectors by cosine similarity, and solving

And

the cosine similarity of (c) is as follows:

6. The twin network target tracking method based on different metric criteria as claimed in claim 5, wherein: the specific steps of the step 5 are as follows:

first template frame depth features

Searching for a region in a current frame

And a current frame search area

Chinese medicine block

The measure of similarity is as follows:

7. The twin network target tracking method based on different metric criteria as claimed in claim 6, wherein: the specific process of the step 6 is as follows:

h(Z,S_t)＝λh_c(Z,S_t)+(1-λ)h_d(Z,S_t) (6)；

wherein r is the update rate;