CN112991385B - Twin network target tracking method based on different measurement criteria - Google Patents

Twin network target tracking method based on different measurement criteria Download PDF

Info

Publication number
CN112991385B
CN112991385B CN202110171718.7A CN202110171718A CN112991385B CN 112991385 B CN112991385 B CN 112991385B CN 202110171718 A CN202110171718 A CN 202110171718A CN 112991385 B CN112991385 B CN 112991385B
Authority
CN
China
Prior art keywords
target
frame
tracking
frame image
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110171718.7A
Other languages
Chinese (zh)
Other versions
CN112991385A (en
Inventor
刘龙
付志豪
史思琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110171718.7A priority Critical patent/CN112991385B/en
Publication of CN112991385A publication Critical patent/CN112991385A/en
Application granted granted Critical
Publication of CN112991385B publication Critical patent/CN112991385B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The invention discloses a twin network target tracking method based on different measurement criteria, which comprises the following specific steps: step 1, selecting a feature extraction network; and 2, acquiring a tracking video, and manually selecting an area where the target is located on a first frame of the video. Obtaining depth characteristics of the template; step 3, entering a subsequent frame, and obtaining depth characteristics of a current frame searching area by utilizing the tracking target coordinate position and the width and height of the previous frame; step 4, similarity measurement is carried out on the depth features of the template and the depth features of the current frame searching area by adopting cosine similarity, and a response chart is obtained; step 5, similarity measurement is carried out on the template depth features and the depth features of the current frame search area by using Euclidean distance, and a response chart obtained by using the Euclidean distance measurement mode is obtained; and 6, carrying out weighted fusion on the two response graphs, and determining the position of the target according to the maximum value on the response graphs. The method solves the problems that a target tracking method based on a twin network is easy to be interfered by a similar object and is not robust to the appearance of the target.

Description

Twin network target tracking method based on different measurement criteria
Technical Field
The invention belongs to the technical field of video single-target tracking, and relates to a twin network target tracking method based on different measurement criteria.
Background
In the field of computer vision, object tracking has been an important topic and research direction at present. The target tracking work is to estimate the position, shape or occupied area of the tracked target in a continuous video image sequence and determine the motion information such as the motion speed, direction and track of the target. The target tracking has important research significance and wide application prospect, and is mainly applied to aspects of video monitoring, man-machine interaction, intelligent transportation, autonomous navigation and the like.
The object tracking method based on the twin network is the main stream of the current object tracking method. The main idea of the twin network architecture is to find a function that can map the input picture to a high-dimensional space such that the simple distance in the target space approximates the "semantic" distance of the input space. More precisely the structure tries to find a set of parameters such that the similarity measure is small when belonging to the same category and large when belonging to different categories. The network is mainly used for metric learning in the past and is used for calculating the similarity of information such as images, sounds, texts and the like. Especially in the field of face recognition. In target tracking, the twin network typically uses the first frame target region as a template, and continuously makes similarity measurements with the template in subsequent frames to obtain the target position and size. The existing target tracking method based on the twin network generally adopts cosine similarity as measurement, the measurement mode is too single, and the situation that the appearance of the target changes greatly or similar target interference exists cannot be well dealt with.
Disclosure of Invention
The invention aims to provide a twin network target tracking method based on different measurement criteria, which solves the problem that the existing twin network target tracking method is easy to be interfered by similar targets or is not robust to the change of the appearance of the targets so as to cause tracking failure.
The technical scheme adopted by the invention is that the twin network target tracking method based on different measurement criteria comprises the following steps:
step 1, selecting a feature extraction network
Figure BDA0002939110140000021
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure BDA0002939110140000022
In, obtain depth features of templates +.>
Figure BDA0002939110140000023
Step 3, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (w) t-1 ,h t-1 ) Search area S for obtaining current frame image t Search area S of current frame image t Input to the feature extraction network selected in step 1
Figure BDA0002939110140000024
Depth feature of search area of current frame image is obtained>
Figure BDA0002939110140000025
Step 4, adopting cosine similarity to obtain the depth characteristics of the template obtained in the step 2
Figure BDA0002939110140000026
And (3) searching the depth feature of the area of the current frame obtained in the step (3)>
Figure BDA0002939110140000027
Performing similarity measurement to obtain a response graph h obtained by using a cosine similarity measurement mode c (Z,S t );
Step 5, using Euclidean distance to obtain the depth characteristics of the template obtained in the step 2
Figure BDA0002939110140000028
And (3) searching the depth feature of the area of the current frame obtained in the step (3)>
Figure BDA0002939110140000029
Performing similarity measurement to obtain a response graph h obtained by using Euclidean distance measurement mode d (Z,S t )。
Step 6, for the response graph h obtained in the step 4 c (Z,S t ) And the response diagram h obtained in the step 5 d (Z,S t ) Weighting and fusing to obtain final response diagram h (Z, S) t ) And the fused response diagram h (Z, S t ) To a fixed size, the response map h (Z, S t ) The maximum point is the position of the tracking target, and then the width and height of the target are updated in a linear interpolation mode, so that the tracking of the target of the current frame is realized.
The invention is also characterized in that:
the specific process of the step 1 is as follows:
selecting AlexNet network pre-trained on ImageNet dataset as feature extraction network of twin network
Figure BDA0002939110140000031
The specific process of the step 2 is as follows: the specific process of the step 2 is as follows:
step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, enabling (x, y) to be the center point coordinate of the target in the first frame, m and n to be the width and height of the target area respectively, taking the center point (x, y) of the target in the first frame as the center, and intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:
Figure BDA0002939110140000032
wherein p= (m+n)/4 represents the filling amount;
step 2.2, if the square region with the side length of z_sz exceeds the first frame image of the tracking video, filling the excess with the average value of the first frame image of the tracking video, wherein the average value of the first frame image
Figure BDA0002939110140000033
The calculation is carried out by adopting the following formula (2):
Figure BDA0002939110140000034
wherein ,
Figure BDA0002939110140000035
representing pixel values of the ith channel, jth row, and kth column in the first frame target region;
Step 2.3, scaling the square region with the side length of z_sz to b×b to obtain a target region Z of the first frame image, and inputting the target region Z of the first frame image into the feature extraction network
Figure BDA0002939110140000036
Depth characteristics w×h×c are obtained for the width, height, and channel number>
Figure BDA0002939110140000037
The specific process of the step 3 is as follows:
step 3.1, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (m t-1 ,n t-1 ) A square area with the side length of x_sx is cut off on the current t frame image, and the side length of x_sx is calculated according to the following formula:
Figure BDA0002939110140000041
wherein pt-1 =(w t-1 +n t-1 ) 4, representing the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, filling the excess part with the mean value of the t frame image, wherein the mean value of the current t frame image is calculated by adopting the following formula:
Figure BDA0002939110140000042
wherein ,
Figure BDA0002939110140000043
pixel values representing an ith channel, a jth row, and a kth column in a subsequent frame target region;
step 3.3, scaling the square area with the side length of x_sx to a size of a×a to obtain a search area S of the current t-th frame image t Will be the current t frameSearch area S of image t Feature extraction network input to step 1
Figure BDA0002939110140000044
Depth feature of current frame search region of W×H×C with width, height and channel number obtained>
Figure BDA0002939110140000045
The specific process of the step 4 is as follows:
first, template frame depth features
Figure BDA0002939110140000046
Search area +.>
Figure BDA0002939110140000047
Sliding operation is performed, and every time sliding operation is performed, the current frame search area is +.>
Figure BDA0002939110140000048
There will always be one and template frame depth feature +.>
Figure BDA0002939110140000049
Areas of equal size->
Figure BDA00029391101400000410
Wherein i represents +.>
Figure BDA00029391101400000411
At->
Figure BDA00029391101400000412
Subscript of upper horizontal movement, j represents +.>
Figure BDA00029391101400000413
At->
Figure BDA00029391101400000414
Subscript of vertical movement is given by
Figure BDA00029391101400000415
At every time at +.>
Figure BDA00029391101400000416
The up shift s is a step of size, then i and j will take values within the interval:
Figure BDA0002939110140000051
wherein i and j are integers;
due to
Figure BDA0002939110140000052
and />
Figure BDA0002939110140000053
The dimensions of (2) are w.times.h.times.C, will now be +.>
Figure BDA0002939110140000054
and />
Figure BDA0002939110140000055
One-dimensional vector flattened to (wXhXC) x 1>
Figure BDA0002939110140000056
and />
Figure BDA0002939110140000057
Measuring the similarity of the two vectors by cosine similarity, solving for +.>
Figure BDA0002939110140000058
And
Figure BDA0002939110140000059
the cosine similarity of (2) is as follows:
Figure BDA00029391101400000510
response diagram h finally obtained by cosine similarity measurement mode c (Z,S t ) Is that
Figure BDA00029391101400000511
The specific steps of the step 5 are as follows:
first template frame depth feature
Figure BDA00029391101400000512
Search area +.>
Figure BDA00029391101400000513
Sliding operation is performed, and during the sliding operation, euclidean distance is used for measuring the depth characteristic of the template frame +.>
Figure BDA00029391101400000514
And the current frame search area->
Figure BDA00029391101400000515
Neutron block->
Figure BDA00029391101400000516
Is measured as follows:
Figure BDA00029391101400000517
finally, a response diagram h is obtained through Euclidean distance measurement mode d (Z,S t ) Is that
Figure BDA00029391101400000518
The specific process of the step 6 is as follows:
step 6.1, for response graph h obtained by cosine similarity d (Z,S t ) And Euclidean distance obtained response graph h d (Z,S t ) The following weighted fusion is carried out to obtain a fused response diagram h(Z,S t ):
h(Z,S t )=λh c (Z,S t )+(1-λ)h d (Z,S t ) (6);
Step 6.2, the fused response diagram h (Z, S) t ) Interpolation into response maps H (Z, S) of size lambda x lambda t ) Response map H (Z, S t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S t ) Maximum value and response diagram H (Z, S t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame t-1 ,y t-1 ) Obtain the target position (x t ,y t ) The specific calculation mode is as follows:
Figure BDA0002939110140000061
step 6.3, update the width and height (w t ,h t ) Firstly, obtaining a scale of the target width-height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:
Figure BDA0002939110140000062
wherein r is the update rate;
step 6.4, updating the width and height (w t ,h t ) The specific formula is as follows:
Figure BDA0002939110140000063
and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame, and jumping to the step 3 to track the next frame.
The beneficial effects of the invention are as follows:
1. the invention additionally introduces the Euclidean distance measurement mode, so that the network can better cope with the condition of similar target change, and effectively solves the problem of tracking failure caused by similar target occurrence
2. The advantages of the two measurement modes are fully utilized by fusing the response graph obtained by Euclidean distance measurement and the response graph obtained by cosine similarity measurement, so that the network is more robust to the change of the appearance of the target, and the tracking drift problem caused by the change of the appearance of the target is effectively solved.
Drawings
FIG. 1 is a block diagram of a network architecture of the twin network target tracking method of the present invention based on different metric criteria;
FIG. 2 is a schematic diagram of similarity measurement in the twin network target tracking method based on different measurement criteria of the present invention;
FIG. 3 is a process schematic diagram of embodiment 1 of the twin network target tracking method of the present invention based on different metric criteria.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses a twin network target tracking method based on different measurement criteria, which is shown in figure 1 and comprises the following specific steps:
step 1, selecting a feature extraction network of a twin network
Figure BDA0002939110140000071
Step 1 is specifically selecting AlexNet network pre-trained on ImageNet dataset as feature extraction network of twin network
Figure BDA0002939110140000072
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure BDA0002939110140000073
In, obtain depth features of templates +.>
Figure BDA0002939110140000074
Step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, enabling (x, y) to be the center point coordinate of the target in the first frame, m and n to be the width and height of the target area respectively, taking the center point (x, y) of the target in the first frame as the center, and intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:
Figure BDA0002939110140000075
wherein p= (m+n)/4 represents the filling amount;
step 2.2, if the square region with the side length of z_sz exceeds the first frame image of the tracking video, filling the excess with the average value of the first frame image of the tracking video, wherein the average value of the first frame image
Figure BDA00029391101400000810
The calculation is carried out by adopting the following formula (2):
Figure BDA0002939110140000081
wherein ,
Figure BDA0002939110140000082
pixel values representing an ith channel, a jth row, and a kth column in a first frame target region;
step 2.3, scaling the square region with the side length of z_sz to b×b to obtain a target region Z of the first frame image, and inputting the target region Z of the first frame image into the feature extraction network
Figure BDA0002939110140000083
Depth characteristics w×h×c are obtained for the width, height, and channel number>
Figure BDA0002939110140000084
Step 3, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (w) t-1 ,h t-1 ) Search area S for obtaining current frame image t Search area S of current frame image t Input to the feature extraction network selected in step 1
Figure BDA0002939110140000085
Depth feature of search area of current frame image is obtained>
Figure BDA0002939110140000086
The specific process of the step 3 is as follows:
step 3.1, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (m t-1 ,n t-1 ) A square area with the side length of x_sx is cut off on the current t frame image, and the side length of x_sx is calculated according to the following formula:
Figure BDA0002939110140000087
wherein pt-1 =(w t-1 +n t-1 ) 4, representing the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, filling the excess part with the mean value of the t frame image, wherein the mean value of the current t frame image is calculated by adopting the following formula:
Figure BDA0002939110140000088
wherein ,
Figure BDA0002939110140000089
pixel values representing an ith channel, a jth row, and a kth column in a subsequent frame target region;
step 3.3, scaling the square area with the side length of x_sx to a size of a×a to obtain a search area S of the current t-th frame image t Search area S of current t-th frame image t Input to the feature extraction network selected in step 1
Figure BDA0002939110140000091
Depth feature of current frame search region of W×H×C with width, height and channel number obtained>
Figure BDA0002939110140000092
Step 4, adopting cosine similarity to form template depth characteristics
Figure BDA0002939110140000093
And the current frame search region depth feature +.>
Figure BDA0002939110140000094
And (5) similarity measurement is carried out. First of all the depth features of the template frame->
Figure BDA0002939110140000095
Search area +.>
Figure BDA0002939110140000096
A sliding operation as shown in fig. 2 is performed. Every time a sliding operation is made, the current frame searches for an area +.>
Figure BDA0002939110140000097
There will always be one and template frame depth feature +.>
Figure BDA0002939110140000098
Areas of the same size. Define the area as +.>
Figure BDA0002939110140000099
Wherein i represents +.>
Figure BDA00029391101400000910
At->
Figure BDA00029391101400000911
Subscript of upper horizontal movement, j represents +.>
Figure BDA00029391101400000912
At->
Figure BDA00029391101400000913
And vertically moving subscripts. Let->
Figure BDA00029391101400000914
At every time at +.>
Figure BDA00029391101400000915
The up shift s is a step of size, then i and j will take values within the interval:
Figure BDA00029391101400000916
due to
Figure BDA00029391101400000917
and />
Figure BDA00029391101400000918
The dimensions of (2) are w.times.h.times.C, will now be +.>
Figure BDA00029391101400000919
and />
Figure BDA00029391101400000920
One-dimensional vector flattened to (wXhXC) x 1>
Figure BDA00029391101400000921
and />
Figure BDA00029391101400000922
The degree of similarity of the two vectors is measured by cosine similarity. Solving->
Figure BDA00029391101400000923
And
Figure BDA00029391101400000924
the cosine similarity of (2) is as follows:
Figure BDA00029391101400000925
response diagram h finally obtained by cosine similarity measurement mode c (Z,S t ) Is that
Figure BDA00029391101400000926
Is a set of (3). h is a c (Z,S t ) The expression of (c) can be written in the following form:
Figure BDA0002939110140000101
* Representing a cross-correlation metric operation.
Step 5, using Euclidean distance to make template frame depth feature
Figure BDA0002939110140000102
And the current frame search region depth feature +.>
Figure BDA0002939110140000103
And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step (4). First template frame depth feature->
Figure BDA0002939110140000104
Search area +.>
Figure BDA0002939110140000105
A sliding operation as shown in fig. 2 is performed. During the sliding operation, euclidean distance is used to measure the depth characteristic of template frame>
Figure BDA0002939110140000106
And the current frame search area->
Figure BDA0002939110140000107
Neutron block->
Figure BDA0002939110140000108
Is measured as follows:
Figure BDA0002939110140000109
finally, a response diagram h is obtained through Euclidean distance measurement mode d (Z,S t ) Is that
Figure BDA00029391101400001010
Is a set of (3). h is a d (Z,S t ) The expression of (c) can be written in the following form:
Figure BDA00029391101400001011
the letter indicates Euclidean distance metric operation.
Step 6, for response graph h obtained by cosine similarity d (Z,S t ) And Euclidean distance obtained response graph h d (Z,S t ) The following weighted fusion is carried out to obtain a fused response diagram:
h(Z,S t )=λh c (Z,S t )+(1-λ)h d (Z,S t ) (10);
and then the fused response diagram h (Z, S) t ) Interpolation into response maps H (Z, S) of size lambda x lambda t ). Response diagram H (Z, S t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S t ) Maximum value and response diagram H (Z, S t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame t-1 ,y t-1 ) Obtaining the destination of the current frameTarget position (x) t ,y t ) The specific calculation mode is as follows:
Figure BDA0002939110140000111
second update the width and height (w) t ,h t ) Firstly, obtaining a scale of the target width-height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:
Figure BDA0002939110140000112
where r is the update rate. And updating the width and height (w) of the current frame target by multiplying the variable scale t ,h t ) The specific formula is as follows:
Figure BDA0002939110140000113
and (3) finishing the tracking process of the current frame target, taking the next frame as the current frame, and jumping to the step (3) to track the next frame.
Example 1
Step 1, selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network
Figure BDA0002939110140000114
TABLE 1 feature extraction network parameter table
Figure BDA0002939110140000115
/>
Figure BDA0002939110140000121
Feature extraction network
Figure BDA0002939110140000122
The parameters of (2) are shown in table 1 and consist of a total of 5 convolutional layers and 2 pooling layers. The first two convolutional layers are followed by two maximum pooling layers. The first 4 convolution layers are all followed by adding a random inactivation layer and a RELU nonlinear activation function
And 2, acquiring a tracking video, and manually selecting an area where the target is located on a first frame of the video. Let (x, y) be the center point coordinates of the object in the first frame, m and n be the width and height of the object region, respectively. Taking the center point (x, y) of the target in the first frame as the center, intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:
Figure BDA0002939110140000123
where p= (m+n)/4 represents the filling amount. If the square area exceeds the image, the excess is filled with the image mean. The square region of z_sz is then scaled to a size of 127 x 127, resulting in the target region Z of the first frame. Finally, inputting the target area Z of the first frame into the feature extraction network
Figure BDA0002939110140000124
Depth features of 6×6×256 are obtained>
Figure BDA0002939110140000125
Step 3, entering the subsequent frame, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (m t-1 ,n t-1 ) A square region with a side length of x_sx is intercepted, and the calculation formula of x_sx is as follows:
Figure BDA0002939110140000126
wherein pt-1 =(w t-1 +n t-1 ) And/4, the filling amount. If the square area exceeds the image, the excess is filled with the image mean. A kind of electronic deviceThen scaling the square region of x_sx to 255×255 to obtain the search region S of the current frame t And input into a feature extraction network to obtain depth features of the current frame search region
Figure BDA0002939110140000131
The size is 22 multiplied by 256;
step 4, adopting cosine similarity to form template depth characteristics
Figure BDA0002939110140000132
And the current frame search region depth feature +.>
Figure BDA0002939110140000133
And (5) similarity measurement is carried out. First of all the depth features of the template frame->
Figure BDA0002939110140000134
Search area +.>
Figure BDA0002939110140000135
A sliding operation as shown in fig. 2 is performed. Every time a sliding operation is made, the current frame searches for an area +.>
Figure BDA0002939110140000136
There will always be one and template frame depth feature +.>
Figure BDA0002939110140000137
Areas of equal size->
Figure BDA0002939110140000138
Wherein i represents +.>
Figure BDA0002939110140000139
At->
Figure BDA00029391101400001310
Subscript of upper horizontal movement, j represents +.>
Figure BDA00029391101400001311
At->
Figure BDA00029391101400001312
And vertically moving subscripts. Let->
Figure BDA00029391101400001313
At every time at +.>
Figure BDA00029391101400001314
Up-shift by a step of s=1, i and j will take values within the interval:
i∈[1,2,...,17]j∈[1,2,...,17]
due to
Figure BDA00029391101400001315
and />
Figure BDA00029391101400001316
The dimensions of (2) are 6X 256, and +.>
Figure BDA00029391101400001317
and />
Figure BDA00029391101400001318
One-dimensional vector flattened to (6X 256) X1>
Figure BDA00029391101400001319
and />
Figure BDA00029391101400001320
The degree of similarity of the two vectors is measured by cosine similarity. Solving for
Figure BDA00029391101400001321
and />
Figure BDA00029391101400001322
The cosine similarity of (2) is as follows: />
Figure BDA00029391101400001323
Response diagram h finally obtained by cosine similarity measurement mode c (Z,S t ) Is that
Figure BDA00029391101400001324
Is a set of (3). h is a c (Z,S t ) The expression of (c) can be written in the following form:
Figure BDA00029391101400001325
* Representing a cross-correlation metric operation.
Step 5, using Euclidean distance to make template frame depth feature
Figure BDA00029391101400001326
And the current frame search region depth feature +.>
Figure BDA00029391101400001327
And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step (4). First template frame depth feature->
Figure BDA00029391101400001328
Search area +.>
Figure BDA00029391101400001329
A sliding operation as shown in fig. 2 is performed. During the sliding operation, euclidean distance is used to measure the depth characteristic of template frame>
Figure BDA00029391101400001330
And the current frame search area->
Figure BDA00029391101400001331
Neutron block->
Figure BDA00029391101400001332
Is measured as follows:
Figure BDA0002939110140000141
finally, a response diagram h is obtained through Euclidean distance measurement mode d (Z,S t ) Is that
Figure BDA0002939110140000142
Is a set of (3). h is a d (Z,S t ) The expression of (c) can be written in the following form:
Figure BDA0002939110140000143
the letter indicates Euclidean distance metric operation.
Step 6, response diagram h obtained by cosine similarity d (Z,S t ) And the size of the response plots obtained from euclidean distances are 17×17×1. For h d (Z,S t) and hd (Z,S t ) The weighted fusion is carried out by the weight lambda=0.5 as follows, and a fused response chart is obtained:
Figure BDA0002939110140000144
and then the fused response diagram h (Z, S) t ) Interpolated into response plots H (Z, S) of size 272×272 t ). Response diagram H (Z, S t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S t ) Maximum value and response diagram H (Z, S t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame t-1 ,y t-1 ) Obtain the target position (x t ,y t ) The specific calculation mode is as follows:
Figure BDA0002939110140000145
second update the width and height (w) t ,h t ). Firstly, obtaining a scale of the target width and height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:
Figure BDA0002939110140000146
let the update rate r=0.59. The width and height (w) of the current frame target are updated as follows t ,h t ),
Figure BDA0002939110140000151
And (3) finishing the tracking process of the current frame target, taking the next frame as the current frame, and jumping to the step (3) to track the next frame. As shown in fig. 3, the target can be positioned and the target size can be determined in the current frame by means of combining the maximum value of the response diagram with the wide-high updating mode.

Claims (5)

1. A twin network target tracking method based on different measurement criteria is characterized in that: the method specifically comprises the following steps:
step 1, selecting a feature extraction network
Figure FDA0004125815620000011
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure FDA0004125815620000012
In, obtain depth features of templates +.>
Figure FDA0004125815620000013
Step 3, entering a follow-up frame image of the tracking video, and utilizing the follow-up frame imageTracking target coordinate position (x of one frame image t-1 ,y t-1 ) And width and height (w) t-1 ,h t-1 ) Search area S for obtaining current frame image t Search area S of current frame image t Input to the feature extraction network selected in step 1
Figure FDA0004125815620000014
Depth feature of search area of current frame image is obtained>
Figure FDA0004125815620000015
Step 4, adopting cosine similarity to obtain the depth characteristics of the template obtained in the step 2
Figure FDA0004125815620000016
And (3) searching the depth feature of the area of the current frame obtained in the step (3)>
Figure FDA0004125815620000017
Performing similarity measurement to obtain a response graph h obtained by using a cosine similarity measurement mode c (Z,S t );
The specific process of the step 4 is as follows:
first, template frame depth features
Figure FDA0004125815620000018
Search area +.>
Figure FDA0004125815620000019
Sliding operation is performed, and every time sliding operation is performed, the current frame search area is +.>
Figure FDA00041258156200000110
There will always be one and template frame depth feature +.>
Figure FDA00041258156200000111
Areas of equal size->
Figure FDA00041258156200000112
Wherein i represents +.>
Figure FDA00041258156200000113
At->
Figure FDA00041258156200000114
Subscript of upper horizontal movement, j represents +.>
Figure FDA00041258156200000115
At->
Figure FDA00041258156200000116
Upper vertically shifted subscript, assume +.>
Figure FDA00041258156200000117
At every time at +.>
Figure FDA00041258156200000118
The up shift s is a step of size, then i and j will take values within the interval:
Figure FDA0004125815620000021
wherein i and j are integers; w, H the feature extraction networks selected for step 1 respectively
Figure FDA0004125815620000022
The width and height obtained in (a); w and h are depth features respectively>
Figure FDA0004125815620000023
Is the width and height of (2);
due to
Figure FDA0004125815620000024
and />
Figure FDA0004125815620000025
The dimensions of (2) are w.times.h.times.C, will now be +.>
Figure FDA0004125815620000026
and />
Figure FDA0004125815620000027
One-dimensional vector flattened to (wXhXC) x 1>
Figure FDA0004125815620000028
and />
Figure FDA0004125815620000029
Measuring the similarity of the two vectors by cosine similarity, solving for +.>
Figure FDA00041258156200000210
and />
Figure FDA00041258156200000211
The cosine similarity of (2) is as follows:
Figure FDA00041258156200000212
finally, a response diagram h obtained by cosine similarity measurement c (Z,S t ) Is that
Figure FDA00041258156200000213
Step 5, using Euclidean distance to obtain the depth characteristics of the template obtained in the step 2
Figure FDA00041258156200000214
And (3) searching the depth feature of the area of the current frame obtained in the step (3)>
Figure FDA00041258156200000215
Performing similarity measurement to obtain a response graph h obtained by using Euclidean distance measurement mode d (Z,S t );
The specific steps of the step 5 are as follows:
first template frame depth feature
Figure FDA00041258156200000216
Search area +.>
Figure FDA00041258156200000217
Sliding operation is performed, and during the sliding operation, euclidean distance is used for measuring the depth characteristic of the template frame +.>
Figure FDA00041258156200000218
And the current frame search area->
Figure FDA00041258156200000219
Neutron block->
Figure FDA00041258156200000220
Is measured as follows: />
Figure FDA00041258156200000221
Finally, a response diagram h is obtained through Euclidean distance measurement mode d (Z,S t ) Is that
Figure FDA00041258156200000222
Step 6, for the response graph h obtained in the step 4 c (Z,S t ) And the response diagram h obtained in the step 5 d (Z,S t ) Weighting and fusing to obtain final response diagram h (Z, S) t ) And fusingResponse plot h (Z, S) t ) To a fixed size, the response map h (Z, S t ) The maximum point is the position of the tracking target, and then the width and height of the target are updated in a linear interpolation mode, so that the tracking of the target of the current frame is realized.
2. The twin network target tracking method based on different metric criteria of claim 1, wherein: the specific process of the step 1 is as follows:
selecting AlexNet network pre-trained on ImageNet dataset as feature extraction network of twin network
Figure FDA0004125815620000031
3. A twin network target tracking method based on different metric criteria as defined in claim 2 wherein: the specific process of the step 2 is as follows: the specific process of the step 2 is as follows:
step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, enabling (x, y) to be the center point coordinate of the target in the first frame, m and n to be the width and height of the target area respectively, taking the center point (x, y) of the target in the first frame as the center, and intercepting a square area with the side length of z_sz, wherein the calculation formula of z_sz is as follows:
Figure FDA0004125815620000032
wherein p= (m+n)/4 represents the filling amount;
step 2.2, if the square region with the side length of z_sz exceeds the first frame image of the tracking video, filling the excess with the average value of the first frame image of the tracking video, wherein the average value of the first frame image
Figure FDA0004125815620000033
The calculation is carried out by adopting the following formula (2):
Figure FDA0004125815620000034
wherein ,
Figure FDA0004125815620000035
pixel values representing an ith channel, a jth row, and a kth column in a first frame target region;
step 2.3, scaling the square region with the side length of z_sz to b×b to obtain a target region Z of the first frame image, and inputting the target region Z of the first frame image into the feature extraction network
Figure FDA0004125815620000041
Depth characteristics w×h×c are obtained for the width, height, and channel number>
Figure FDA0004125815620000042
4. A twin network target tracking method based on different metric criteria as defined in claim 3, wherein: the specific process of the step 3 is as follows:
step 3.1, entering a subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x t-1 ,y t-1 ) And width and height (m t-1 ,n t-1 ) A square area with the side length of x_sx is cut off on the current t frame image, and the side length of x_sx is calculated according to the following formula:
Figure FDA0004125815620000043
wherein pt-1 =(w t-1 +n t-1 ) 4, representing the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, filling the excess part with the mean value of the t frame image, wherein the mean value of the current t frame image is calculated by adopting the following formula:
Figure FDA0004125815620000044
wherein ,
Figure FDA0004125815620000045
pixel values representing an ith channel, a jth row, and a kth column in a subsequent frame target region;
step 3.3, scaling the square area with the side length of x_sx to a size of a×a to obtain a search area S of the current t-th frame image t Search area S of current t-th frame image t Input to the feature extraction network selected in step 1
Figure FDA0004125815620000046
Depth feature of current frame search region of W×H×C with width, height and channel number obtained>
Figure FDA0004125815620000047
5. The twin network target tracking method based on different metric criteria of claim 1, wherein: the specific process of the step 6 is as follows:
step 6.1, for response graph h obtained by cosine similarity d (Z,S t ) And Euclidean distance obtained response graph h d (Z,S t ) The following weighted fusion is performed to obtain a fused response diagram h (Z, S) t ):
h(Z,S t )=λh c (Z,S t )+(1-λ)h d (Z,S t ) (6);
Step 6.2, the fused response diagram h (Z, S) t ) Interpolation into response maps H (Z, S) of size lambda x lambda t ) Response map H (Z, S t ) The maximum point on the map is the position of the target, and then according to the response map H (Z, S t ) Maximum value and response diagram H (Z, S t ) The deviation (Deltax, deltay) of the center position corrects the target position (x) of the previous frame t-1 ,y t-1 ) Obtain the target position (x t ,y t ) The specific calculation mode is as follows:
Figure FDA0004125815620000051
step 6.3, update the width and height (w t ,h t ) Firstly, obtaining a scale of the target width-height change by adopting a linear interpolation mode, wherein the calculation mode is as follows:
Figure FDA0004125815620000052
wherein r is the update rate;
step 6.4, updating the width and height (w t ,h t ) The specific formula is as follows:
Figure FDA0004125815620000053
and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame, and jumping to the step 3 to track the next frame.
CN202110171718.7A 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria Active CN112991385B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110171718.7A CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110171718.7A CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Publications (2)

Publication Number Publication Date
CN112991385A CN112991385A (en) 2021-06-18
CN112991385B true CN112991385B (en) 2023-04-28

Family

ID=76347410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110171718.7A Active CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Country Status (1)

Country Link
CN (1) CN112991385B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379806B (en) * 2021-08-13 2021-11-09 南昌工程学院 Target tracking method and system based on learnable sparse conversion attention mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128254A1 (en) * 2017-12-26 2019-07-04 浙江宇视科技有限公司 Image analysis method and apparatus, and electronic device and readable storage medium
CN111161317A (en) * 2019-12-30 2020-05-15 北京工业大学 Single-target tracking method based on multiple networks
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111639551A (en) * 2020-05-12 2020-09-08 华中科技大学 Online multi-target tracking method and system based on twin network and long-short term clues
CN111951304A (en) * 2020-09-03 2020-11-17 湖南人文科技学院 Target tracking method, device and equipment based on mutual supervision twin network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272530B (en) * 2018-08-08 2020-07-21 北京航空航天大学 Target tracking method and device for space-based monitoring scene

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128254A1 (en) * 2017-12-26 2019-07-04 浙江宇视科技有限公司 Image analysis method and apparatus, and electronic device and readable storage medium
CN111161317A (en) * 2019-12-30 2020-05-15 北京工业大学 Single-target tracking method based on multiple networks
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111639551A (en) * 2020-05-12 2020-09-08 华中科技大学 Online multi-target tracking method and system based on twin network and long-short term clues
CN111951304A (en) * 2020-09-03 2020-11-17 湖南人文科技学院 Target tracking method, device and equipment based on mutual supervision twin network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Visual Tracking Based on Siamese Network of Fused Score Map";L. Xu等;《IEEE Access》;20191016;第7卷;全文 *
"基于孪生网络和多距离融合的行人再识别";秦晓飞等;《光学仪器》;20200229;第42卷(第01期);全文 *

Also Published As

Publication number Publication date
CN112991385A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN110276785B (en) Anti-shielding infrared target tracking method
CN107818571A (en) Ship automatic tracking method and system based on deep learning network and average drifting
CN107563494A (en) A kind of the first visual angle Fingertip Detection based on convolutional neural networks and thermal map
CN111781608B (en) Moving target detection method and system based on FMCW laser radar
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN111275740B (en) Satellite video target tracking method based on high-resolution twin network
CN112183675B (en) Tracking method for low-resolution target based on twin network
CN112991385B (en) Twin network target tracking method based on different measurement criteria
CN106408596A (en) Edge-based local stereo matching method
CN107945207A (en) A kind of real-time object tracking method based on video interframe low-rank related information uniformity
CN111998862A (en) Dense binocular SLAM method based on BNN
CN111027586A (en) Target tracking method based on novel response map fusion
CN111123953B (en) Particle-based mobile robot group under artificial intelligence big data and control method thereof
CN115908539A (en) Target volume automatic measurement method and device and storage medium
CN113487631B (en) LEGO-LOAM-based adjustable large-angle detection sensing and control method
CN111127510B (en) Target object position prediction method and device
CN112446353B (en) Video image trace line detection method based on depth convolution neural network
CN107038710B (en) It is a kind of using paper as the Vision Tracking of target
CN113064422A (en) Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
CN116777956A (en) Moving target screening method based on multi-scale track management
CN116659500A (en) Mobile robot positioning method and system based on laser radar scanning information
CN106408600A (en) Image registration method applied to solar high-resolution image
CN116469001A (en) Remote sensing image-oriented construction method of rotating frame target detection model
CN114612518A (en) Twin network target tracking method based on historical track information and fine-grained matching
CN116429116A (en) Robot positioning method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant