CN112991385A - Twin network target tracking method based on different measurement criteria - Google Patents

Twin network target tracking method based on different measurement criteria Download PDF

Info

Publication number
CN112991385A
CN112991385A CN202110171718.7A CN202110171718A CN112991385A CN 112991385 A CN112991385 A CN 112991385A CN 202110171718 A CN202110171718 A CN 202110171718A CN 112991385 A CN112991385 A CN 112991385A
Authority
CN
China
Prior art keywords
target
frame
tracking
frame image
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110171718.7A
Other languages
Chinese (zh)
Other versions
CN112991385B (en
Inventor
刘龙
付志豪
史思琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110171718.7A priority Critical patent/CN112991385B/en
Publication of CN112991385A publication Critical patent/CN112991385A/en
Application granted granted Critical
Publication of CN112991385B publication Critical patent/CN112991385B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The invention discloses a twin network target tracking method based on different measurement criteria, which comprises the following specific steps: step 1, selecting a feature extraction network; and 2, acquiring a tracking video, and manually selecting an area where the target is located on the first frame of the video. Obtaining the depth characteristic of the template; step 3, entering a subsequent frame, and obtaining the depth characteristic of the current frame search area by utilizing the coordinate position and the width and the height of the tracking target of the previous frame; step 4, performing similarity measurement on the template depth feature and the current frame search area depth feature by adopting cosine similarity to obtain a response map; step 5, performing similarity measurement on the template depth characteristic and the current frame search area depth characteristic by adopting the Euclidean distance to obtain a response map obtained by using an Euclidean distance measurement mode; and 6, performing weighted fusion on the two response graphs, and determining the position of the target according to the maximum value on the response graphs. The method solves the problems that a target tracking method based on a twin network is easily interfered by similar objects and is not robust to the appearance of a target.

Description

Twin network target tracking method based on different measurement criteria
Technical Field
The invention belongs to the technical field of video single-target tracking, and relates to a twin network target tracking method based on different measurement criteria.
Background
In the field of computer vision, target tracking has been an important topic and research direction at present. The target tracking work is to estimate the position, shape or occupied area of a tracked target in a continuous video image sequence and determine the motion information of the target such as the motion speed, direction, track and the like. The target tracking has important research significance and wide application prospect, and is mainly applied to the aspects of video monitoring, man-machine interaction, intelligent transportation, autonomous navigation and the like.
The target tracking method based on the twin network is the mainstream of the current target tracking method. The main idea of the twin network structure is to find a function that can map the input picture to a high dimensional space, so that the simple distance in the target space approximates the "semantic" distance of the input space. More precisely, the structure tries to find a set of parameters such that the similarity measure is small when belonging to the same category and large when belonging to different categories. The network was mainly used for metric learning to calculate the similarity of information such as images, sounds, texts, etc. Especially in the field of face recognition. In target tracking, the twin network usually adopts the target area of the first frame as a template, and continuously performs similarity measurement with the template in subsequent frames to obtain the target position and size. The existing twin network-based target tracking method generally adopts cosine similarity as measurement, and the measurement mode is too single and cannot well deal with the situation that the appearance of a target is greatly changed or similar target interference exists.
Disclosure of Invention
The invention aims to provide a twin network target tracking method based on different measurement criteria, and solves the problem that the conventional twin network target tracking method is easily interfered by similar targets or is not robust to target appearance change to cause tracking failure.
The invention adopts the technical scheme that a twin network target tracking method based on different measurement criteria specifically comprises the following steps:
step 1, selecting a feature extraction network
Figure BDA0002939110140000021
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure BDA0002939110140000022
In the method, depth characteristics of the template are obtained
Figure BDA0002939110140000023
Step 3, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the previous frame imaget-1,yt-1) And width and height (w)t-1,ht-1) Obtaining a search region S of the current frame imagetSearching area S of the current frame imagetInput to the feature extraction network selected in step 1
Figure BDA0002939110140000024
Obtaining the depth characteristic of the current frame image search area
Figure BDA0002939110140000025
Step 4, the depth characteristics of the template obtained in the step 2 are subjected to cosine similarity
Figure BDA0002939110140000026
And the depth characteristic of the current frame search area obtained in the step 3
Figure BDA0002939110140000027
Similarity measurement is carried out to obtain a response graph h obtained by using a cosine similarity measurement modec(Z,St);
Step 5, the template depth characteristics obtained in the step 2 are subjected to Euclidean distance
Figure BDA0002939110140000028
And the current frame search area obtained in step 3Depth feature
Figure BDA0002939110140000029
Performing similarity measurement to obtain a response graph h obtained by using an Euclidean distance measurement moded(Z,St)。
Step 6, response graph h obtained in the step 4c(Z,St) And the response graph h obtained in the step 5d(Z,St) Performing weighted fusion to obtain the final response graph h (Z, S)t) And the fused response graph h (Z, S)t) Interpolation to fixed size, response plot h (Z, S)t) The maximum value point is the position of the tracking target, and the width and the height of the target are updated by adopting a linear interpolation mode, so that the tracking of the current frame target is realized.
The invention is also characterized in that:
the specific process of the step 1 is as follows:
selecting an AlexNet network pre-trained on the ImageNet data set as a feature extraction network of the twin network
Figure BDA0002939110140000031
The specific process of the step 2 is as follows: the specific process of the step 2 is as follows:
step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, setting (x, y) as coordinates of a central point of the target in the first frame, setting m and n as the width and the height of the target area respectively, taking the central point (x, y) of the target in the first frame as a center, and intercepting a square area with the side length of z _ sz, wherein a calculation formula of z _ sz is as follows:
Figure BDA0002939110140000032
wherein p ═ m + n)/4, represents the fill level;
step 2.2, if the square area with the side length of z _ sz exceeds the first frame image of the tracking video, filling the exceeding part with the mean value of the first frame image of the tracking video, wherein the mean value of the first frame image
Figure BDA0002939110140000033
Calculated using the following equation (2):
Figure BDA0002939110140000034
wherein ,
Figure BDA0002939110140000035
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the first frame;
step 2.3, the square area with the side length of Z _ sz is scaled to b multiplied by b to obtain the target area Z of the first frame image, and the target area Z of the first frame image is input into the feature extraction network
Figure BDA0002939110140000036
Obtaining the depth characteristics of width, height and channel number of w multiplied by h multiplied by C
Figure BDA0002939110140000037
The specific process of the step 3 is as follows:
step 3.1, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the t-1 frame imaget-1,yt-1) And width and height (m)t-1,nt-1) A square area with the side length of x _ sx is cut from the current t frame image, and the calculation formula of the side length x _ sx is as follows:
Figure BDA0002939110140000041
wherein pt-1=(wt-1+nt-1) (4) represents the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, the exceeding part is filled with the mean value of the t frame image, and the mean value of the current t frame image is calculated by adopting the following formula:
Figure BDA0002939110140000042
wherein ,
Figure BDA0002939110140000043
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the subsequent frame;
step 3.3, the square area with the side length of x _ sx is scaled to a size of a multiplied by a to obtain a search area S of the current t frame imagetSearching area S of current t frame imagetFeature extraction network input to step 1
Figure BDA0002939110140000044
Obtaining the depth characteristics of the current frame search area with width, height and channel number of W multiplied by H multiplied by C
Figure BDA0002939110140000045
The specific process of the step 4 is as follows:
first depth features of template frames
Figure BDA0002939110140000046
Searching for a region in a current frame
Figure BDA0002939110140000047
Performing sliding operation, and searching the region in the current frame every time sliding operation is performed
Figure BDA0002939110140000048
There will always be one sum template frame depth feature
Figure BDA0002939110140000049
Areas of the same size
Figure BDA00029391101400000410
Wherein i represents
Figure BDA00029391101400000411
In that
Figure BDA00029391101400000412
Subscript of upper horizontal shift, j represents
Figure BDA00029391101400000413
In that
Figure BDA00029391101400000414
Up a vertically moving subscript, assuming
Figure BDA00029391101400000415
Each time at
Figure BDA00029391101400000416
The up shift s is a step size, then i and j will take values within the following interval:
Figure BDA0002939110140000051
wherein i and j are integers;
due to the fact that
Figure BDA0002939110140000052
And
Figure BDA0002939110140000053
all of which are w x h x C, will now be
Figure BDA0002939110140000054
And
Figure BDA0002939110140000055
one-dimensional vector flattened to (w × h × C) × 1
Figure BDA0002939110140000056
And
Figure BDA0002939110140000057
using cosine similarity measure of the two vectorsDegree of similarity, solving
Figure BDA0002939110140000058
And
Figure BDA0002939110140000059
the cosine similarity of (c) is as follows:
Figure BDA00029391101400000510
finally, a response graph h obtained by a cosine similarity measurement modec(Z,St) Is composed of
Figure BDA00029391101400000511
The specific steps of the step 5 are as follows:
first template frame depth features
Figure BDA00029391101400000512
Searching for a region in a current frame
Figure BDA00029391101400000513
Performing sliding operation, and measuring the depth characteristic of the template frame by using Euclidean distance in the sliding operation process
Figure BDA00029391101400000514
And a current frame search area
Figure BDA00029391101400000515
Chinese medicine block
Figure BDA00029391101400000516
The measure of similarity is as follows:
Figure BDA00029391101400000517
finally obtaining a response graph h through an Euclidean distance measurement moded(Z,St) Is composed of
Figure BDA00029391101400000518
The specific process of the step 6 is as follows:
step 6.1, response graph h obtained by cosine similarityd(Z,St) Response graph h obtained from Euclidean distanced(Z,St) Weighted fusion is performed as shown below to obtain a fused response graph h (Z, S)t):
h(Z,St)=λhc(Z,St)+(1-λ)hd(Z,St) (6);
Step 6.2, the fused response graph h (Z, S) is processed by a bicubic interpolation modet) Interpolated to a response plot H (Z, S) of size λ ×t) Response graph H (Z, S)t) The maximum point is the position of the target, and then the response diagram H (Z, S) is usedt) Maximum and response plot H (Z, S) oft) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous framet-1,yt-1) Obtaining the target position (x) of the current framet,yt) The specific calculation method is as follows:
Figure BDA0002939110140000061
step 6.3, updating the width and height (w) of the current frame targett,ht) Firstly, a linear interpolation mode is adopted to obtain a scale with target width and high variation, and the calculation mode is as follows:
Figure BDA0002939110140000062
wherein r is the update rate;
step 6.4, updating the width and height (w) of the current t frame target in a mode of multiplying the changed scalet,ht) The concrete formula is as follows:
Figure BDA0002939110140000063
and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame and skipping to the step 3 to track the subsequent frame.
The invention has the following beneficial effects:
1. the invention leads the network to better cope with the situation of similar target change by additionally introducing the Euclidean distance measurement mode, and effectively solves the problem of tracking failure caused by the appearance of similar targets
2. The response graph obtained by Euclidean distance measurement and the response graph obtained by cosine similarity measurement are fused, and the advantages of the two measurement modes are fully utilized, so that the network is more robust to the appearance change of the target, and the tracking drift problem caused by the appearance change of the target is effectively solved.
Drawings
FIG. 1 is a network structure diagram of a twin network target tracking method based on different measurement criteria according to the present invention;
FIG. 2 is a schematic diagram of similarity measurement performed in a twin network target tracking method based on different measurement criteria according to the present invention;
FIG. 3 is a process diagram of an embodiment 1 of the twin network target tracking method based on different metric criteria.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The twin network target tracking method based on different measurement criteria, as shown in fig. 1, comprises the following specific steps:
step 1, selecting a feature extraction network of a twin network
Figure BDA0002939110140000071
Step 1 specifically comprises the step of selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network
Figure BDA0002939110140000072
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure BDA0002939110140000073
In the method, depth characteristics of the template are obtained
Figure BDA0002939110140000074
Step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, setting (x, y) as coordinates of a central point of the target in the first frame, setting m and n as the width and the height of the target area respectively, taking the central point (x, y) of the target in the first frame as a center, and intercepting a square area with the side length of z _ sz, wherein a calculation formula of z _ sz is as follows:
Figure BDA0002939110140000075
wherein p ═ m + n)/4, represents the fill level;
step 2.2, if the square area with the side length of z _ sz exceeds the first frame image of the tracking video, filling the exceeding part with the mean value of the first frame image of the tracking video, wherein the mean value of the first frame image
Figure BDA00029391101400000810
Calculated using the following equation (2):
Figure BDA0002939110140000081
wherein ,
Figure BDA0002939110140000082
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the first frame;
step 2.3, mixingThe square area with the side length of Z _ sz is scaled to b multiplied by b to obtain the target area Z of the first frame image, and the target area Z of the first frame image is input into the feature extraction network
Figure BDA0002939110140000083
Obtaining the depth characteristics of width, height and channel number of w multiplied by h multiplied by C
Figure BDA0002939110140000084
Step 3, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the previous frame imaget-1,yt-1) And width and height (w)t-1,ht-1) Obtaining a search region S of the current frame imagetSearching area S of the current frame imagetInput to the feature extraction network selected in step 1
Figure BDA0002939110140000085
Obtaining the depth characteristic of the current frame image search area
Figure BDA0002939110140000086
The specific process of the step 3 is as follows:
step 3.1, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the t-1 frame imaget-1,yt-1) And width and height (m)t-1,nt-1) A square area with the side length of x _ sx is cut from the current t frame image, and the calculation formula of the side length x _ sx is as follows:
Figure BDA0002939110140000087
wherein pt-1=(wt-1+nt-1) (4) represents the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, the exceeding part is filled with the mean value of the t frame image, and the mean value of the current t frame image is calculated by adopting the following formula:
Figure BDA0002939110140000088
wherein ,
Figure BDA0002939110140000089
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the subsequent frame;
step 3.3, the square area with the side length of x _ sx is scaled to a size of a multiplied by a to obtain a search area S of the current t frame imagetSearching area S of current t frame imagetInput to the feature extraction network selected in step 1
Figure BDA0002939110140000091
Obtaining the depth characteristics of the current frame search area with width, height and channel number of W multiplied by H multiplied by C
Figure BDA0002939110140000092
Step 4, adopting cosine similarity to match the depth characteristic of the template
Figure BDA0002939110140000093
And current frame search region depth features
Figure BDA0002939110140000094
And (5) performing similarity measurement. First depth features of template frames
Figure BDA0002939110140000095
Searching for a region in a current frame
Figure BDA0002939110140000096
A sliding operation is performed as shown in fig. 2. Searching the area of the current frame every time sliding operation is performed
Figure BDA0002939110140000097
There will always be one sum template frame depth feature
Figure BDA0002939110140000098
Areas of the same size. Defining the region as
Figure BDA0002939110140000099
Wherein i represents
Figure BDA00029391101400000910
In that
Figure BDA00029391101400000911
Subscript of upper horizontal shift, j represents
Figure BDA00029391101400000912
In that
Figure BDA00029391101400000913
Up the vertically shifted subscript. Suppose that
Figure BDA00029391101400000914
Each time at
Figure BDA00029391101400000915
The up shift s is a step size, then i and j will take values within the following interval:
Figure BDA00029391101400000916
due to the fact that
Figure BDA00029391101400000917
And
Figure BDA00029391101400000918
all of which are w x h x C, will now be
Figure BDA00029391101400000919
And
Figure BDA00029391101400000920
is flattened into (w)One-dimensional vector of x h x C) x 1
Figure BDA00029391101400000921
And
Figure BDA00029391101400000922
the cosine similarity measures the degree of similarity of the two vectors. Solving for
Figure BDA00029391101400000923
And
Figure BDA00029391101400000924
the cosine similarity of (c) is as follows:
Figure BDA00029391101400000925
finally, a response graph h obtained by a cosine similarity measurement modec(Z,St) Is composed of
Figure BDA00029391101400000926
A collection of (a). h isc(Z,St) The expression of (c) can be written as follows:
Figure BDA0002939110140000101
denotes the cross-correlation metric operation.
Step 5, the depth characteristics of the template frame are subjected to Euclidean distance
Figure BDA0002939110140000102
And current frame search region depth features
Figure BDA0002939110140000103
And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step 4. First template frame depth features
Figure BDA0002939110140000104
Searching for a region in a current frame
Figure BDA0002939110140000105
A sliding operation is performed as shown in fig. 2. In the process of sliding operation, the Euclidean distance is used for measuring the depth characteristic of the template frame
Figure BDA0002939110140000106
And a current frame search area
Figure BDA0002939110140000107
Chinese medicine block
Figure BDA0002939110140000108
The measure of similarity is as follows:
Figure BDA0002939110140000109
finally obtaining a response graph h through an Euclidean distance measurement moded(Z,St) Is that
Figure BDA00029391101400001010
A collection of (a). h isd(Z,St) The expression of (c) can be written as follows:
Figure BDA00029391101400001011
an "-" indicates an Oldham distance metric operation.
Step 6, response graph h obtained by cosine similarityd(Z,St) Response graph h obtained from Euclidean distanced(Z,St) Performing weighted fusion as shown below to obtain a fused response graph:
h(Z,St)=λhc(Z,St)+(1-λ)hd(Z,St) (10);
then the fused response graph h (Z, S) is processed by a bicubic interpolation modet) Interpolated to a response plot H (Z, S) of size λ ×t). Response graph H (Z, S)t) The maximum point is the position of the target, and then the response diagram H (Z, S) is usedt) Maximum and response plot H (Z, S) oft) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous framet-1,yt-1) Obtaining the target position (x) of the current framet,yt) The specific calculation method is as follows:
Figure BDA0002939110140000111
secondly, the width and height (w) of the current frame target is updatedt,ht) Firstly, a linear interpolation mode is adopted to obtain a scale with target width and high variation, and the calculation mode is as follows:
Figure BDA0002939110140000112
where r is the update rate. And then updating the width and height (w) of the target of the current frame in a mode of multiplying the changed scalet,ht) The concrete formula is as follows:
Figure BDA0002939110140000113
and (3) ending the tracking process of the target of the current frame, taking the next frame as the current frame and skipping to the step (3) to track the subsequent frame.
Example 1
Step 1, selecting an AlexNet network pre-trained on an ImageNet data set as a feature extraction network of a twin network
Figure BDA0002939110140000114
Table 1 feature extraction network parameter table
Figure BDA0002939110140000115
Figure BDA0002939110140000121
Feature extraction network
Figure BDA0002939110140000122
As shown in table 1, consists of a total of 5 convolutional layers and 2 pooling layers. The first two convolutional layers are followed by two max pooling layers. Random deactivation layers and RELU nonlinear activation functions are added after the first 4 convolutional layers
And 2, acquiring a tracking video, and manually selecting an area where the target is located on the first frame of the video. Let (x, y) be the coordinates of the center point of the object in the first frame, and m and n be the width and height of the object region, respectively. Taking the center point (x, y) of the target in the first frame as the center, cutting out a square area with the side length of z _ sz, wherein the calculation formula of z _ sz is as follows:
Figure BDA0002939110140000123
wherein p ═ m + n)/4 represents the filling amount. If the square area is beyond the image, the excess portion is filled with the image mean. The square area of Z _ sz is then scaled to 127 x 127, resulting in the target area Z of the first frame. Finally, inputting the target area Z of the first frame into the feature extraction network
Figure BDA0002939110140000124
Resulting in a depth feature with dimensions of 6 x 256
Figure BDA0002939110140000125
Step 3, entering the subsequent frame, and utilizing the tracking target coordinate position (x) of the previous framet-1,yt-1) And width and height (m)t-1,nt-1) And (3) cutting a square area with the side length of x _ sx, wherein the calculation formula of x _ sx is as follows:
Figure BDA0002939110140000126
wherein pt-1=(wt-1+nt-1) And/4, the filling amount. If the square area is beyond the image, the excess portion is filled with the image mean. Then, the square area of x _ sx is scaled to 255 × 255 size to obtain the search area S of the current frametAnd inputting the depth feature of the current frame search area into a feature extraction network
Figure BDA0002939110140000131
The size is 22 × 22 × 256;
step 4, adopting cosine similarity to match the depth characteristic of the template
Figure BDA0002939110140000132
And current frame search region depth features
Figure BDA0002939110140000133
And (5) performing similarity measurement. First depth features of template frames
Figure BDA0002939110140000134
Searching for a region in a current frame
Figure BDA0002939110140000135
A sliding operation is performed as shown in fig. 2. Searching the area of the current frame every time sliding operation is performed
Figure BDA0002939110140000136
There will always be one sum template frame depth feature
Figure BDA0002939110140000137
Areas of the same size
Figure BDA0002939110140000138
Wherein i represents
Figure BDA0002939110140000139
In that
Figure BDA00029391101400001310
Subscript of upper horizontal shift, j represents
Figure BDA00029391101400001311
In that
Figure BDA00029391101400001312
Up the vertically shifted subscript. Suppose that
Figure BDA00029391101400001313
Each time at
Figure BDA00029391101400001314
If the up shift s is 1 step, i and j will take values within the following interval:
i∈[1,2,...,17]j∈[1,2,...,17]
due to the fact that
Figure BDA00029391101400001315
And
Figure BDA00029391101400001316
are all 6X 256, will now be
Figure BDA00029391101400001317
And
Figure BDA00029391101400001318
one-dimensional vector flattened to (6 × 6 × 256) × 1
Figure BDA00029391101400001319
And
Figure BDA00029391101400001320
the cosine similarity measures the degree of similarity of the two vectors. Solving for
Figure BDA00029391101400001321
And
Figure BDA00029391101400001322
the cosine similarity of (c) is as follows:
Figure BDA00029391101400001323
finally, a response graph h obtained by a cosine similarity measurement modec(Z,St) Is composed of
Figure BDA00029391101400001324
A collection of (a). h isc(Z,St) The expression of (c) can be written as follows:
Figure BDA00029391101400001325
denotes the cross-correlation metric operation.
Step 5, the depth characteristics of the template frame are subjected to Euclidean distance
Figure BDA00029391101400001326
And current frame search region depth features
Figure BDA00029391101400001327
And (4) performing similarity measurement, and adopting a method similar to the cosine similarity measurement in the step 4. First template frame depth features
Figure BDA00029391101400001328
Searching for a region in a current frame
Figure BDA00029391101400001329
A sliding operation is performed as shown in fig. 2. In the process of sliding operation, the Euclidean distance is used for measuring the depth characteristic of the template frame
Figure BDA00029391101400001330
And a current frame search area
Figure BDA00029391101400001331
Chinese medicine block
Figure BDA00029391101400001332
The measure of similarity is as follows:
Figure BDA0002939110140000141
finally obtaining a response graph h through an Euclidean distance measurement moded(Z,St) Is that
Figure BDA0002939110140000142
A collection of (a). h isd(Z,St) The expression of (c) can be written as follows:
Figure BDA0002939110140000143
an "-" indicates an Oldham distance metric operation.
Step 6, obtaining a response graph h by cosine similarityd(Z,St) The response plots obtained with the euclidean distance were all 17 × 17 × 1 in size. To hd(Z,St) and hd(Z,St) And performing weighted fusion as shown in the following by taking the weight lambda as 0.5 to obtain a fused response graph:
Figure BDA0002939110140000144
then the fused response graph h (Z, S) is processed by a bicubic interpolation modet) Interpolated to a response plot H (Z, S) of size 272X 272t). Response graph H (Z, S)t) The maximum point is the position of the target, and then the response diagram H (Z, S) is usedt) Maximum and response plot H (Z, S) oft) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous framet-1,yt-1) Obtaining the target position (x) of the current framet,yt) The specific calculation method is as follows:
Figure BDA0002939110140000145
secondly, the width and height (w) of the current frame target is updatedt,ht). Firstly, obtaining a target width and height variable scale by adopting a linear interpolation mode, wherein the calculation mode is as follows:
Figure BDA0002939110140000146
the update rate r is made 0.59. The width and height (w) of the current frame target are updated as followst,ht),
Figure BDA0002939110140000151
And (3) ending the tracking process of the target of the current frame, taking the next frame as the current frame and skipping to the step (3) to track the subsequent frame. As shown in fig. 3, by using the maximum value of the fused response map and the width and height updating method, the target can be located and the size of the target can be determined in the current frame.

Claims (7)

1. A twin network target tracking method based on different measurement criteria is characterized in that: the method specifically comprises the following steps:
step 1, selecting a feature extraction network
Figure FDA0002939110130000011
Step 2, acquiring a tracking video, manually selecting a region where a target is located on a first frame of the video, and inputting a target region Z of the first frame as a template into the feature extraction network selected in the step 1
Figure FDA0002939110130000012
In the method, depth characteristics of the template are obtained
Figure FDA0002939110130000013
Step 3, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the previous frame imaget-1,yt-1) And width and height (w)t-1,ht-1) Obtaining a search region S of the current frame imagetSearching area S of the current frame imagetInput to the feature extraction network selected in step 1
Figure FDA0002939110130000014
Obtaining the depth characteristic of the current frame image search area
Figure FDA0002939110130000015
Step 4, the depth characteristics of the template obtained in the step 2 are subjected to cosine similarity
Figure FDA0002939110130000016
And the depth characteristic of the current frame search area obtained in the step 3
Figure FDA0002939110130000017
Similarity measurement is carried out to obtain a response graph h obtained by using a cosine similarity measurement modec(Z,St);
Step 5, the template depth characteristics obtained in the step 2 are subjected to Euclidean distance
Figure FDA0002939110130000018
And the depth characteristic of the current frame search area obtained in the step 3
Figure FDA0002939110130000019
Performing similarity measurement to obtain a response graph h obtained by using an Euclidean distance measurement moded(Z,St)。
Step 6, response graph h obtained in the step 4c(Z,St) And the response chart obtained in the step 5hd(Z,St) Performing weighted fusion to obtain the final response graph h (Z, S)t) And the fused response graph h (Z, S)t) Interpolation to fixed size, response plot h (Z, S)t) The maximum value point is the position of the tracking target, and the width and the height of the target are updated by adopting a linear interpolation mode, so that the tracking of the current frame target is realized.
2. The twin network target tracking method based on different metric criteria as claimed in claim 1, wherein: the specific process of the step 1 is as follows:
selecting an AlexNet network pre-trained on the ImageNet data set as a feature extraction network of the twin network
Figure FDA0002939110130000021
3. The twin network target tracking method based on different metric criteria as claimed in claim 2, wherein: the specific process of the step 2 is as follows: the specific process of the step 2 is as follows:
step 2.1, acquiring a tracking video, manually selecting an area where a target is located on a first frame of the video, setting (x, y) as coordinates of a central point of the target in the first frame, setting m and n as the width and the height of the target area respectively, taking the central point (x, y) of the target in the first frame as a center, and intercepting a square area with the side length of z _ sz, wherein a calculation formula of z _ sz is as follows:
Figure FDA0002939110130000022
wherein p ═ m + n)/4, represents the fill level;
step 2.2, if the square area with the side length of z _ sz exceeds the first frame image of the tracking video, filling the exceeding part with the mean value of the first frame image of the tracking video, wherein the mean value of the first frame image
Figure FDA0002939110130000027
Calculated using the following equation (2):
Figure FDA0002939110130000023
wherein ,
Figure FDA0002939110130000024
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the first frame;
step 2.3, the square area with the side length of Z _ sz is scaled to b multiplied by b to obtain the target area Z of the first frame image, and the target area Z of the first frame image is input into the feature extraction network
Figure FDA0002939110130000025
Obtaining the depth characteristics of width, height and channel number of w multiplied by h multiplied by C
Figure FDA0002939110130000026
4. The twin network target tracking method based on different metric criteria as claimed in claim 3, wherein: the specific process of the step 3 is as follows:
step 3.1, entering the subsequent frame image of the tracking video, and utilizing the tracking target coordinate position (x) of the t-1 frame imaget-1,yt-1) And width and height (m)t-1,nt-1) A square area with the side length of x _ sx is cut from the current t frame image, and the calculation formula of the side length x _ sx is as follows:
Figure FDA0002939110130000031
wherein pt-1=(wt-1+nt-1) (4) represents the filling amount;
step 3.2, if the square area intercepted in the step 3.1 exceeds the current t frame image, the exceeding part is filled with the mean value of the t frame image, and the mean value of the current t frame image is calculated by adopting the following formula:
Figure FDA0002939110130000032
wherein ,
Figure FDA0002939110130000033
representing the pixel values of the ith channel, the jth row and the kth column in the target area of the subsequent frame;
step 3.3, the square area with the side length of x _ sx is scaled to a size of a multiplied by a to obtain a search area S of the current t frame imagetSearching area S of current t frame imagetInput to the feature extraction network selected in step 1
Figure FDA0002939110130000034
Obtaining the depth characteristics of the current frame search area with width, height and channel number of W multiplied by H multiplied by C
Figure FDA0002939110130000035
5. The twin network target tracking method based on different metric criteria as claimed in claim 4, wherein: the specific process of the step 4 is as follows:
first depth features of template frames
Figure FDA0002939110130000036
Searching for a region in a current frame
Figure FDA0002939110130000037
Performing sliding operation, and searching the region in the current frame every time sliding operation is performed
Figure FDA0002939110130000038
There will always be one sum template frameDepth feature
Figure FDA0002939110130000039
Areas of the same size
Figure FDA00029391101300000310
Wherein i represents
Figure FDA00029391101300000311
In that
Figure FDA00029391101300000312
Subscript of upper horizontal shift, j represents
Figure FDA00029391101300000313
In that
Figure FDA00029391101300000314
Up a vertically moving subscript, assuming
Figure FDA00029391101300000315
Each time at
Figure FDA00029391101300000316
The up shift s is a step size, then i and j will take values within the following interval:
Figure FDA00029391101300000317
wherein i and j are integers;
due to the fact that
Figure FDA0002939110130000041
And
Figure FDA0002939110130000042
all of which are w x h x C, will now be
Figure FDA0002939110130000043
And
Figure FDA0002939110130000044
one-dimensional vector flattened to (w × h × C) × 1
Figure FDA0002939110130000045
And
Figure FDA0002939110130000046
measuring the similarity of the two vectors by cosine similarity, and solving
Figure FDA0002939110130000047
And
Figure FDA0002939110130000048
the cosine similarity of (c) is as follows:
Figure FDA0002939110130000049
finally, a response graph h obtained by a cosine similarity measurement modec(Z,St) Is composed of
Figure FDA00029391101300000410
6. The twin network target tracking method based on different metric criteria as claimed in claim 5, wherein: the specific steps of the step 5 are as follows:
first template frame depth features
Figure FDA00029391101300000411
Searching for a region in a current frame
Figure FDA00029391101300000412
Performing sliding operation, and measuring the depth characteristic of the template frame by using Euclidean distance in the sliding operation process
Figure FDA00029391101300000413
And a current frame search area
Figure FDA00029391101300000414
Chinese medicine block
Figure FDA00029391101300000415
The measure of similarity is as follows:
Figure FDA00029391101300000416
finally obtaining a response graph h through an Euclidean distance measurement moded(Z,St) Is composed of
Figure FDA00029391101300000417
7. The twin network target tracking method based on different metric criteria as claimed in claim 6, wherein: the specific process of the step 6 is as follows:
step 6.1, response graph h obtained by cosine similarityd(Z,St) Response graph h obtained from Euclidean distanced(Z,St) Weighted fusion is performed as shown below to obtain a fused response graph h (Z, S)t):
h(Z,St)=λhc(Z,St)+(1-λ)hd(Z,St) (6);
Step 6.2, the fused response graph h (Z, S) is processed by a bicubic interpolation modet) Interpolated to a response plot H (Z, S) of size λ ×t) Response graph H (Z, S)t) The maximum point is the position of the target, and then the response diagram H (Z, S) is usedt) Maximum and response plot H (Z, S) oft) The deviation (Deltax, Deltay) of the center position corrects the target position (x) of the previous framet-1,yt-1) Obtaining the target position (x) of the current framet,yt) The specific calculation method is as follows:
Figure FDA0002939110130000051
step 6.3, updating the width and height (w) of the current frame targett,ht) Firstly, a linear interpolation mode is adopted to obtain a scale with target width and high variation, and the calculation mode is as follows:
Figure FDA0002939110130000052
wherein r is the update rate;
step 6.4, updating the width and height (w) of the current t frame target in a mode of multiplying the changed scalet,ht) The concrete formula is as follows:
Figure FDA0002939110130000053
and 6.5, finishing the tracking process of the target image of the current frame, taking the next frame as the current frame and skipping to the step 3 to track the subsequent frame.
CN202110171718.7A 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria Active CN112991385B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110171718.7A CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110171718.7A CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Publications (2)

Publication Number Publication Date
CN112991385A true CN112991385A (en) 2021-06-18
CN112991385B CN112991385B (en) 2023-04-28

Family

ID=76347410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110171718.7A Active CN112991385B (en) 2021-02-08 2021-02-08 Twin network target tracking method based on different measurement criteria

Country Status (1)

Country Link
CN (1) CN112991385B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379806A (en) * 2021-08-13 2021-09-10 南昌工程学院 Target tracking method and system based on learnable sparse conversion attention mechanism

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128254A1 (en) * 2017-12-26 2019-07-04 浙江宇视科技有限公司 Image analysis method and apparatus, and electronic device and readable storage medium
US20200051250A1 (en) * 2018-08-08 2020-02-13 Beihang University Target tracking method and device oriented to airborne-based monitoring scenarios
CN111161317A (en) * 2019-12-30 2020-05-15 北京工业大学 Single-target tracking method based on multiple networks
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111639551A (en) * 2020-05-12 2020-09-08 华中科技大学 Online multi-target tracking method and system based on twin network and long-short term clues
CN111951304A (en) * 2020-09-03 2020-11-17 湖南人文科技学院 Target tracking method, device and equipment based on mutual supervision twin network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128254A1 (en) * 2017-12-26 2019-07-04 浙江宇视科技有限公司 Image analysis method and apparatus, and electronic device and readable storage medium
US20200051250A1 (en) * 2018-08-08 2020-02-13 Beihang University Target tracking method and device oriented to airborne-based monitoring scenarios
CN111161317A (en) * 2019-12-30 2020-05-15 北京工业大学 Single-target tracking method based on multiple networks
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111639551A (en) * 2020-05-12 2020-09-08 华中科技大学 Online multi-target tracking method and system based on twin network and long-short term clues
CN111951304A (en) * 2020-09-03 2020-11-17 湖南人文科技学院 Target tracking method, device and equipment based on mutual supervision twin network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
L. XU等: ""Visual Tracking Based on Siamese Network of Fused Score Map"", 《IEEE ACCESS》 *
秦晓飞等: ""基于孪生网络和多距离融合的行人再识别"", 《光学仪器》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379806A (en) * 2021-08-13 2021-09-10 南昌工程学院 Target tracking method and system based on learnable sparse conversion attention mechanism

Also Published As

Publication number Publication date
CN112991385B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN109949375B (en) Mobile robot target tracking method based on depth map region of interest
US20220366576A1 (en) Method for target tracking, electronic device, and storage medium
CN108550162B (en) Object detection method based on deep reinforcement learning
CN111781608B (en) Moving target detection method and system based on FMCW laser radar
CN107578430B (en) Stereo matching method based on self-adaptive weight and local entropy
CN107169994B (en) Correlation filtering tracking method based on multi-feature fusion
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN106408596B (en) Sectional perspective matching process based on edge
CN106780631A (en) A kind of robot closed loop detection method based on deep learning
CN109708658B (en) Visual odometer method based on convolutional neural network
CN110111370B (en) Visual object tracking method based on TLD and depth multi-scale space-time features
CN111260661A (en) Visual semantic SLAM system and method based on neural network technology
CN111292369B (en) False point cloud data generation method of laser radar
CN108537825B (en) Target tracking method based on transfer learning regression network
CN107945207A (en) A kind of real-time object tracking method based on video interframe low-rank related information uniformity
WO2023169337A1 (en) Target object speed estimation method and apparatus, vehicle, and storage medium
CN111998862A (en) Dense binocular SLAM method based on BNN
CN112508851A (en) Mud rock lithology recognition system based on CNN classification algorithm
CN112991385A (en) Twin network target tracking method based on different measurement criteria
CN112802199A (en) High-precision mapping point cloud data processing method and system based on artificial intelligence
CN113487631B (en) LEGO-LOAM-based adjustable large-angle detection sensing and control method
CN115908539A (en) Target volume automatic measurement method and device and storage medium
CN100378752C (en) Segmentation method of natural image in robustness
CN113643329B (en) Twin attention network-based online update target tracking method and system
CN112446353B (en) Video image trace line detection method based on depth convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant