CN114757970A - Multi-level regression target tracking method and system based on sample balance - Google Patents

Multi-level regression target tracking method and system based on sample balance Download PDF

Info

Publication number
CN114757970A
CN114757970A CN202210394687.6A CN202210394687A CN114757970A CN 114757970 A CN114757970 A CN 114757970A CN 202210394687 A CN202210394687 A CN 202210394687A CN 114757970 A CN114757970 A CN 114757970A
Authority
CN
China
Prior art keywords
iou
candidate
image
search image
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210394687.6A
Other languages
Chinese (zh)
Other versions
CN114757970B (en
Inventor
吴晶晶
楚喻棋
刘学亮
洪日昌
蒋建国
齐美彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202210394687.6A priority Critical patent/CN114757970B/en
Publication of CN114757970A publication Critical patent/CN114757970A/en
Application granted granted Critical
Publication of CN114757970B publication Critical patent/CN114757970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance, wherein a candidate frame in a search image is optimized by adopting a plurality of cascaded optimization stages through acquiring fusion characteristics between the candidate frame in the search image and a target frame in a reference image; wherein IoU thresholds in multiple optimization stages are gradually raised, and the positioning precision is gradually raised while samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.

Description

Multi-level regression target tracking method and system based on sample balance
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance.
Background
Given the location of an object of interest in a first frame of a video, a visual object tracking task aims to continuously locate the object in subsequent frames of the video. The task has higher practical application value in a security system, so the task is widely concerned in the field of computer vision. Although deep learning techniques have been successfully applied to this task, significant progress has been made. But this task remains challenging due to factors such as shape changes, scale changes, object occlusion, background clutter, etc. of the object.
In the existing target tracker based on deep learning, a Siamese double-current network structure is mostly adopted for an offline network, and regression operation of candidate positions is realized by integrating appearance information of a given template and the candidate positions. The following documents:
[1]Li B,Yan J,Wu W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:8971-8980.
[2]Li B,Wu W,Wang Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4282-4291.
[3]Zhu Z,Wang Q,Li B,et al.Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:101-117.
[4]He A,Luo C,Tian X,et al.Towards a better match in siamese network based visual object tracker[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:0-0.
[5]Zhang Z,Peng H.Deeper and wider siamese networks for real-time visualtracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4591-4600.
in the target tracking task, the most commonly used regression operation is bounding box regression (bounding box regression), which directly learns the deviation between the candidate positions and the real position of the target for correcting the candidate positions so as to make them closer to the real position. However, due to the sample imbalance problem, for example, when the bounding box regresses, only the positive sample candidate box whose interaction over Union (IoU) is greater than the set threshold value is regressed. When the threshold IoU is set higher, the fewer the number of positive samples, and thus the greater the likelihood of overfitting. But when the threshold setting is lower, there is more error because a lower threshold will result in more background in the positive samples. Therefore, how to set a reasonable IoU threshold value to balance the samples and improve the accuracy of tracking and positioning is a crucial issue in this task. However, in the existing trace off-line network design, this problem is ignored.
Disclosure of Invention
The invention aims to: aiming at the problems in the prior art, the invention provides a multi-level regression target tracking method based on sample balance and a corresponding tracking system. According to the target tracking method, the threshold value is gradually increased IoU in the positioning of multiple stages of cascade connection, and the positioning precision is gradually increased while the samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.
The technical scheme is as follows: the invention discloses a multi-level regression target tracking method based on sample balance, which comprises the following steps:
s1, extracting shallow feature R of reference image1And deep layer feature R2(ii) a Are each according to R1And R2Obtaining an object in a reference image using a Prpoool layerShallow feature a of the framed region1And deep layer characteristics a2
S2, extracting shallow feature S of search image1And deep layer characteristics S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
s3 according to S1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image 0iInner shallow feature is denoted as b1iDeep layer characteristics are denoted as b2i
A is to be1And b1iMultiplying the channels by a2And b2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g' i
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1
S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image pass through the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to be used as the final target frame of the search image.
Further, in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image.
Further, still include:
and S7, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize target tracking in the video.
Further, IoU threshold U of the first IoU prediction unit10.5, IoU threshold U for the second IoU PU2Is 0.7.
Further, in the steps S1 and S2, a shallow feature extractor composed of the initial convolutional layer of Resnet-50, the Block1, and two convolutional layers connected in sequence is used to extract shallow features of the reference image and the search image.
Further, in the steps S1 and S2, a deep feature extractor composed of blocks 2-blocks 4 and two convolutional layers connected in sequence in Resnet-50 is used to extract deep features of the reference image and the search image.
Further, the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained as follows:
s11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the real IoU calculation module is used for calculating IoU values IoU of the candidate frame and the real surrounding frame of the target in the selected search imagegt1(ii) a If IoUgt1>U1Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
and performing second-stage optimization treatment: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB 1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then the candidate box BB is1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;
s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimization loss function;
the loss function is:
Figure BDA0003598371780000041
where t represents the current number of epochs of training,
Figure BDA0003598371780000042
and
Figure BDA0003598371780000043
represent the IoU loss during the first phase and the IoU loss during the second phase, respectively, of the t-1 training generations:
Figure BDA0003598371780000044
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the ith candidate box of the search image in the sample, IoU2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
Figure BDA0003598371780000045
and
Figure BDA0003598371780000046
respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
Figure BDA0003598371780000047
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BB gtA real bounding box representing an object in the search image;
Figure BDA0003598371780000051
and the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation is shown.
On the other hand, the invention also discloses a system for realizing the multi-level regression target tracking method based on the sample balance, which comprises the following steps:
a reference image shallow feature extractor 1 for extracting the shallow feature R of the reference image1
A reference image deep feature extractor 2 for extracting deep features R of the reference image2
Reference image superficial Prpool layer 3 for the image according to R1Obtaining shallow layer characteristic a in a target frame in a reference image1
Reference image deep Prpool layer 4 for the layer based on R2Obtaining deep layer characteristics a in a target frame in a reference image2
A candidate frame generating module 5, configured to obtain an initial target frame in the search image, and perturb the initial target frame in the search image to generate multiple candidate frames B0i
A search image shallow feature extractor 6 for extracting shallow features S of the search image1
A search image deep feature extractor 7 for extracting deep features S of the search image2
Search for image shallow Prpool layer 8 for the basis of S1Obtaining a search image candidate frame B0iInner shallow feature b1i
Search for image deep Prpool layer 9 for the basis of S 2Obtaining a search image candidate frame B0iCharacteristic of inner deep layer b2i
A first fused feature obtaining module 10 for obtaining a1And b1iMultiplying the channels by a2And b2iMultiplying channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi
A first optimization module 11, configured to perform a first-stage optimization on candidate frames in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
a second optimization module 12 forAnd performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B 1iCorresponding second fusion feature gi(ii) a Merging the second fusion characteristic giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, for1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1
And a final target frame obtaining module 13, configured to select M candidate frames with the largest second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module 11 and the second optimization module 12, and take the average of the M candidate frames as a final target frame of the search image.
Further, the target tracking system further comprises a loss function calculation module 14 for calculating a loss function value when training parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit, and the second bounding box regression unit;
the loss function is:
Figure BDA0003598371780000061
where t represents the current number of epochs of training,
Figure BDA0003598371780000062
and
Figure BDA0003598371780000063
represent the IoU loss during the first phase and the IoU loss during the second phase, respectively, of the t-1 training generations:
Figure BDA0003598371780000064
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the nth candidate box of the search image in the sample, IoU 2nSecond prediction IoU value, IoU, representing the n-th candidate box after the search image is optimized in the first stagegt1And IoUgt2True IoU values for the candidate box in the first stage and second stage optimizations, respectively, in the search image;
Figure BDA0003598371780000065
and
Figure BDA0003598371780000066
respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
Figure BDA0003598371780000067
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
Figure BDA0003598371780000071
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
Has the advantages that: the multi-level regression network is designed by the multi-level regression target tracking method and the multi-level regression target tracking system based on the sample balance, and the IoU threshold values of the candidate frames are improved stage by stage through the positioning process of cascading two stages. The first optimization stage sets a smaller IoU threshold to increase the number of positive samples (candidate boxes IoU greater than the threshold are labeled as positive samples), thereby achieving a balance of training samples. After the optimization stage of the first localization regression, the quality of the candidate frame is improved. Therefore, the threshold is raised IoU in the second optimization stage, so that a large number of positive samples can be kept, and the candidate frames are subjected to further positioning regression, thereby improving the regression accuracy. In conclusion, the invention sets different IoU thresholds at different stages, thereby alleviating the balance problem of the sample and improving the positioning precision through layer-by-layer positioning.
Drawings
FIG. 1 is a flow chart of a multi-level regression target tracking method based on sample balancing according to the present disclosure;
FIG. 2 is a schematic diagram of a multi-level regression target tracking system based on sample balancing;
FIG. 3 is a schematic diagram of the components of a two-stage optimization module;
FIG. 4 is a process flow diagram of a two-stage optimization phase in the training process.
Detailed Description
The invention is further elucidated with reference to the drawings and the detailed description.
The invention discloses a sample balance-based multi-level regression target tracking method, a flow chart of which is shown in figure 1, and figure 2 is a composition schematic diagram of a tracking system for realizing the target tracking method. The target tracking method comprises the following steps:
s1, extracting shallow feature R of reference image1And the deep layer characteristic R2(ii) a Each according to R1And R2Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer1And deep layer characteristics a2
S2, extracting shallow feature S of search image1And deep layer characteristics S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image, where the online classifier is found in the literature: danelljan M, Bhat G, Khan F S, et al ATOM Accurate tracking by overlay maximum attenuation [ C ] ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-4669, detailed in the present document, the classifier is able to obtain the approximate location of the target. The candidate frame generation module 5 perturbs the initial target frame in the search image to generate a plurality of candidate frames B0i
In steps S1 and S2 of this embodiment, the shallow extractor and the deep extractor with shared parameters are used to obtain two-scale stem features in the reference image and the search image; specifically, the reference image shallow feature extractor 1 and the search image shallow feature extractor 6 both adopt a shallow feature extractor formed by sequentially connecting an initial convolutional layer of Resnet-50, a Block1 and two convolutional layers; the reference image deep feature extractor 2 and the search image deep feature extractor 7 both adopt a deep feature extractor which is formed by sequentially connecting a Block2-Block4 and two convolutional layers in Resnet-50. Resnet-50 in the literature: [7] kaim He, Xiangyu Zhuang, Shaoqing Ren, and Jianan Sun, 2016.deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
S3 according to S 1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image0iInner shallow feature b1iThe deep layer is characterized as b2i
Prpool layers in the reference image shallow Prpool layer 3, the reference image deep Prpool layer 4, the search image shallow Prpool layer 8, and the search image deep Prpool layer 9 are described in detail in the documents [6] Danelljan M, Bhat G, Khan F S, et al.
The shallow feature a of the target intra-frame area in the reference image1Shallow feature b corresponding to candidate frame1iChannel multiplication is carried out to multiply the deep features a of the target frame region in the reference image2Deep layer characteristic b corresponding to candidate frame2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi(ii) a This part of the functionality is implemented by a first fused feature acquisition module 10, as shown in FIG. 2, where
Figure BDA0003598371780000081
It is meant that the channels are multiplied by each other,
Figure BDA0003598371780000082
it is shown that the size adjustment is performed,
Figure BDA0003598371780000083
representing a cascade.
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature f iInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f is mixedi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For the candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
at this stage IoU threshold value U1Set to 0.5. That is, candidate locations with a first predicted IoU value greater than 0.5 are considered to be candidate boxes for the preliminary screening, and a first bounding box regression unit is used to optimize candidate location B0iCandidate frames subjected to preliminary screening. Since the IoU threshold at this stage is lower, more screening results can be obtained. And the number of the candidate frames of the N candidate frames in the search image after the first-stage optimization is less than N. The candidate frame obtained by optimization is marked as B1i
S5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining Raynaud by Prpoool layerOptimizing candidate frame B in search image1iShallow layer feature b'1iAnd deep layer characteristic b'2i
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B 1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1
In this phase IoU threshold value U2Set to 0.7. The IoU threshold at this stage is higher, i.e. the quality of the candidate frame after screening is higher, and the candidate position B after optimizing the candidate frame is correspondingly obtained2iThe quality is higher, thereby gradually improving the quality of the candidate frame.
The first head network and the second head network are both a smaller network located behind the backbone network. In the invention, the first head network and the second head network both comprise a plurality of sequentially cascaded convolution layers and a full connection layer, and output the characteristics of fixed size and dimension. S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as a final target frame of the search image;
the steps S4 and S5 are performed by the first optimization module 11 and the second optimization module 12, respectively, and the structures thereof are shown in (a) and (b) of fig. 3. The final target frame obtaining module 13 selects M candidate frames with the largest second prediction IoU value, and averages the M candidate frames to obtain a final target frame of the search image.
And S7, when tracking the target in the video, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize the target tracking in the video.
In this embodiment, the structures of the first IoU PU and the second IoU PU are similar to the literature: IoU predictors in Danelljan M, Bhat G, Khan F S, et al, ATOM, Accurate tracking by overlay mapping [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-.
The parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained using the following steps:
s11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then adopting a first-stage optimization process similar to S4: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search image gt1(ii) a If IoUgt1>U1Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
a second stage optimization process similar to S5 is performed: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then, the candidate frame BB is set1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2m,m=1,2, …, N2 and N2 are the number of candidate frames obtained after the second-stage optimization processing of N1 candidate frames obtained after the first-stage optimization processing;
the processing flows of the first-stage optimization and the second-stage optimization during training are shown as (a) and (b) in fig. 4, respectively. It differs from S4 and S5 in that: during training, a first coding fusion characteristic output by the first head network is input into a first IoU prediction unit and a real IoU calculation module in parallel, and a second coding fusion characteristic output by the second head network is input into a second IoU prediction unit and a real IoU calculation module in parallel; and judging whether the first bounding box regression unit and the second bounding box regression unit are input for regression optimization according to the candidate box real IoU value calculated by the real IoU calculation module. That is, the first IoU predictor and the second IoU predictor are trained to learn the IoU scores of the prediction candidate frames, and the predicted IoU is used in a case of testing (at this time, the candidate frame IoU cannot be acquired) by making the output results of the first IoU predictor and the second IoU predictor as close as possible to the true IoU value of the candidate frame calculated by the true IoU calculation module. A smaller IoU threshold is adopted in the first-stage optimization, more positive samples can be obtained, and the first bounding box regression unit optimizes the positive samples, so that the balance of training samples is guaranteed; the larger IoU threshold is used in the second stage optimization, which again improves the quality of the positive samples, enabling the second bounding box regression unit to obtain higher quality candidate boxes.
S13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimum loss function;
the loss function is:
Figure BDA0003598371780000111
where t represents the current number of epochs of training,
Figure BDA0003598371780000112
and
Figure BDA0003598371780000113
representing the first stage IoU loss and the second stage IoU loss, respectively, during t-1 training:
Figure BDA0003598371780000114
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the ith candidate box of the search image in the sample, IoU2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
Figure BDA0003598371780000115
and
Figure BDA0003598371780000116
respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
Figure BDA0003598371780000117
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
Figure BDA0003598371780000118
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
In the last term of the loss function,
Figure BDA0003598371780000119
and
Figure BDA00035983717800001110
are inversely proportional. Thus, in the early part of the training phase, the optimization of the first phase is based on the loss function The influence of the number is large, after the boundary frame regression unit in the first stage is trained well, the optimization error is smaller and smaller, and the influence of the optimization in the second stage on the loss function is increased gradually. In subsequent training, the quality of the candidate positions is better and better, the number of positive samples is more and more, and the weight occupied by the second stage is increased while keeping the samples balanced. By cascading multiple regression networks, IoU thresholds are gradually raised in the target location of multiple stages, and a smaller IoU threshold is set in the previous stage to increase the number of positive samples, so that the balance of training samples is realized. Through the first positioning regression, the quality of the candidate frame can be improved. Therefore, the threshold is raised IoU in the second stage, and the number of positive samples can still be kept from changing too much, and the candidate frame is subjected to further localization regression, so that the regression accuracy is improved. Therefore, it is possible to gradually improve the accuracy of positioning while balancing the samples.

Claims (10)

1. A multi-level regression target tracking method based on sample balance is characterized by comprising the following steps:
s1, extracting shallow feature R of reference image1And the deep layer characteristic R2(ii) a Each according to R1And R2Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer 1And deep layer characteristics a2
S2, extracting shallow feature S of search image1And deep layer feature S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
s3 according to S1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image0iInner shallow feature b1iThe deep layer is characterized as b2i
A is to1And b1iMultiplying the channels by a2And b2iMultiplying the channels; two junctions multiplying channelsCascading after the fruits are adjusted to the same size to obtain a candidate frame B0iCorresponding first fusion feature fi
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S 1And S2Obtaining an optimized candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer feature b'2i
A is to be1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1
S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as the final target frame of the search image.
2. The method for tracking multiple regression targets according to claim 1, wherein an ATOM-based online classifier is used to obtain an initial target frame in the search image in step S2.
3. The multi-level regression target tracking method according to claim 1, further comprising:
S7, the searched image is used as a reference image, the next frame image of the searched image is used as a new searched image, and the steps S1 to S6 are executed again, so that the target tracking in the video is realized.
4. The multi-level regression target tracking method of claim 1, wherein the IoU threshold U of the first IoU prediction unit10.5, IoU threshold U of the second IoU prediction unit2And was 0.7.
5. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, shallow feature extractors consisting of an initial convolution layer of Resnet-50, a Block1 and two convolution layers connected in sequence are used to extract shallow features of the reference image and the search image.
6. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, a deep feature extractor consisting of a Block2-Block4 in Resnet-50 and two convolution layers connected in sequence is used to extract deep features of the reference image and the search image.
7. The multi-level regression target tracking method according to claim 1, wherein the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained by:
S11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search imagegt1(ii) a If IoUgt1>U1Then inputting the candidate frame into the first bounding box regression unit for optimization to obtain the optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
and (3) performing second-stage optimization treatment: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then, the candidate frame BB is set 1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;
s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimum loss function;
the loss function is:
Figure FDA0003598371770000031
where t represents the current number of epochs of training,
Figure FDA0003598371770000032
and
Figure FDA0003598371770000033
respectively represent t-1 generationLoss of IoU in the first stage and loss of IoU in the second stage of training:
Figure FDA0003598371770000034
IoU therein1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
Figure FDA0003598371770000035
and
Figure FDA0003598371770000036
respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
Figure FDA0003598371770000037
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
Figure FDA0003598371770000038
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
8. A multi-level regression target tracking system based on sample balancing, comprising:
a reference image shallow feature extractor (1) for extracting shallow features R of the reference image1
A reference image deep feature extractor (2) for extracting a referenceDeep features R of an image2
A superficial Prpool layer (3) of reference image for the image according to R1Obtaining shallow layer characteristic a in a target frame in a reference image1
A reference image deep Prpool layer (4) for the image based on R2Obtaining deep layer characteristics a in a target frame in a reference image2
A candidate frame generation module (5) for acquiring an initial target frame in the search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i
A search image shallow feature extractor (6) for extracting shallow features S of the search image1
A search image deep feature extractor (7) for extracting deep features S of the search image2
Searching for image shallow Prpool layer (8) for S-dependent1Obtaining a search image candidate frame B0iInner shallow feature b1i
Search for image deep Prpool layer (9) for the basis of S2Obtaining a search image candidate frame B0iCharacteristic of inner deep layer b2i
A first fused feature acquisition module (10) for combining a1And b1iMultiplying the channels by a2And b2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B 0iCorresponding first fusion feature fi
A first optimization module (11) for performing a first stage optimization on candidate frames in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
a second optimization module (12) for performing a second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B 2i;U2IoU threshold of prediction unit for second IoU, and U2>U1
And a final target frame acquisition module (13) for selecting the M candidate frames with the maximum second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module (11) and the second optimization module (12), and averaging the M candidate frames to obtain the final target frame of the search image.
9. The multi-level regression target tracking system according to claim 8, wherein the reference image shallow feature extractor (1) and the search image shallow feature extractor (6) are respectively composed of an initial convolution layer of Resnet-50, a Block1 and two convolution layers which are connected in sequence.
10. The multi-level regression target tracking system of claim 8 further comprising a loss function calculation module (14) for calculating loss function values when training parameters in the first IoU, second IoU, first bounding box regression, and second bounding box regression units;
the loss function is:
Figure FDA0003598371770000051
where t represents the current number of epochs of training,
Figure FDA0003598371770000052
and
Figure FDA0003598371770000053
representing the first stage IoU loss and the second stage IoU loss, respectively, during t-1 training:
Figure FDA0003598371770000054
IoU therein 1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample2nSecond prediction IoU value, IoU, representing the n-th candidate box after the first stage optimization of the search imagegt1And IoUgt2True IoU values for the candidate box in the first stage and second stage optimization, respectively, in the search image;
Figure FDA0003598371770000055
and
Figure FDA0003598371770000056
respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
Figure FDA0003598371770000057
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtRepresenting objects in search imagesA real enclosure frame;
Figure FDA0003598371770000058
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
CN202210394687.6A 2022-04-15 2022-04-15 Sample balance-based multi-level regression target tracking method and tracking system Active CN114757970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210394687.6A CN114757970B (en) 2022-04-15 2022-04-15 Sample balance-based multi-level regression target tracking method and tracking system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210394687.6A CN114757970B (en) 2022-04-15 2022-04-15 Sample balance-based multi-level regression target tracking method and tracking system

Publications (2)

Publication Number Publication Date
CN114757970A true CN114757970A (en) 2022-07-15
CN114757970B CN114757970B (en) 2024-03-08

Family

ID=82330152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210394687.6A Active CN114757970B (en) 2022-04-15 2022-04-15 Sample balance-based multi-level regression target tracking method and tracking system

Country Status (1)

Country Link
CN (1) CN114757970B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer
WO2020051776A1 (en) * 2018-09-11 2020-03-19 Intel Corporation Method and system of deep supervision object detection for reducing resource usage
CN112215080A (en) * 2020-09-16 2021-01-12 电子科技大学 Target tracking method using time sequence information
CN112215079A (en) * 2020-09-16 2021-01-12 电子科技大学 Global multistage target tracking method
WO2021208502A1 (en) * 2020-04-16 2021-10-21 中国科学院深圳先进技术研究院 Remote-sensing image target detection method based on smooth bounding box regression function

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020051776A1 (en) * 2018-09-11 2020-03-19 Intel Corporation Method and system of deep supervision object detection for reducing resource usage
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer
WO2021208502A1 (en) * 2020-04-16 2021-10-21 中国科学院深圳先进技术研究院 Remote-sensing image target detection method based on smooth bounding box regression function
CN112215080A (en) * 2020-09-16 2021-01-12 电子科技大学 Target tracking method using time sequence information
CN112215079A (en) * 2020-09-16 2021-01-12 电子科技大学 Global multistage target tracking method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
熊昌镇;李言;: "基于孪生网络的跟踪算法综述", 工业控制计算机, no. 03, 25 March 2020 (2020-03-25) *
石国强;赵霞;: "基于联合优化的强耦合孪生区域推荐网络的目标跟踪算法", 计算机应用, no. 10, 10 October 2020 (2020-10-10) *

Also Published As

Publication number Publication date
CN114757970B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
Yang et al. Step: Spatio-temporal progressive learning for video action detection
CN112669325B (en) Video semantic segmentation method based on active learning
CN107066973B (en) Video content description method using space-time attention model
CN106570464B (en) Face recognition method and device for rapidly processing face shielding
CN110263666B (en) Action detection method based on asymmetric multi-stream
CN110033473B (en) Moving target tracking method based on template matching and depth classification network
CN110688927B (en) Video action detection method based on time sequence convolution modeling
CN110110648B (en) Action nomination method based on visual perception and artificial intelligence
CN108875610A (en) A method of positioning for actuation time axis in video based on border searching
CN112132856A (en) Twin network tracking method based on self-adaptive template updating
CN112215079B (en) Global multistage target tracking method
CN116188528B (en) RGBT unmanned aerial vehicle target tracking method and system based on multi-stage attention mechanism
CN111696136A (en) Target tracking method based on coding and decoding structure
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN111241987B (en) Multi-target model visual tracking method based on cost-sensitive three-branch decision
CN109145685B (en) Fruit and vegetable hyperspectral quality detection method based on ensemble learning
CN116757986A (en) Infrared and visible light image fusion method and device
CN109190505A (en) The image-recognizing method that view-based access control model understands
CN114757970A (en) Multi-level regression target tracking method and system based on sample balance
Xiang et al. Transformer-based person search model with symmetric online instance matching
CN109165586A (en) intelligent image processing method for AI chip
CN110991565A (en) Target tracking optimization algorithm based on KCF
CN113449601B (en) Pedestrian re-recognition model training and recognition method and device based on progressive smooth loss
CN109684954B (en) On-line training method for realizing target detection on unmanned equipment
CN110059584B (en) Event naming method combining boundary distribution and correction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant