CN114757970A - Multi-level regression target tracking method and system based on sample balance - Google Patents
Multi-level regression target tracking method and system based on sample balance Download PDFInfo
- Publication number
- CN114757970A CN114757970A CN202210394687.6A CN202210394687A CN114757970A CN 114757970 A CN114757970 A CN 114757970A CN 202210394687 A CN202210394687 A CN 202210394687A CN 114757970 A CN114757970 A CN 114757970A
- Authority
- CN
- China
- Prior art keywords
- iou
- candidate
- image
- search image
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000005457 optimization Methods 0.000 claims abstract description 94
- 230000004927 fusion Effects 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 17
- 238000012935 Averaging Methods 0.000 claims description 4
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000007547 defect Effects 0.000 abstract description 2
- 238000005070 sampling Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 10
- 230000000007 visual effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000003909 pattern recognition Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 240000004093 Mitragyna parvifolia Species 0.000 description 1
- 208000012322 Raynaud phenomenon Diseases 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20021—Dividing image into blocks, subimages or windows
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Abstract
The invention discloses a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance, wherein a candidate frame in a search image is optimized by adopting a plurality of cascaded optimization stages through acquiring fusion characteristics between the candidate frame in the search image and a target frame in a reference image; wherein IoU thresholds in multiple optimization stages are gradually raised, and the positioning precision is gradually raised while samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance.
Background
Given the location of an object of interest in a first frame of a video, a visual object tracking task aims to continuously locate the object in subsequent frames of the video. The task has higher practical application value in a security system, so the task is widely concerned in the field of computer vision. Although deep learning techniques have been successfully applied to this task, significant progress has been made. But this task remains challenging due to factors such as shape changes, scale changes, object occlusion, background clutter, etc. of the object.
In the existing target tracker based on deep learning, a Siamese double-current network structure is mostly adopted for an offline network, and regression operation of candidate positions is realized by integrating appearance information of a given template and the candidate positions. The following documents:
[1]Li B,Yan J,Wu W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:8971-8980.
[2]Li B,Wu W,Wang Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4282-4291.
[3]Zhu Z,Wang Q,Li B,et al.Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:101-117.
[4]He A,Luo C,Tian X,et al.Towards a better match in siamese network based visual object tracker[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:0-0.
[5]Zhang Z,Peng H.Deeper and wider siamese networks for real-time visualtracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4591-4600.
in the target tracking task, the most commonly used regression operation is bounding box regression (bounding box regression), which directly learns the deviation between the candidate positions and the real position of the target for correcting the candidate positions so as to make them closer to the real position. However, due to the sample imbalance problem, for example, when the bounding box regresses, only the positive sample candidate box whose interaction over Union (IoU) is greater than the set threshold value is regressed. When the threshold IoU is set higher, the fewer the number of positive samples, and thus the greater the likelihood of overfitting. But when the threshold setting is lower, there is more error because a lower threshold will result in more background in the positive samples. Therefore, how to set a reasonable IoU threshold value to balance the samples and improve the accuracy of tracking and positioning is a crucial issue in this task. However, in the existing trace off-line network design, this problem is ignored.
Disclosure of Invention
The invention aims to: aiming at the problems in the prior art, the invention provides a multi-level regression target tracking method based on sample balance and a corresponding tracking system. According to the target tracking method, the threshold value is gradually increased IoU in the positioning of multiple stages of cascade connection, and the positioning precision is gradually increased while the samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.
The technical scheme is as follows: the invention discloses a multi-level regression target tracking method based on sample balance, which comprises the following steps:
s1, extracting shallow feature R of reference image1And deep layer feature R2(ii) a Are each according to R1And R2Obtaining an object in a reference image using a Prpoool layerShallow feature a of the framed region1And deep layer characteristics a2;
S2, extracting shallow feature S of search image1And deep layer characteristics S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
s3 according to S1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image 0iInner shallow feature is denoted as b1iDeep layer characteristics are denoted as b2i;
A is to be1And b1iMultiplying the channels by a2And b2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi;
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i;
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g' i;
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1;
S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image pass through the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to be used as the final target frame of the search image.
Further, in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image.
Further, still include:
and S7, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize target tracking in the video.
Further, IoU threshold U of the first IoU prediction unit10.5, IoU threshold U for the second IoU PU2Is 0.7.
Further, in the steps S1 and S2, a shallow feature extractor composed of the initial convolutional layer of Resnet-50, the Block1, and two convolutional layers connected in sequence is used to extract shallow features of the reference image and the search image.
Further, in the steps S1 and S2, a deep feature extractor composed of blocks 2-blocks 4 and two convolutional layers connected in sequence in Resnet-50 is used to extract deep features of the reference image and the search image.
Further, the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained as follows:
s11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the real IoU calculation module is used for calculating IoU values IoU of the candidate frame and the real surrounding frame of the target in the selected search imagegt1(ii) a If IoUgt1>U1Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
and performing second-stage optimization treatment: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB 1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then the candidate box BB is1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;
s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimization loss function;
where t represents the current number of epochs of training,andrepresent the IoU loss during the first phase and the IoU loss during the second phase, respectively, of the t-1 training generations:
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the ith candidate box of the search image in the sample, IoU2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
andrespectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BB gtA real bounding box representing an object in the search image;and the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation is shown.
On the other hand, the invention also discloses a system for realizing the multi-level regression target tracking method based on the sample balance, which comprises the following steps:
a reference image shallow feature extractor 1 for extracting the shallow feature R of the reference image1
A reference image deep feature extractor 2 for extracting deep features R of the reference image2;
Reference image superficial Prpool layer 3 for the image according to R1Obtaining shallow layer characteristic a in a target frame in a reference image1;
Reference image deep Prpool layer 4 for the layer based on R2Obtaining deep layer characteristics a in a target frame in a reference image2;
A candidate frame generating module 5, configured to obtain an initial target frame in the search image, and perturb the initial target frame in the search image to generate multiple candidate frames B0i;
A search image shallow feature extractor 6 for extracting shallow features S of the search image1;
A search image deep feature extractor 7 for extracting deep features S of the search image2;
Search for image shallow Prpool layer 8 for the basis of S1Obtaining a search image candidate frame B0iInner shallow feature b1i;
Search for image deep Prpool layer 9 for the basis of S 2Obtaining a search image candidate frame B0iCharacteristic of inner deep layer b2i;
A first fused feature obtaining module 10 for obtaining a1And b1iMultiplying the channels by a2And b2iMultiplying channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi;
A first optimization module 11, configured to perform a first-stage optimization on candidate frames in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
a second optimization module 12 forAnd performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i;
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B 1iCorresponding second fusion feature gi(ii) a Merging the second fusion characteristic giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i;
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, for1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1;
And a final target frame obtaining module 13, configured to select M candidate frames with the largest second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module 11 and the second optimization module 12, and take the average of the M candidate frames as a final target frame of the search image.
Further, the target tracking system further comprises a loss function calculation module 14 for calculating a loss function value when training parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit, and the second bounding box regression unit;
where t represents the current number of epochs of training,andrepresent the IoU loss during the first phase and the IoU loss during the second phase, respectively, of the t-1 training generations:
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the nth candidate box of the search image in the sample, IoU 2nSecond prediction IoU value, IoU, representing the n-th candidate box after the search image is optimized in the first stagegt1And IoUgt2True IoU values for the candidate box in the first stage and second stage optimizations, respectively, in the search image;
andrespectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
Has the advantages that: the multi-level regression network is designed by the multi-level regression target tracking method and the multi-level regression target tracking system based on the sample balance, and the IoU threshold values of the candidate frames are improved stage by stage through the positioning process of cascading two stages. The first optimization stage sets a smaller IoU threshold to increase the number of positive samples (candidate boxes IoU greater than the threshold are labeled as positive samples), thereby achieving a balance of training samples. After the optimization stage of the first localization regression, the quality of the candidate frame is improved. Therefore, the threshold is raised IoU in the second optimization stage, so that a large number of positive samples can be kept, and the candidate frames are subjected to further positioning regression, thereby improving the regression accuracy. In conclusion, the invention sets different IoU thresholds at different stages, thereby alleviating the balance problem of the sample and improving the positioning precision through layer-by-layer positioning.
Drawings
FIG. 1 is a flow chart of a multi-level regression target tracking method based on sample balancing according to the present disclosure;
FIG. 2 is a schematic diagram of a multi-level regression target tracking system based on sample balancing;
FIG. 3 is a schematic diagram of the components of a two-stage optimization module;
FIG. 4 is a process flow diagram of a two-stage optimization phase in the training process.
Detailed Description
The invention is further elucidated with reference to the drawings and the detailed description.
The invention discloses a sample balance-based multi-level regression target tracking method, a flow chart of which is shown in figure 1, and figure 2 is a composition schematic diagram of a tracking system for realizing the target tracking method. The target tracking method comprises the following steps:
s1, extracting shallow feature R of reference image1And the deep layer characteristic R2(ii) a Each according to R1And R2Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer1And deep layer characteristics a2;
S2, extracting shallow feature S of search image1And deep layer characteristics S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image, where the online classifier is found in the literature: danelljan M, Bhat G, Khan F S, et al ATOM Accurate tracking by overlay maximum attenuation [ C ] ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-4669, detailed in the present document, the classifier is able to obtain the approximate location of the target. The candidate frame generation module 5 perturbs the initial target frame in the search image to generate a plurality of candidate frames B0i。
In steps S1 and S2 of this embodiment, the shallow extractor and the deep extractor with shared parameters are used to obtain two-scale stem features in the reference image and the search image; specifically, the reference image shallow feature extractor 1 and the search image shallow feature extractor 6 both adopt a shallow feature extractor formed by sequentially connecting an initial convolutional layer of Resnet-50, a Block1 and two convolutional layers; the reference image deep feature extractor 2 and the search image deep feature extractor 7 both adopt a deep feature extractor which is formed by sequentially connecting a Block2-Block4 and two convolutional layers in Resnet-50. Resnet-50 in the literature: [7] kaim He, Xiangyu Zhuang, Shaoqing Ren, and Jianan Sun, 2016.deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
S3 according to S 1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image0iInner shallow feature b1iThe deep layer is characterized as b2i;
Prpool layers in the reference image shallow Prpool layer 3, the reference image deep Prpool layer 4, the search image shallow Prpool layer 8, and the search image deep Prpool layer 9 are described in detail in the documents [6] Danelljan M, Bhat G, Khan F S, et al.
The shallow feature a of the target intra-frame area in the reference image1Shallow feature b corresponding to candidate frame1iChannel multiplication is carried out to multiply the deep features a of the target frame region in the reference image2Deep layer characteristic b corresponding to candidate frame2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B0iCorresponding first fusion feature fi(ii) a This part of the functionality is implemented by a first fused feature acquisition module 10, as shown in FIG. 2, whereIt is meant that the channels are multiplied by each other,it is shown that the size adjustment is performed,representing a cascade.
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature f iInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f is mixedi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For the candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
at this stage IoU threshold value U1Set to 0.5. That is, candidate locations with a first predicted IoU value greater than 0.5 are considered to be candidate boxes for the preliminary screening, and a first bounding box regression unit is used to optimize candidate location B0iCandidate frames subjected to preliminary screening. Since the IoU threshold at this stage is lower, more screening results can be obtained. And the number of the candidate frames of the N candidate frames in the search image after the first-stage optimization is less than N. The candidate frame obtained by optimization is marked as B1i。
S5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining Raynaud by Prpoool layerOptimizing candidate frame B in search image1iShallow layer feature b'1iAnd deep layer characteristic b'2i;
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B 1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i;
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1;
In this phase IoU threshold value U2Set to 0.7. The IoU threshold at this stage is higher, i.e. the quality of the candidate frame after screening is higher, and the candidate position B after optimizing the candidate frame is correspondingly obtained2iThe quality is higher, thereby gradually improving the quality of the candidate frame.
The first head network and the second head network are both a smaller network located behind the backbone network. In the invention, the first head network and the second head network both comprise a plurality of sequentially cascaded convolution layers and a full connection layer, and output the characteristics of fixed size and dimension. S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as a final target frame of the search image;
the steps S4 and S5 are performed by the first optimization module 11 and the second optimization module 12, respectively, and the structures thereof are shown in (a) and (b) of fig. 3. The final target frame obtaining module 13 selects M candidate frames with the largest second prediction IoU value, and averages the M candidate frames to obtain a final target frame of the search image.
And S7, when tracking the target in the video, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize the target tracking in the video.
In this embodiment, the structures of the first IoU PU and the second IoU PU are similar to the literature: IoU predictors in Danelljan M, Bhat G, Khan F S, et al, ATOM, Accurate tracking by overlay mapping [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-.
The parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained using the following steps:
s11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then adopting a first-stage optimization process similar to S4: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search image gt1(ii) a If IoUgt1>U1Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
a second stage optimization process similar to S5 is performed: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then, the candidate frame BB is set1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2m,m=1,2, …, N2 and N2 are the number of candidate frames obtained after the second-stage optimization processing of N1 candidate frames obtained after the first-stage optimization processing;
the processing flows of the first-stage optimization and the second-stage optimization during training are shown as (a) and (b) in fig. 4, respectively. It differs from S4 and S5 in that: during training, a first coding fusion characteristic output by the first head network is input into a first IoU prediction unit and a real IoU calculation module in parallel, and a second coding fusion characteristic output by the second head network is input into a second IoU prediction unit and a real IoU calculation module in parallel; and judging whether the first bounding box regression unit and the second bounding box regression unit are input for regression optimization according to the candidate box real IoU value calculated by the real IoU calculation module. That is, the first IoU predictor and the second IoU predictor are trained to learn the IoU scores of the prediction candidate frames, and the predicted IoU is used in a case of testing (at this time, the candidate frame IoU cannot be acquired) by making the output results of the first IoU predictor and the second IoU predictor as close as possible to the true IoU value of the candidate frame calculated by the true IoU calculation module. A smaller IoU threshold is adopted in the first-stage optimization, more positive samples can be obtained, and the first bounding box regression unit optimizes the positive samples, so that the balance of training samples is guaranteed; the larger IoU threshold is used in the second stage optimization, which again improves the quality of the positive samples, enabling the second bounding box regression unit to obtain higher quality candidate boxes.
S13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimum loss function;
where t represents the current number of epochs of training,andrepresenting the first stage IoU loss and the second stage IoU loss, respectively, during t-1 training:
IoU therein1iRepresenting the value of the first prediction IoU corresponding to the ith candidate box of the search image in the sample, IoU2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
andrespectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.
In the last term of the loss function,andare inversely proportional. Thus, in the early part of the training phase, the optimization of the first phase is based on the loss function The influence of the number is large, after the boundary frame regression unit in the first stage is trained well, the optimization error is smaller and smaller, and the influence of the optimization in the second stage on the loss function is increased gradually. In subsequent training, the quality of the candidate positions is better and better, the number of positive samples is more and more, and the weight occupied by the second stage is increased while keeping the samples balanced. By cascading multiple regression networks, IoU thresholds are gradually raised in the target location of multiple stages, and a smaller IoU threshold is set in the previous stage to increase the number of positive samples, so that the balance of training samples is realized. Through the first positioning regression, the quality of the candidate frame can be improved. Therefore, the threshold is raised IoU in the second stage, and the number of positive samples can still be kept from changing too much, and the candidate frame is subjected to further localization regression, so that the regression accuracy is improved. Therefore, it is possible to gradually improve the accuracy of positioning while balancing the samples.
Claims (10)
1. A multi-level regression target tracking method based on sample balance is characterized by comprising the following steps:
s1, extracting shallow feature R of reference image1And the deep layer characteristic R2(ii) a Each according to R1And R2Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer 1And deep layer characteristics a2;
S2, extracting shallow feature S of search image1And deep layer feature S2(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;
s3 according to S1And S2Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image0iInner shallow feature b1iThe deep layer is characterized as b2i;
A is to1And b1iMultiplying the channels by a2And b2iMultiplying the channels; two junctions multiplying channelsCascading after the fruits are adjusted to the same size to obtain a candidate frame B0iCorresponding first fusion feature fi;
S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S 1And S2Obtaining an optimized candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer feature b'2i;
A is to be1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i;
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B2i;U2IoU threshold of the unit is predicted for the second IoU, and U2>U1;
S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as the final target frame of the search image.
2. The method for tracking multiple regression targets according to claim 1, wherein an ATOM-based online classifier is used to obtain an initial target frame in the search image in step S2.
3. The multi-level regression target tracking method according to claim 1, further comprising:
S7, the searched image is used as a reference image, the next frame image of the searched image is used as a new searched image, and the steps S1 to S6 are executed again, so that the target tracking in the video is realized.
4. The multi-level regression target tracking method of claim 1, wherein the IoU threshold U of the first IoU prediction unit10.5, IoU threshold U of the second IoU prediction unit2And was 0.7.
5. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, shallow feature extractors consisting of an initial convolution layer of Resnet-50, a Block1 and two convolution layers connected in sequence are used to extract shallow features of the reference image and the search image.
6. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, a deep feature extractor consisting of a Block2-Block4 in Resnet-50 and two convolution layers connected in sequence is used to extract deep features of the reference image and the search image.
7. The multi-level regression target tracking method according to claim 1, wherein the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained by:
S11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;
s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search imagegt1(ii) a If IoUgt1>U1Then inputting the candidate frame into the first bounding box regression unit for optimization to obtain the optimized candidate frame BB1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;
and (3) performing second-stage optimization treatment: according to BB1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB1nIoU value IoU with the real bounding box of the targetgt2(ii) a If IoUgt2>U2Then, the candidate frame BB is set 1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;
s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimum loss function;
where t represents the current number of epochs of training,andrespectively represent t-1 generationLoss of IoU in the first stage and loss of IoU in the second stage of training:
IoU therein1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;
andrespectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtA real bounding box representing an object in the search image;
8. A multi-level regression target tracking system based on sample balancing, comprising:
a reference image shallow feature extractor (1) for extracting shallow features R of the reference image1
A reference image deep feature extractor (2) for extracting a referenceDeep features R of an image2;
A superficial Prpool layer (3) of reference image for the image according to R1Obtaining shallow layer characteristic a in a target frame in a reference image1;
A reference image deep Prpool layer (4) for the image based on R2Obtaining deep layer characteristics a in a target frame in a reference image2;
A candidate frame generation module (5) for acquiring an initial target frame in the search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B0i;
A search image shallow feature extractor (6) for extracting shallow features S of the search image1;
A search image deep feature extractor (7) for extracting deep features S of the search image2;
Searching for image shallow Prpool layer (8) for S-dependent1Obtaining a search image candidate frame B0iInner shallow feature b1i;
Search for image deep Prpool layer (9) for the basis of S2Obtaining a search image candidate frame B0iCharacteristic of inner deep layer b2i;
A first fused feature acquisition module (10) for combining a1And b1iMultiplying the channels by a2And b2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B 0iCorresponding first fusion feature fi;
A first optimization module (11) for performing a first stage optimization on candidate frames in the search image: fusing the first fusion feature fiInputting the first code into the first head network to obtain a first code fusion characteristic fi'; will f isi' input the first IoU prediction Unit to get the candidate Box B0iFirst prediction IoU value ui(ii) a If u isi>U1For candidate frame B0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B1i;U1IoU threshold for the first IoU prediction unit;
a second optimization module (12) for performing a second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S1And S2Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer1iShallow layer feature b'1iAnd deep layer characteristic b'2i;
A is to1And b'1iMultiplying the channels by a2And b'2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B1iCorresponding second fused feature gi(ii) a Merging the second fused feature giInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'i;
G'iInputting a second IoU prediction unit to obtain B1iSecond prediction IoU value vi(ii) a If v isi>U2To B, pair1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B 2i;U2IoU threshold of prediction unit for second IoU, and U2>U1;
And a final target frame acquisition module (13) for selecting the M candidate frames with the maximum second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module (11) and the second optimization module (12), and averaging the M candidate frames to obtain the final target frame of the search image.
9. The multi-level regression target tracking system according to claim 8, wherein the reference image shallow feature extractor (1) and the search image shallow feature extractor (6) are respectively composed of an initial convolution layer of Resnet-50, a Block1 and two convolution layers which are connected in sequence.
10. The multi-level regression target tracking system of claim 8 further comprising a loss function calculation module (14) for calculating loss function values when training parameters in the first IoU, second IoU, first bounding box regression, and second bounding box regression units;
where t represents the current number of epochs of training,andrepresenting the first stage IoU loss and the second stage IoU loss, respectively, during t-1 training:
IoU therein 1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample2nSecond prediction IoU value, IoU, representing the n-th candidate box after the first stage optimization of the search imagegt1And IoUgt2True IoU values for the candidate box in the first stage and second stage optimization, respectively, in the search image;andrespectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,
wherein BB1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BBgtRepresenting objects in search imagesA real enclosure frame;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210394687.6A CN114757970B (en) | 2022-04-15 | 2022-04-15 | Sample balance-based multi-level regression target tracking method and tracking system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210394687.6A CN114757970B (en) | 2022-04-15 | 2022-04-15 | Sample balance-based multi-level regression target tracking method and tracking system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114757970A true CN114757970A (en) | 2022-07-15 |
CN114757970B CN114757970B (en) | 2024-03-08 |
Family
ID=82330152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210394687.6A Active CN114757970B (en) | 2022-04-15 | 2022-04-15 | Sample balance-based multi-level regression target tracking method and tracking system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114757970B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110533691A (en) * | 2019-08-15 | 2019-12-03 | 合肥工业大学 | Method for tracking target, equipment and storage medium based on multi-categorizer |
WO2020051776A1 (en) * | 2018-09-11 | 2020-03-19 | Intel Corporation | Method and system of deep supervision object detection for reducing resource usage |
CN112215080A (en) * | 2020-09-16 | 2021-01-12 | 电子科技大学 | Target tracking method using time sequence information |
CN112215079A (en) * | 2020-09-16 | 2021-01-12 | 电子科技大学 | Global multistage target tracking method |
WO2021208502A1 (en) * | 2020-04-16 | 2021-10-21 | 中国科学院深圳先进技术研究院 | Remote-sensing image target detection method based on smooth bounding box regression function |
-
2022
- 2022-04-15 CN CN202210394687.6A patent/CN114757970B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020051776A1 (en) * | 2018-09-11 | 2020-03-19 | Intel Corporation | Method and system of deep supervision object detection for reducing resource usage |
CN110533691A (en) * | 2019-08-15 | 2019-12-03 | 合肥工业大学 | Method for tracking target, equipment and storage medium based on multi-categorizer |
WO2021208502A1 (en) * | 2020-04-16 | 2021-10-21 | 中国科学院深圳先进技术研究院 | Remote-sensing image target detection method based on smooth bounding box regression function |
CN112215080A (en) * | 2020-09-16 | 2021-01-12 | 电子科技大学 | Target tracking method using time sequence information |
CN112215079A (en) * | 2020-09-16 | 2021-01-12 | 电子科技大学 | Global multistage target tracking method |
Non-Patent Citations (2)
Title |
---|
熊昌镇;李言;: "基于孪生网络的跟踪算法综述", 工业控制计算机, no. 03, 25 March 2020 (2020-03-25) * |
石国强;赵霞;: "基于联合优化的强耦合孪生区域推荐网络的目标跟踪算法", 计算机应用, no. 10, 10 October 2020 (2020-10-10) * |
Also Published As
Publication number | Publication date |
---|---|
CN114757970B (en) | 2024-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yang et al. | Step: Spatio-temporal progressive learning for video action detection | |
CN112669325B (en) | Video semantic segmentation method based on active learning | |
CN107066973B (en) | Video content description method using space-time attention model | |
CN106570464B (en) | Face recognition method and device for rapidly processing face shielding | |
CN110263666B (en) | Action detection method based on asymmetric multi-stream | |
CN110033473B (en) | Moving target tracking method based on template matching and depth classification network | |
CN110688927B (en) | Video action detection method based on time sequence convolution modeling | |
CN110110648B (en) | Action nomination method based on visual perception and artificial intelligence | |
CN108875610A (en) | A method of positioning for actuation time axis in video based on border searching | |
CN112132856A (en) | Twin network tracking method based on self-adaptive template updating | |
CN112215079B (en) | Global multistage target tracking method | |
CN116188528B (en) | RGBT unmanned aerial vehicle target tracking method and system based on multi-stage attention mechanism | |
CN111696136A (en) | Target tracking method based on coding and decoding structure | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN111241987B (en) | Multi-target model visual tracking method based on cost-sensitive three-branch decision | |
CN109145685B (en) | Fruit and vegetable hyperspectral quality detection method based on ensemble learning | |
CN116757986A (en) | Infrared and visible light image fusion method and device | |
CN109190505A (en) | The image-recognizing method that view-based access control model understands | |
CN114757970A (en) | Multi-level regression target tracking method and system based on sample balance | |
Xiang et al. | Transformer-based person search model with symmetric online instance matching | |
CN109165586A (en) | intelligent image processing method for AI chip | |
CN110991565A (en) | Target tracking optimization algorithm based on KCF | |
CN113449601B (en) | Pedestrian re-recognition model training and recognition method and device based on progressive smooth loss | |
CN109684954B (en) | On-line training method for realizing target detection on unmanned equipment | |
CN110059584B (en) | Event naming method combining boundary distribution and correction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |