CN114757970A

CN114757970A - Multi-level regression target tracking method and system based on sample balance

Info

Publication number: CN114757970A
Application number: CN202210394687.6A
Authority: CN
Inventors: 吴晶晶; 楚喻棋; 刘学亮; 洪日昌; 蒋建国; 齐美彬
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-07-15
Anticipated expiration: 2042-04-15
Also published as: CN114757970B

Abstract

The invention discloses a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance, wherein a candidate frame in a search image is optimized by adopting a plurality of cascaded optimization stages through acquiring fusion characteristics between the candidate frame in the search image and a target frame in a reference image; wherein IoU thresholds in multiple optimization stages are gradually raised, and the positioning precision is gradually raised while samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.

Description

Multi-level regression target tracking method and system based on sample balance

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a multi-level regression target tracking method and a multi-level regression target tracking system based on sample balance.

Background

Given the location of an object of interest in a first frame of a video, a visual object tracking task aims to continuously locate the object in subsequent frames of the video. The task has higher practical application value in a security system, so the task is widely concerned in the field of computer vision. Although deep learning techniques have been successfully applied to this task, significant progress has been made. But this task remains challenging due to factors such as shape changes, scale changes, object occlusion, background clutter, etc. of the object.

In the existing target tracker based on deep learning, a Siamese double-current network structure is mostly adopted for an offline network, and regression operation of candidate positions is realized by integrating appearance information of a given template and the candidate positions. The following documents:

[1]Li B,Yan J,Wu W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:8971-8980.

[2]Li B,Wu W,Wang Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4282-4291.

[3]Zhu Z,Wang Q,Li B,et al.Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:101-117.

[4]He A,Luo C,Tian X,et al.Towards a better match in siamese network based visual object tracker[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:0-0.

[5]Zhang Z,Peng H.Deeper and wider siamese networks for real-time visualtracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4591-4600.

in the target tracking task, the most commonly used regression operation is bounding box regression (bounding box regression), which directly learns the deviation between the candidate positions and the real position of the target for correcting the candidate positions so as to make them closer to the real position. However, due to the sample imbalance problem, for example, when the bounding box regresses, only the positive sample candidate box whose interaction over Union (IoU) is greater than the set threshold value is regressed. When the threshold IoU is set higher, the fewer the number of positive samples, and thus the greater the likelihood of overfitting. But when the threshold setting is lower, there is more error because a lower threshold will result in more background in the positive samples. Therefore, how to set a reasonable IoU threshold value to balance the samples and improve the accuracy of tracking and positioning is a crucial issue in this task. However, in the existing trace off-line network design, this problem is ignored.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides a multi-level regression target tracking method based on sample balance and a corresponding tracking system. According to the target tracking method, the threshold value is gradually increased IoU in the positioning of multiple stages of cascade connection, and the positioning precision is gradually increased while the samples are balanced; the method overcomes the defect that the balance between sample sampling and sample errors is difficult to realize due to the fact that a single threshold value is set in the existing method.

The technical scheme is as follows: the invention discloses a multi-level regression target tracking method based on sample balance, which comprises the following steps:

s1, extracting shallow feature R of reference image₁And deep layer feature R₂(ii) a Are each according to R₁And R₂Obtaining an object in a reference image using a Prpoool layerShallow feature a of the framed region₁And deep layer characteristics a₂；

S2, extracting shallow feature S of search image₁And deep layer characteristics S₂(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B_0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;

s3 according to S₁And S₂Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image _0iInner shallow feature is denoted as b_1iDeep layer characteristics are denoted as b_2i；

A is to be₁And b_1iMultiplying the channels by a₂And b_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B_0iCorresponding first fusion feature f_i；

S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature f_iInputting the first code into the first head network to obtain a first code fusion characteristic f_i'; will f is_i' input the first IoU prediction Unit to get the candidate Box B_0iFirst prediction IoU value u_i(ii) a If u is_i>U₁For candidate frame B_0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B_1i；U₁IoU threshold for the first IoU prediction unit;

s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S₁And S₂Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer_1iShallow layer feature b'_1iAnd deep layer characteristic b'_2i；

A is to₁And b'_1iMultiplying the channels by a₂And b'_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B_1iCorresponding second fused feature g_i(ii) a Merging the second fused feature g_iInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g' _i；

G'_iInputting a second IoU prediction unit to obtain B_1iSecond prediction IoU value v_i(ii) a If v is_i>U₂To B, pair_1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B_2i；U₂IoU threshold of the unit is predicted for the second IoU, and U₂>U₁；

S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image pass through the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to be used as the final target frame of the search image.

Further, in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image.

Further, still include:

and S7, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize target tracking in the video.

Further, IoU threshold U of the first IoU prediction unit₁0.5, IoU threshold U for the second IoU PU₂Is 0.7.

Further, in the steps S1 and S2, a shallow feature extractor composed of the initial convolutional layer of Resnet-50, the Block1, and two convolutional layers connected in sequence is used to extract shallow features of the reference image and the search image.

Further, in the steps S1 and S2, a deep feature extractor composed of blocks 2-blocks 4 and two convolutional layers connected in sequence in Resnet-50 is used to extract deep features of the reference image and the search image.

Further, the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained as follows:

s11, constructing a sample set, wherein each sample in the sample set comprises: the method comprises the steps of referring to an image, searching for the image, referring to a target frame in the image, and searching for a real surrounding frame of a target in the image;

s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the real IoU calculation module is used for calculating IoU values IoU of the candidate frame and the real surrounding frame of the target in the selected search image_gt1(ii) a If IoU_gt1>U₁Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB_1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;

and performing second-stage optimization treatment: according to BB_1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB _1nIoU value IoU with the real bounding box of the target_gt2(ii) a If IoU_gt2>U₂Then the candidate box BB is_1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB_2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;

s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimization loss function;

the loss function is:

where t represents the current number of epochs of training,

and

represent the IoU loss during the first phase and the IoU loss during the second phase, respectively, of the t-1 training generations:

IoU therein_1iRepresenting the value of the first prediction IoU corresponding to the ith candidate box of the search image in the sample, IoU_2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;

and

respectively representing the optimization error of the first bounding box regression unit and the optimization error of the second bounding box regression unit during the t-1 generation training,

wherein BB_1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization_2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BB _gtA real bounding box representing an object in the search image;

and the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation is shown.

On the other hand, the invention also discloses a system for realizing the multi-level regression target tracking method based on the sample balance, which comprises the following steps:

a reference image shallow feature extractor 1 for extracting the shallow feature R of the reference image₁

A reference image deep feature extractor 2 for extracting deep features R of the reference image₂；

Reference image superficial Prpool layer 3 for the image according to R₁Obtaining shallow layer characteristic a in a target frame in a reference image₁；

Reference image deep Prpool layer 4 for the layer based on R₂Obtaining deep layer characteristics a in a target frame in a reference image₂；

A candidate frame generating module 5, configured to obtain an initial target frame in the search image, and perturb the initial target frame in the search image to generate multiple candidate frames B_0i；

A search image shallow feature extractor 6 for extracting shallow features S of the search image₁；

A search image deep feature extractor 7 for extracting deep features S of the search image₂；

Search for image shallow Prpool layer 8 for the basis of S₁Obtaining a search image candidate frame B_0iInner shallow feature b_1i；

Search for image deep Prpool layer 9 for the basis of S ₂Obtaining a search image candidate frame B_0iCharacteristic of inner deep layer b_2i；

A first fused feature obtaining module 10 for obtaining a₁And b_1iMultiplying the channels by a₂And b_2iMultiplying channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B_0iCorresponding first fusion feature f_i；

A first optimization module 11, configured to perform a first-stage optimization on candidate frames in the search image: fusing the first fusion feature f_iInputting the first code into the first head network to obtain a first code fusion characteristic f_i'; will f is_i' input the first IoU prediction Unit to get the candidate Box B_0iFirst prediction IoU value u_i(ii) a If u is_i>U₁For candidate frame B_0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B_1i；U₁IoU threshold for the first IoU prediction unit;

a second optimization module 12 forAnd performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S₁And S₂Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer_1iShallow layer feature b'_1iAnd deep layer characteristic b'_2i；

A is to₁And b'_1iMultiplying the channels by a₂And b'_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B _1iCorresponding second fusion feature g_i(ii) a Merging the second fusion characteristic g_iInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'_i；

G'_iInputting a second IoU prediction unit to obtain B_1iSecond prediction IoU value v_i(ii) a If v is_i>U₂To B, for_1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B_2i；U₂IoU threshold of the unit is predicted for the second IoU, and U₂>U₁；

And a final target frame obtaining module 13, configured to select M candidate frames with the largest second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module 11 and the second optimization module 12, and take the average of the M candidate frames as a final target frame of the search image.

Further, the target tracking system further comprises a loss function calculation module 14 for calculating a loss function value when training parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit, and the second bounding box regression unit;

the loss function is:

where t represents the current number of epochs of training,

and

IoU therein_1iRepresenting the value of the first prediction IoU corresponding to the nth candidate box of the search image in the sample, IoU _2nSecond prediction IoU value, IoU, representing the n-th candidate box after the search image is optimized in the first stage_gt1And IoU_gt2True IoU values for the candidate box in the first stage and second stage optimizations, respectively, in the search image;

and

wherein BB_1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization_2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BB_gtA real bounding box representing an object in the search image;

represents the average value of the optimization errors of the first bounding box regression unit obtained by training of 1-t-1 generation.

Has the advantages that: the multi-level regression network is designed by the multi-level regression target tracking method and the multi-level regression target tracking system based on the sample balance, and the IoU threshold values of the candidate frames are improved stage by stage through the positioning process of cascading two stages. The first optimization stage sets a smaller IoU threshold to increase the number of positive samples (candidate boxes IoU greater than the threshold are labeled as positive samples), thereby achieving a balance of training samples. After the optimization stage of the first localization regression, the quality of the candidate frame is improved. Therefore, the threshold is raised IoU in the second optimization stage, so that a large number of positive samples can be kept, and the candidate frames are subjected to further positioning regression, thereby improving the regression accuracy. In conclusion, the invention sets different IoU thresholds at different stages, thereby alleviating the balance problem of the sample and improving the positioning precision through layer-by-layer positioning.

Drawings

FIG. 1 is a flow chart of a multi-level regression target tracking method based on sample balancing according to the present disclosure;

FIG. 2 is a schematic diagram of a multi-level regression target tracking system based on sample balancing;

FIG. 3 is a schematic diagram of the components of a two-stage optimization module;

FIG. 4 is a process flow diagram of a two-stage optimization phase in the training process.

Detailed Description

The invention is further elucidated with reference to the drawings and the detailed description.

The invention discloses a sample balance-based multi-level regression target tracking method, a flow chart of which is shown in figure 1, and figure 2 is a composition schematic diagram of a tracking system for realizing the target tracking method. The target tracking method comprises the following steps:

s1, extracting shallow feature R of reference image₁And the deep layer characteristic R₂(ii) a Each according to R₁And R₂Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer₁And deep layer characteristics a₂；

in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image, where the online classifier is found in the literature: danelljan M, Bhat G, Khan F S, et al ATOM Accurate tracking by overlay maximum attenuation [ C ] ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-4669, detailed in the present document, the classifier is able to obtain the approximate location of the target. The candidate frame generation module 5 perturbs the initial target frame in the search image to generate a plurality of candidate frames B_0i。

In steps S1 and S2 of this embodiment, the shallow extractor and the deep extractor with shared parameters are used to obtain two-scale stem features in the reference image and the search image; specifically, the reference image shallow feature extractor 1 and the search image shallow feature extractor 6 both adopt a shallow feature extractor formed by sequentially connecting an initial convolutional layer of Resnet-50, a Block1 and two convolutional layers; the reference image deep feature extractor 2 and the search image deep feature extractor 7 both adopt a deep feature extractor which is formed by sequentially connecting a Block2-Block4 and two convolutional layers in Resnet-50. Resnet-50 in the literature: [7] kaim He, Xiangyu Zhuang, Shaoqing Ren, and Jianan Sun, 2016.deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.

S3 according to S ₁And S₂Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image_0iInner shallow feature b_1iThe deep layer is characterized as b_2i；

Prpool layers in the reference image shallow Prpool layer 3, the reference image deep Prpool layer 4, the search image shallow Prpool layer 8, and the search image deep Prpool layer 9 are described in detail in the documents [6] Danelljan M, Bhat G, Khan F S, et al.

The shallow feature a of the target intra-frame area in the reference image₁Shallow feature b corresponding to candidate frame_1iChannel multiplication is carried out to multiply the deep features a of the target frame region in the reference image₂Deep layer characteristic b corresponding to candidate frame_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B_0iCorresponding first fusion feature f_i(ii) a This part of the functionality is implemented by a first fused feature acquisition module 10, as shown in FIG. 2, where

It is meant that the channels are multiplied by each other,

it is shown that the size adjustment is performed,

representing a cascade.

S4, carrying out first-stage optimization on the candidate frame in the search image: fusing the first fusion feature f _iInputting the first code into the first head network to obtain a first code fusion characteristic f_i'; will f is mixed_i' input the first IoU prediction Unit to get the candidate Box B_0iFirst prediction IoU value u_i(ii) a If u is_i>U₁For the candidate frame B_0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B_1i；U₁IoU threshold for the first IoU prediction unit;

at this stage IoU threshold value U₁Set to 0.5. That is, candidate locations with a first predicted IoU value greater than 0.5 are considered to be candidate boxes for the preliminary screening, and a first bounding box regression unit is used to optimize candidate location B_0iCandidate frames subjected to preliminary screening. Since the IoU threshold at this stage is lower, more screening results can be obtained. And the number of the candidate frames of the N candidate frames in the search image after the first-stage optimization is less than N. The candidate frame obtained by optimization is marked as B_1i。

S5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S₁And S₂Obtaining Raynaud by Prpoool layerOptimizing candidate frame B in search image_1iShallow layer feature b'_1iAnd deep layer characteristic b'_2i；

A is to₁And b'_1iMultiplying the channels by a₂And b'_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B _1iCorresponding second fused feature g_i(ii) a Merging the second fused feature g_iInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'_i；

In this phase IoU threshold value U₂Set to 0.7. The IoU threshold at this stage is higher, i.e. the quality of the candidate frame after screening is higher, and the candidate position B after optimizing the candidate frame is correspondingly obtained_2iThe quality is higher, thereby gradually improving the quality of the candidate frame.

The first head network and the second head network are both a smaller network located behind the backbone network. In the invention, the first head network and the second head network both comprise a plurality of sequentially cascaded convolution layers and a full connection layer, and output the characteristics of fixed size and dimension. S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as a final target frame of the search image;

the steps S4 and S5 are performed by the first optimization module 11 and the second optimization module 12, respectively, and the structures thereof are shown in (a) and (b) of fig. 3. The final target frame obtaining module 13 selects M candidate frames with the largest second prediction IoU value, and averages the M candidate frames to obtain a final target frame of the search image.

And S7, when tracking the target in the video, taking the searched image as a reference image, taking the next frame image of the searched image as a new searched image, and re-executing the steps S1 to S6 to realize the target tracking in the video.

In this embodiment, the structures of the first IoU PU and the second IoU PU are similar to the literature: IoU predictors in Danelljan M, Bhat G, Khan F S, et al, ATOM, Accurate tracking by overlay mapping [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern registration.2019: 4660-.

The parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained using the following steps:

s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then adopting a first-stage optimization process similar to S4: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search image _gt1(ii) a If IoU_gt1>U₁Inputting the candidate frame into the first bounding box regression unit for optimization to obtain an optimized candidate frame BB_1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;

a second stage optimization process similar to S5 is performed: according to BB_1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB_1nIoU value IoU with the real bounding box of the target_gt2(ii) a If IoU_gt2>U₂Then, the candidate frame BB is set_1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB_2m，m＝1,2, …, N2 and N2 are the number of candidate frames obtained after the second-stage optimization processing of N1 candidate frames obtained after the first-stage optimization processing;

the processing flows of the first-stage optimization and the second-stage optimization during training are shown as (a) and (b) in fig. 4, respectively. It differs from S4 and S5 in that: during training, a first coding fusion characteristic output by the first head network is input into a first IoU prediction unit and a real IoU calculation module in parallel, and a second coding fusion characteristic output by the second head network is input into a second IoU prediction unit and a real IoU calculation module in parallel; and judging whether the first bounding box regression unit and the second bounding box regression unit are input for regression optimization according to the candidate box real IoU value calculated by the real IoU calculation module. That is, the first IoU predictor and the second IoU predictor are trained to learn the IoU scores of the prediction candidate frames, and the predicted IoU is used in a case of testing (at this time, the candidate frame IoU cannot be acquired) by making the output results of the first IoU predictor and the second IoU predictor as close as possible to the true IoU value of the candidate frame calculated by the true IoU calculation module. A smaller IoU threshold is adopted in the first-stage optimization, more positive samples can be obtained, and the first bounding box regression unit optimizes the positive samples, so that the balance of training samples is guaranteed; the larger IoU threshold is used in the second stage optimization, which again improves the quality of the positive samples, enabling the second bounding box regression unit to obtain higher quality candidate boxes.

S13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit through a minimum loss function;

the loss function is:

where t represents the current number of epochs of training,

and

representing the first stage IoU loss and the second stage IoU loss, respectively, during t-1 training:

and

In the last term of the loss function,

and

are inversely proportional. Thus, in the early part of the training phase, the optimization of the first phase is based on the loss function The influence of the number is large, after the boundary frame regression unit in the first stage is trained well, the optimization error is smaller and smaller, and the influence of the optimization in the second stage on the loss function is increased gradually. In subsequent training, the quality of the candidate positions is better and better, the number of positive samples is more and more, and the weight occupied by the second stage is increased while keeping the samples balanced. By cascading multiple regression networks, IoU thresholds are gradually raised in the target location of multiple stages, and a smaller IoU threshold is set in the previous stage to increase the number of positive samples, so that the balance of training samples is realized. Through the first positioning regression, the quality of the candidate frame can be improved. Therefore, the threshold is raised IoU in the second stage, and the number of positive samples can still be kept from changing too much, and the candidate frame is subjected to further localization regression, so that the regression accuracy is improved. Therefore, it is possible to gradually improve the accuracy of positioning while balancing the samples.

Claims

1. A multi-level regression target tracking method based on sample balance is characterized by comprising the following steps:

s1, extracting shallow feature R of reference image₁And the deep layer characteristic R₂(ii) a Each according to R₁And R₂Obtaining shallow feature a of target intra-frame region in reference image by using Prpoool layer ₁And deep layer characteristics a₂；

S2, extracting shallow feature S of search image₁And deep layer feature S₂(ii) a Obtaining an initial target frame in a search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B_0i(ii) a i is 1,2, …, and N is the number of candidate frames in the search image;

s3 according to S₁And S₂Obtaining shallow features and deep features in each candidate frame in the search image by adopting a Prpoool layer, and putting the ith candidate frame B into the search image_0iInner shallow feature b_1iThe deep layer is characterized as b_2i；

A is to₁And b_1iMultiplying the channels by a₂And b_2iMultiplying the channels; two junctions multiplying channelsCascading after the fruits are adjusted to the same size to obtain a candidate frame B_0iCorresponding first fusion feature f_i；

s5, performing second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S ₁And S₂Obtaining an optimized candidate frame B in a search image by adopting a PrPool layer_1iShallow layer feature b'_1iAnd deep layer feature b'_2i；

A is to be₁And b'_1iMultiplying the channels by a₂And b'_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B_1iCorresponding second fused feature g_i(ii) a Merging the second fused feature g_iInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'_i；

S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and averaging the M candidate frames to serve as the final target frame of the search image.

2. The method for tracking multiple regression targets according to claim 1, wherein an ATOM-based online classifier is used to obtain an initial target frame in the search image in step S2.

3. The multi-level regression target tracking method according to claim 1, further comprising:

S7, the searched image is used as a reference image, the next frame image of the searched image is used as a new searched image, and the steps S1 to S6 are executed again, so that the target tracking in the video is realized.

4. The multi-level regression target tracking method of claim 1, wherein the IoU threshold U of the first IoU prediction unit₁0.5, IoU threshold U of the second IoU prediction unit₂And was 0.7.

5. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, shallow feature extractors consisting of an initial convolution layer of Resnet-50, a Block1 and two convolution layers connected in sequence are used to extract shallow features of the reference image and the search image.

6. The multi-level regression target tracking method according to claim 1, wherein in steps S1 and S2, a deep feature extractor consisting of a Block2-Block4 in Resnet-50 and two convolution layers connected in sequence is used to extract deep features of the reference image and the search image.

7. The multi-level regression target tracking method according to claim 1, wherein the parameters in the first IoU predictor unit, the second IoU predictor unit, the first bounding box regression unit, and the second bounding box regression unit are trained by:

s12, processing the reference image and the search image in the sample according to the steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristics output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the stage real IoU calculation module is used for calculating IoU values IoU of candidate frames and real surrounding frames of targets in the selected search image_gt1(ii) a If IoU_gt1>U₁Then inputting the candidate frame into the first bounding box regression unit for optimization to obtain the optimized candidate frame BB_1nN is 1,2, …, N1, and N1 are the number of candidate frames obtained after the first-stage optimization processing is performed on N candidate frames in the search image;

and (3) performing second-stage optimization treatment: according to BB_1nObtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, inputting a second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module at the stage is used for calculating BB_1nIoU value IoU with the real bounding box of the target_gt2(ii) a If IoU_gt2>U₂Then, the candidate frame BB is set _1nInputting a second bounding box regression unit for optimization to obtain an optimized candidate box BB_2mN2 is the number of candidate frames obtained by subjecting N1 candidate frames obtained after the first-stage optimization processing to the second-stage optimization processing;

the loss function is:

where t represents the current number of epochs of training,

and

respectively represent t-1 generationLoss of IoU in the first stage and loss of IoU in the second stage of training:

IoU therein_1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample_2nA second prediction IoU value corresponding to the nth candidate box after the search image is optimized in the first stage;

and

8. A multi-level regression target tracking system based on sample balancing, comprising:

a reference image shallow feature extractor (1) for extracting shallow features R of the reference image₁

A reference image deep feature extractor (2) for extracting a referenceDeep features R of an image₂；

A superficial Prpool layer (3) of reference image for the image according to R₁Obtaining shallow layer characteristic a in a target frame in a reference image₁；

A reference image deep Prpool layer (4) for the image based on R₂Obtaining deep layer characteristics a in a target frame in a reference image₂；

A candidate frame generation module (5) for acquiring an initial target frame in the search image, disturbing the initial target frame in the search image, and generating a plurality of candidate frames B_0i；

A search image shallow feature extractor (6) for extracting shallow features S of the search image₁；

A search image deep feature extractor (7) for extracting deep features S of the search image₂；

Searching for image shallow Prpool layer (8) for S-dependent₁Obtaining a search image candidate frame B_0iInner shallow feature b_1i；

Search for image deep Prpool layer (9) for the basis of S₂Obtaining a search image candidate frame B_0iCharacteristic of inner deep layer b_2i；

A first fused feature acquisition module (10) for combining a₁And b_1iMultiplying the channels by a₂And b_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain a candidate frame B _0iCorresponding first fusion feature f_i；

A first optimization module (11) for performing a first stage optimization on candidate frames in the search image: fusing the first fusion feature f_iInputting the first code into the first head network to obtain a first code fusion characteristic f_i'; will f is_i' input the first IoU prediction Unit to get the candidate Box B_0iFirst prediction IoU value u_i(ii) a If u is_i>U₁For candidate frame B_0iOptimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B_1i；U₁IoU threshold for the first IoU prediction unit;

a second optimization module (12) for performing a second-stage optimization on the candidate frame after the first-stage optimization: respectively according to S₁And S₂Obtaining an optimization candidate frame B in a search image by adopting a PrPool layer_1iShallow layer feature b'_1iAnd deep layer characteristic b'_2i；

A is to₁And b'_1iMultiplying the channels by a₂And b'_2iMultiplying the channels; adjusting two multiplied results of the channels to be the same size and then cascading to obtain an optimized candidate frame B_1iCorresponding second fused feature g_i(ii) a Merging the second fused feature g_iInputting the second coded fusion feature g 'into a second head network to obtain a second coded fusion feature g'_i；

G'_iInputting a second IoU prediction unit to obtain B_1iSecond prediction IoU value v_i(ii) a If v is_i>U₂To B, pair_1iOptimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B _2i；U₂IoU threshold of prediction unit for second IoU, and U₂>U₁；

And a final target frame acquisition module (13) for selecting the M candidate frames with the maximum second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image through the first optimization module (11) and the second optimization module (12), and averaging the M candidate frames to obtain the final target frame of the search image.

9. The multi-level regression target tracking system according to claim 8, wherein the reference image shallow feature extractor (1) and the search image shallow feature extractor (6) are respectively composed of an initial convolution layer of Resnet-50, a Block1 and two convolution layers which are connected in sequence.

10. The multi-level regression target tracking system of claim 8 further comprising a loss function calculation module (14) for calculating loss function values when training parameters in the first IoU, second IoU, first bounding box regression, and second bounding box regression units;

the loss function is:

where t represents the current number of epochs of training,

and

IoU therein _1iFirst prediction IoU value, IoU, corresponding to the ith candidate box of the search image in the sample_2nSecond prediction IoU value, IoU, representing the n-th candidate box after the first stage optimization of the search image_gt1And IoU_gt2True IoU values for the candidate box in the first stage and second stage optimization, respectively, in the search image;

and

wherein BB_1nRepresenting the n-th candidate frame, BB, of the search image in the sample after the first stage of optimization_2mRepresenting the m-th candidate frame of the search image after the second-stage optimization; BB_gtRepresenting objects in search imagesA real enclosure frame;