CN114757970B

CN114757970B - Sample balance-based multi-level regression target tracking method and tracking system

Info

Publication number: CN114757970B
Application number: CN202210394687.6A
Authority: CN
Inventors: 吴晶晶; 楚喻棋; 刘学亮; 洪日昌; 蒋建国; 齐美彬
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2024-03-08
Anticipated expiration: 2042-04-15
Also published as: CN114757970A

Abstract

The invention discloses a sample balance-based multi-level regression target tracking method and a sample balance-based multi-level regression target tracking system, which are characterized in that fusion characteristics between a candidate frame in a search image and a target frame in a reference image are obtained, and a plurality of cascaded optimization stages are adopted to optimize the candidate frame in the search image; wherein IoU thresholds in the plurality of optimization stages are gradually raised, and the positioning accuracy is gradually raised while balancing the samples; the method overcomes the defect that the balance of sample sampling and sample error is difficult to realize by setting a single threshold value in the existing method.

Description

Sample balance-based multi-level regression target tracking method and tracking system

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a multi-level regression target tracking method and system based on sample balance.

Background

Given the location of an object of interest in a first frame of video, a visual object tracking task aims to continuously locate that object in a subsequent frame of video. The task has higher practical application value in a security system, so that the task is widely focused in the field of computer vision. Although deep learning techniques have been successfully applied to this task, significant progress has been made. But this task remains challenging due to shape changes, dimensional changes, object occlusion, background clutter, etc. of the object.

In the existing target tracker based on deep learning, an offline network mostly adopts a Siamese double-flow network structure, and regression operation of candidate positions is realized by integrating appearance information of a given template and the candidate positions. The following documents:

[1]Li B,Yan J,Wu W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:8971-8980.

[2]Li B,Wu W,Wang Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4282-4291.

[3]Zhu Z,Wang Q,Li B,et al.Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:101-117.

[4]He A,Luo C,Tian X,et al.Towards a better match in siamese network based visual object tracker[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:0-0.

[5]Zhang Z,Peng H.Deeper and wider siamese networks for real-time visualtracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4591-4600.

in the target tracking task, the most commonly used regression operation is bounding box regression (bounding box regression), which directly learns the deviation between the candidate position and the target real position for correcting the candidate position so that it is closer to the real position. However, due to sample imbalance problems, for example, when the bounding box regresses, only positive sample candidates with a size larger than the set threshold are regressed Intersection over Union (IoU). When the IoU threshold is set higher, the fewer the number of positive samples, and thus the greater the likelihood of overfitting. But when the threshold is set lower there will be more error because a lower threshold will result in more background in the positive sample. Therefore, how to set a reasonable IoU threshold, and to balance the samples and improve the accuracy of tracking and positioning is a critical issue in this task. However, this problem is ignored in existing tracking offline network designs.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides a multi-level regression target tracking method based on sample balance and a corresponding tracking system. According to the target tracking method, the IoU threshold value is gradually lifted in the positioning of a plurality of stages in cascade, and the positioning precision is gradually lifted while the sample is balanced; the method overcomes the defect that the balance of sample sampling and sample error is difficult to realize by setting a single threshold value in the existing method.

The technical scheme is as follows: the invention discloses a multi-level regression target tracking method based on sample balance, which comprises the following steps:

s1, extracting shallow layer characteristics R of a reference image ₁ And deep features R ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to R respectively ₁ And R is ₂ Acquiring shallow layer characteristics a of target in-frame region in reference image by using PrPool layer ₁ And deep features a ₂ ；

S2, extracting shallow layer characteristics S of the search image ₁ And deep features S ₂ The method comprises the steps of carrying out a first treatment on the surface of the Acquiring an initial target frame in a search image, and disturbing the initial target frame in the search image to generate a plurality of candidate frames B _0i The method comprises the steps of carrying out a first treatment on the surface of the i=1, 2, …, N is the number of candidate frames in the search image;

s3, according to S respectively ₁ And S is ₂ The PrPool layer is adopted to obtain shallow layer characteristics and deep layer characteristics in each candidate frame in the search image, and the ith candidate frame B is obtained _0i The shallow features in the inner part are denoted b _1i Deep features are denoted b _2i ；

Will a ₁ And b _1i Channel multiplication is performed to multiply a ₂ And b _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that a candidate frame B is obtained _0i Corresponding first fusion feature f _i ；

S4, optimizing candidate frames in the search image in a first stage: incorporating the first fusion feature f _i Inputting the first code fusion characteristic f into a first head network _i 'A'; will f _i ' input first IoU prediction unit, get candidate frame B _0i First predicted IoU value u _i The method comprises the steps of carrying out a first treatment on the surface of the If u is _i >U ₁ For candidate frame B _0i Optimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B _1i ；U ₁ IoU threshold for the first IoU prediction unit;

s5, performing second-stage optimization on the candidate frames after the first-stage optimization: according to S respectively ₁ And S is ₂ Obtaining optimized candidate frame B in search image by PrPool layer _1i Shallow features b' _1i And deep features b' _2i ；

Will a ₁ And b' _1i Channel multiplication is performed to multiply a ₂ And b' _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that an optimized candidate frame B is obtained _1i Corresponding second fusion feature g _i The method comprises the steps of carrying out a first treatment on the surface of the Fusing the second fusion feature g _i Inputting the second code fusion characteristic g 'into a second head network' _i ；

Will g' _i Inputting the second IoU prediction unit to obtain B _1i A second predicted IoU value v of (2) _i The method comprises the steps of carrying out a first treatment on the surface of the If v _i >U ₂ For B _1i Optimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B _2i ；U ₂ IoU threshold for the second IoU prediction unit, and U ₂ >U ₁ ；

S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and taking the M candidate frames as a final target frame of the search image after averaging.

Further, in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image.

Further, the method further comprises the following steps:

s7, taking the search image as a reference image, taking the next frame image of the search image as a new search image, and re-executing the steps S1 to S6 to realize target tracking in the video.

Further, the IoU threshold U of the first IoU prediction unit ₁ IoU threshold U of 0.5 for the second IoU prediction unit ₂ 0.7.

Further, in the steps S1 and S2, shallow features of the reference image and the search image are extracted by using a shallow feature extractor formed by sequentially connecting an initial convolution layer, a Block1 and two convolution layers of Resnet-50.

Further, in the steps S1 and S2, a deep feature extractor consisting of a Block2-Block4 in Resnet-50 and two convolution layers connected in sequence is adopted to extract deep features of the reference image and the search image.

Further, the parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit are trained by:

s11, constructing a sample set, wherein each sample in the sample set comprises: a reference image, a search image, a target frame in the reference image, and a real bounding box of a target in the search image;

s12, processing the reference image and the search image in the sample according to steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristic output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the true IoU computing module at this stage is used for computing IoU values IoU of candidate frames and true bounding frames of targets in the selected search image _gt1 The method comprises the steps of carrying out a first treatment on the surface of the If IoU _gt1 >U ₁ Inputting the candidate frame into a first boundary frame regression unit for optimization to obtain an optimized candidate frame BB _1n N=1, 2, …, N1 are N in the search imageThe number of candidate frames obtained after the candidate frames are subjected to the first-stage optimization treatment;

and (3) performing second-stage optimization treatment: according to BB _1n Obtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, and inputting the second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module in the stage is used for calculating BB _1n IoU value IoU of real bounding box with target _gt2 The method comprises the steps of carrying out a first treatment on the surface of the If IoU _gt2 >U ₂ Then candidate box BB _1n Inputting the optimized candidate frame BB into a second boundary frame regression unit for optimization _2m M=1, 2, …, N2 are the number of candidate frames obtained by performing the second-stage optimization on the N1 candidate frames obtained by performing the first-stage optimization;

s13, optimizing parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit by minimizing a loss function;

the loss function is:

where t represents the current number of training epochs,and->Representing IoU loss in the first stage and IoU loss in the second stage, respectively, for the t-1 generation training:

therein IoU _1i First predicted IoU value corresponding to ith candidate box of search image in sample, ioU _2n A second predicted IoU value corresponding to the nth candidate box representing the search image after the first-stage optimization;

and->Respectively representing the optimization error of the first boundary box regression unit and the optimization error of the second boundary box regression unit during t-1 generation training, +.>

Wherein BB is _1n Representing an nth candidate box, BB, of a search image in a sample after the search image is optimized in a first stage _2m Representing an mth candidate frame of the search image after the second-stage optimization; BB (BB) _gt Representing a true bounding box of the object in the search image;mean values of the optimization errors of the first bounding box regression units obtained by training of the 1 to t-1 generation are shown.

On the other hand, the invention also discloses a system for realizing the multi-level regression target tracking method based on sample balance, which comprises the following steps:

a reference image shallow feature extractor 1 for extracting shallow features R of a reference image ₁

A reference image deep feature extractor 2 for extracting deep features R of the reference image ₂ ；

A reference image shallow PrPool layer 3 for R-dependent ₁ Acquiring shallow layer characteristics a in target frame in reference image ₁ ；

A reference image deep PrPool layer 4 for R-dependent ₂ Obtaining deep layer characteristic a in target frame in reference image ₂ ；

A candidate frame generation module 5 for acquiring an initial target frame in the search image, and disturbing the initial target frame in the search image to generate a plurality of candidate frames B _0i ；

A search image shallow feature extractor 6 for extracting shallow features S of the search image ₁ ；

A search image deep feature extractor 7 for extracting deep features S of the search image ₂ ；

Searching the image shallow PrPool layer 8 for the image according to S ₁ Obtaining search image candidate frame B _0i Internal shallow features b _1i ；

Search image deep PrPool layer 9 for according to S ₂ Obtaining search image candidate frame B _0i Deep-inside features b _2i ；

A first fusion feature acquisition module 10 for acquiring a ₁ And b _1i Channel multiplication is performed to multiply a ₂ And b _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that a candidate frame B is obtained _0i Corresponding first fusion feature f _i ；

A first optimizing module 11, configured to perform a first-stage optimization on candidate frames in the search image: incorporating the first fusion feature f _i Inputting the first code fusion characteristic f into a first head network _i 'A'; will f _i ' input first IoU prediction unit, get candidate frame B _0i First predicted IoU value u _i The method comprises the steps of carrying out a first treatment on the surface of the If u is _i >U ₁ For candidate frame B _0i Optimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B _1i ；U ₁ IoU threshold for the first IoU prediction unit;

a second optimizing module 12, configured to perform a second-stage optimization on the candidate frame after the first-stage optimization: according to S respectively ₁ And S is ₂ Obtaining optimized candidate frame B in search image by PrPool layer _1i Shallow features b' _1i And deep features b' _2i ；

The final target frame obtaining module 13 is configured to select M candidate frames with the largest second prediction IoU value from the multiple optimized candidate frames obtained by processing the N candidate frames in the search image by the first optimizing module 11 and the second optimizing module 12, and average the M candidate frames to be used as the final target frame of the search image.

Further, the target tracking system further includes a loss function calculation module 14 for calculating a loss function value by training parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit, and the second bounding box regression unit;

the loss function is:

therein IoU _1i First predicted IoU value corresponding to nth candidate box of search image in sample, ioU _2n A second predicted IoU value corresponding to the nth candidate box representing the search image after the first-stage optimization IoU _gt1 And IoU _gt2 Respectively, search imagesTrue IoU values of candidate boxes in the first and second stage optimizations;

Wherein BB is _1n Representing an nth candidate box, BB, of a search image in a sample after the search image is optimized in a first stage _2m Representing an mth candidate frame of the search image after the second-stage optimization; BB (BB) _gt Representing a true bounding box of the object in the search image;

mean values of the optimization errors of the first bounding box regression units obtained by training of the 1 to t-1 generation are shown.

The beneficial effects are that: the invention discloses a sample balance-based multi-level regression target tracking method and a tracking system, which design a multi-level regression network, and increase IoU threshold values of candidate frames stage by stage through a positioning process of cascading two stages. The first optimization stage sets a smaller IoU threshold to increase the number of positive samples (IoU candidate boxes greater than the threshold are marked as positive samples), thereby achieving a balance of training samples. After the optimization phase of the first positional regression, the quality of the candidate boxes will be improved. Therefore, the IoU threshold value is lifted in the second optimization stage, a large number of positive samples can be kept, and meanwhile, the candidate frames are subjected to further positioning regression, so that the regression accuracy of the candidate frames is improved. In summary, the invention sets different IoU thresholds at different stages, thereby alleviating the balance problem of the sample and improving the positioning accuracy through positioning layer by layer.

Drawings

FIG. 1 is a flow chart of a sample balance-based multi-level regression target tracking method disclosed by the invention;

FIG. 2 is a schematic diagram of the composition of a sample balance based multi-level regression target tracking system;

FIG. 3 is a schematic diagram of the composition of a two-stage optimization module;

FIG. 4 is a process flow diagram of a two-stage optimization stage in the training process.

Detailed Description

The invention is further elucidated below in connection with the drawings and the detailed description.

The invention discloses a multi-level regression target tracking method based on sample balance, wherein a flow chart of the method is shown in fig. 1, and fig. 2 is a composition schematic diagram of a tracking system for realizing the target tracking method. The target tracking method comprises the following steps:

in step S2, an ATOM-based online classifier is used to obtain an initial target frame in the search image, where the online classifier is used in literature: danelljan M, bhat G, khan F S, et al ATOM Accurate tracking by overlap maximization [ C]The classifier is capable of obtaining the approximate location of the target as described in detail in// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognizing.2019:4660-4669. The candidate frame generation module 5 perturbs the initial target frame in the search image to generate a plurality of candidate frames B _0i 。

In steps S1 and S2 of the present embodiment, a shallow extractor and a deep extractor with shared parameters are used to obtain two-scale trunk features in the reference image and the search image; specifically, the reference image shallow feature extractor 1 and the search image shallow feature extractor 6 are both shallow feature extractors formed by sequentially connecting an initial convolution layer of Resnet-50, block1 and two convolution layers; the reference image deep feature extractor 2 and the search image deep feature extractor 7 are deep feature extractors formed by sequentially connecting Block2-Block4 and two convolution layers in Resnet-50. Resnet-50 is described in: [7] the details of Kaiming He, xiangyu Zhang, shaoqing Ren, and Jian Sun 2016.Deep residual learning for image recognment. In Proceedings of the IEEE conference on computer vision and pattern recognment. 770-778 are described in detail.

PrPool layers in reference image shallow PrPool layer 3, reference image deep PrPool layer 4, search image shallow PrPool layer 8, search image deep PrPool layer 9 are described in detail in document [6]Danelljan M,Bhat G,Khan F S,et al.ATOM:Accurate tracking by overlap maximization[C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern Reconnection.2019:4660-4669.

Shallow features a of regions within a target frame in a reference image ₁ Shallow features b corresponding to candidate boxes _1i Channel multiplication is carried out to obtain deep features a of the region in the target frame in the reference image ₂ Deep features b corresponding to candidate boxes _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that a candidate frame B is obtained _0i Corresponding first fusion feature f _i The method comprises the steps of carrying out a first treatment on the surface of the This part of the functionality is implemented by the first fusion feature acquisition module 10, as shown in fig. 2, in whichRepresenting channel multiplication>Indicating a size adjustment->Representing a cascade.

at this stage, ioU threshold U ₁ Set to 0.5. That is, candidate positions with a first predicted IoU value greater than 0.5 are marked as candidate frames for preliminary screening, and a first bounding-frame regression unit is used to optimize candidate position B _0i Is selected from the candidate frames of the preliminary screening. Since the IoU threshold at this stage is lower, more screening results can be obtained. The number of the N candidate frames in the search image after the optimization of the first stage is smaller than N. The candidate frame obtained by optimization is marked as B _1i 。

Will g' _i Inputting the second IoU prediction unit to obtain B _1i Second prediction I of (2)oU value v _i The method comprises the steps of carrying out a first treatment on the surface of the If v _i >U ₂ For B _1i Optimizing by adopting a second bounding box regression unit to obtain an optimized candidate frame B _2i ；U ₂ IoU threshold for the second IoU prediction unit, and U ₂ >U ₁ ；

In this stage, ioU threshold U ₂ Set to 0.7. The IoU threshold at this stage is higher, i.e. the quality of the candidate frame after screening is higher, and the candidate position B after optimizing it accordingly _2i The quality is higher, thereby progressively improving the quality of the candidate frame.

The first head network and the second head network are each a smaller network located after the backbone network. In the invention, the structures of the first head network and the second head network are all a plurality of convolution layers and a full connection layer which are cascaded in turn, and the characteristics of fixed size and dimension are output. S6, obtaining a plurality of optimized candidate frames after the N candidate frames in the search image are subjected to the steps S4 and S5, selecting M candidate frames with the largest second prediction IoU value, and taking the average as a final target frame of the search image;

steps S4 and S5 are respectively completed by the first optimizing module 11 and the second optimizing module 12, and the structures thereof are shown in (a) and (b) in fig. 3. The final target frame obtaining module 13 selects M candidate frames with the largest second prediction IoU value, and takes the M candidate frames as the final target frame of the search image after averaging.

And S7, when tracking the target in the video, taking the search image as a reference image, taking the next frame image of the search image as a new search image, and re-executing the steps S1 to S6 to realize target tracking in the video.

In this embodiment, the structures and documents of the first IoU prediction unit and the second IoU prediction unit are as follows: the IoU predictors in Danelljan M, bhat G, khan F S, et al ATOM Accurate tracking by overlap maximization [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recording.2019:4660-4669 are the same.

Parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit, and the second bounding box regression unit were trained using the following steps:

s12, processing the reference image and the search image in the sample according to steps S1 to S3, and then adopting a first-stage optimization process similar to S4: inputting the first coding fusion characteristic output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the true IoU computing module at this stage is used for computing IoU values IoU of candidate frames and true bounding frames of targets in the selected search image _gt1 The method comprises the steps of carrying out a first treatment on the surface of the If IoU _gt1 >U ₁ Inputting the candidate frame into a first boundary frame regression unit for optimization to obtain an optimized candidate frame BB _1n N=1, 2, …, N1 is the number of candidate frames obtained by performing the first-stage optimization processing on N candidate frames in the search image;

a second-stage optimization process similar to S5 is performed: according to BB _1n Obtaining a second fusion characteristic and inputting the second fusion characteristic into a second head network, and inputting the second coding fusion characteristic output by the second head network into a second IoU prediction unit and a real IoU calculation module in parallel, wherein the real IoU calculation module in the stage is used for calculating BB _1n IoU value IoU of real bounding box with target _gt2 The method comprises the steps of carrying out a first treatment on the surface of the If IoU _gt2 >U ₂ Then candidate box BB _1n Inputting the optimized candidate frame BB into a second boundary frame regression unit for optimization _2m M=1, 2, …, N2 are the number of candidate frames obtained by performing the second-stage optimization on the N1 candidate frames obtained by performing the first-stage optimization;

the process flows of the first-stage optimization and the second-stage optimization in training are shown in fig. 4 (a) and (b), respectively. It differs from S4 and S5 in that: during training, the first coding fusion characteristic output by the first head network is input into the first IoU prediction unit and the real IoU calculation module in parallel, and the second coding fusion characteristic output by the second head network is input into the second IoU prediction unit and the real IoU calculation module in parallel; and judging whether the first boundary box regression unit and the second boundary box regression unit are input to carry out regression optimization according to the true IoU value of the candidate box calculated by the true IoU calculation module. That is, training of the first IoU prediction unit and the second IoU prediction unit is to learn the IoU score of the predicted candidate frame, and the output results of the first IoU prediction unit and the second IoU prediction unit are made as close as possible to the true IoU value of the candidate frame calculated by the true IoU calculation module, so that the predicted IoU is used in the case where the candidate frame IoU cannot be acquired at the time of the test. In the first-stage optimization, a smaller IoU threshold value is adopted, more positive samples can be obtained, and the first boundary box regression unit optimizes the positive samples, so that the balance of training samples is ensured; in the second-stage optimization, a larger IoU threshold is adopted, so that the quality of the positive sample is improved again, and the second bounding box regression unit can obtain a candidate box with higher quality.

the loss function is:

In the last term of the loss function,and->Is inversely proportional to the coefficient of (c). Therefore, in the early stage of the training phase, the influence of the optimization of the first phase on the loss function is larger, the optimization error of the first phase is smaller and smaller after the bounding box regression unit of the first phase is trained better, and the influence of the optimization of the second phase on the loss function is gradually increased. In the subsequent training, the quality of the candidate position is better and the number of positive samples is also more and more, and the weight occupied by the second stage is increased when the sample balance is kept. By cascading multiple regression networks, the IoU threshold is gradually lifted in the target positioning of multiple stages, and the previous stage sets a smaller IoU threshold to increase the number of positive samples, thereby realizing the balance of training samples. The quality of the candidate frame is improved through the first positioning regression. Thus raising the IoU threshold in the second stageThe value can still keep the number of positive samples unchanged, and the candidate frames are subjected to further positioning regression, so that the regression accuracy of the candidate frames is improved. Therefore, it is possible to gradually improve the accuracy of positioning while balancing the sample.

Claims

1. The multi-level regression target tracking method based on sample balance is characterized by comprising the following steps:

S4, optimizing candidate frames in the search image in a first stage: incorporating the first fusion feature f _i Inputting the first code fusion characteristic f into a first head network _i 'A'; will f _i ' input first IoU prediction unit, get candidate frame B _0i First predicted IoU value u _i The method comprises the steps of carrying out a first treatment on the surface of the If u is _i >U ₁ For candidate frame B _0i Optimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B _1i ；U ₁ Is the firstA IoU threshold value for a IoU prediction unit;

2. The multi-level regression object tracking method according to claim 1, wherein the initial object frame in the search image is obtained by an ATOM-based on-line classifier in step S2.

3. The multi-level regression target tracking method of claim 1 further comprising:

4. According toThe multi-level regression target tracking method of claim 1 wherein the IoU threshold U of the first IoU prediction unit ₁ IoU threshold U of 0.5 for the second IoU prediction unit ₂ 0.7.

5. The multi-level regression object tracking method according to claim 1, wherein in the steps S1 and S2, shallow features of the reference image and the search image are extracted by a shallow feature extractor consisting of an initial convolution layer of Resnet-50, block1 and two convolution layers connected in sequence.

6. The multi-level regression object tracking method according to claim 1, wherein in the steps S1 and S2, the deep feature extractor consisting of Block2-Block4 and two convolution layers connected in sequence in the Resnet-50 is used to extract the deep features of the reference image and the search image.

7. The multi-level regression target tracking method of claim 1 wherein the parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit are trained by:

s12, processing the reference image and the search image in the sample according to steps S1 to S3, and then performing first-stage optimization processing: inputting the first coding fusion characteristic output by the first head network into a first IoU prediction unit and a real IoU calculation module in parallel; the true IoU computing module at this stage is used for computing IoU values IoU of candidate frames and true bounding frames of targets in the selected search image _gt1 The method comprises the steps of carrying out a first treatment on the surface of the If IoU _gt1 >U ₁ Inputting the candidate frame into a first boundary frame regression unit for optimization to obtain an optimized candidate frame BB _1n N=1, 2, …, N1 is the number of candidate frames obtained by performing the first-stage optimization processing on N candidate frames in the search image;

the loss function is:

8. A sample balance-based multi-level regression target tracking system, comprising:

a reference image shallow feature extractor (1) for extracting shallow features R of a reference image ₁

A reference image deep feature extractor (2) for extracting deep features R of a reference image ₂ ；

A reference image shallow PrPool layer (3) for R-dependent ₁ Acquiring shallow layer characteristics a in target frame in reference image ₁ ；

A reference image deep PrPool layer (4) for R-dependent ₂ Obtaining deep layer characteristic a in target frame in reference image ₂ ；

A candidate frame generation module (5) for acquiring an initial target frame in the search image, and disturbing the initial target frame in the search image to generate a plurality of candidate frames B _0i ；

Searching for shallow features of an imageAn extractor (6) for extracting shallow features S of the search image ₁ ；

A search image deep feature extractor (7) for extracting deep features S of the search image ₂ ；

Searching the image shallow PrPool layer (8) for the image according to S ₁ Obtaining search image candidate frame B _0i Internal shallow features b _1i ；

Searching the image deep PrPool layer (9) for the image according to S ₂ Obtaining search image candidate frame B _0i Deep-inside features b _2i ；

A first fusion feature acquisition module (10) for acquiring a ₁ And b _1i Channel multiplication is performed to multiply a ₂ And b _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that a candidate frame B is obtained _0i Corresponding first fusion feature f _i ；

A first optimization module (11) for performing a first stage optimization on candidate boxes in the search image: incorporating the first fusion feature f _i Inputting the first code fusion characteristic f into a first head network _i 'A'; will f _i ' input first IoU prediction unit, get candidate frame B _0i First predicted IoU value u _i The method comprises the steps of carrying out a first treatment on the surface of the If u is _i >U ₁ For candidate frame B _0i Optimizing by adopting a first bounding box regression unit to obtain an optimized candidate frame B _1i ；U ₁ IoU threshold for the first IoU prediction unit;

a second optimizing module (12) for performing a second-stage optimization on the candidate frames after the first-stage optimization: according to S respectively ₁ And S is ₂ Obtaining optimized candidate frame B in search image by PrPool layer _1i Shallow features b' _1i And deep features b' _2i ；

Will a ₁ And b' _1i Channel multiplication is performed to multiply a ₂ And b' _2i Performing channel multiplication; the two results of channel multiplication are regulated to be the same size and then are cascaded, so that an optimized candidate frame B is obtained _1i Corresponding second fusion feature g _i The method comprises the steps of carrying out a first treatment on the surface of the Fusing the second fusion feature g _i Input the firstIn the two-head network, a second coding fusion characteristic g 'is obtained' _i ；

And the final target frame acquisition module (13) is used for selecting M candidate frames with the largest second prediction IoU value from a plurality of optimized candidate frames obtained by processing N candidate frames in the search image through the first optimization module (11) and the second optimization module (12), and taking the average as the final target frame of the search image.

9. The multi-level regression object tracking system according to claim 8, wherein the reference image shallow feature extractor (1) and the search image shallow feature extractor (6) are each composed of an initial convolution layer of Resnet-50, block1, and two convolution layers connected in sequence.

10. The multi-level regression target tracking system of claim 8 further comprising a loss function calculation module (14) for calculating a loss function value for training parameters in the first IoU prediction unit, the second IoU prediction unit, the first bounding box regression unit and the second bounding box regression unit;

the loss function is:

where t represents the current number of training epochs,and->Respectively are provided withRepresenting IoU loss in the first stage and IoU loss in the second stage for the t-1 generation training:

therein IoU _1i First predicted IoU value corresponding to ith candidate box of search image in sample, ioU _2n A second predicted IoU value corresponding to the nth candidate box representing the search image after the first-stage optimization IoU _gt1 And IoU _gt2 True IoU values of candidate frames in the first-stage and second-stage optimization in the search image are respectively obtained;and->Respectively representing the optimization error of the first boundary box regression unit and the optimization error of the second boundary box regression unit during t-1 generation training,