CN112150510A

CN112150510A - Stepping target tracking method based on double-depth enhanced network

Info

Publication number: CN112150510A
Application number: CN202011057357.5A
Authority: CN
Inventors: 陆永安; 赵柯; 王暐; 张波; 刘传玲; 周铁军; 张华�; 付飞亚; 李嘉; 计宇; 张乐
Original assignee: Pla 63875 Unit
Current assignee: Pla 63875 Unit
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2020-12-29
Anticipated expiration: 2040-09-29
Also published as: CN112150510B

Abstract

The invention provides a design target tracking network (TNet) for extracting a deep convolution characteristic of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training; designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet; and in the tracking process, the TNet is utilized to position the target, different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is adjusted and updated on line, then the next frame of tracking is started, and the tracking frame is gradually adjusted to become the minimum circumscribed rectangle of the target. The method can better adapt to target deformation, and enhances the robustness and stability of tracking.

Description

Stepping target tracking method based on double-depth enhanced network

Technical Field

The invention relates to the field of visual tracking, in particular to accurate target tracking based on deep reinforcement learning.

Background

The task of the visual target tracking algorithm is to predict a new target state in a subsequent frame on the premise of specifying the position, the size and other states of a target object in a first frame of a video. With deep learning, particularly great success of a Convolutional Neural Network (CNN) in the fields of image classification and target detection, most of the existing target tracking algorithms adopt the pre-trained CNN to extract image features. An ADNet Tracking algorithm is disclosed in an article "Action-Decision Networks for Visual Tracking with Deep Learning document set" published by Sangdoo Yun, Jongwon choice, Young joon school, Kimin Yun, Jin Young choice, wherein the ADNet Tracking algorithm is firstly embedded into a Reinforcement Learning frame while utilizing the CNN target expression capability, and is characterized in that a behavior control Tracking frame obtained by a series of Deep Reinforcement Learning is adopted, the Decision-making process consumes less time, and the accurate target Tracking is realized by adjusting the position of the Tracking frame, so that the best result on the OTB100 data set in the current year is obtained.

However, in the process of tracking the ADNet, because the aspect ratio of the target is fixed, when the target is greatly deformed such as rotating, the target cannot be tracked, and the online training of the network model adopts a mode of fixing a time interval and fixing the training times, so that when shielding and interference occur, the ADNet network is updated by mistake, the model degrades, and tracking drift or even failure occurs.

In view of the above problems in the related art, no effective solution has been found.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a stepping target tracking method based on a double-depth enhanced network, which improves the structure of a network output layer on the basis of the prior ADNet network, and increases the action of independent scaling in the length and width directions of a target window so as to better adapt to target deformation; a new tracking state evaluation network based on a deep enhancement network is introduced to guide the online updating process of the tracking network.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1, designing a target tracking network TNet for extracting the depth convolution characteristics of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training;

step 2, designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet;

and 3, in the tracking process, positioning the target by using the TNet, sampling different training samples according to the evaluation of the ENet on the current tracking result, adjusting and updating the TNet on line, then entering the next frame for tracking, repeating the step, and gradually adjusting the tracking frame to become the minimum circumscribed rectangle of the target.

In the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }_left,T_right,T_up,T_downThe center of the target is accurately positioned, and 4 scale changes are output { H }_expand,H_shrink,W_expand,W_shrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.

The scales in the height direction and the width direction in the 4 scale changes are independently changed.

In the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function

L_TNet＝λ₁×L_{cross-entropy}(conf,conf^～)+(1-λ₁)×L_{cross-entropy}(act,act^～) Wherein L is_{cross-entropy}Representing a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf^～、act^～Are respectively corresponding true value, lambda₁Representing the weight distribution of the two losses.

Said lambda₁The value range is [0.55,0.73 ]]。

In the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.

In the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }_suf，sample_neg，sample_noneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.

In the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function is

Wherein BB_TNet+ENetFor the tracking results of post-TNet using ENet evaluation, BB_TNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames.

In the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.

In step 3, the state of the target is adjusted by the TNet according to the following formula

Where i is the number of iterations of the stepwise adjustment, { c_x,c_yH, w is the center coordinate and height and width of the object, a_iPredicting the behavior of the TNet for the current iteration; state items which are not listed in the condition corresponding to each behavior are kept consistent with the last iteration and are not changed;

the training sample set of the TNet is adjusted according to the sampling behavior in the way of

Wherein P is^t、N^tRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, Pos^tAnd Neg^tOn-line fine-tuning training samples for the t-th frame TNet, { sample_suf,sample_neg,sample_noneThree output behaviors for ENet。

The invention has the beneficial effects that:

firstly, the method fully considers the arbitrariness of the tracked object, designs a complete tracking frame adjusting behavior, enables the tracking network TNet to have target expression capacity through pre-training, enables the TNet to have the capacity of iteratively adjusting the tracking frame through a reinforcement learning simulation tracking process, enables the TNet to adaptively track a specific target object through online fine adjustment, and solves the problem that the TNet cannot accurately track when the target form is severely changed in the prior art.

Secondly, a new depth enhancement network ENet is introduced to control the on-line fine tuning of the TNet and guide the on-line updating process of the tracking network, so that model degradation caused by error or improper sampling of the TNet is avoided, the problem of tracking failure under the condition that a target is shielded or interfered, particularly a long-time shielding condition, in the prior art is solved, and the tracking robustness and stability are enhanced.

Drawings

Fig. 1 is a flowchart of an implementation of a step-by-step target tracking algorithm based on a dual-depth enhanced network on a frame image.

Fig. 2 is a schematic diagram of a step-by-step target tracking algorithm based on a dual-depth enhanced network.

FIG. 3 is a graph showing the comparison results of the accuracy curve and the success rate curve of ADNet on OTB100 by the method of the present invention, wherein (a) is the comparison result of the accuracy curve and (b) is the comparison result of the success rate curve.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

The invention provides a double-depth enhanced network-based stepping target tracking algorithm aiming at the problems of real-time change and long-time shielding of target forms faced by the tracking algorithm.

The invention provides a stepping target tracking method based on a double-depth enhanced network, which comprises the following steps:

step 1, designing a target tracking network (TNet), wherein the network can extract deep convolution characteristics of a target, give sequence movement behaviors of a tracking frame according to the extracted characteristics, and gradually adjust the tracking frame to be a minimum circumscribed rectangle of the target. For this purpose, the TNet is trained off-line, including supervised pre-training and reinforcement learning.

And 2, designing and training a tracking result evaluation network (ENet), outputting an online sampling behavior in the tracking process by the network, and controlling the updating process of the TNet.

And 3, in the tracking, the TNet is utilized to accurately determine the position target step by step, then different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is finely adjusted and updated on line, then the next frame of tracking is carried out, and the step is repeated.

The TNet input in the step 1 is an image block, deep convolution characteristics of the image are fully extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers. The behavior output layer outputs 4-direction displacement { T }_left,T_right,T_up,T_downAccurately positioning the target center, 4 scale changes { H }_expand，H_shrink，W_expand，W_shrinkAnd outputting the credibility output layer as the corresponding behavior confidence by responding to the inconsistent deformation in the length and width directions of the target and terminating the operation { stop }. The scales in the high and wide directions in the behavior output can be independently changed.

In step 1, in order to make the convolution layer of the TNet network have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the training is carried out by adopting a traditional supervised learning method, an objective function is defined as a multitask cross entropy loss function,

L_TNet＝λ₁×L_{cross-entropy}(conf,conf^～)+(1-λ₁)×L_{cross-entropy}(act,act^～)

wherein L is_{cross-entropy}Represents a cross entropy loss function in the form of one-hot, and the conf and the act are networks respectivelyOutputs of the behavior output layer and the target credibility output layer, conf^～、act^～Are respectively corresponding true value, lambda₁The weight distribution of two losses is represented, and the value range is [0.55,0.73 ]]The pre-training stage focuses more on the confidence of the target, so λ₁Values greater than 0.5.

The pre-trained TNet convolutional layer has better target expression capacity, so that the network parameters of the convolutional layer are fixed, and the full-connection layer (except the target reliability layer) of the convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value.

In the step 2, the other layer structures of the ENet except the output layer are the same as those of the TNet, all convolution layer parameters are shared, an image block in a tracking frame of the current frame TNet is input, and meanwhile, a historical tracking result of a previous frame is connected in series after the output of the second layer from the last time, so that the rationality of evaluation is improved. The output of ENet is the sampling behavior { sample }_suf，sample_neg，sample_noneAnd the degree and the mode of sampling the TNet network training data at the current frame reflect the evaluation of the ENet on the current tracking result. By implementing different sampling behaviors, the online fine adjustment sample of the TNet is changed correspondingly, and the ENet realizes the evaluation and control of the TNet tracking result in this way.

In step 2, ENet training directly adopts reinforcement learning, random initialization is output, parameters of other layers are initialized by TNet, convolutional layer parameters are fixed, training data is a video sequence, the tracking process is simulated by training, the final reward function is,

wherein BB_TNet+ENetFor the tracking results of post-TNet using ENet evaluation, BB_TNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. This formula represents trackability after using ENetThe more rewards that can be boosted are also larger, and the network is penalized once the performance is reduced. And finally, optimizing parameters of the full-connection layer by using a random gradient method.

In step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a positive sample is sampled near the center of the target, a negative sample is sampled in an annular area far away from the center, and the TNet is subjected to online fine adjustment so as to be better adapted to the target to be tracked.

The state adjustment of the target by the TNet in step 3 is realized by the following formula,

where i is the number of iterations of the stepwise adjustment, { c_x,c_yH, w is the center coordinate and height and width of the object, a_iThe predicted behavior for the TNet for this iteration. The state items not listed in the case corresponding to each behavior are consistent with the last iteration and are not changed.

In step 3, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area obtained by TNet tracking into the first convolution layer conv1 of the ENet, and connecting the output of the TNet in the last iteration of each frame with the output of the fc5 layer of the ENet network in series to be used as the input of the fc6 layer to obtain the predicted sampling behavior. Adjusting the training sample set of the TNet according to the sampling behavior in the way of

Wherein P is^t、N^tRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, Pos^tAnd Neg^tOn-line fine-tuning training samples for the t-th frame TNet, { sample_suf,sample_neg,sample_noneIs the three output behaviors of ENet.

Referring to fig. 1 and 2, an embodiment of the present invention comprises the following steps:

step 1, firstly designing and training a target tracking network (TNet) off line.

(1a) The network structure of the TNet is shown in table 1, the network input is an image block, and the image block passes through three convolutional layers conv1-conv3, two full connection layers fc4 and fc5 and then reaches two output layers, namely a behavior output layer fc6 and a target reliability output layer fc 7. The 3 convolutional layers are used for fully extracting deep convolutional features of the image, and the fc6 is output as the probability of 9 behaviors of one-hot mode, wherein the behaviors comprise 4-direction displacement { T }_left,T_right,T_up,T_downAccurately positioning the target center, 4 scale changes { H }_expand，H_shrink，W_expand，W_shrinkAnd stopping iteration by using the operation { stop } to deal with the inconsistent deformation in the length and width directions of the target, wherein the current tracking frame is accurately positioned, and the fc7 is output as a corresponding behavior confidence coefficient.

Table 1 TNet concrete configuration description table of target tracking network

(1b) In order to enable the first three layers of the TNet network to have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the traditional supervised learning method is adopted for training, an objective function is defined as a multitask cross entropy loss function,

L_TNet＝λ₁×L_{cross-entropy}(conf,conf^～)+(1-λ₁)×L_{cross-entropy}(act,act^～) (1)

wherein L is_{cross-entropy}Representing the cross-entropy loss function through one-hot form, conf, act being the outputs of the fc7 and fc6 layers of the network, respectively, conf^～，act^～Are respectively corresponding true value, lambda₁0.65 represents the weight distribution of two losses, and the pre-training stage focuses more on the confidence of the target, so λ₁The value is large.

The pre-trained TNet convolutional layer has better target expression capacity, so the network parameters of the conv1 to conv3 layers are fixed, and the fc4-fc6 of the network parameters are subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value, setting the reward function to,

wherein, BB_TNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.

And 2, designing and training a tracking and evaluating network (ENet) off line.

(2a) The structure of the ENet is similar to that of the TNet, and as shown in Table 2, the two have the same conv1-fc5 layer, and the conv1-conv3 layers are shared with the TNet parameters, and the tracking frame image block of the current t-th frame TNet is input, and the tracking result of the previous frame is concatenated after the output of the fc5 layer, specifically, the method comprises a) the behavior prediction of the TNet network at the first frame output of the video

And confidence level

Since the target state of the first frame is known and has a reference value, b) the behavior prediction of the TNet output finally in the previous m frames (m 15)

And confidence level

The dimension of the above 2 items is (9+2) + (9+2) × m. The output of ENet is 3 sampling behaviors different in the current frame, including sample_suf,sample_neg,sample_noneThus the structure of the tier output fc6 is (512+176) × 3 fully connected tiers. These 3 sampling behaviors reflect the evaluation of the current tracking result by the ENetAnd estimating, wherein the credibility degrees corresponding to the current tracking results are sequentially reduced, and the obtained samples are used for updating the TNet network on line. sample_sufThe sampling of the current frame is complete, positive and negative samples indicate that the ENet evaluates that the confidence of the current tracking result is high, and sample_negOnly the area around the tracking result is sampled as a negative sample, which indicates that the target of the current frame is partially occluded or greatly deformed, sample_noneIndicating that the target is completely occluded or tracking fails and no sample sampling is performed. Through different sampling behaviors, online training samples of the TNet are different, and the optimal network parameters and tracking accuracy are achieved.

Table 2 tracking and evaluating network ENet concrete configuration description table

(2b) Env's conv1 to fc5 layer parameters are initialized by corresponding layers of TNet, fc6 layer is initialized randomly, conv1 to conv3 layer parameters are fixed, training data is a video sequence, training simulates a tracking process, and a final reward function is,

wherein BB_TNet+ENetFor the tracking results of TNet after evaluation assistance using the ENet, the other symbols are consistent with equation (2). Equation (3) shows that the more the ENet is used, the greater the reward is given by tracking the performance increase, and the network is penalized once the performance is reduced. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.

And 3, sampling positive and negative samples on line in the first frame image of the video and finely adjusting the TNet.

The state of the target to be tracked is given in the first frame by a manual or interactive algorithm,

wherein

Is a position coordinate in the height and width directions of the center of the object, { h¹,w¹The target height-width is denoted by the superscript "1" for the first frame. Carrying out sufficient positive and negative sample sampling around a given target, wherein the positive sampling rule is that the center of a sample is at s¹Randomly sampling in a designated area, wherein the height and width of a sample are { h¹,w¹Carry out random scaling on the basis of the (0.85, 1.15) ranges]×h¹And [0.85,1.15]×w¹And requires a positive sample and s¹The overlap ratio of (2) is 0.75, and finally the positive sample set Pos is obtained¹The number of samples is 400. The negative sample sampling rule is that the sample is centered at

Is located internally and not at s¹The random variation rule of the sample height and width is the same as that of the positive sample, and the negative sample and the s are required¹The overlapping rate of the negative sample set Neg is less than 0.5, and the negative sample set Neg is finally obtained¹The number of samples is 400.

And 4, utilizing the TNet to perform step-by-step target positioning in a new frame.

When a new tth trace is performed, Pos is first used^t-1And Neg^t-1For training the samples, the TNet was trimmed online in the same supervised fashion as in step (1 b). Then taking the target state of the last frame as the current target initial state,

for the 1 st state of the t-th frame, s^t-1The final target state obtained for the previous frame. In the t frame imageGet

The represented image blocks are used as candidate targets.

(4a) Inputting the candidate target into TNet to obtain corresponding behavior probability

And confidence level

The behavior with the highest probability is selected as the behavior of the prediction,

where i is the number of TNet iterations in the frame,

the behavior given for the ith iteration.

If it is not

Adjusting the state according to the behavior in a way of

And the state items which are not listed in the condition corresponding to each behavior are consistent with the last iteration and are not adjusted. According to the new state

And extracting image blocks in the t frame image as new candidate targets. And then repeating step (4 a). As shown in the upper portion of the schematic drawing of the embodiment of fig. 2.

If it is not

Indicates the presentAnd (5) positioning the tracking frame accurately, and stopping iteration. The target state after the step-by-step adjustment is the current frame tracking result,

and 5, evaluating a current frame positioning result by utilizing the ENet and determining a sampling behavior.

During the tracking, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area tracked by the TNet into the first convolution layer conv1 of the ENet, and outputting the TNet in the last iteration of each frame

Concatenated with the ENet network fc5 level output as fc6 level input, when the tracked frame number is less than m,

and

are filled with 0, keeping the dimensions formally consistent. The sample behavior prediction with the maximum probability output by fc6 is selected as the current prediction,

wherein

Adjusting the training sample set of the TNet according to the sampling behavior by using the 3-dimensional sampling behavior probability output by fc6

Wherein P is^t、N^tRespectively positive and negative sample sets sampled at the current frame, the sampling mode is the same as that in the step (3), U is set merging operation, Pos^tAnd Neg^tTraining samples are trimmed online for the t-th frame TNet.

And (4) judging whether the current frame is the last frame of the video, if not, returning to the step (4) to continue the tracking of the next frame, and if so, ending the video tracking process.

The tracking method of the invention performs experiments on the OTB100 and obtains experimental results, and adopts a precision curve and a success rate curve as evaluation criteria. The OTB100 contains 100 video sequences including many challenging factors such as object motion blur, background clutter, partial occlusion, and complete occlusion. Comparing the tracking method designed by the invention with the ADNet classical algorithm, and the figure 3 is a quantitative comparison result, it can be seen that the method of the invention has good tracking precision and robustness, and is superior to the comparison algorithm.

The algorithm of the present invention is not limited to the above embodiments, and any technical solutions obtained by equivalent substitution methods fall within the scope of the present invention.

Claims

1. A stepping target tracking method based on a double-depth enhanced network is characterized by comprising the following steps:

2. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }_left,T_right,T_up,T_downThe center of the target is accurately positioned, and 4 scale changes are output { H }_expand,H_shrink,W_expand,W_shrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.

3. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 2, wherein: the scales in the height direction and the width direction in the 4 scale changes are independently changed.

4. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function L_TNet＝λ₁×L_{cross-entropy}(conf,conf^～)+(1-λ₁)×L_{cross-entropy}(act,act^～) Wherein L is_{cross-entropy}Representing a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf^～、act^～Are respectively corresponding true value, lambda₁Representing the weight distribution of the two losses.

5. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 4, wherein: said lambda₁The value range is [0.55,0.73 ]]。

6. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.

7. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }_suf，sample_neg，sample_noneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.

8. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function is

9. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.

10. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in step 3, the state of the target is adjusted by the TNet according to the following formula