CN112150510A - Stepping target tracking method based on double-depth enhanced network - Google Patents

Stepping target tracking method based on double-depth enhanced network Download PDF

Info

Publication number
CN112150510A
CN112150510A CN202011057357.5A CN202011057357A CN112150510A CN 112150510 A CN112150510 A CN 112150510A CN 202011057357 A CN202011057357 A CN 202011057357A CN 112150510 A CN112150510 A CN 112150510A
Authority
CN
China
Prior art keywords
tnet
target
tracking
network
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011057357.5A
Other languages
Chinese (zh)
Other versions
CN112150510B (en
Inventor
陆永安
赵柯
王暐
张波
刘传玲
周铁军
张华�
付飞亚
李嘉
计宇
张乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pla 63875 Unit
Original Assignee
Pla 63875 Unit
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pla 63875 Unit filed Critical Pla 63875 Unit
Priority to CN202011057357.5A priority Critical patent/CN112150510B/en
Publication of CN112150510A publication Critical patent/CN112150510A/en
Application granted granted Critical
Publication of CN112150510B publication Critical patent/CN112150510B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a design target tracking network (TNet) for extracting a deep convolution characteristic of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training; designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet; and in the tracking process, the TNet is utilized to position the target, different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is adjusted and updated on line, then the next frame of tracking is started, and the tracking frame is gradually adjusted to become the minimum circumscribed rectangle of the target. The method can better adapt to target deformation, and enhances the robustness and stability of tracking.

Description

Stepping target tracking method based on double-depth enhanced network
Technical Field
The invention relates to the field of visual tracking, in particular to accurate target tracking based on deep reinforcement learning.
Background
The task of the visual target tracking algorithm is to predict a new target state in a subsequent frame on the premise of specifying the position, the size and other states of a target object in a first frame of a video. With deep learning, particularly great success of a Convolutional Neural Network (CNN) in the fields of image classification and target detection, most of the existing target tracking algorithms adopt the pre-trained CNN to extract image features. An ADNet Tracking algorithm is disclosed in an article "Action-Decision Networks for Visual Tracking with Deep Learning document set" published by Sangdoo Yun, Jongwon choice, Young joon school, Kimin Yun, Jin Young choice, wherein the ADNet Tracking algorithm is firstly embedded into a Reinforcement Learning frame while utilizing the CNN target expression capability, and is characterized in that a behavior control Tracking frame obtained by a series of Deep Reinforcement Learning is adopted, the Decision-making process consumes less time, and the accurate target Tracking is realized by adjusting the position of the Tracking frame, so that the best result on the OTB100 data set in the current year is obtained.
However, in the process of tracking the ADNet, because the aspect ratio of the target is fixed, when the target is greatly deformed such as rotating, the target cannot be tracked, and the online training of the network model adopts a mode of fixing a time interval and fixing the training times, so that when shielding and interference occur, the ADNet network is updated by mistake, the model degrades, and tracking drift or even failure occurs.
In view of the above problems in the related art, no effective solution has been found.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a stepping target tracking method based on a double-depth enhanced network, which improves the structure of a network output layer on the basis of the prior ADNet network, and increases the action of independent scaling in the length and width directions of a target window so as to better adapt to target deformation; a new tracking state evaluation network based on a deep enhancement network is introduced to guide the online updating process of the tracking network.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1, designing a target tracking network TNet for extracting the depth convolution characteristics of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training;
step 2, designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet;
and 3, in the tracking process, positioning the target by using the TNet, sampling different training samples according to the evaluation of the ENet on the current tracking result, adjusting and updating the TNet on line, then entering the next frame for tracking, repeating the step, and gradually adjusting the tracking frame to become the minimum circumscribed rectangle of the target.
In the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownThe center of the target is accurately positioned, and 4 scale changes are output { H }expand,Hshrink,Wexpand,WshrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.
The scales in the height direction and the width direction in the 4 scale changes are independently changed.
In the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function
LTNet=λ1×Lcross-entropy(conf,conf)+(1-λ1)×Lcross-entropy(act,act) Wherein L iscross-entropyRepresenting a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf、actAre respectively corresponding true value, lambda1Representing the weight distribution of the two losses.
Said lambda1The value range is [0.55,0.73 ]]。
In the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.
In the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.
In the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function is
Figure BDA0002711202100000031
Wherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames.
In the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.
In step 3, the state of the target is adjusted by the TNet according to the following formula
Figure BDA0002711202100000032
Where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiPredicting the behavior of the TNet for the current iteration; state items which are not listed in the condition corresponding to each behavior are kept consistent with the last iteration and are not changed;
the training sample set of the TNet is adjusted according to the sampling behavior in the way of
Figure BDA0002711202100000033
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneThree output behaviors for ENet。
The invention has the beneficial effects that:
firstly, the method fully considers the arbitrariness of the tracked object, designs a complete tracking frame adjusting behavior, enables the tracking network TNet to have target expression capacity through pre-training, enables the TNet to have the capacity of iteratively adjusting the tracking frame through a reinforcement learning simulation tracking process, enables the TNet to adaptively track a specific target object through online fine adjustment, and solves the problem that the TNet cannot accurately track when the target form is severely changed in the prior art.
Secondly, a new depth enhancement network ENet is introduced to control the on-line fine tuning of the TNet and guide the on-line updating process of the tracking network, so that model degradation caused by error or improper sampling of the TNet is avoided, the problem of tracking failure under the condition that a target is shielded or interfered, particularly a long-time shielding condition, in the prior art is solved, and the tracking robustness and stability are enhanced.
Drawings
Fig. 1 is a flowchart of an implementation of a step-by-step target tracking algorithm based on a dual-depth enhanced network on a frame image.
Fig. 2 is a schematic diagram of a step-by-step target tracking algorithm based on a dual-depth enhanced network.
FIG. 3 is a graph showing the comparison results of the accuracy curve and the success rate curve of ADNet on OTB100 by the method of the present invention, wherein (a) is the comparison result of the accuracy curve and (b) is the comparison result of the success rate curve.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
The invention provides a double-depth enhanced network-based stepping target tracking algorithm aiming at the problems of real-time change and long-time shielding of target forms faced by the tracking algorithm.
The invention provides a stepping target tracking method based on a double-depth enhanced network, which comprises the following steps:
step 1, designing a target tracking network (TNet), wherein the network can extract deep convolution characteristics of a target, give sequence movement behaviors of a tracking frame according to the extracted characteristics, and gradually adjust the tracking frame to be a minimum circumscribed rectangle of the target. For this purpose, the TNet is trained off-line, including supervised pre-training and reinforcement learning.
And 2, designing and training a tracking result evaluation network (ENet), outputting an online sampling behavior in the tracking process by the network, and controlling the updating process of the TNet.
And 3, in the tracking, the TNet is utilized to accurately determine the position target step by step, then different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is finely adjusted and updated on line, then the next frame of tracking is carried out, and the step is repeated.
The TNet input in the step 1 is an image block, deep convolution characteristics of the image are fully extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers. The behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownAccurately positioning the target center, 4 scale changes { H }expand,Hshrink,Wexpand,WshrinkAnd outputting the credibility output layer as the corresponding behavior confidence by responding to the inconsistent deformation in the length and width directions of the target and terminating the operation { stop }. The scales in the high and wide directions in the behavior output can be independently changed.
In step 1, in order to make the convolution layer of the TNet network have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the training is carried out by adopting a traditional supervised learning method, an objective function is defined as a multitask cross entropy loss function,
LTNet=λ1×Lcross-entropy(conf,conf)+(1-λ1)×Lcross-entropy(act,act)
wherein L iscross-entropyRepresents a cross entropy loss function in the form of one-hot, and the conf and the act are networks respectivelyOutputs of the behavior output layer and the target credibility output layer, conf、actAre respectively corresponding true value, lambda1The weight distribution of two losses is represented, and the value range is [0.55,0.73 ]]The pre-training stage focuses more on the confidence of the target, so λ1Values greater than 0.5.
The pre-trained TNet convolutional layer has better target expression capacity, so that the network parameters of the convolutional layer are fixed, and the full-connection layer (except the target reliability layer) of the convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value.
In the step 2, the other layer structures of the ENet except the output layer are the same as those of the TNet, all convolution layer parameters are shared, an image block in a tracking frame of the current frame TNet is input, and meanwhile, a historical tracking result of a previous frame is connected in series after the output of the second layer from the last time, so that the rationality of evaluation is improved. The output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the degree and the mode of sampling the TNet network training data at the current frame reflect the evaluation of the ENet on the current tracking result. By implementing different sampling behaviors, the online fine adjustment sample of the TNet is changed correspondingly, and the ENet realizes the evaluation and control of the TNet tracking result in this way.
In step 2, ENet training directly adopts reinforcement learning, random initialization is output, parameters of other layers are initialized by TNet, convolutional layer parameters are fixed, training data is a video sequence, the tracking process is simulated by training, the final reward function is,
Figure BDA0002711202100000051
wherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. This formula represents trackability after using ENetThe more rewards that can be boosted are also larger, and the network is penalized once the performance is reduced. And finally, optimizing parameters of the full-connection layer by using a random gradient method.
In step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a positive sample is sampled near the center of the target, a negative sample is sampled in an annular area far away from the center, and the TNet is subjected to online fine adjustment so as to be better adapted to the target to be tracked.
The state adjustment of the target by the TNet in step 3 is realized by the following formula,
Figure BDA0002711202100000061
where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiThe predicted behavior for the TNet for this iteration. The state items not listed in the case corresponding to each behavior are consistent with the last iteration and are not changed.
In step 3, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area obtained by TNet tracking into the first convolution layer conv1 of the ENet, and connecting the output of the TNet in the last iteration of each frame with the output of the fc5 layer of the ENet network in series to be used as the input of the fc6 layer to obtain the predicted sampling behavior. Adjusting the training sample set of the TNet according to the sampling behavior in the way of
Figure BDA0002711202100000062
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneIs the three output behaviors of ENet.
Referring to fig. 1 and 2, an embodiment of the present invention comprises the following steps:
step 1, firstly designing and training a target tracking network (TNet) off line.
(1a) The network structure of the TNet is shown in table 1, the network input is an image block, and the image block passes through three convolutional layers conv1-conv3, two full connection layers fc4 and fc5 and then reaches two output layers, namely a behavior output layer fc6 and a target reliability output layer fc 7. The 3 convolutional layers are used for fully extracting deep convolutional features of the image, and the fc6 is output as the probability of 9 behaviors of one-hot mode, wherein the behaviors comprise 4-direction displacement { T }left,Tright,Tup,TdownAccurately positioning the target center, 4 scale changes { H }expand,Hshrink,Wexpand,WshrinkAnd stopping iteration by using the operation { stop } to deal with the inconsistent deformation in the length and width directions of the target, wherein the current tracking frame is accurately positioned, and the fc7 is output as a corresponding behavior confidence coefficient.
Table 1 TNet concrete configuration description table of target tracking network
Figure BDA0002711202100000071
(1b) In order to enable the first three layers of the TNet network to have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the traditional supervised learning method is adopted for training, an objective function is defined as a multitask cross entropy loss function,
LTNet=λ1×Lcross-entropy(conf,conf)+(1-λ1)×Lcross-entropy(act,act) (1)
wherein L iscross-entropyRepresenting the cross-entropy loss function through one-hot form, conf, act being the outputs of the fc7 and fc6 layers of the network, respectively, conf,actAre respectively corresponding true value, lambda10.65 represents the weight distribution of two losses, and the pre-training stage focuses more on the confidence of the target, so λ1The value is large.
The pre-trained TNet convolutional layer has better target expression capacity, so the network parameters of the conv1 to conv3 layers are fixed, and the fc4-fc6 of the network parameters are subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value, setting the reward function to,
Figure BDA0002711202100000081
wherein, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.
And 2, designing and training a tracking and evaluating network (ENet) off line.
(2a) The structure of the ENet is similar to that of the TNet, and as shown in Table 2, the two have the same conv1-fc5 layer, and the conv1-conv3 layers are shared with the TNet parameters, and the tracking frame image block of the current t-th frame TNet is input, and the tracking result of the previous frame is concatenated after the output of the fc5 layer, specifically, the method comprises a) the behavior prediction of the TNet network at the first frame output of the video
Figure BDA0002711202100000082
And confidence level
Figure BDA0002711202100000083
Since the target state of the first frame is known and has a reference value, b) the behavior prediction of the TNet output finally in the previous m frames (m 15)
Figure BDA0002711202100000084
And confidence level
Figure BDA0002711202100000085
The dimension of the above 2 items is (9+2) + (9+2) × m. The output of ENet is 3 sampling behaviors different in the current frame, including samplesuf,sampleneg,samplenoneThus the structure of the tier output fc6 is (512+176) × 3 fully connected tiers. These 3 sampling behaviors reflect the evaluation of the current tracking result by the ENetAnd estimating, wherein the credibility degrees corresponding to the current tracking results are sequentially reduced, and the obtained samples are used for updating the TNet network on line. samplesufThe sampling of the current frame is complete, positive and negative samples indicate that the ENet evaluates that the confidence of the current tracking result is high, and samplenegOnly the area around the tracking result is sampled as a negative sample, which indicates that the target of the current frame is partially occluded or greatly deformed, samplenoneIndicating that the target is completely occluded or tracking fails and no sample sampling is performed. Through different sampling behaviors, online training samples of the TNet are different, and the optimal network parameters and tracking accuracy are achieved.
Table 2 tracking and evaluating network ENet concrete configuration description table
Figure BDA0002711202100000086
Figure BDA0002711202100000091
(2b) Env's conv1 to fc5 layer parameters are initialized by corresponding layers of TNet, fc6 layer is initialized randomly, conv1 to conv3 layer parameters are fixed, training data is a video sequence, training simulates a tracking process, and a final reward function is,
Figure BDA0002711202100000092
wherein BBTNet+ENetFor the tracking results of TNet after evaluation assistance using the ENet, the other symbols are consistent with equation (2). Equation (3) shows that the more the ENet is used, the greater the reward is given by tracking the performance increase, and the network is penalized once the performance is reduced. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.
And 3, sampling positive and negative samples on line in the first frame image of the video and finely adjusting the TNet.
The state of the target to be tracked is given in the first frame by a manual or interactive algorithm,
Figure BDA0002711202100000093
wherein
Figure BDA0002711202100000094
Is a position coordinate in the height and width directions of the center of the object, { h1,w1The target height-width is denoted by the superscript "1" for the first frame. Carrying out sufficient positive and negative sample sampling around a given target, wherein the positive sampling rule is that the center of a sample is at s1Randomly sampling in a designated area, wherein the height and width of a sample are { h1,w1Carry out random scaling on the basis of the (0.85, 1.15) ranges]×h1And [0.85,1.15]×w1And requires a positive sample and s1The overlap ratio of (2) is 0.75, and finally the positive sample set Pos is obtained1The number of samples is 400. The negative sample sampling rule is that the sample is centered at
Figure BDA0002711202100000095
Is located internally and not at s1The random variation rule of the sample height and width is the same as that of the positive sample, and the negative sample and the s are required1The overlapping rate of the negative sample set Neg is less than 0.5, and the negative sample set Neg is finally obtained1The number of samples is 400.
And 4, utilizing the TNet to perform step-by-step target positioning in a new frame.
When a new tth trace is performed, Pos is first usedt-1And Negt-1For training the samples, the TNet was trimmed online in the same supervised fashion as in step (1 b). Then taking the target state of the last frame as the current target initial state,
Figure BDA0002711202100000096
Figure BDA0002711202100000101
for the 1 st state of the t-th frame, st-1The final target state obtained for the previous frame. In the t frame imageGet
Figure BDA0002711202100000102
The represented image blocks are used as candidate targets.
(4a) Inputting the candidate target into TNet to obtain corresponding behavior probability
Figure BDA0002711202100000103
And confidence level
Figure BDA0002711202100000104
The behavior with the highest probability is selected as the behavior of the prediction,
Figure BDA0002711202100000105
where i is the number of TNet iterations in the frame,
Figure BDA0002711202100000106
the behavior given for the ith iteration.
If it is not
Figure BDA0002711202100000107
Adjusting the state according to the behavior in a way of
Figure BDA0002711202100000108
And the state items which are not listed in the condition corresponding to each behavior are consistent with the last iteration and are not adjusted. According to the new state
Figure BDA0002711202100000109
And extracting image blocks in the t frame image as new candidate targets. And then repeating step (4 a). As shown in the upper portion of the schematic drawing of the embodiment of fig. 2.
If it is not
Figure BDA00027112021000001010
Indicates the presentAnd (5) positioning the tracking frame accurately, and stopping iteration. The target state after the step-by-step adjustment is the current frame tracking result,
Figure BDA00027112021000001011
and 5, evaluating a current frame positioning result by utilizing the ENet and determining a sampling behavior.
During the tracking, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area tracked by the TNet into the first convolution layer conv1 of the ENet, and outputting the TNet in the last iteration of each frame
Figure BDA00027112021000001012
Figure BDA00027112021000001013
Concatenated with the ENet network fc5 level output as fc6 level input, when the tracked frame number is less than m,
Figure BDA00027112021000001014
and
Figure BDA00027112021000001015
are filled with 0, keeping the dimensions formally consistent. The sample behavior prediction with the maximum probability output by fc6 is selected as the current prediction,
Figure BDA00027112021000001016
wherein
Figure BDA00027112021000001017
Adjusting the training sample set of the TNet according to the sampling behavior by using the 3-dimensional sampling behavior probability output by fc6
Figure BDA0002711202100000111
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, the sampling mode is the same as that in the step (3), U is set merging operation, PostAnd NegtTraining samples are trimmed online for the t-th frame TNet.
And (4) judging whether the current frame is the last frame of the video, if not, returning to the step (4) to continue the tracking of the next frame, and if so, ending the video tracking process.
The tracking method of the invention performs experiments on the OTB100 and obtains experimental results, and adopts a precision curve and a success rate curve as evaluation criteria. The OTB100 contains 100 video sequences including many challenging factors such as object motion blur, background clutter, partial occlusion, and complete occlusion. Comparing the tracking method designed by the invention with the ADNet classical algorithm, and the figure 3 is a quantitative comparison result, it can be seen that the method of the invention has good tracking precision and robustness, and is superior to the comparison algorithm.
The algorithm of the present invention is not limited to the above embodiments, and any technical solutions obtained by equivalent substitution methods fall within the scope of the present invention.

Claims (10)

1. A stepping target tracking method based on a double-depth enhanced network is characterized by comprising the following steps:
step 1, designing a target tracking network TNet for extracting the depth convolution characteristics of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training;
step 2, designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet;
and 3, in the tracking process, positioning the target by using the TNet, sampling different training samples according to the evaluation of the ENet on the current tracking result, adjusting and updating the TNet on line, then entering the next frame for tracking, repeating the step, and gradually adjusting the tracking frame to become the minimum circumscribed rectangle of the target.
2. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownThe center of the target is accurately positioned, and 4 scale changes are output { H }expand,Hshrink,Wexpand,WshrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.
3. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 2, wherein: the scales in the height direction and the width direction in the 4 scale changes are independently changed.
4. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function LTNet=λ1×Lcross-entropy(conf,conf)+(1-λ1)×Lcross-entropy(act,act) Wherein L iscross-entropyRepresenting a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf、actAre respectively corresponding true value, lambda1Representing the weight distribution of the two losses.
5. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 4, wherein: said lambda1The value range is [0.55,0.73 ]]。
6. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.
7. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.
8. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function is
Figure FDA0002711202090000021
Wherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames.
9. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.
10. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in step 3, the state of the target is adjusted by the TNet according to the following formula
Figure FDA0002711202090000022
Where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiPredicting the behavior of the TNet for the current iteration; state items which are not listed in the condition corresponding to each behavior are kept consistent with the last iteration and are not changed;
the training sample set of the TNet is adjusted according to the sampling behavior in the way of
Figure FDA0002711202090000031
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneIs the three output behaviors of ENet.
CN202011057357.5A 2020-09-29 2020-09-29 Stepping target tracking method based on dual-depth enhancement network Active CN112150510B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011057357.5A CN112150510B (en) 2020-09-29 2020-09-29 Stepping target tracking method based on dual-depth enhancement network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011057357.5A CN112150510B (en) 2020-09-29 2020-09-29 Stepping target tracking method based on dual-depth enhancement network

Publications (2)

Publication Number Publication Date
CN112150510A true CN112150510A (en) 2020-12-29
CN112150510B CN112150510B (en) 2024-03-26

Family

ID=73895941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011057357.5A Active CN112150510B (en) 2020-09-29 2020-09-29 Stepping target tracking method based on dual-depth enhancement network

Country Status (1)

Country Link
CN (1) CN112150510B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991346A (en) * 2021-05-13 2021-06-18 深圳科亚医疗科技有限公司 Training method and training system for learning network for medical image analysis
CN115099372A (en) * 2022-08-25 2022-09-23 深圳比特微电子科技有限公司 Classification identification method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106920248A (en) * 2017-01-19 2017-07-04 博康智能信息技术有限公司上海分公司 A kind of method for tracking target and device
CN106960446A (en) * 2017-04-01 2017-07-18 广东华中科技大学工业技术研究院 A kind of waterborne target detecting and tracking integral method applied towards unmanned boat
CN109801310A (en) * 2018-11-23 2019-05-24 南京信息工程大学 A kind of method for tracking target in orientation and scale differentiation depth network
US20200065976A1 (en) * 2018-08-23 2020-02-27 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning
US20200126241A1 (en) * 2018-10-18 2020-04-23 Deepnorth Inc. Multi-Object Tracking using Online Metric Learning with Long Short-Term Memory
CN111666631A (en) * 2020-06-03 2020-09-15 南京航空航天大学 Unmanned aerial vehicle maneuvering decision method combining hesitation fuzzy and dynamic deep reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106920248A (en) * 2017-01-19 2017-07-04 博康智能信息技术有限公司上海分公司 A kind of method for tracking target and device
CN106960446A (en) * 2017-04-01 2017-07-18 广东华中科技大学工业技术研究院 A kind of waterborne target detecting and tracking integral method applied towards unmanned boat
US20200065976A1 (en) * 2018-08-23 2020-02-27 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning
US20200126241A1 (en) * 2018-10-18 2020-04-23 Deepnorth Inc. Multi-Object Tracking using Online Metric Learning with Long Short-Term Memory
CN109801310A (en) * 2018-11-23 2019-05-24 南京信息工程大学 A kind of method for tracking target in orientation and scale differentiation depth network
CN111666631A (en) * 2020-06-03 2020-09-15 南京航空航天大学 Unmanned aerial vehicle maneuvering decision method combining hesitation fuzzy and dynamic deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SANGDOO YUN,JONGWON CHOI,YOUNGJOON YOO,KIMIN YUN,JIN YOUNG CHOI: "Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning", 《2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 1349 - 1356 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991346A (en) * 2021-05-13 2021-06-18 深圳科亚医疗科技有限公司 Training method and training system for learning network for medical image analysis
CN112991346B (en) * 2021-05-13 2022-04-26 深圳科亚医疗科技有限公司 Training method and training system for learning network for medical image analysis
CN115099372A (en) * 2022-08-25 2022-09-23 深圳比特微电子科技有限公司 Classification identification method and device

Also Published As

Publication number Publication date
CN112150510B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
Labach et al. Survey of dropout methods for deep neural networks
CN112163516B (en) Rope skipping counting method and device and computer storage medium
CN107369166B (en) Target tracking method and system based on multi-resolution neural network
CN108764006B (en) SAR image target detection method based on deep reinforcement learning
CN110120064A (en) A kind of depth related objective track algorithm based on mutual reinforcing with the study of more attention mechanisms
CN112150510A (en) Stepping target tracking method based on double-depth enhanced network
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
CN111476814B (en) Target tracking method, device, equipment and storage medium
CN110096202B (en) Automatic lightweight image clipping system and method based on deep reinforcement learning
CN110991621A (en) Method for searching convolutional neural network based on channel number
CN116524062B (en) Diffusion model-based 2D human body posture estimation method
CN112802061A (en) Robust target tracking method and system based on hierarchical decision network
CN111105442B (en) Switching type target tracking method
CN112614163A (en) Target tracking method and system fusing Bayesian trajectory inference
CN110378932B (en) Correlation filtering visual tracking method based on spatial regularization correction
CN114973071A (en) Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics
US10643092B2 (en) Segmenting irregular shapes in images using deep region growing with an image pyramid
CN110428447B (en) Target tracking method and system based on strategy gradient
CN117237893A (en) Automatic driving multi-target detection method based on instance self-adaptive dynamic neural network
Firouznia et al. Adaptive chaotic sampling particle filter to handle occlusion and fast motion in visual object tracking
CN116342624A (en) Brain tumor image segmentation method combining feature fusion and attention mechanism
US10776923B2 (en) Segmenting irregular shapes in images using deep region growing
CN116258877A (en) Land utilization scene similarity change detection method, device, medium and equipment
WO2019243910A1 (en) Segmenting irregular shapes in images using deep region growing
CN111539989B (en) Computer vision single target tracking method based on optimized variance reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant