CN112150510A - Stepping target tracking method based on double-depth enhanced network - Google Patents
Stepping target tracking method based on double-depth enhanced network Download PDFInfo
- Publication number
- CN112150510A CN112150510A CN202011057357.5A CN202011057357A CN112150510A CN 112150510 A CN112150510 A CN 112150510A CN 202011057357 A CN202011057357 A CN 202011057357A CN 112150510 A CN112150510 A CN 112150510A
- Authority
- CN
- China
- Prior art keywords
- tnet
- target
- tracking
- network
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000006399 behavior Effects 0.000 claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000005070 sampling Methods 0.000 claims abstract description 36
- 238000011156 evaluation Methods 0.000 claims abstract description 18
- 230000002787 reinforcement Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 5
- 238000006073 displacement reaction Methods 0.000 claims description 4
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a design target tracking network (TNet) for extracting a deep convolution characteristic of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training; designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet; and in the tracking process, the TNet is utilized to position the target, different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is adjusted and updated on line, then the next frame of tracking is started, and the tracking frame is gradually adjusted to become the minimum circumscribed rectangle of the target. The method can better adapt to target deformation, and enhances the robustness and stability of tracking.
Description
Technical Field
The invention relates to the field of visual tracking, in particular to accurate target tracking based on deep reinforcement learning.
Background
The task of the visual target tracking algorithm is to predict a new target state in a subsequent frame on the premise of specifying the position, the size and other states of a target object in a first frame of a video. With deep learning, particularly great success of a Convolutional Neural Network (CNN) in the fields of image classification and target detection, most of the existing target tracking algorithms adopt the pre-trained CNN to extract image features. An ADNet Tracking algorithm is disclosed in an article "Action-Decision Networks for Visual Tracking with Deep Learning document set" published by Sangdoo Yun, Jongwon choice, Young joon school, Kimin Yun, Jin Young choice, wherein the ADNet Tracking algorithm is firstly embedded into a Reinforcement Learning frame while utilizing the CNN target expression capability, and is characterized in that a behavior control Tracking frame obtained by a series of Deep Reinforcement Learning is adopted, the Decision-making process consumes less time, and the accurate target Tracking is realized by adjusting the position of the Tracking frame, so that the best result on the OTB100 data set in the current year is obtained.
However, in the process of tracking the ADNet, because the aspect ratio of the target is fixed, when the target is greatly deformed such as rotating, the target cannot be tracked, and the online training of the network model adopts a mode of fixing a time interval and fixing the training times, so that when shielding and interference occur, the ADNet network is updated by mistake, the model degrades, and tracking drift or even failure occurs.
In view of the above problems in the related art, no effective solution has been found.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a stepping target tracking method based on a double-depth enhanced network, which improves the structure of a network output layer on the basis of the prior ADNet network, and increases the action of independent scaling in the length and width directions of a target window so as to better adapt to target deformation; a new tracking state evaluation network based on a deep enhancement network is introduced to guide the online updating process of the tracking network.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1, designing a target tracking network TNet for extracting the depth convolution characteristics of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training;
step 2, designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet;
and 3, in the tracking process, positioning the target by using the TNet, sampling different training samples according to the evaluation of the ENet on the current tracking result, adjusting and updating the TNet on line, then entering the next frame for tracking, repeating the step, and gradually adjusting the tracking frame to become the minimum circumscribed rectangle of the target.
In the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownThe center of the target is accurately positioned, and 4 scale changes are output { H }expand,Hshrink,Wexpand,WshrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.
The scales in the height direction and the width direction in the 4 scale changes are independently changed.
In the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function
LTNet=λ1×Lcross-entropy(conf,conf~)+(1-λ1)×Lcross-entropy(act,act~) Wherein L iscross-entropyRepresenting a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf~、act~Are respectively corresponding true value, lambda1Representing the weight distribution of the two losses.
Said lambda1The value range is [0.55,0.73 ]]。
In the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.
In the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.
In the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function isWherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames.
In the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.
In step 3, the state of the target is adjusted by the TNet according to the following formula
Where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiPredicting the behavior of the TNet for the current iteration; state items which are not listed in the condition corresponding to each behavior are kept consistent with the last iteration and are not changed;
the training sample set of the TNet is adjusted according to the sampling behavior in the way of
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneThree output behaviors for ENet。
The invention has the beneficial effects that:
firstly, the method fully considers the arbitrariness of the tracked object, designs a complete tracking frame adjusting behavior, enables the tracking network TNet to have target expression capacity through pre-training, enables the TNet to have the capacity of iteratively adjusting the tracking frame through a reinforcement learning simulation tracking process, enables the TNet to adaptively track a specific target object through online fine adjustment, and solves the problem that the TNet cannot accurately track when the target form is severely changed in the prior art.
Secondly, a new depth enhancement network ENet is introduced to control the on-line fine tuning of the TNet and guide the on-line updating process of the tracking network, so that model degradation caused by error or improper sampling of the TNet is avoided, the problem of tracking failure under the condition that a target is shielded or interfered, particularly a long-time shielding condition, in the prior art is solved, and the tracking robustness and stability are enhanced.
Drawings
Fig. 1 is a flowchart of an implementation of a step-by-step target tracking algorithm based on a dual-depth enhanced network on a frame image.
Fig. 2 is a schematic diagram of a step-by-step target tracking algorithm based on a dual-depth enhanced network.
FIG. 3 is a graph showing the comparison results of the accuracy curve and the success rate curve of ADNet on OTB100 by the method of the present invention, wherein (a) is the comparison result of the accuracy curve and (b) is the comparison result of the success rate curve.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
The invention provides a double-depth enhanced network-based stepping target tracking algorithm aiming at the problems of real-time change and long-time shielding of target forms faced by the tracking algorithm.
The invention provides a stepping target tracking method based on a double-depth enhanced network, which comprises the following steps:
step 1, designing a target tracking network (TNet), wherein the network can extract deep convolution characteristics of a target, give sequence movement behaviors of a tracking frame according to the extracted characteristics, and gradually adjust the tracking frame to be a minimum circumscribed rectangle of the target. For this purpose, the TNet is trained off-line, including supervised pre-training and reinforcement learning.
And 2, designing and training a tracking result evaluation network (ENet), outputting an online sampling behavior in the tracking process by the network, and controlling the updating process of the TNet.
And 3, in the tracking, the TNet is utilized to accurately determine the position target step by step, then different training samples are sampled according to the evaluation of the ENet on the current tracking result, the TNet is finely adjusted and updated on line, then the next frame of tracking is carried out, and the step is repeated.
The TNet input in the step 1 is an image block, deep convolution characteristics of the image are fully extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers. The behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownAccurately positioning the target center, 4 scale changes { H }expand,Hshrink,Wexpand,WshrinkAnd outputting the credibility output layer as the corresponding behavior confidence by responding to the inconsistent deformation in the length and width directions of the target and terminating the operation { stop }. The scales in the high and wide directions in the behavior output can be independently changed.
In step 1, in order to make the convolution layer of the TNet network have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the training is carried out by adopting a traditional supervised learning method, an objective function is defined as a multitask cross entropy loss function,
LTNet=λ1×Lcross-entropy(conf,conf~)+(1-λ1)×Lcross-entropy(act,act~)
wherein L iscross-entropyRepresents a cross entropy loss function in the form of one-hot, and the conf and the act are networks respectivelyOutputs of the behavior output layer and the target credibility output layer, conf~、act~Are respectively corresponding true value, lambda1The weight distribution of two losses is represented, and the value range is [0.55,0.73 ]]The pre-training stage focuses more on the confidence of the target, so λ1Values greater than 0.5.
The pre-trained TNet convolutional layer has better target expression capacity, so that the network parameters of the convolutional layer are fixed, and the full-connection layer (except the target reliability layer) of the convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value.
In the step 2, the other layer structures of the ENet except the output layer are the same as those of the TNet, all convolution layer parameters are shared, an image block in a tracking frame of the current frame TNet is input, and meanwhile, a historical tracking result of a previous frame is connected in series after the output of the second layer from the last time, so that the rationality of evaluation is improved. The output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the degree and the mode of sampling the TNet network training data at the current frame reflect the evaluation of the ENet on the current tracking result. By implementing different sampling behaviors, the online fine adjustment sample of the TNet is changed correspondingly, and the ENet realizes the evaluation and control of the TNet tracking result in this way.
In step 2, ENet training directly adopts reinforcement learning, random initialization is output, parameters of other layers are initialized by TNet, convolutional layer parameters are fixed, training data is a video sequence, the tracking process is simulated by training, the final reward function is,
wherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. This formula represents trackability after using ENetThe more rewards that can be boosted are also larger, and the network is penalized once the performance is reduced. And finally, optimizing parameters of the full-connection layer by using a random gradient method.
In step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a positive sample is sampled near the center of the target, a negative sample is sampled in an annular area far away from the center, and the TNet is subjected to online fine adjustment so as to be better adapted to the target to be tracked.
The state adjustment of the target by the TNet in step 3 is realized by the following formula,
where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiThe predicted behavior for the TNet for this iteration. The state items not listed in the case corresponding to each behavior are consistent with the last iteration and are not changed.
In step 3, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area obtained by TNet tracking into the first convolution layer conv1 of the ENet, and connecting the output of the TNet in the last iteration of each frame with the output of the fc5 layer of the ENet network in series to be used as the input of the fc6 layer to obtain the predicted sampling behavior. Adjusting the training sample set of the TNet according to the sampling behavior in the way of
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneIs the three output behaviors of ENet.
Referring to fig. 1 and 2, an embodiment of the present invention comprises the following steps:
step 1, firstly designing and training a target tracking network (TNet) off line.
(1a) The network structure of the TNet is shown in table 1, the network input is an image block, and the image block passes through three convolutional layers conv1-conv3, two full connection layers fc4 and fc5 and then reaches two output layers, namely a behavior output layer fc6 and a target reliability output layer fc 7. The 3 convolutional layers are used for fully extracting deep convolutional features of the image, and the fc6 is output as the probability of 9 behaviors of one-hot mode, wherein the behaviors comprise 4-direction displacement { T }left,Tright,Tup,TdownAccurately positioning the target center, 4 scale changes { H }expand,Hshrink,Wexpand,WshrinkAnd stopping iteration by using the operation { stop } to deal with the inconsistent deformation in the length and width directions of the target, wherein the current tracking frame is accurately positioned, and the fc7 is output as a corresponding behavior confidence coefficient.
Table 1 TNet concrete configuration description table of target tracking network
(1b) In order to enable the first three layers of the TNet network to have stronger image feature extraction capability, the network is pre-trained by utilizing a public object detection data set, the traditional supervised learning method is adopted for training, an objective function is defined as a multitask cross entropy loss function,
LTNet=λ1×Lcross-entropy(conf,conf~)+(1-λ1)×Lcross-entropy(act,act~) (1)
wherein L iscross-entropyRepresenting the cross-entropy loss function through one-hot form, conf, act being the outputs of the fc7 and fc6 layers of the network, respectively, conf~,act~Are respectively corresponding true value, lambda10.65 represents the weight distribution of two losses, and the pre-training stage focuses more on the confidence of the target, so λ1The value is large.
The pre-trained TNet convolutional layer has better target expression capacity, so the network parameters of the conv1 to conv3 layers are fixed, and the fc4-fc6 of the network parameters are subjected to reinforcement learning training on a multi-frame image sequence. In each frame of the image of the sequence, the target is located using the pre-trained TNet until the result of the last frame is compared to the true value, setting the reward function to,
wherein, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.
And 2, designing and training a tracking and evaluating network (ENet) off line.
(2a) The structure of the ENet is similar to that of the TNet, and as shown in Table 2, the two have the same conv1-fc5 layer, and the conv1-conv3 layers are shared with the TNet parameters, and the tracking frame image block of the current t-th frame TNet is input, and the tracking result of the previous frame is concatenated after the output of the fc5 layer, specifically, the method comprises a) the behavior prediction of the TNet network at the first frame output of the videoAnd confidence levelSince the target state of the first frame is known and has a reference value, b) the behavior prediction of the TNet output finally in the previous m frames (m 15)And confidence levelThe dimension of the above 2 items is (9+2) + (9+2) × m. The output of ENet is 3 sampling behaviors different in the current frame, including samplesuf,sampleneg,samplenoneThus the structure of the tier output fc6 is (512+176) × 3 fully connected tiers. These 3 sampling behaviors reflect the evaluation of the current tracking result by the ENetAnd estimating, wherein the credibility degrees corresponding to the current tracking results are sequentially reduced, and the obtained samples are used for updating the TNet network on line. samplesufThe sampling of the current frame is complete, positive and negative samples indicate that the ENet evaluates that the confidence of the current tracking result is high, and samplenegOnly the area around the tracking result is sampled as a negative sample, which indicates that the target of the current frame is partially occluded or greatly deformed, samplenoneIndicating that the target is completely occluded or tracking fails and no sample sampling is performed. Through different sampling behaviors, online training samples of the TNet are different, and the optimal network parameters and tracking accuracy are achieved.
Table 2 tracking and evaluating network ENet concrete configuration description table
(2b) Env's conv1 to fc5 layer parameters are initialized by corresponding layers of TNet, fc6 layer is initialized randomly, conv1 to conv3 layer parameters are fixed, training data is a video sequence, training simulates a tracking process, and a final reward function is,
wherein BBTNet+ENetFor the tracking results of TNet after evaluation assistance using the ENet, the other symbols are consistent with equation (2). Equation (3) shows that the more the ENet is used, the greater the reward is given by tracking the performance increase, and the network is penalized once the performance is reduced. And finally, optimizing the fc4, fc5 and fc6 layer parameters by using a random gradient method.
And 3, sampling positive and negative samples on line in the first frame image of the video and finely adjusting the TNet.
The state of the target to be tracked is given in the first frame by a manual or interactive algorithm,whereinIs a position coordinate in the height and width directions of the center of the object, { h1,w1The target height-width is denoted by the superscript "1" for the first frame. Carrying out sufficient positive and negative sample sampling around a given target, wherein the positive sampling rule is that the center of a sample is at s1Randomly sampling in a designated area, wherein the height and width of a sample are { h1,w1Carry out random scaling on the basis of the (0.85, 1.15) ranges]×h1And [0.85,1.15]×w1And requires a positive sample and s1The overlap ratio of (2) is 0.75, and finally the positive sample set Pos is obtained1The number of samples is 400. The negative sample sampling rule is that the sample is centered atIs located internally and not at s1The random variation rule of the sample height and width is the same as that of the positive sample, and the negative sample and the s are required1The overlapping rate of the negative sample set Neg is less than 0.5, and the negative sample set Neg is finally obtained1The number of samples is 400.
And 4, utilizing the TNet to perform step-by-step target positioning in a new frame.
When a new tth trace is performed, Pos is first usedt-1And Negt-1For training the samples, the TNet was trimmed online in the same supervised fashion as in step (1 b). Then taking the target state of the last frame as the current target initial state,
for the 1 st state of the t-th frame, st-1The final target state obtained for the previous frame. In the t frame imageGetThe represented image blocks are used as candidate targets.
(4a) Inputting the candidate target into TNet to obtain corresponding behavior probabilityAnd confidence levelThe behavior with the highest probability is selected as the behavior of the prediction,
And the state items which are not listed in the condition corresponding to each behavior are consistent with the last iteration and are not adjusted. According to the new stateAnd extracting image blocks in the t frame image as new candidate targets. And then repeating step (4 a). As shown in the upper portion of the schematic drawing of the embodiment of fig. 2.
If it is notIndicates the presentAnd (5) positioning the tracking frame accurately, and stopping iteration. The target state after the step-by-step adjustment is the current frame tracking result,
and 5, evaluating a current frame positioning result by utilizing the ENet and determining a sampling behavior.
During the tracking, the ENet parameter is fixed. Starting from the second frame of the video, inputting the target area tracked by the TNet into the first convolution layer conv1 of the ENet, and outputting the TNet in the last iteration of each frame Concatenated with the ENet network fc5 level output as fc6 level input, when the tracked frame number is less than m,andare filled with 0, keeping the dimensions formally consistent. The sample behavior prediction with the maximum probability output by fc6 is selected as the current prediction,
whereinAdjusting the training sample set of the TNet according to the sampling behavior by using the 3-dimensional sampling behavior probability output by fc6
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, the sampling mode is the same as that in the step (3), U is set merging operation, PostAnd NegtTraining samples are trimmed online for the t-th frame TNet.
And (4) judging whether the current frame is the last frame of the video, if not, returning to the step (4) to continue the tracking of the next frame, and if so, ending the video tracking process.
The tracking method of the invention performs experiments on the OTB100 and obtains experimental results, and adopts a precision curve and a success rate curve as evaluation criteria. The OTB100 contains 100 video sequences including many challenging factors such as object motion blur, background clutter, partial occlusion, and complete occlusion. Comparing the tracking method designed by the invention with the ADNet classical algorithm, and the figure 3 is a quantitative comparison result, it can be seen that the method of the invention has good tracking precision and robustness, and is superior to the comparison algorithm.
The algorithm of the present invention is not limited to the above embodiments, and any technical solutions obtained by equivalent substitution methods fall within the scope of the present invention.
Claims (10)
1. A stepping target tracking method based on a double-depth enhanced network is characterized by comprising the following steps:
step 1, designing a target tracking network TNet for extracting the depth convolution characteristics of a target; performing off-line training on the TNet, wherein the off-line training comprises supervision pre-training and reinforcement learning training;
step 2, designing and training a tracking result evaluation network ENet, and outputting an online sampling behavior in the tracking process to control the updating process of the TNet;
and 3, in the tracking process, positioning the target by using the TNet, sampling different training samples according to the evaluation of the ENet on the current tracking result, adjusting and updating the TNet on line, then entering the next frame for tracking, repeating the step, and gradually adjusting the tracking frame to become the minimum circumscribed rectangle of the target.
2. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet input is an image block, deep convolution characteristics of the image are extracted through more than three convolution layers, and then the deep convolution characteristics are transmitted to a behavior output layer and a target reliability output layer through more than two full connection layers; the behavior output layer outputs 4-direction displacement { T }left,Tright,Tup,TdownThe center of the target is accurately positioned, and 4 scale changes are output { H }expand,Hshrink,Wexpand,WshrinkOutputting a termination operation { stop } to deal with inconsistent deformation in the length and width directions of the target; and outputting the corresponding behavior confidence degree by the target confidence degree output layer.
3. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 2, wherein: the scales in the height direction and the width direction in the 4 scale changes are independently changed.
4. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, the TNet is pre-trained by using a public object detection data set and adopting a supervised learning method, and an objective function is defined as a multitask cross entropy loss function LTNet=λ1×Lcross-entropy(conf,conf~)+(1-λ1)×Lcross-entropy(act,act~) Wherein L iscross-entropyRepresenting a cross entropy loss function in a one-hot form, wherein conf and act are respectively the output of a network behavior output layer and a target credibility output layer, and conf~、act~Are respectively corresponding true value, lambda1Representing the weight distribution of the two losses.
5. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 4, wherein: said lambda1The value range is [0.55,0.73 ]]。
6. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 1, network parameters of the pre-trained TNet convolutional layer are fixed, the full connection layer of the TNet convolutional layer is subjected to reinforcement learning training on a multi-frame image sequence, and in each frame image of the sequence, a target is positioned by using the pre-trained TNet until the result of the last frame is compared with a true value.
7. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the other layers of the ENet except the output layer have the same structure as the TNet, all the convolution layer parameters are shared, the input is an image block in a tracking frame of the current frame TNet, and the historical tracking result of the previous frame is connected in series after the output of the second layer from the last time; the output of ENet is the sampling behavior { sample }suf,sampleneg,samplenoneAnd the TNet network training data is sampled at the current frame, and the on-line fine adjustment sample of the TNet is correspondingly changed by implementing different sampling behaviors.
8. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 2, the training of the ENet directly adopts reinforcement learning, random initialization is output, the parameters of other layers are initialized by TNet, the parameters of the convolutional layers are fixed, the training data is a video sequence, the tracking process is simulated by training, and the final reward function isWherein BBTNet+ENetFor the tracking results of post-TNet using ENet evaluation, BBTNetFor the tracking result of TNet in the last frame and GT the target real state of the last frame, IoU (-) calculates the overlapping rate of two rectangular frames.
9. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in the step 3, the state of the target to be tracked of the first frame of the video is given by a manual or interactive algorithm, a circular area which is less than a set threshold value from the center of the target is sampled to obtain a positive sample, an annular area which is greater than the set threshold value from the center of the target is sampled to obtain a negative sample, and the state of the TNet is adjusted; inputting a target area obtained by TNet tracking into a first convolution layer of the ENet from a second frame of the video, and connecting the output of the TNet in the last iteration of each frame in series with the output of an fc5 layer of an ENet network as fc6 layer input to obtain a predicted sampling behavior; the training sample set of the TNet is adjusted according to the sampling behavior.
10. The dual-depth-enhanced-network-based step-by-step target tracking method according to claim 1, wherein: in step 3, the state of the target is adjusted by the TNet according to the following formula
Where i is the number of iterations of the stepwise adjustment, { cx,cyH, w is the center coordinate and height and width of the object, aiPredicting the behavior of the TNet for the current iteration; state items which are not listed in the condition corresponding to each behavior are kept consistent with the last iteration and are not changed;
the training sample set of the TNet is adjusted according to the sampling behavior in the way of
Wherein P ist、NtRespectively positive and negative sample sets sampled at the current frame, U is set merging operation, PostAnd NegtOn-line fine-tuning training samples for the t-th frame TNet, { samplesuf,sampleneg,samplenoneIs the three output behaviors of ENet.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011057357.5A CN112150510B (en) | 2020-09-29 | 2020-09-29 | Stepping target tracking method based on dual-depth enhancement network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011057357.5A CN112150510B (en) | 2020-09-29 | 2020-09-29 | Stepping target tracking method based on dual-depth enhancement network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112150510A true CN112150510A (en) | 2020-12-29 |
CN112150510B CN112150510B (en) | 2024-03-26 |
Family
ID=73895941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011057357.5A Active CN112150510B (en) | 2020-09-29 | 2020-09-29 | Stepping target tracking method based on dual-depth enhancement network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112150510B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112991346A (en) * | 2021-05-13 | 2021-06-18 | 深圳科亚医疗科技有限公司 | Training method and training system for learning network for medical image analysis |
CN115099372A (en) * | 2022-08-25 | 2022-09-23 | 深圳比特微电子科技有限公司 | Classification identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106920248A (en) * | 2017-01-19 | 2017-07-04 | 博康智能信息技术有限公司上海分公司 | A kind of method for tracking target and device |
CN106960446A (en) * | 2017-04-01 | 2017-07-18 | 广东华中科技大学工业技术研究院 | A kind of waterborne target detecting and tracking integral method applied towards unmanned boat |
CN109801310A (en) * | 2018-11-23 | 2019-05-24 | 南京信息工程大学 | A kind of method for tracking target in orientation and scale differentiation depth network |
US20200065976A1 (en) * | 2018-08-23 | 2020-02-27 | Seoul National University R&Db Foundation | Method and system for real-time target tracking based on deep learning |
US20200126241A1 (en) * | 2018-10-18 | 2020-04-23 | Deepnorth Inc. | Multi-Object Tracking using Online Metric Learning with Long Short-Term Memory |
CN111666631A (en) * | 2020-06-03 | 2020-09-15 | 南京航空航天大学 | Unmanned aerial vehicle maneuvering decision method combining hesitation fuzzy and dynamic deep reinforcement learning |
-
2020
- 2020-09-29 CN CN202011057357.5A patent/CN112150510B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106920248A (en) * | 2017-01-19 | 2017-07-04 | 博康智能信息技术有限公司上海分公司 | A kind of method for tracking target and device |
CN106960446A (en) * | 2017-04-01 | 2017-07-18 | 广东华中科技大学工业技术研究院 | A kind of waterborne target detecting and tracking integral method applied towards unmanned boat |
US20200065976A1 (en) * | 2018-08-23 | 2020-02-27 | Seoul National University R&Db Foundation | Method and system for real-time target tracking based on deep learning |
US20200126241A1 (en) * | 2018-10-18 | 2020-04-23 | Deepnorth Inc. | Multi-Object Tracking using Online Metric Learning with Long Short-Term Memory |
CN109801310A (en) * | 2018-11-23 | 2019-05-24 | 南京信息工程大学 | A kind of method for tracking target in orientation and scale differentiation depth network |
CN111666631A (en) * | 2020-06-03 | 2020-09-15 | 南京航空航天大学 | Unmanned aerial vehicle maneuvering decision method combining hesitation fuzzy and dynamic deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
SANGDOO YUN,JONGWON CHOI,YOUNGJOON YOO,KIMIN YUN,JIN YOUNG CHOI: "Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning", 《2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 1349 - 1356 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112991346A (en) * | 2021-05-13 | 2021-06-18 | 深圳科亚医疗科技有限公司 | Training method and training system for learning network for medical image analysis |
CN112991346B (en) * | 2021-05-13 | 2022-04-26 | 深圳科亚医疗科技有限公司 | Training method and training system for learning network for medical image analysis |
US12094188B2 (en) | 2021-05-13 | 2024-09-17 | Shenzhen Keya Medical Technology Corporation | Methods and systems for training learning network for medical image analysis |
CN115099372A (en) * | 2022-08-25 | 2022-09-23 | 深圳比特微电子科技有限公司 | Classification identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112150510B (en) | 2024-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Labach et al. | Survey of dropout methods for deep neural networks | |
CN112163516B (en) | Rope skipping counting method and device and computer storage medium | |
CN110335290A (en) | Twin candidate region based on attention mechanism generates network target tracking method | |
CN107369166B (en) | Target tracking method and system based on multi-resolution neural network | |
CN108764006B (en) | SAR image target detection method based on deep reinforcement learning | |
CN112651998B (en) | Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network | |
CN110120064A (en) | A kind of depth related objective track algorithm based on mutual reinforcing with the study of more attention mechanisms | |
CN112150510A (en) | Stepping target tracking method based on double-depth enhanced network | |
CN109767036A (en) | Support vector machines failure prediction method based on the optimization of adaptive ant lion | |
CN111476814B (en) | Target tracking method, device, equipment and storage medium | |
CN110991621A (en) | Method for searching convolutional neural network based on channel number | |
CN110096202B (en) | Automatic lightweight image clipping system and method based on deep reinforcement learning | |
CN110942142A (en) | Neural network training and face detection method, device, equipment and storage medium | |
CN112802061A (en) | Robust target tracking method and system based on hierarchical decision network | |
CN116342624A (en) | Brain tumor image segmentation method combining feature fusion and attention mechanism | |
CN111105442B (en) | Switching type target tracking method | |
CN112614163A (en) | Target tracking method and system fusing Bayesian trajectory inference | |
CN110544267B (en) | Correlation filtering tracking method for self-adaptive selection characteristics | |
CN110378932B (en) | Correlation filtering visual tracking method based on spatial regularization correction | |
CN116258877A (en) | Land utilization scene similarity change detection method, device, medium and equipment | |
CN118742925A (en) | Tracking query points in video using neural networks | |
CN114973071A (en) | Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics | |
US10643092B2 (en) | Segmenting irregular shapes in images using deep region growing with an image pyramid | |
CN110428447B (en) | Target tracking method and system based on strategy gradient | |
CN117237893A (en) | Automatic driving multi-target detection method based on instance self-adaptive dynamic neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |