CN110728694A - Long-term visual target tracking method based on continuous learning - Google Patents
Long-term visual target tracking method based on continuous learning Download PDFInfo
- Publication number
- CN110728694A CN110728694A CN201910956780.XA CN201910956780A CN110728694A CN 110728694 A CN110728694 A CN 110728694A CN 201910956780 A CN201910956780 A CN 201910956780A CN 110728694 A CN110728694 A CN 110728694A
- Authority
- CN
- China
- Prior art keywords
- model
- sample
- tracking
- sample set
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 230000007774 longterm Effects 0.000 title claims abstract description 31
- 230000000007 visual effect Effects 0.000 title claims abstract description 27
- 230000008569 process Effects 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 58
- 230000015654 memory Effects 0.000 claims description 46
- 230000008447 perception Effects 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 23
- 238000011478 gradient descent method Methods 0.000 claims description 16
- 230000011218 segmentation Effects 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 10
- 238000013140 knowledge distillation Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000013461 design Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 2
- 238000003064 k means clustering Methods 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000012544 monitoring process Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- XOOUIPVCVHRTMJ-UHFFFAOYSA-L zinc stearate Chemical compound [Zn+2].CCCCCCCCCCCCCCCCCC([O-])=O.CCCCCCCCCCCCCCCCCC([O-])=O XOOUIPVCVHRTMJ-UHFFFAOYSA-L 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a long-term visual target tracking method based on continuous learning. A deep neural network structure is designed for long-term visual target tracking, an initialized network model is obtained through model initialization, then online tracking is carried out by using the initialized network model, and long-term or short-term model updating is carried out by using a continuous learning method in the tracking process so as to adapt to various changes of a target in the tracking process. The method converts the on-line updating process of the traditional visual target tracking model into the continuous learning process, establishes the complete appearance description of the target from all historical data of the video, and effectively improves the robustness of long-term visual tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.
Description
Technical Field
The invention belongs to the field of computer vision and image video processing, and particularly relates to a long-term visual target tracking method based on continuous learning.
Background
Visual target tracking is a basic problem in computer vision and image video processing, and has wide application in the fields of monitoring video automatic analysis, human-computer interaction, visual navigation and the like. Tracking methods can be roughly divided into two main categories according to the length of a video sequence frame: short-term target tracking and long-term target tracking. Generally, when the length of a frame of a tracked video sequence is greater than 1000 frames, the tracked video sequence is called long-term target tracking. At present, a short-time tracking algorithm has better performance on relatively short video data, but when the short-time tracking algorithm is directly applied to the processing of real long-time video sequences, the tracking precision and robustness can not meet the index requirements of actual scenes.
In the long-term tracking task, besides the common challenges in the short-term scene, such as target scale change, illumination change, target deformation, etc., the problem of robust locking of frequent 'reappearance after disappearance' targets needs to be solved. Therefore, compared with the traditional short-time tracking, the long-time tracking has more challenges and is more suitable for the actual requirements of various application scenes. However, the tracking technology facing such long-time data is relatively deficient, and the performance of the existing method is very limited. An existing long-term tracking idea is to combine traditional tracking and traditional target detection methods to solve the problems of target deformation, partial shielding and the like in tracking. Meanwhile, the 'significant feature points' of the tracking module, the target model of the detection module and related parameters are continuously updated through an online learning mechanism, so that the tracking effect is more robust and reliable. In addition, there are methods that utilize keypoint matching tracking and robust estimation techniques, that can integrate long-term memory, and that can provide additional information for output control. The tracking method can search for the target in the whole frame of image, but the performance is not ideal because only simple characteristics of manual design are adopted. Recently, some tracking methods based on correlation filtering and deep learning are proposed, and although there are re-detection schemes for long-term tracking, the methods are limited to searching only in a local range of an image, so that a target cannot be captured again after being out of view, and the method cannot meet the requirements of long-term tracking tasks.
From the current state of technical development, the visual target tracking method based on deep convolutional neural network image classification has great potential of effectively distinguishing the target from a disordered background, and the tracking method based on the frame has wide development prospect. However, tracking models trained only offline are often difficult to adapt to online changes in video, and simply updating the models frequently with new data accelerates tracking drift, making it prone to failure when dealing with long-term tracking problems. The invention provides a long-term visual target tracking method based on continuous learning by balancing historical memory and online updating of a model through a continuous learning method.
Disclosure of Invention
The method utilizes the continuous learning theory to convert the online updating of the model of the visual target tracking method into a continuous learning process, learns the effective abstraction and characterization of the time sequence image in the whole video sequence and establishes the complete portrait of the target. Finally, the method adapts to the conditions of target deformation, background interference, shielding, illumination change and the like in the tracking process, the adaptability and the reliability of the existing tracking method during online updating are improved, the sensitivity of the model to noise such as target deformation, shielding and the like is reduced, and the purpose of stably tracking the target for a long time is achieved.
The invention is realized by adopting the following technical means: a long-term visual target tracking method based on continuous learning mainly comprises four parts of network model design, model initialization, online tracking and model updating.
Designing a network model: firstly, designing a deep neural network structure according to the overall flow shown in the attached figure 1; the network stage profiles are then adapted to adaptive size.
Model initialization: mainly comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition. The model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method.
Online tracking: mainly comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; the target region is located using target box regression.
Updating the model: mainly comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; and (5) continuous learning mode model training and model obtaining. The method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.
The network model design comprises the following specific steps:
(1) the deep neural network structure designed by the invention comprises the following components: as shown in fig. 2, the network structure of the present invention is composed of a sharing layer and a classification layer. Wherein, the shared layer comprises 3 convolutional layers, 2 maximum pooling layers, 2 full-link layers and 5 nonlinear active ReLU layers. The convolutional layers are identical to the corresponding parts of the generic VGG-M network. The next two fully connected layers each have 512 output units and incorporate the ReLU and Dropouts modules. The classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing the target from the background.
In the image processing process of the convolutional neural network CNN, convolutional layers need to be connected through a convolutional filter, and the definition of the convolutional filter is expressed as NxCxWxH, wherein N represents the type of the convolutional filter, and C represents the number of channels of a filtered channel; w, H represent the width and height of the filtering range, respectively.
(2) In the continuous learning long-term target tracking process, the change of the input and output characteristic diagram of each convolution layer is as follows:
in the tracking process, images with different sizes are unified into images with the size of 3 multiplied by 107 and then input into a network, in a first convolution layer, the number of output channels is 96 after 96 convolution kernels with the size of 7 multiplied by 7, the number of the output channels is 96 after the output channels pass through a nonlinear activation layer ReLU and a local response normalization layer, and finally a characteristic diagram with the size of 96 multiplied by 25 is obtained through a maximum value pooling layer; in the second convolutional layer, inputting a feature map with the size of 96 × 25 × 25, firstly passing through 256 5 × 5 convolutional kernels, then passing through a nonlinear active layer ReLU and a local response normalization layer to output a channel with the number of 256, and finally passing through a maximum pooling layer to obtain a 256 × 5 × 5 feature map; inputting a feature map with the size of 256 × 5 × 5 in the third convolutional layer, firstly performing 512 convolution kernels with the size of 3 × 3, and then performing nonlinear activation layer ReLU to obtain a feature map with the size of 512 × 3 × 3; in the fourth full-connection layer, inputting a characteristic diagram with the size of 512 multiplied by 3, firstly passing through a 512 neural unit, and then passing through a nonlinear activation layer ReLU to obtain a 512-dimensional characteristic vector; inputting a characteristic vector with the size of 512 dimensions in a fifth full-connection layer, firstly passing through a 512 neural unit, then passing through a Dropouts layer, and finally obtaining the characteristic vector with the size of 512 dimensions through a nonlinear activation layer ReLU; in the classification layer, 512-dimension feature vectors pass through a Dropouts layer, then a binary classification layer with softmax loss is input, and finally a classification score with 2-dimension is output.
The model initialization comprises the following specific steps:
(1) obtaining an initial frame segmentation image: the quality of the initial frame template has a significant impact on the current tracking results. To increase the detailed representation of the tracked target, superpixel level segmentation is applied by a Simple Linear Iterative Clustering (SLIC) superpixel segmentation method, so that the segmented image not only coincides in color and texture with the target, but also retains the structural information of the target, as shown in fig. 3.
(2) Generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively1And (4) sampling. These samples are labeled as positive samples (between 0.7 and 1.0) and negative samples (between 0 and 0.5) according to their intersection and score ratio with the true label box (ground truth).
(3) Model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score of the final output of the network adopts two-classification cross entropy loss as a loss function to solve the loss, and then gradient is used for solving the lossAnd updating the parameters of the network full-connection layer by a descent method. Wherein the full link layer is trained to perform H1Iteration (50 times), the learning rate of the full-connection FC4-5 layer is set to 0.0005, and the learning rate of the classification layer FC6 layer is set to 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M+(32) A positive sample and a slave M-(1024) Selected from a negative sample(96) The composition of each difficultly-divided negative sample; finally, after repeated iteration, when H is reached1And (50) stopping training during iteration to obtain a network initialization model.
The online tracking comprises the following specific steps:
(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame2A candidate sample.
(2) Obtaining the best candidate sample: n obtained in the step (1)2And sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position.
(3) Regression of a target frame: and (3) after the estimated target position is obtained in the step (2), positioning a target area by using a target frame regression method to obtain a tracking result.
The model updating comprises the following specific steps:
(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity. And two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-term updating is performed every f (8-10) frames, and short-term updating is performed when the model classifies the estimated target position as the background.
(2) Generating and updating a model updating sample library: the model update sample library comprises an online sample setAnd memory perception sample setTwo parts of which fl(80-100) and fs(20-30) respectively representing the set frame number of the long-time collected samples and the set frame number of the short-time collected samples.Andrespectively representing an online positive sample set and an online negative sample set in the online sample set,andrespectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set. In particular, on-line positive and negative sample concentration(500) An(5000) One is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collected(50) A positive sample and(200) a negative sample is added toAndsample collection ofWhere t represents the tth frame of the online tracking video sequence. For online positive sample setWhen tracking succeeds beyond fl(80-100) deleting positive samples collected in the earliest frame in the frame, and then adding the deleted positive samples to a memory-aware positive sample setIn (i.e. on-line positive sample set only collects the latest successfully tracked f)l(80-100) frame samples; for online negative sample setWhen tracking succeeds beyond fs(20-30) deleting negative samples collected in the earliest frame during the frame, and then adding the deleted negative samples to a memory perception negative sample setIn, i.e. the online negative sample set only collects the latest successfully traced fs(20-30) frame samples. Perception of memory positive sample setWhen it collects more than fl(80-100) when the frame is in use, clustering the samples into N by using a K mean value clustering algorithmC(10-15) classes, when there is a new sample, calculating the characteristic mean vector and N of the new sample respectivelyCThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensuredThe sample lumped number is unchanged before and after; perception of memory with negative sample setWhen the collection exceeds fs(20-30) deleting the frame collected in the earliest frameSamples, i.e. sets of memory-aware negative samples, only the latest fs(20-30) frame samples.
(3) Model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and joint optimization training. The purpose of the preheating training is to enable the model to learn to adapt to the current target change, the purpose of the combined optimization training is to enable the model to remember the historical target change, so that the complete description of the target is established in the long-term target tracking process, and when the tracked target appears after being out of the visual field, the target can be quickly found back by utilizing the historical memory of the model, so that the long-term stable tracking is realized. When the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilizedAnd training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function. Finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H2(15) iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilizedPreheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training3(10) iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)And memory perception sample setPerforming joint optimization training on the model, and calculating the classification loss of the online sample set by using a two-classification cross entropy loss functionAnd calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, wherein the final total loss is the classification loss plus the knowledge distillation loss multiplied by lambda. After the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H4(15) iterations. Wherein in each training stage, the learning rates of the full-connection FC4-5 layer are all set to be 0.001, the learning rates of the classification FC6 layer are all set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch of the M-class training data is trained+(32) A positive sample and a slave M-(1024) Selected from a negative sample(96) And (4) forming a hard negative sample.
The invention has the characteristics that:
the invention provides a long-term visual target tracking method based on continuous learning. The method converts the on-line updating of a traditional visual target tracking model into a continuous learning process, combines a dynamically constructed on-line sample set and a memory perception sample set, and learns the changes of the shielding, the shape, the scale, the illumination and the like of a target in a long-term time dimension, so that the time series data are effectively abstracted and represented in the whole video sequence, and a complete portrait of the target is established. The method can quickly retrieve the target reappearing in the visual field according to the continuously learned historical model after the target is blocked or out of the visual field for a long time. Compared with the existing visual target tracking technology, the method balances the history memory and online updating of the model through a continuous learning method, overcomes the problem of 'catastrophic forgetting' of the model caused by frequent updating by using new data in the prior art, establishes complete portrait description of the target integrally from all historical data of the video, obtains the target model insensitive to noise, improves the robustness of visual tracking and achieves the purpose of long-term tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.
Description of the drawings:
FIG. 1 is an overall flow chart
FIG. 2 is a network architecture
FIG. 3 shows an initial frame segmentation image
The specific implementation mode is as follows:
the following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:
a long-term target tracking method based on continuous learning is disclosed, the whole flow is shown in figure 1; the algorithm is divided into a model initialization part, an online tracking part and a model updating part. A model initialization part: processing an initial frame, firstly obtaining an initial frame segmentation image only with prospect by using a superpixel segmentation method, then inputting the initial frame original image and the initial frame segmentation image to respectively extract convolution layer characteristics, then fusing the two parts of characteristics, namely adding the two parts of characteristics, then obtaining classification scores through a full connection layer and a classification layer and calculating classification loss, and then optimally solving an optimal initialization model through a back propagation gradient loss item. And an online tracking part: in the subsequent frame processing process, firstly, candidate samples are generated by utilizing the predicted position of the target in the previous frame, then, each candidate sample is input into a network to calculate the classification score, the candidate sample with the highest classification score is selected, and finally, the target frame is used for regression positioning of the target area to obtain a tracking result. And a model updating part: in the tracking process, when the estimated target is classified as the background every 10 frames or models, the long-term or short-term model is updated by using a continuous learning method to adapt to various changes of the target in the tracking process.
The model initialization part comprises the following specific steps:
(1) obtaining an initial frame segmentation image: the initial frame consisting of a super set of pixelsComposition, where N is the number of super pixels in the image, OiRepresenting the pixel value of the ith superpixel in the superpixel set. The superpixels that lie completely outside the bounding box are considered as background, and the remaining superpixels are unknown (background or foreground). P pixel values x sampled randomly with superpixelsvModeling superpixels asWhere P is the number of randomly sampled superpixels, xvRepresenting the pixel value of the v-th superpixel in superpixel model m. This can be seen as an empirical histogram of the color distribution of the super-blend. For any known super-hybrid model mbIf the similarity score is S (m)a,mb) Eta, eta is 0.5, which corresponds to the unknown super-hybrid model maThe hyper-hybrid model of (a), is labeled as background, wherein:
wherein xkIs an unknown superpixel model maPixel value of the kth super pixel, score (x)k,mb) Is defined as:
wherein xjIs a known superpixel model mbThe pixel value of the jth super pixel. The parameter R is set to 0.5, which controls the radius of the sphere centered at each model pixel, allowing for slight errors. Figure 3 shows the segmentation results.
(2) Generation of a model initialization training sample library: 500 positive samples are randomly sampled around the initial target positions of the first frame original image and the segmented image, respectively, and 5000 negative samples are extracted only in the first frame original image. The samples are scored according to their intersection ratio with the real labeled box, the samples with the score between [0.7,1] are marked as positive samples, and the samples with the score between [0,0.5] are marked as negative samples.
(3) Model initialization training and model acquisition: and (3) solving a loss term of the classification score finally output by the network by adopting the two-classification cross entropy loss as a loss function, wherein the formula is as follows:
wherein,Xn/YnTraining samples and training sample labels representing the initial training sample library, NnIs from XnA batch of samples is taken of the sample,is NnThe label corresponding to the ith sample in (1),is NnThe ith sampleThe corresponding softmax output. Then, solving the optimized network parameters by a stochastic gradient descent method, training a full-connection layer to perform 50 iterations in an initial frame of a test sequence, setting the learning rate of the full-connection FC4-5 layer to be 0.0005, and setting the learning rate of the classification layer FC6 layer to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M+32 positive samples and slave M-Selected from 1024 negative samplesAnd (4) forming a negative difficult sample.
The online tracking part comprises the following specific steps:
(1) for each frame in the on-line tracking, 256 candidate samples are generated using a Gaussian distribution based on the estimated target position of the previous framexuRepresenting the u-th one of the candidate samples. The mean of the Gaussian distribution is r and the covariance is the diagonal matrix diag (0.09 r)2,0.09r20.25) where r is the average of the width and height of the target position estimated from the previous frame.
(2) The output of the network function is a two-dimensional vector representing the scores of the target and the background respectively corresponding to the input candidate samples. Selecting the candidate sample with the highest classification score as the estimated target position:
where u is the candidate sample index, f+(. represents a current network function, x)*Representing the candidate sample with the highest classification score computed by the network, i.e., the estimated target location.
(3) And finally, performing target frame regression on the obtained target position to locate the target area. The target frame regression adopts a ridge regression method, and a parameter alpha in the ridge regression is set to be 1000.
The model updating part comprises the following specific steps:
(1) and (3) selecting an updating mode: and two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-time update is performed every f-10 frames, and short-time update is performed when the model classifies the estimated target position as the background.
(2) Generating and updating a model updating sample library: the model update sample library comprises an online sample setAnd memory perception sample setTwo moieties, subscript flAnd fsRespectively indicating a long-time collection sample setting frame number and a short-time collection sample setting frame number.Andrespectively representing an online positive sample set and an online negative sample set in the online sample set,andrespectively representing memory in a set of memory perception samplesA perceptual positive sample set and a memory perceptual negative sample set. In particular, on-line positive and negative sample concentrationAnOne is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in online tracking, when the model classifies the estimated target position as foreground, indicating that the tracking is successful, randomly sampling around the estimated target position, respectively collecting 50 positive samples and 200 negative samples to be addedAndand (4) collecting samples. For online positive sample setDeleting positive samples collected in the earliest frame when tracking succeeds beyond 100 frames, and then adding the deleted positive samples to a memory-aware positive sample setIn other words, only 100 frame samples which are successfully tracked newly are collected in the online positive sample set; for online negative sample setDeleting negative samples collected in the earliest frame when tracking succeeds beyond 30 frames, and then adding the deleted negative samples to a memory-aware negative sample setIn other words, only 30 frame samples of which the latest tracking is successful are collected in the online negative sample set.
Perception of memory positive sample setWhen the collected samples exceed the long-time collected sample setting frame of 100 frames, the samples are clustered into 10 classes by using a K-means clustering algorithm:
where tau denotes the cluster index of the cluster,the result of the clustering is represented by,is a feature vector calculation function:
where W and b represent the network weight and offset, respectively, before the network is fully connected to the FC5 layer, x represents the input sample,representing a convolution operation. When there is a new memory perception sampleAnd respectively calculating Euclidean distances between the feature mean vector of the new sample and 10 clustering centers, wherein the Euclidean distance calculation formula is as follows:
dτ(μnew-μτ)=||μnew-μτ||,τ=1,...,10 (7)
in the formula, munewRepresents the new sample feature mean vector, μτRepresenting the feature mean vector of the # th class in the 10 clusters. Determining cluster labels of the new samples according to the mean vector closest to the new samples:
and draw new samples into correspondingClustering:the earliest samples with the same number as the new samples in the class are deleted at the same time, so that the memory perception positive sample set is ensuredThe sample lumped number is unchanged before and after; perception of memory with negative sample setThe oldest collected samples are deleted when more than 30 samples are collected, i.e. the memory-aware negative sample set collects only the latest 30 samples.
(3) Continuous learning mode training and model acquisition: when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not have a sample, the online sample set collected in the step (2) is utilizedAnd (4) training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function formula (3). And finally, updating parameters of the network full-connection layer by using a gradient descent method according to the current classification loss, wherein the formula of the gradient descent method is as follows:
in the formula, thetanRepresenting the network parameters, η is the learning rate, l (-) represents the loss function. The fully connected layer was trained for 15 iterations. When the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilizedCarrying out preheating training on the model, calculating the classification loss of the model by adopting a two-classification cross entropy loss function formula (3), then updating the parameters of the network full-connection layer by using a gradient descent method formula (9), and training the full-connection layer to carry out 10 iterations; when the mouldAfter the preheating training is finished, utilizing the online sample set collected in the step (2)And memory perception sample setPerforming combined optimization training on the model, and calculating the classification loss L of the online sample set by adopting a two-classification cross entropy loss function formula (3)CCalculating the distillation loss L of the memory perception sample set by using a knowledge distillation loss functionDThe formula of the knowledge distillation loss function is as follows:
in the formula (I), the compound is shown in the specification,represents a memory perception sample set, training samples and sample labels, different from formula (3),is a soft label output by the old network, NmIs from XmA batch of samples is taken of the sample,is NmThe label corresponding to the ith sample in (1),is the ith sampleThe corresponding softmax output. Finally, the overall loss function is:
Lsum=LC+λ·LD(11)
in the formula, the parameter λ is set to 0.7. After the total loss is calculated, parameters of the network full-connection layer are updated by using a gradient descent method formula (9), and the full-connection layer is trained to perform 15 iterations. In each training stage, the learning rate of the full-connection FC4-5 layer is set to be 0.001, the learning rate of the classification FC6 layer is set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch consists of 32 positive samples and 96 hard-to-divide negative samples selected from 1024 negative samples during training.
Claims (5)
1. A long-term visual target tracking method based on continuous learning is characterized in that: the method comprises four parts of network model design, model initialization, online tracking and model updating;
designing a network model: a deep neural network structure designed for long-term visual target tracking;
model initialization: comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition; the model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method;
online tracking: comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; using the target frame to regress and locate the target area;
updating the model: comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; model training and model obtaining in a continuous learning mode; the method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.
2. The method of claim 1, wherein the network model is designed by the following specific steps:
the deep neural network structure designed for long-term visual target tracking is as follows: the network structure consists of a sharing layer and a classification layer; wherein, the sharing layer comprises 3 convolution layers, 2 maximum value pooling layers, 2 full-connection layers and 5 nonlinear activation ReLU layers; the convolutional layer is the same as the corresponding part of the general VGG-M network; the next two fully connected layers each have 512 output units, and incorporate the ReLU and Dropouts modules; the classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing a target from a background;
in the image processing process of the convolutional neural network CNN, convolutional layers need to be connected through a convolutional filter, and the definition of the convolutional filter is expressed as NxCxWxH, wherein N represents the type of the convolutional filter, and C represents the number of channels of a filtered channel; w, H represent the width and height of the filtering range, respectively.
3. The method of claim 1, wherein the model is initialized by the following specific steps:
(1) obtaining an initial frame segmentation image: the quality of the initial frame template has important influence on the current tracking result; in order to increase the detailed representation of the tracked target, the segmented image is enabled to be consistent with the target in color and texture through super-pixel level segmentation, and structural information of the target is kept;
(2) generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively1A sample is obtained; the samples are marked as positive samples and negative samples according to the intersection scores of the samples and the real marking box;
(3) model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score finally output by the network is calculated by adopting two-classification cross entropy loss as a loss function, and then the parameters of the network full-connection layer are updated by using a gradient descent method; wherein the full link layer is trained to perform H1Iteration, the learning rate of the full-connection FC4-5 layer is set to be 0.0005, and the learning rate of the classification layer FC6 layer is set to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; finally, after repeated iteration, when H is reached1Namely stopping training when more than 50 iterations to obtain the network initialization model.
4. The method according to claim 1, wherein the online tracking comprises the following specific steps:
(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame2A candidate sample;
(2) obtaining the best candidate sample: n obtained in the step (1)2Sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position;
(3) regression of a target frame: and (3) after the estimated target position is obtained in the step (2), positioning a target area by using a target frame regression method to obtain a tracking result.
5. The method of claim 1, wherein the model is updated by the following steps:
(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity; two model updating modes of long-time updating and short-time updating are adopted; in the tracking process, long-term updating is performed every f-8-10 frames, and short-term updating is performed when the model classifies the estimated target position as a background;
(2) generating and updating a model updating sample library: the model update sample library comprises an online sample setAnd memory perception sample setTwo parts of which fl80-100 and fsRespectively representing a long-time collection sample setting frame number and a short-time collection sample setting frame number by 20-30;andrespectively representing online positive samples in an online sample setThe present set and the online negative sample set,andrespectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set;
(3) for each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collectedA positive sample anda negative sample is added toAnda sample set, wherein t represents the tth frame of the online tracking video sequence; for online positive sample setWhen tracking succeeds beyond flFrame-wise deleting positive samples collected in the earliest frame and then adding the deleted positive samples to a memory-aware positive sample setIn (i.e. on-line positive sample set only collects the latest successfully tracked f)lA frame sample; for online negative sample setWhen tracking succeeds beyond fsFrame time erasure at the earliest frameThe negative samples collected are added into the memory perception negative sample setIn, i.e. the online negative sample set only collects the latest successfully traced fsA frame sample; perception of memory positive sample setWhen it collects more than flClustering samples into N using K-means clustering algorithm when frames are processedCWhen there are new samples, calculating the characteristic mean vector and N of the new samples respectivelyCThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensuredThe sample lumped number is unchanged before and after; perception of memory with negative sample setWhen the collection exceeds fsDeleting samples collected in the earliest frame at the time of frame, i.e. only collecting the latest f in the memory perception negative sample setsA frame sample;
(4) model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and combined optimization training;
when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilizedTraining the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function; finally, according to the current classification loss, the parameters of the network full-connection layer are updated by using a gradient descent method,training full-connectivity layer to perform H215 iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilizedPreheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training310 iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)And memory perception sample setPerforming joint optimization training on the model, calculating the classification loss of the online sample set by using a two-classification cross entropy loss function, calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, and finally calculating the total loss by adding 0.7 times of the knowledge distillation loss to the classification loss; after the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H415 iterations; in each training stage, the learning rates of the full-connection FC4-5 layers are all set to 0.001, the learning rates of the classification FC6 layers are all set to 0.01, and the momentum attenuation and the weight attenuation are all set to 0.9 and 0.0005 respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910956780.XA CN110728694B (en) | 2019-10-10 | 2019-10-10 | Long-time visual target tracking method based on continuous learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910956780.XA CN110728694B (en) | 2019-10-10 | 2019-10-10 | Long-time visual target tracking method based on continuous learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110728694A true CN110728694A (en) | 2020-01-24 |
CN110728694B CN110728694B (en) | 2023-11-24 |
Family
ID=69219832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910956780.XA Active CN110728694B (en) | 2019-10-10 | 2019-10-10 | Long-time visual target tracking method based on continuous learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110728694B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539989A (en) * | 2020-04-20 | 2020-08-14 | 北京交通大学 | Computer vision single-target tracking method based on optimization variance reduction |
CN112037269A (en) * | 2020-08-24 | 2020-12-04 | 大连理工大学 | Visual moving target tracking method based on multi-domain collaborative feature expression |
CN112330719A (en) * | 2020-12-02 | 2021-02-05 | 东北大学 | Deep learning target tracking method based on feature map segmentation and adaptive fusion |
CN112698933A (en) * | 2021-03-24 | 2021-04-23 | 中国科学院自动化研究所 | Method and device for continuous learning in multitask data stream |
CN112767450A (en) * | 2021-01-25 | 2021-05-07 | 开放智能机器(上海)有限公司 | Multi-loss learning-based related filtering target tracking method and system |
CN113343280A (en) * | 2021-07-07 | 2021-09-03 | 时代云英(深圳)科技有限公司 | Joint learning-based private cloud algorithm model generation method |
CN113837296A (en) * | 2021-09-28 | 2021-12-24 | 安徽大学 | RGBT visual tracking method and system based on two-stage fusion structure search |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709936A (en) * | 2016-12-14 | 2017-05-24 | 北京工业大学 | Single target tracking method based on convolution neural network |
CN108062764A (en) * | 2017-11-30 | 2018-05-22 | 极翼机器人(上海)有限公司 | A kind of object tracking methods of view-based access control model |
CN110211157A (en) * | 2019-06-04 | 2019-09-06 | 重庆邮电大学 | A kind of target long time-tracking method based on correlation filtering |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
-
2019
- 2019-10-10 CN CN201910956780.XA patent/CN110728694B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709936A (en) * | 2016-12-14 | 2017-05-24 | 北京工业大学 | Single target tracking method based on convolution neural network |
CN108062764A (en) * | 2017-11-30 | 2018-05-22 | 极翼机器人(上海)有限公司 | A kind of object tracking methods of view-based access control model |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN110211157A (en) * | 2019-06-04 | 2019-09-06 | 重庆邮电大学 | A kind of target long time-tracking method based on correlation filtering |
Non-Patent Citations (2)
Title |
---|
ABHINAV MOUDGIL 等: "Long-Term Visual Object Tracking Benchmark", 《SPRINGER NATURE SWITZERLAND AG 2019》 * |
刘威;赵文杰;李成;: "时空上下文学习长时目标跟踪", 光学学报 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539989A (en) * | 2020-04-20 | 2020-08-14 | 北京交通大学 | Computer vision single-target tracking method based on optimization variance reduction |
CN111539989B (en) * | 2020-04-20 | 2023-09-22 | 北京交通大学 | Computer vision single target tracking method based on optimized variance reduction |
CN112037269A (en) * | 2020-08-24 | 2020-12-04 | 大连理工大学 | Visual moving target tracking method based on multi-domain collaborative feature expression |
CN112330719A (en) * | 2020-12-02 | 2021-02-05 | 东北大学 | Deep learning target tracking method based on feature map segmentation and adaptive fusion |
CN112330719B (en) * | 2020-12-02 | 2024-02-27 | 东北大学 | Deep learning target tracking method based on feature map segmentation and self-adaptive fusion |
CN112767450A (en) * | 2021-01-25 | 2021-05-07 | 开放智能机器(上海)有限公司 | Multi-loss learning-based related filtering target tracking method and system |
CN112698933A (en) * | 2021-03-24 | 2021-04-23 | 中国科学院自动化研究所 | Method and device for continuous learning in multitask data stream |
CN113343280A (en) * | 2021-07-07 | 2021-09-03 | 时代云英(深圳)科技有限公司 | Joint learning-based private cloud algorithm model generation method |
CN113343280B (en) * | 2021-07-07 | 2024-08-23 | 时代云英(深圳)科技有限公司 | Private cloud algorithm model generation method based on joint learning |
CN113837296A (en) * | 2021-09-28 | 2021-12-24 | 安徽大学 | RGBT visual tracking method and system based on two-stage fusion structure search |
CN113837296B (en) * | 2021-09-28 | 2024-05-31 | 安徽大学 | RGBT visual tracking method and system based on two-stage fusion structure search |
Also Published As
Publication number | Publication date |
---|---|
CN110728694B (en) | 2023-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108154118B (en) | A kind of target detection system and method based on adaptive combined filter and multistage detection | |
CN110728694B (en) | Long-time visual target tracking method based on continuous learning | |
CN108960140B (en) | Pedestrian re-identification method based on multi-region feature extraction and fusion | |
CN110070074B (en) | Method for constructing pedestrian detection model | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
CN109858406B (en) | Key frame extraction method based on joint point information | |
CN114972418B (en) | Maneuvering multi-target tracking method based on combination of kernel adaptive filtering and YOLOX detection | |
CN108629288B (en) | Gesture recognition model training method, gesture recognition method and system | |
CN110781262B (en) | Semantic map construction method based on visual SLAM | |
CN110738247B (en) | Fine-grained image classification method based on selective sparse sampling | |
CN111079847B (en) | Remote sensing image automatic labeling method based on deep learning | |
CN105654139B (en) | A kind of real-time online multi-object tracking method using time dynamic apparent model | |
CN115995063A (en) | Work vehicle detection and tracking method and system | |
CN108596203B (en) | Optimization method of parallel pooling layer for pantograph carbon slide plate surface abrasion detection model | |
CN110175615B (en) | Model training method, domain-adaptive visual position identification method and device | |
CN111161315B (en) | Multi-target tracking method and system based on graph neural network | |
CN111476817A (en) | Multi-target pedestrian detection tracking method based on yolov3 | |
CN109871875B (en) | Building change detection method based on deep learning | |
CN112560656A (en) | Pedestrian multi-target tracking method combining attention machine system and end-to-end training | |
CN110033473A (en) | Motion target tracking method based on template matching and depth sorting network | |
CN107169117B (en) | Hand-drawn human motion retrieval method based on automatic encoder and DTW | |
CN109033978B (en) | Error correction strategy-based CNN-SVM hybrid model gesture recognition method | |
CN112489081B (en) | Visual target tracking method and device | |
CN112085765B (en) | Video target tracking method combining particle filtering and metric learning | |
CN111462173B (en) | Visual tracking method based on twin network discrimination feature learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |