CN110728694A - Long-term visual target tracking method based on continuous learning - Google Patents

Long-term visual target tracking method based on continuous learning Download PDF

Info

Publication number
CN110728694A
CN110728694A CN201910956780.XA CN201910956780A CN110728694A CN 110728694 A CN110728694 A CN 110728694A CN 201910956780 A CN201910956780 A CN 201910956780A CN 110728694 A CN110728694 A CN 110728694A
Authority
CN
China
Prior art keywords
model
sample
tracking
sample set
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910956780.XA
Other languages
Chinese (zh)
Other versions
CN110728694B (en
Inventor
张辉
朱牧
张菁
卓力
齐天卉
张磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201910956780.XA priority Critical patent/CN110728694B/en
Publication of CN110728694A publication Critical patent/CN110728694A/en
Application granted granted Critical
Publication of CN110728694B publication Critical patent/CN110728694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a long-term visual target tracking method based on continuous learning. A deep neural network structure is designed for long-term visual target tracking, an initialized network model is obtained through model initialization, then online tracking is carried out by using the initialized network model, and long-term or short-term model updating is carried out by using a continuous learning method in the tracking process so as to adapt to various changes of a target in the tracking process. The method converts the on-line updating process of the traditional visual target tracking model into the continuous learning process, establishes the complete appearance description of the target from all historical data of the video, and effectively improves the robustness of long-term visual tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description

Long-term visual target tracking method based on continuous learning
Technical Field
The invention belongs to the field of computer vision and image video processing, and particularly relates to a long-term visual target tracking method based on continuous learning.
Background
Visual target tracking is a basic problem in computer vision and image video processing, and has wide application in the fields of monitoring video automatic analysis, human-computer interaction, visual navigation and the like. Tracking methods can be roughly divided into two main categories according to the length of a video sequence frame: short-term target tracking and long-term target tracking. Generally, when the length of a frame of a tracked video sequence is greater than 1000 frames, the tracked video sequence is called long-term target tracking. At present, a short-time tracking algorithm has better performance on relatively short video data, but when the short-time tracking algorithm is directly applied to the processing of real long-time video sequences, the tracking precision and robustness can not meet the index requirements of actual scenes.
In the long-term tracking task, besides the common challenges in the short-term scene, such as target scale change, illumination change, target deformation, etc., the problem of robust locking of frequent 'reappearance after disappearance' targets needs to be solved. Therefore, compared with the traditional short-time tracking, the long-time tracking has more challenges and is more suitable for the actual requirements of various application scenes. However, the tracking technology facing such long-time data is relatively deficient, and the performance of the existing method is very limited. An existing long-term tracking idea is to combine traditional tracking and traditional target detection methods to solve the problems of target deformation, partial shielding and the like in tracking. Meanwhile, the 'significant feature points' of the tracking module, the target model of the detection module and related parameters are continuously updated through an online learning mechanism, so that the tracking effect is more robust and reliable. In addition, there are methods that utilize keypoint matching tracking and robust estimation techniques, that can integrate long-term memory, and that can provide additional information for output control. The tracking method can search for the target in the whole frame of image, but the performance is not ideal because only simple characteristics of manual design are adopted. Recently, some tracking methods based on correlation filtering and deep learning are proposed, and although there are re-detection schemes for long-term tracking, the methods are limited to searching only in a local range of an image, so that a target cannot be captured again after being out of view, and the method cannot meet the requirements of long-term tracking tasks.
From the current state of technical development, the visual target tracking method based on deep convolutional neural network image classification has great potential of effectively distinguishing the target from a disordered background, and the tracking method based on the frame has wide development prospect. However, tracking models trained only offline are often difficult to adapt to online changes in video, and simply updating the models frequently with new data accelerates tracking drift, making it prone to failure when dealing with long-term tracking problems. The invention provides a long-term visual target tracking method based on continuous learning by balancing historical memory and online updating of a model through a continuous learning method.
Disclosure of Invention
The method utilizes the continuous learning theory to convert the online updating of the model of the visual target tracking method into a continuous learning process, learns the effective abstraction and characterization of the time sequence image in the whole video sequence and establishes the complete portrait of the target. Finally, the method adapts to the conditions of target deformation, background interference, shielding, illumination change and the like in the tracking process, the adaptability and the reliability of the existing tracking method during online updating are improved, the sensitivity of the model to noise such as target deformation, shielding and the like is reduced, and the purpose of stably tracking the target for a long time is achieved.
The invention is realized by adopting the following technical means: a long-term visual target tracking method based on continuous learning mainly comprises four parts of network model design, model initialization, online tracking and model updating.
Designing a network model: firstly, designing a deep neural network structure according to the overall flow shown in the attached figure 1; the network stage profiles are then adapted to adaptive size.
Model initialization: mainly comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition. The model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method.
Online tracking: mainly comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; the target region is located using target box regression.
Updating the model: mainly comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; and (5) continuous learning mode model training and model obtaining. The method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.
The network model design comprises the following specific steps:
(1) the deep neural network structure designed by the invention comprises the following components: as shown in fig. 2, the network structure of the present invention is composed of a sharing layer and a classification layer. Wherein, the shared layer comprises 3 convolutional layers, 2 maximum pooling layers, 2 full-link layers and 5 nonlinear active ReLU layers. The convolutional layers are identical to the corresponding parts of the generic VGG-M network. The next two fully connected layers each have 512 output units and incorporate the ReLU and Dropouts modules. The classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing the target from the background.
In the image processing process of the convolutional neural network CNN, convolutional layers need to be connected through a convolutional filter, and the definition of the convolutional filter is expressed as NxCxWxH, wherein N represents the type of the convolutional filter, and C represents the number of channels of a filtered channel; w, H represent the width and height of the filtering range, respectively.
(2) In the continuous learning long-term target tracking process, the change of the input and output characteristic diagram of each convolution layer is as follows:
in the tracking process, images with different sizes are unified into images with the size of 3 multiplied by 107 and then input into a network, in a first convolution layer, the number of output channels is 96 after 96 convolution kernels with the size of 7 multiplied by 7, the number of the output channels is 96 after the output channels pass through a nonlinear activation layer ReLU and a local response normalization layer, and finally a characteristic diagram with the size of 96 multiplied by 25 is obtained through a maximum value pooling layer; in the second convolutional layer, inputting a feature map with the size of 96 × 25 × 25, firstly passing through 256 5 × 5 convolutional kernels, then passing through a nonlinear active layer ReLU and a local response normalization layer to output a channel with the number of 256, and finally passing through a maximum pooling layer to obtain a 256 × 5 × 5 feature map; inputting a feature map with the size of 256 × 5 × 5 in the third convolutional layer, firstly performing 512 convolution kernels with the size of 3 × 3, and then performing nonlinear activation layer ReLU to obtain a feature map with the size of 512 × 3 × 3; in the fourth full-connection layer, inputting a characteristic diagram with the size of 512 multiplied by 3, firstly passing through a 512 neural unit, and then passing through a nonlinear activation layer ReLU to obtain a 512-dimensional characteristic vector; inputting a characteristic vector with the size of 512 dimensions in a fifth full-connection layer, firstly passing through a 512 neural unit, then passing through a Dropouts layer, and finally obtaining the characteristic vector with the size of 512 dimensions through a nonlinear activation layer ReLU; in the classification layer, 512-dimension feature vectors pass through a Dropouts layer, then a binary classification layer with softmax loss is input, and finally a classification score with 2-dimension is output.
The model initialization comprises the following specific steps:
(1) obtaining an initial frame segmentation image: the quality of the initial frame template has a significant impact on the current tracking results. To increase the detailed representation of the tracked target, superpixel level segmentation is applied by a Simple Linear Iterative Clustering (SLIC) superpixel segmentation method, so that the segmented image not only coincides in color and texture with the target, but also retains the structural information of the target, as shown in fig. 3.
(2) Generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively1And (4) sampling. These samples are labeled as positive samples (between 0.7 and 1.0) and negative samples (between 0 and 0.5) according to their intersection and score ratio with the true label box (ground truth).
(3) Model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score of the final output of the network adopts two-classification cross entropy loss as a loss function to solve the loss, and then gradient is used for solving the lossAnd updating the parameters of the network full-connection layer by a descent method. Wherein the full link layer is trained to perform H1Iteration (50 times), the learning rate of the full-connection FC4-5 layer is set to 0.0005, and the learning rate of the classification layer FC6 layer is set to 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M+(32) A positive sample and a slave M-(1024) Selected from a negative sample
Figure BDA0002227609550000041
(96) The composition of each difficultly-divided negative sample; finally, after repeated iteration, when H is reached1And (50) stopping training during iteration to obtain a network initialization model.
The online tracking comprises the following specific steps:
(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame2A candidate sample.
(2) Obtaining the best candidate sample: n obtained in the step (1)2And sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position.
(3) Regression of a target frame: and (3) after the estimated target position is obtained in the step (2), positioning a target area by using a target frame regression method to obtain a tracking result.
The model updating comprises the following specific steps:
(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity. And two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-term updating is performed every f (8-10) frames, and short-term updating is performed when the model classifies the estimated target position as the background.
(2) Generating and updating a model updating sample library: the model update sample library comprises an online sample set
Figure BDA0002227609550000051
And memory perception sample set
Figure BDA0002227609550000052
Two parts of which fl(80-100) and fs(20-30) respectively representing the set frame number of the long-time collected samples and the set frame number of the short-time collected samples.
Figure BDA0002227609550000053
And
Figure BDA0002227609550000054
respectively representing an online positive sample set and an online negative sample set in the online sample set,and
Figure BDA0002227609550000056
respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set. In particular, on-line positive and negative sample concentration
Figure BDA0002227609550000057
(500) An(5000) One is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collected
Figure BDA0002227609550000059
(50) A positive sample and
Figure BDA00022276095500000510
(200) a negative sample is added to
Figure BDA00022276095500000511
And
Figure BDA00022276095500000512
sample collection ofWhere t represents the tth frame of the online tracking video sequence. For online positive sample set
Figure BDA00022276095500000513
When tracking succeeds beyond fl(80-100) deleting positive samples collected in the earliest frame in the frame, and then adding the deleted positive samples to a memory-aware positive sample set
Figure BDA00022276095500000514
In (i.e. on-line positive sample set only collects the latest successfully tracked f)l(80-100) frame samples; for online negative sample setWhen tracking succeeds beyond fs(20-30) deleting negative samples collected in the earliest frame during the frame, and then adding the deleted negative samples to a memory perception negative sample setIn, i.e. the online negative sample set only collects the latest successfully traced fs(20-30) frame samples. Perception of memory positive sample set
Figure BDA00022276095500000517
When it collects more than fl(80-100) when the frame is in use, clustering the samples into N by using a K mean value clustering algorithmC(10-15) classes, when there is a new sample, calculating the characteristic mean vector and N of the new sample respectivelyCThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensured
Figure BDA00022276095500000518
The sample lumped number is unchanged before and after; perception of memory with negative sample set
Figure BDA00022276095500000519
When the collection exceeds fs(20-30) deleting the frame collected in the earliest frameSamples, i.e. sets of memory-aware negative samples, only the latest fs(20-30) frame samples.
(3) Model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and joint optimization training. The purpose of the preheating training is to enable the model to learn to adapt to the current target change, the purpose of the combined optimization training is to enable the model to remember the historical target change, so that the complete description of the target is established in the long-term target tracking process, and when the tracked target appears after being out of the visual field, the target can be quickly found back by utilizing the historical memory of the model, so that the long-term stable tracking is realized. When the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilized
Figure BDA0002227609550000061
And training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function. Finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H2(15) iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilized
Figure BDA0002227609550000062
Preheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training3(10) iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)
Figure BDA0002227609550000063
And memory perception sample set
Figure BDA0002227609550000064
Performing joint optimization training on the model, and calculating the classification loss of the online sample set by using a two-classification cross entropy loss functionAnd calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, wherein the final total loss is the classification loss plus the knowledge distillation loss multiplied by lambda. After the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H4(15) iterations. Wherein in each training stage, the learning rates of the full-connection FC4-5 layer are all set to be 0.001, the learning rates of the classification FC6 layer are all set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch of the M-class training data is trained+(32) A positive sample and a slave M-(1024) Selected from a negative sample
Figure BDA0002227609550000065
(96) And (4) forming a hard negative sample.
The invention has the characteristics that:
the invention provides a long-term visual target tracking method based on continuous learning. The method converts the on-line updating of a traditional visual target tracking model into a continuous learning process, combines a dynamically constructed on-line sample set and a memory perception sample set, and learns the changes of the shielding, the shape, the scale, the illumination and the like of a target in a long-term time dimension, so that the time series data are effectively abstracted and represented in the whole video sequence, and a complete portrait of the target is established. The method can quickly retrieve the target reappearing in the visual field according to the continuously learned historical model after the target is blocked or out of the visual field for a long time. Compared with the existing visual target tracking technology, the method balances the history memory and online updating of the model through a continuous learning method, overcomes the problem of 'catastrophic forgetting' of the model caused by frequent updating by using new data in the prior art, establishes complete portrait description of the target integrally from all historical data of the video, obtains the target model insensitive to noise, improves the robustness of visual tracking and achieves the purpose of long-term tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.
Description of the drawings:
FIG. 1 is an overall flow chart
FIG. 2 is a network architecture
FIG. 3 shows an initial frame segmentation image
The specific implementation mode is as follows:
the following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:
a long-term target tracking method based on continuous learning is disclosed, the whole flow is shown in figure 1; the algorithm is divided into a model initialization part, an online tracking part and a model updating part. A model initialization part: processing an initial frame, firstly obtaining an initial frame segmentation image only with prospect by using a superpixel segmentation method, then inputting the initial frame original image and the initial frame segmentation image to respectively extract convolution layer characteristics, then fusing the two parts of characteristics, namely adding the two parts of characteristics, then obtaining classification scores through a full connection layer and a classification layer and calculating classification loss, and then optimally solving an optimal initialization model through a back propagation gradient loss item. And an online tracking part: in the subsequent frame processing process, firstly, candidate samples are generated by utilizing the predicted position of the target in the previous frame, then, each candidate sample is input into a network to calculate the classification score, the candidate sample with the highest classification score is selected, and finally, the target frame is used for regression positioning of the target area to obtain a tracking result. And a model updating part: in the tracking process, when the estimated target is classified as the background every 10 frames or models, the long-term or short-term model is updated by using a continuous learning method to adapt to various changes of the target in the tracking process.
The model initialization part comprises the following specific steps:
(1) obtaining an initial frame segmentation image: the initial frame consisting of a super set of pixels
Figure BDA0002227609550000071
Composition, where N is the number of super pixels in the image, OiRepresenting the pixel value of the ith superpixel in the superpixel set. The superpixels that lie completely outside the bounding box are considered as background, and the remaining superpixels are unknown (background or foreground). P pixel values x sampled randomly with superpixelsvModeling superpixels as
Figure BDA0002227609550000072
Where P is the number of randomly sampled superpixels, xvRepresenting the pixel value of the v-th superpixel in superpixel model m. This can be seen as an empirical histogram of the color distribution of the super-blend. For any known super-hybrid model mbIf the similarity score is S (m)a,mb) Eta, eta is 0.5, which corresponds to the unknown super-hybrid model maThe hyper-hybrid model of (a), is labeled as background, wherein:
Figure BDA0002227609550000081
wherein xkIs an unknown superpixel model maPixel value of the kth super pixel, score (x)k,mb) Is defined as:
Figure BDA0002227609550000082
wherein xjIs a known superpixel model mbThe pixel value of the jth super pixel. The parameter R is set to 0.5, which controls the radius of the sphere centered at each model pixel, allowing for slight errors. Figure 3 shows the segmentation results.
(2) Generation of a model initialization training sample library: 500 positive samples are randomly sampled around the initial target positions of the first frame original image and the segmented image, respectively, and 5000 negative samples are extracted only in the first frame original image. The samples are scored according to their intersection ratio with the real labeled box, the samples with the score between [0.7,1] are marked as positive samples, and the samples with the score between [0,0.5] are marked as negative samples.
(3) Model initialization training and model acquisition: and (3) solving a loss term of the classification score finally output by the network by adopting the two-classification cross entropy loss as a loss function, wherein the formula is as follows:
Figure BDA0002227609550000083
wherein,Xn/YnTraining samples and training sample labels representing the initial training sample library, NnIs from XnA batch of samples is taken of the sample,
Figure BDA0002227609550000084
is NnThe label corresponding to the ith sample in (1),
Figure BDA0002227609550000085
is NnThe ith sample
Figure BDA0002227609550000086
The corresponding softmax output. Then, solving the optimized network parameters by a stochastic gradient descent method, training a full-connection layer to perform 50 iterations in an initial frame of a test sequence, setting the learning rate of the full-connection FC4-5 layer to be 0.0005, and setting the learning rate of the classification layer FC6 layer to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M+32 positive samples and slave M-Selected from 1024 negative samples
Figure BDA0002227609550000087
And (4) forming a negative difficult sample.
The online tracking part comprises the following specific steps:
(1) for each frame in the on-line tracking, 256 candidate samples are generated using a Gaussian distribution based on the estimated target position of the previous frame
Figure BDA0002227609550000088
xuRepresenting the u-th one of the candidate samples. The mean of the Gaussian distribution is r and the covariance is the diagonal matrix diag (0.09 r)2,0.09r20.25) where r is the average of the width and height of the target position estimated from the previous frame.
(2) The output of the network function is a two-dimensional vector representing the scores of the target and the background respectively corresponding to the input candidate samples. Selecting the candidate sample with the highest classification score as the estimated target position:
Figure BDA0002227609550000091
where u is the candidate sample index, f+(. represents a current network function, x)*Representing the candidate sample with the highest classification score computed by the network, i.e., the estimated target location.
(3) And finally, performing target frame regression on the obtained target position to locate the target area. The target frame regression adopts a ridge regression method, and a parameter alpha in the ridge regression is set to be 1000.
The model updating part comprises the following specific steps:
(1) and (3) selecting an updating mode: and two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-time update is performed every f-10 frames, and short-time update is performed when the model classifies the estimated target position as the background.
(2) Generating and updating a model updating sample library: the model update sample library comprises an online sample set
Figure BDA0002227609550000092
And memory perception sample set
Figure BDA0002227609550000093
Two moieties, subscript flAnd fsRespectively indicating a long-time collection sample setting frame number and a short-time collection sample setting frame number.
Figure BDA0002227609550000094
Andrespectively representing an online positive sample set and an online negative sample set in the online sample set,and
Figure BDA0002227609550000097
respectively representing memory in a set of memory perception samplesA perceptual positive sample set and a memory perceptual negative sample set. In particular, on-line positive and negative sample concentration
Figure BDA0002227609550000098
An
Figure BDA0002227609550000099
One is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in online tracking, when the model classifies the estimated target position as foreground, indicating that the tracking is successful, randomly sampling around the estimated target position, respectively collecting 50 positive samples and 200 negative samples to be added
Figure BDA00022276095500000910
Andand (4) collecting samples. For online positive sample setDeleting positive samples collected in the earliest frame when tracking succeeds beyond 100 frames, and then adding the deleted positive samples to a memory-aware positive sample set
Figure BDA00022276095500000913
In other words, only 100 frame samples which are successfully tracked newly are collected in the online positive sample set; for online negative sample set
Figure BDA00022276095500000914
Deleting negative samples collected in the earliest frame when tracking succeeds beyond 30 frames, and then adding the deleted negative samples to a memory-aware negative sample set
Figure BDA00022276095500000915
In other words, only 30 frame samples of which the latest tracking is successful are collected in the online negative sample set.
Perception of memory positive sample set
Figure BDA00022276095500000916
When the collected samples exceed the long-time collected sample setting frame of 100 frames, the samples are clustered into 10 classes by using a K-means clustering algorithm:
Figure BDA0002227609550000101
where tau denotes the cluster index of the cluster,
Figure BDA0002227609550000102
the result of the clustering is represented by,is a feature vector calculation function:
Figure BDA0002227609550000104
where W and b represent the network weight and offset, respectively, before the network is fully connected to the FC5 layer, x represents the input sample,
Figure BDA0002227609550000105
representing a convolution operation. When there is a new memory perception sample
Figure BDA0002227609550000106
And respectively calculating Euclidean distances between the feature mean vector of the new sample and 10 clustering centers, wherein the Euclidean distance calculation formula is as follows:
dτnewτ)=||μnewτ||,τ=1,...,10 (7)
in the formula, munewRepresents the new sample feature mean vector, μτRepresenting the feature mean vector of the # th class in the 10 clusters. Determining cluster labels of the new samples according to the mean vector closest to the new samples:
Figure BDA0002227609550000107
and draw new samples into correspondingClustering:
Figure BDA0002227609550000108
the earliest samples with the same number as the new samples in the class are deleted at the same time, so that the memory perception positive sample set is ensured
Figure BDA0002227609550000109
The sample lumped number is unchanged before and after; perception of memory with negative sample setThe oldest collected samples are deleted when more than 30 samples are collected, i.e. the memory-aware negative sample set collects only the latest 30 samples.
(3) Continuous learning mode training and model acquisition: when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not have a sample, the online sample set collected in the step (2) is utilized
Figure BDA00022276095500001011
And (4) training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function formula (3). And finally, updating parameters of the network full-connection layer by using a gradient descent method according to the current classification loss, wherein the formula of the gradient descent method is as follows:
Figure BDA00022276095500001012
in the formula, thetanRepresenting the network parameters, η is the learning rate, l (-) represents the loss function. The fully connected layer was trained for 15 iterations. When the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilized
Figure BDA00022276095500001013
Carrying out preheating training on the model, calculating the classification loss of the model by adopting a two-classification cross entropy loss function formula (3), then updating the parameters of the network full-connection layer by using a gradient descent method formula (9), and training the full-connection layer to carry out 10 iterations; when the mouldAfter the preheating training is finished, utilizing the online sample set collected in the step (2)
Figure BDA0002227609550000111
And memory perception sample set
Figure BDA0002227609550000112
Performing combined optimization training on the model, and calculating the classification loss L of the online sample set by adopting a two-classification cross entropy loss function formula (3)CCalculating the distillation loss L of the memory perception sample set by using a knowledge distillation loss functionDThe formula of the knowledge distillation loss function is as follows:
Figure BDA0002227609550000113
in the formula (I), the compound is shown in the specification,
Figure BDA0002227609550000114
represents a memory perception sample set, training samples and sample labels, different from formula (3),
Figure BDA0002227609550000115
is a soft label output by the old network, NmIs from XmA batch of samples is taken of the sample,is NmThe label corresponding to the ith sample in (1),
Figure BDA0002227609550000117
is the ith sampleThe corresponding softmax output. Finally, the overall loss function is:
Lsum=LC+λ·LD(11)
in the formula, the parameter λ is set to 0.7. After the total loss is calculated, parameters of the network full-connection layer are updated by using a gradient descent method formula (9), and the full-connection layer is trained to perform 15 iterations. In each training stage, the learning rate of the full-connection FC4-5 layer is set to be 0.001, the learning rate of the classification FC6 layer is set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch consists of 32 positive samples and 96 hard-to-divide negative samples selected from 1024 negative samples during training.

Claims (5)

1. A long-term visual target tracking method based on continuous learning is characterized in that: the method comprises four parts of network model design, model initialization, online tracking and model updating;
designing a network model: a deep neural network structure designed for long-term visual target tracking;
model initialization: comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition; the model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method;
online tracking: comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; using the target frame to regress and locate the target area;
updating the model: comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; model training and model obtaining in a continuous learning mode; the method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.
2. The method of claim 1, wherein the network model is designed by the following specific steps:
the deep neural network structure designed for long-term visual target tracking is as follows: the network structure consists of a sharing layer and a classification layer; wherein, the sharing layer comprises 3 convolution layers, 2 maximum value pooling layers, 2 full-connection layers and 5 nonlinear activation ReLU layers; the convolutional layer is the same as the corresponding part of the general VGG-M network; the next two fully connected layers each have 512 output units, and incorporate the ReLU and Dropouts modules; the classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing a target from a background;
in the image processing process of the convolutional neural network CNN, convolutional layers need to be connected through a convolutional filter, and the definition of the convolutional filter is expressed as NxCxWxH, wherein N represents the type of the convolutional filter, and C represents the number of channels of a filtered channel; w, H represent the width and height of the filtering range, respectively.
3. The method of claim 1, wherein the model is initialized by the following specific steps:
(1) obtaining an initial frame segmentation image: the quality of the initial frame template has important influence on the current tracking result; in order to increase the detailed representation of the tracked target, the segmented image is enabled to be consistent with the target in color and texture through super-pixel level segmentation, and structural information of the target is kept;
(2) generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively1A sample is obtained; the samples are marked as positive samples and negative samples according to the intersection scores of the samples and the real marking box;
(3) model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score finally output by the network is calculated by adopting two-classification cross entropy loss as a loss function, and then the parameters of the network full-connection layer are updated by using a gradient descent method; wherein the full link layer is trained to perform H1Iteration, the learning rate of the full-connection FC4-5 layer is set to be 0.0005, and the learning rate of the classification layer FC6 layer is set to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; finally, after repeated iteration, when H is reached1Namely stopping training when more than 50 iterations to obtain the network initialization model.
4. The method according to claim 1, wherein the online tracking comprises the following specific steps:
(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame2A candidate sample;
(2) obtaining the best candidate sample: n obtained in the step (1)2Sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position;
(3) regression of a target frame: and (3) after the estimated target position is obtained in the step (2), positioning a target area by using a target frame regression method to obtain a tracking result.
5. The method of claim 1, wherein the model is updated by the following steps:
(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity; two model updating modes of long-time updating and short-time updating are adopted; in the tracking process, long-term updating is performed every f-8-10 frames, and short-term updating is performed when the model classifies the estimated target position as a background;
(2) generating and updating a model updating sample library: the model update sample library comprises an online sample set
Figure FDA0002227609540000021
And memory perception sample set
Figure FDA0002227609540000022
Two parts of which fl80-100 and fsRespectively representing a long-time collection sample setting frame number and a short-time collection sample setting frame number by 20-30;
Figure FDA0002227609540000023
andrespectively representing online positive samples in an online sample setThe present set and the online negative sample set,and
Figure FDA0002227609540000026
respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set;
(3) for each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collectedA positive sample and
Figure FDA0002227609540000028
a negative sample is added to
Figure FDA0002227609540000029
Anda sample set, wherein t represents the tth frame of the online tracking video sequence; for online positive sample setWhen tracking succeeds beyond flFrame-wise deleting positive samples collected in the earliest frame and then adding the deleted positive samples to a memory-aware positive sample set
Figure FDA0002227609540000031
In (i.e. on-line positive sample set only collects the latest successfully tracked f)lA frame sample; for online negative sample set
Figure FDA0002227609540000032
When tracking succeeds beyond fsFrame time erasure at the earliest frameThe negative samples collected are added into the memory perception negative sample set
Figure FDA0002227609540000033
In, i.e. the online negative sample set only collects the latest successfully traced fsA frame sample; perception of memory positive sample setWhen it collects more than flClustering samples into N using K-means clustering algorithm when frames are processedCWhen there are new samples, calculating the characteristic mean vector and N of the new samples respectivelyCThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensured
Figure FDA0002227609540000035
The sample lumped number is unchanged before and after; perception of memory with negative sample set
Figure FDA0002227609540000036
When the collection exceeds fsDeleting samples collected in the earliest frame at the time of frame, i.e. only collecting the latest f in the memory perception negative sample setsA frame sample;
(4) model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and combined optimization training;
when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilized
Figure FDA0002227609540000037
Training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function; finally, according to the current classification loss, the parameters of the network full-connection layer are updated by using a gradient descent method,training full-connectivity layer to perform H215 iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilizedPreheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training310 iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)
Figure FDA0002227609540000039
And memory perception sample set
Figure FDA00022276095400000310
Performing joint optimization training on the model, calculating the classification loss of the online sample set by using a two-classification cross entropy loss function, calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, and finally calculating the total loss by adding 0.7 times of the knowledge distillation loss to the classification loss; after the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H415 iterations; in each training stage, the learning rates of the full-connection FC4-5 layers are all set to 0.001, the learning rates of the classification FC6 layers are all set to 0.01, and the momentum attenuation and the weight attenuation are all set to 0.9 and 0.0005 respectively.
CN201910956780.XA 2019-10-10 2019-10-10 Long-time visual target tracking method based on continuous learning Active CN110728694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910956780.XA CN110728694B (en) 2019-10-10 2019-10-10 Long-time visual target tracking method based on continuous learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910956780.XA CN110728694B (en) 2019-10-10 2019-10-10 Long-time visual target tracking method based on continuous learning

Publications (2)

Publication Number Publication Date
CN110728694A true CN110728694A (en) 2020-01-24
CN110728694B CN110728694B (en) 2023-11-24

Family

ID=69219832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910956780.XA Active CN110728694B (en) 2019-10-10 2019-10-10 Long-time visual target tracking method based on continuous learning

Country Status (1)

Country Link
CN (1) CN110728694B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539989A (en) * 2020-04-20 2020-08-14 北京交通大学 Computer vision single-target tracking method based on optimization variance reduction
CN112037269A (en) * 2020-08-24 2020-12-04 大连理工大学 Visual moving target tracking method based on multi-domain collaborative feature expression
CN112330719A (en) * 2020-12-02 2021-02-05 东北大学 Deep learning target tracking method based on feature map segmentation and adaptive fusion
CN112698933A (en) * 2021-03-24 2021-04-23 中国科学院自动化研究所 Method and device for continuous learning in multitask data stream
CN112767450A (en) * 2021-01-25 2021-05-07 开放智能机器(上海)有限公司 Multi-loss learning-based related filtering target tracking method and system
CN113837296A (en) * 2021-09-28 2021-12-24 安徽大学 RGBT visual tracking method and system based on two-stage fusion structure search
CN113837296B (en) * 2021-09-28 2024-05-31 安徽大学 RGBT visual tracking method and system based on two-stage fusion structure search

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709936A (en) * 2016-12-14 2017-05-24 北京工业大学 Single target tracking method based on convolution neural network
CN108062764A (en) * 2017-11-30 2018-05-22 极翼机器人(上海)有限公司 A kind of object tracking methods of view-based access control model
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110211157A (en) * 2019-06-04 2019-09-06 重庆邮电大学 A kind of target long time-tracking method based on correlation filtering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709936A (en) * 2016-12-14 2017-05-24 北京工业大学 Single target tracking method based on convolution neural network
CN108062764A (en) * 2017-11-30 2018-05-22 极翼机器人(上海)有限公司 A kind of object tracking methods of view-based access control model
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110211157A (en) * 2019-06-04 2019-09-06 重庆邮电大学 A kind of target long time-tracking method based on correlation filtering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ABHINAV MOUDGIL 等: "Long-Term Visual Object Tracking Benchmark", 《SPRINGER NATURE SWITZERLAND AG 2019》 *
刘威;赵文杰;李成;: "时空上下文学习长时目标跟踪", 光学学报 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539989A (en) * 2020-04-20 2020-08-14 北京交通大学 Computer vision single-target tracking method based on optimization variance reduction
CN111539989B (en) * 2020-04-20 2023-09-22 北京交通大学 Computer vision single target tracking method based on optimized variance reduction
CN112037269A (en) * 2020-08-24 2020-12-04 大连理工大学 Visual moving target tracking method based on multi-domain collaborative feature expression
CN112330719A (en) * 2020-12-02 2021-02-05 东北大学 Deep learning target tracking method based on feature map segmentation and adaptive fusion
CN112330719B (en) * 2020-12-02 2024-02-27 东北大学 Deep learning target tracking method based on feature map segmentation and self-adaptive fusion
CN112767450A (en) * 2021-01-25 2021-05-07 开放智能机器(上海)有限公司 Multi-loss learning-based related filtering target tracking method and system
CN112698933A (en) * 2021-03-24 2021-04-23 中国科学院自动化研究所 Method and device for continuous learning in multitask data stream
CN113837296A (en) * 2021-09-28 2021-12-24 安徽大学 RGBT visual tracking method and system based on two-stage fusion structure search
CN113837296B (en) * 2021-09-28 2024-05-31 安徽大学 RGBT visual tracking method and system based on two-stage fusion structure search

Also Published As

Publication number Publication date
CN110728694B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN108154118B (en) A kind of target detection system and method based on adaptive combined filter and multistage detection
CN110728694B (en) Long-time visual target tracking method based on continuous learning
CN110070074B (en) Method for constructing pedestrian detection model
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN114972418B (en) Maneuvering multi-target tracking method based on combination of kernel adaptive filtering and YOLOX detection
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
CN110781262B (en) Semantic map construction method based on visual SLAM
CN108629288B (en) Gesture recognition model training method, gesture recognition method and system
CN108596203B (en) Optimization method of parallel pooling layer for pantograph carbon slide plate surface abrasion detection model
CN111161315B (en) Multi-target tracking method and system based on graph neural network
CN111476817A (en) Multi-target pedestrian detection tracking method based on yolov3
CN112560656A (en) Pedestrian multi-target tracking method combining attention machine system and end-to-end training
CN109871875B (en) Building change detection method based on deep learning
CN107169117B (en) Hand-drawn human motion retrieval method based on automatic encoder and DTW
CN115995063A (en) Work vehicle detection and tracking method and system
CN111079847B (en) Remote sensing image automatic labeling method based on deep learning
CN105654139A (en) Real-time online multi-target tracking method adopting temporal dynamic appearance model
CN111160407A (en) Deep learning target detection method and system
CN112489081A (en) Visual target tracking method and device
CN110490915B (en) Point cloud registration method based on convolution-limited Boltzmann machine
CN112084895B (en) Pedestrian re-identification method based on deep learning
CN111462173B (en) Visual tracking method based on twin network discrimination feature learning
CN112434599A (en) Pedestrian re-identification method based on random shielding recovery of noise channel
CN104778699A (en) Adaptive object feature tracking method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant