CN110728694B

CN110728694B - Long-time visual target tracking method based on continuous learning

Info

Publication number: CN110728694B
Application number: CN201910956780.XA
Authority: CN
Inventors: 张辉; 朱牧; 张菁; 卓力; 齐天卉; 张磊
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-11-24
Anticipated expiration: 2039-10-10
Also published as: CN110728694A

Abstract

The invention relates to a long-term visual target tracking method based on continuous learning. The method is characterized in that a deep neural network structure is designed for long-time visual target tracking, an initialized network model is obtained through model initialization, then the initialized network model is utilized for online tracking, long-time or short-time model updating is carried out in the tracking process by utilizing a continuous learning method, and the method is suitable for various changes of targets in the tracking process. According to the invention, the online updating process of the traditional visual target tracking model is converted into the continuous learning process, and the complete appearance description of the target is integrally built from all the historical data of the video, so that the robustness of long-term visual tracking is effectively improved. The method can provide an effective solution for long-time visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description

Long-time visual target tracking method based on continuous learning

Technical Field

The invention belongs to the field of computer vision and image video processing, and particularly relates to a long-time vision target tracking method based on continuous learning.

Background

The visual target tracking is a basic problem in computer vision and image video processing, and has wide application in the fields of monitoring video automatic analysis, man-machine interaction, visual navigation and the like. Tracking methods can be broadly divided into two main categories, according to the length of the frames of the video sequence: short-term target tracking and long-term target tracking. Generally, we refer to long-term object tracking when the frame length of the video sequence being tracked is greater than 1000 frames. At present, the short-time tracking algorithm has better performance on relatively short video data, but the short-time tracking algorithm is directly applied to the processing of a real long-time video sequence, and the tracking precision and robustness still cannot reach the index requirements of an actual scene.

In long-term tracking tasks, in addition to facing the common challenges in short-term scenarios such as target scale change, illumination change, target deformation, etc., there is a need to solve the problem of robust locking of frequent "reproduction after disappearance" targets. Therefore, compared with the traditional short-time tracking, the long-time tracking has greater challenges and meets the actual requirements of various application scenes better. However, the tracking technology for such long-term data is relatively deficient, and the performance of the existing method is very limited. An existing long-term tracking idea is to combine the traditional tracking with the traditional target detection method to solve the problems of deformation, partial shielding and the like of a target in tracking. Meanwhile, the 'salient feature points' of the tracking module, the target model of the detection module and related parameters are continuously updated through an online learning mechanism, so that the tracking effect is more robust and reliable. In addition, there are methods that use keypoint matching tracking and robust estimation techniques to enable long term memorization to be integrated and to provide additional information for output control. The tracking method can search the whole frame of image for the target, but the performance is not ideal due to the simple characteristic of manual design. Recently, some tracking methods based on correlation filtering and deep learning have been proposed, and although there are redetection schemes for long-term tracking, they are limited to searching only in the local area of the image, so that the target cannot be captured again after coming out of view, and cannot be qualified for long-term tracking task.

From the current state of the art, the visual target tracking method based on the deep convolutional neural network image classification has great potential for effectively distinguishing targets from disordered backgrounds, and the tracking method based on the framework has wide development prospect. However, tracking models that are trained offline alone are often difficult to accommodate for online changes in video, and simply updating the model frequently with new data accelerates tracking drift, resulting in their failure in dealing with long-term tracking problems. The invention provides a long-term visual target tracking method based on continuous learning by balancing the historical memory and online updating of a model through the continuous learning method.

Disclosure of Invention

The invention uses the continuous learning theory to convert the online update of the model of the visual target tracking method into a continuous learning process, learns the effective abstraction and characterization of the time sequence image in the whole video sequence, and establishes the complete portrait of the target. Finally, the method is suitable for the conditions of target deformation, background interference, shielding, illumination change and the like in the tracking process, achieves the purposes of improving the adaptability and reliability of the existing tracking method during online updating, reducing the sensitivity of the model to noise such as target deformation, shielding and the like, and achieving the purpose of stably tracking the target for a long time.

The invention is realized by adopting the following technical means: a long-time visual target tracking method based on continuous learning mainly comprises four parts of network model design, model initialization, online tracking and model updating.

Designing a network model: firstly, designing a deep neural network structure according to the overall flow shown in figure 1; the network phase feature map is then scaled to an adaptive size.

Model initialization: mainly comprises 3 steps: obtaining an initial frame segmentation image; model initialization training sample library generation; model initialization training and model acquisition. The model initialization training and model obtaining stage comprises the selection of a loss function and a gradient descent method.

On-line tracking: mainly comprises 3 steps: generating a candidate sample; obtaining an optimal candidate sample; target region is located using target frame regression.

Model updating: mainly comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; model training and model acquisition in a continuous learning mode. The sample library generation comprises the acquisition of an online sample set and a memory perception sample set; the sample library updating comprises updating an online sample set and a memory perception sample set; the model training and model acquisition stage of the continuous learning mode comprises the selection of a loss function and gradient descent method.

The network model design comprises the following specific steps:

(1) The deep neural network structure designed by the invention comprises the following components: as shown in fig. 2, the network structure of the present invention is composed of a sharing layer and a classification layer. Wherein the shared layer comprises 3 convolutional layers, 2 max pooling layers, 2 fully connected layers, and 5 nonlinear active ReLU layers. The convolutional layer is identical to the corresponding part of the generic VGG-M network. The next two fully connected layers each have 512 output units and incorporate the ReLU and Dropoits modules. The classification layer is a binary classification layer containing a Dropouts module and having a softmax penalty and is responsible for distinguishing objects from background.

In the image processing process of the convolutional neural network CNN, the convolutional layers are required to be connected through a convolutional filter, the definition of the convolutional filter is expressed as NxCxW xH, wherein N represents the type of the convolutional filter, and C represents the channel number of a filtered channel; w, H represent the width and height of the filtering range, respectively.

(2) In the long-term target tracking process of continuous learning, the change of the input and output characteristic diagrams of each convolution layer is as follows:

in the tracking process, images with different sizes are unified into an image with the size of 3 multiplied by 107 and then input into a network, 96 convolution kernels with the size of 7 multiplied by 7 are processed in a first convolution layer, the number of output channels is 96 through a nonlinear activation layer ReLU and a local response normalization layer, and finally a characteristic diagram with the size of 96 multiplied by 25 is obtained through a maximum pooling layer; in the second convolution layer, a characteristic diagram with the size of 96 multiplied by 25 is input, 256 convolution kernels with the size of 5 multiplied by 5 are firstly processed, the number of output channels of a nonlinear activation layer ReLU and a local response normalization layer is 256, and finally the characteristic diagram with the size of 256 multiplied by 5 is obtained through a maximum value pooling layer; in the third convolution layer, a feature map with the size of 256 multiplied by 5 is input, 512 multiplied by 3 convolution kernels are firstly processed, and then 512 multiplied by 3 is obtained through a nonlinear activation layer ReLU; in the fourth full-connection layer, inputting a feature map with the size of 512 multiplied by 3, and obtaining a 512-dimensional feature vector through 512 nerve units and then a nonlinear activation layer ReLU; in the fifth full-connection layer, a feature vector with 512 dimensions is input, firstly 512 nerve units are passed through, then a Dropouts layer is passed through, and finally a feature vector with 512 dimensions is obtained through a nonlinear activation layer ReLU; in the classification layer, the size of 512 is taken as a feature vector, the feature vector passes through the Dropouts layer, a binary classification layer with softmax loss is input, and finally a classification score with the size of 2 dimensions is output.

The model initialization comprises the following specific steps:

(1) Initial frame segmentation image acquisition: the quality of the initial frame template has an important impact on the current tracking result. To increase the detailed representation of the tracked object, a superpixel level segmentation is applied by a Simple Linear Iterative Clustering (SLIC) superpixel segmentation method, such that the segmented image is consistent with the object not only in color and texture, but also retains structural information of the object, as shown in fig. 3.

(2) Generating a training sample library: randomly sampling and extracting N around the initial target positions of the original image and the divided image of the first frame respectively ₁ Samples. These samples are marked as positive samples (between 0.7 and 1.0) and negative samples (between 0 and 0.5) according to their cross-point scores with the true annotation box (ground trunk).

(3) Model initialization training and model acquisition: and in the initial frame of the tracking sequence, classifying the last output of the network, adopting two classification cross entropy loss as a loss function to calculate the loss of the last output of the network, and then updating the parameters of the network full-connection layer by using a gradient descent method. Wherein, training the full connection layer to carry out H ₁ (50 times) iterating, wherein the learning rate of the fully connected FC4-5 layer is set to be 0.0005, and the learning rate of the classifying layer FC6 layer is set to be 0.005; momentum and weight decay are set to 0.9 and 0.0005, respectively; each small batch is composed of M ⁺ (32) Positive samples and slave M ^- (1024) Selected from the negative samples(96) Negative samples difficult to separate; finally, after repeated iteration, when H is reached ₁ (50 times)And stopping training during iteration to obtain a network initialization model.

The specific steps of the online tracking are as follows:

(1) Generating target candidate samples: given each frame in a video sequence, N is first drawn around the predicted position of the object in the previous frame ₂ Candidate samples.

(2) Obtaining the best candidate sample: n obtained in the step (1) ₂ And sending the candidate samples into the current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position.

(3) Target frame regression: and (2) after obtaining the estimated target position, positioning the target area by using a target frame regression method to obtain a tracking result.

The model updating comprises the following specific steps:

(1) And (3) updating mode selection: comprehensively considering two complementary aspects in target tracking: robustness and adaptivity. And a long-time updating mode and a short-time updating mode are adopted. In the tracking process, a long time update is performed every f (8-10) frames, and a short time update is performed once when the model classifies the estimated target position as background.

(2) Generating and updating a model update sample library: model update sample library comprising an online sample setAnd memory perception sample set->Two parts, f _l (80-100) and f _s (20-30) show the long-time collection sample set frame number and the short-time collection sample set frame number, respectively. />And->Representing an online positive sample set and an online negative sample set in an online sample set respectively,/>and->Respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set. In particular, +.>(500) Suo and->(5000) And the positive and negative samples are randomly sampled at the target positions of the initial frame. For each frame in the online tracking, indicating successful tracking when the model classifies the estimated target position as foreground, randomly sampling around the estimated target position, and collecting +.>(50) Positive samples and->(200) The negative samples are added to->And->A sample set, where t represents the t-th frame of the online tracking video sequence. For online positive sample set->When the tracking success exceeds f _l (80-100) deleting positive samples collected in the earliest frame at the time of frame, and then adding the deleted positive samples to the memory-aware positive sample set +.>In, i.e. on-lineThe positive sample set only collects f of latest tracking success _l (80-100) frame samples; for on-line negative sample set->When the tracking success exceeds f _s (20-30) deleting the negative samples collected in the earliest frame at the time of the frame, and then adding the deleted negative samples to the memory perception negative sample set +.>In (3), only the f of the latest tracking success is collected by the online negative sample set _s (20-30) frame samples. Positive sample set for memory perception>When it collects more than f _l (80-100) frames, aggregating samples into N using a K-means clustering algorithm _C (10-15) when there are new samples, respectively calculating the characteristic mean vector and N of the new samples _C The Euclidean distance of the clustering centers is added into the class with the smallest Euclidean distance, and the earliest sample with the same number as the new sample in the class is deleted at the same time, so that the memory perception positive sample set is ensured>The total number of sample sets is unchanged before and after the sample sets; negative sample set for memory perception>When the collection exceeds f _s (20-30) deleting samples collected in the earliest frame, i.e. the memory perception negative sample set collects only the latest f _s (20-30) frame samples.

(3) Model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and joint optimization training. The aim of the preheating training is to enable the model to learn to adapt to the current target change, and the aim of the joint optimization training is to enable the model to memorize the historical target change, thereby establishing the integrity of the target in the long-term target tracking processThe method is characterized in that when the tracked target appears after being out of view, the target can be quickly retrieved by utilizing the history memory of the model, and long-term and stable tracking is realized. When the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the sample yet, the online sample set collected in the step (2) is utilizedTraining the model, calculating the classification loss of the classification score output by the network finally by adopting a classification cross entropy loss function. Finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H ₂ (15 times) iterating; when there are samples in the memory perception sample set, first, the online sample set collected in the step (2) is used +.>Preheating training the model, calculating the classification loss by adopting a two-classification cross entropy loss function, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H ₃ (10) iterations; after model preheating training is finished, the online sample set collected in the step (2) is utilizedAnd memory perception sample set->And carrying out joint optimization training on the model, calculating classification loss of the online sample set by utilizing a two-class cross entropy loss function, calculating knowledge distillation loss of the memory perception sample set by utilizing a knowledge distillation loss function, and adding lambda-times of the knowledge distillation loss to the classification loss as the final total loss. After the total loss is calculated, the gradient descent method is used for updating the parameters of the network full-connection layer, and the full-connection layer is trained to carry out H ₄ (15 times) iterations. Wherein in each training stage, the learning rate of the fully connected FC4-5 layers is set to 0.001, the learning rate of the classifying layer FC6 layers is set to 0.01, and the momentum and weight attenuation are respectivelySet to 0.9 and 0.0005, each batch is trained with M ⁺ (32) Positive samples and slave M ^- (1024) Selected +.>(96) Negative samples are difficult to separate.

The invention is characterized in that:

the invention provides a long-time visual target tracking method based on continuous learning. The method converts the online updating of the traditional visual target tracking model into a continuous learning process, and learns the changes of the shielding, the morphology, the scale, the illumination and the like of the target in a long-term time dimension by combining a dynamically constructed online sample set and a memory perception sample set, so that the time sequence data is effectively abstracted and represented in the whole video sequence, and a complete portrait of the target is established. After the target is blocked or out of the visual field for a long time, the target reappeared in the visual field can be quickly retrieved according to the continuously learned historical model. Compared with the existing visual target tracking technology, the method balances the historical memory and online updating of the model through a continuous learning method, solves the problem of 'catastrophic forgetting' of the model caused by frequent updating of the traditional new data, integrally builds the complete portrait description of the target from all the historical data of the video, obtains the target model insensitive to noise, improves the robustness of visual tracking, and achieves the aim of long-term tracking. The method can provide an effective solution for long-time visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description of the drawings:

FIG. 1A is an overall flow chart

FIG. 2 network structure

FIG. 3 initial frame segmentation image

The specific embodiment is as follows:

the following detailed description of embodiments of the invention refers to the accompanying drawings, which illustrate in detail:

a long-term target tracking method based on continuous learning is shown in figure 1; the algorithm is divided into a model initialization part, an online tracking part and a model updating part. Model initializing section: and for initial frame processing, firstly obtaining an initial frame segmentation image with only a foreground by using a super-pixel segmentation method, then inputting an initial frame original image and the initial frame segmentation image to extract the characteristics of a convolution layer respectively, merging the characteristics of the two parts, namely adding the characteristics of the two parts, obtaining a classification score through a full connection layer and a classification layer, calculating classification loss, and optimizing and solving an optimal initialization model through a back propagation gradient loss term. On-line tracking part: in the subsequent frame processing process, firstly, a candidate sample is generated by utilizing the predicted position of a target in the previous frame, then, each candidate sample is input into a network to calculate the classification score of the candidate sample, the candidate sample with the highest classification score is selected, and finally, a target frame regression is used for positioning a target area to obtain a tracking result. Model updating section: in the tracking process, when the estimated target is classified as a background every 10 frames or models, a continuous learning method is utilized to update the long-time or short-time model, so as to adapt to various changes of the target in the tracking process.

The model initializing part comprises the following specific steps:

(1) Initial frame segmentation image acquisition: the initial frame is composed of super-pixel setsComposition, where N is the number of super pixels in the image, O _i Representing the pixel value of the ith superpixel in the superpixel set. Super-pixels located entirely outside the bounding box are considered background, with the remaining super-pixels being unknown (background or foreground). P pixel values x randomly sampled with super-pixels _v Modeling superpixels asWhere P is the number of superpixels randomly sampled, x _v Representing the pixel value of the v-th superpixel in the superpixel model m. This can be seen as an empirical histogram of the color distribution of the super-mix. For any known super-mix model m ^b If the similarity score S (m ^a ,m ^b ) > η, η=0.5, then corresponds to the unknown super-mix model m ^a Is marked as background for the super-mixed model of (c),wherein:

wherein x is _k Is an unknown superpixel model m ^a The pixel value of the kth super-pixel in (x), score (x) _k ,m ^b ) The definition is as follows:

wherein x is _j Is a known superpixel model m ^b The pixel value of the j-th super pixel in (a). The parameter R is set to 0.5, which controls the radius of the sphere centered on each model pixel, allowing for slight errors. Fig. 3 shows the segmentation result.

(2) Model initialization training sample library generation: 500 positive samples are randomly sampled around the initial target positions of the first frame original image and the segmented image, respectively, and 5000 negative samples are sampled only in the first frame original image. Samples with scores between [0.7,1] are marked as positive samples and samples with scores between [0,0.5] are marked as negative samples based on their cross-ratios with the true callout box.

(3) Model initialization training and model acquisition: and (3) for the classification score finally output by the network, adopting the two classification cross entropy losses as a loss function to calculate a loss term, wherein the formula is as follows:

wherein X is _n /Y _n Training samples and training sample labels representing an initialized training sample library, N _n Is from X _n A sample of the sample is taken and a sample is taken,is N _n Label corresponding to the i-th sample in +.>Is N _n I-th sample->The corresponding softmax output. Then, solving optimized network parameters by a random gradient descent method, training a fully connected layer for 50 iterations in an initial frame of a test sequence, setting the learning rate of the fully connected FC4-5 layer to be 0.0005, and setting the learning rate of a classification layer FC6 layer to be 0.005; momentum and weight decay are set to 0.9 and 0.0005, respectively; each small batch is composed of M ⁺ =32 positive samples and slave M ^- Selected =1024 negative samples>And the negative refractory samples are formed.

The on-line tracking part comprises the following specific steps:

(1) For each frame in the online tracking, 256 candidate samples are generated by Gaussian distribution according to the estimated target position of the previous framex _u Representing the u-th candidate sample of the candidate samples. The mean of the gaussian distribution is r, the covariance is the diagonal matrix diag (0.09 r ² ,0.09r ² 0.25), where r is the average of the width and height of the target location estimated for the previous frame.

(2) The output of the network function is a two-dimensional vector representing the scores of the input candidate samples for the target and background, respectively. The candidate sample with the highest classification score is selected as the estimated target position:

where u is the candidate sample index, f ⁺ (. Cndot.) represents the current network function, x ^* Representing the candidate sample with the highest classification score calculated by the network, i.e., the estimated target location.

(3) And finally, carrying out target frame regression positioning on the obtained target position. The target frame regression adopts a ridge regression method, and the parameter alpha in ridge regression is set to be 1000.

The model updating part comprises the following specific steps:

(1) And (3) updating mode selection: and a long-time updating mode and a short-time updating mode are adopted. In the tracking process, a long-term update is performed every f=10 frames, and a short-term update is performed once when the model classifies the estimated target position as background.

(2) Generating and updating a model update sample library: model update sample library comprising an online sample setAnd memory perception sample set->Two parts, subscript f _l And f _s The long-time collection sample set frame number and the short-time collection sample set frame number are indicated, respectively. />And->On-line positive and on-line negative sample sets, respectively, representing an on-line sample set,/for>Andrespectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set. In particular, +.>Suo and->And the positive and negative samples are randomly sampled at the target positions of the initial frame. For each frame in the online tracking, which indicates successful tracking when the model classifies the estimated target position as foreground, random sampling around the estimated target position is performed, and 50 positive samples and 200 negative samples are collected and added to ∈>And->And (3) sample collection. For online positive sample set->Deleting positive samples collected in the earliest frame when tracking is successful over 100 frames, and then adding the deleted positive samples to the memory-aware positive sample set +.>In the method, only 100 frames of samples which are successfully tracked up to date are collected by an online positive sample set; for on-line negative sample set->Deleting the negative samples collected in the earliest frame when the tracking is successful over 30 frames, and then adding the deleted negative samples to the memory perception negative sample set +.>I.e. the online negative sample set only collects the 30 frames of samples that were last successfully tracked.

Positive sample set for memory perceptionWhen the collected samples exceed the long time collection sample setup frame by 100 frames, the samples are clustered into 10 classes using a K-means clustering algorithm:

where tau represents the cluster label subscript of the cluster,representing the clustering result->Is a feature vector calculation function:

where W and b represent the network weights and offsets, respectively, before the network fully connects the FC5 layers, x represents the input samples,representing a convolution operation. There is a new memory sense sample->And respectively calculating Euclidean distances between the characteristic mean value vector of the new sample and 10 clustering centers, wherein the Euclidean distance calculation formula is as follows:

d _τ (μ _new -μ _τ )＝||μ _new -μ _τ ||,τ＝1,...,10 (7)

wherein mu is _new Representing the new sample feature mean vector, mu _τ And the characteristic mean vector of the tau class in the 10 clusters is represented. Determining cluster labels of the new samples according to the mean vector closest to the cluster labels:

and the new samples are grouped into corresponding clusters:simultaneously deleting the earliest and new sample numbers in the classIdentical samples, ensuring memory-aware positive sample set +.>The total number of sample sets is unchanged before and after the sample sets; negative sample set for memory perceptionThe earliest collected sample is deleted when more than 30 frames of samples are collected, i.e. the memory perception negative sample set collects only the latest 30 frames of samples.

(3) Training and model acquisition in a continuous learning mode: when the model is updated for a long time or is updated for a short time, if the memory sensing sample set does not have a sample, the online sample set collected in the step (2) is utilizedTraining the model, calculating the classification loss of the classification score finally output by the network by adopting a classification cross entropy loss function formula (3). And finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, wherein the gradient descent method formula is as follows:

in θ _n Representing network parameters, η is the learning rate, l (·) represents the loss function. Training the fully connected layer was performed for 15 iterations. When the memory sensing sample set has samples, firstly, the online sample set collected in the step (2) is utilizedPreheating training is carried out on the model, a classification loss is calculated by adopting a classification cross entropy loss function formula (3), then parameters of a network full-connection layer are updated by using a gradient descent method formula (9), and 10 iterations are carried out on the training full-connection layer; after model warm-up training is finished, using the online sample set collected in step (2)>And memory perception sample set->Performing joint optimization training on the model, and calculating the classification loss L of the online sample set by adopting a classification cross entropy loss function formula (3) _C The distillation loss L of the memory perception sample set is calculated by adopting a knowledge distillation loss function _D The knowledge distillation loss function formula is:

in the method, in the process of the invention,training samples and sample tags representing a memory perception sample set, different from formula (3), are ++>Is a soft tag output by the old network, N _m Is from X _m A sample of samples withdrawn, +.>Is N _m Label corresponding to the ith sample in +.>Is the i < th > sample->The corresponding softmax output. Finally, the total loss function is:

L _sum ＝L _C +λ·L _D (11)

where the parameter lambda is set to 0.7. After the total loss is calculated, the parameters of the network full-connection layer are updated by using a gradient descent method formula (9), and the full-connection layer is trained for 15 iterations. Wherein in each training stage, the learning rate of the fully connected FC4-5 layers is set to be 0.001, the learning rate of the classifying layer FC6 layers is set to be 0.01, the momentum and weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch consists of 32 positive samples and 96 difficultly-classified negative samples selected from 1024 negative samples during training.

Claims

1. A long-term visual target tracking method based on continuous learning is characterized by comprising the following steps of: the method comprises four parts of network model design, model initialization, online tracking and model updating;

designing a network model: a deep neural network structure designed for long-term visual target tracking;

model initialization: comprises 3 steps: obtaining an initial frame segmentation image; model initialization training sample library generation; model initialization training and model acquisition; the model initialization training and model obtaining stage comprises the selection of a loss function and a gradient descent method;

on-line tracking: comprises 3 steps: generating a candidate sample; obtaining an optimal candidate sample; positioning a target area by using target frame regression;

model updating: comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; model training and model acquisition in a continuous learning mode; the sample library generation comprises the acquisition of an online sample set and a memory perception sample set; the sample library updating comprises updating an online sample set and a memory perception sample set; the model training and model obtaining stage of the continuous learning mode comprises the selection of a loss function and a gradient descent method;

the model updating comprises the following specific steps:

(1) And (3) updating mode selection: comprehensively considering two complementary aspects in target tracking: robustness and adaptivity; adopting two model updating modes of long-time updating and short-time updating; in the tracking process, performing a long-time update every f=8-10 frames, and performing a short-time update when the model classifies the estimated target position as background;

(2) Generating and updating a model update sample library: model update sample library comprising an online sample setAnd memory perception sample set->Two parts, f _l =80 to 100 and f _s =20 to 30 each represents a long-time collection sample set frame number and a short-time collection sample set frame number; />And->On-line positive and on-line negative sample sets, respectively, representing an on-line sample set,/for>And->Respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set;

(3) For each frame in the online tracking, when the model classifies the estimated target position as foreground, which indicates successful tracking, random sampling is carried out around the estimated target position, and collection is carried out respectivelyPositive samples and->The negative samples are added to->And->A sample set, wherein t representsTracking the t frame of the video sequence online; for online positive sample set->When the tracking success exceeds f _l Deleting positive samples collected in the earliest frame at the time of frame, and then adding the deleted positive samples to the memory-aware positive sample set +.>In (a), only the f of the latest tracking success is collected by an online positive sample set _l Frame samples; for on-line negative sample set->When the tracking success exceeds f _s Deleting the negative samples collected in the earliest frame at the frame time, and then adding the deleted negative samples to the memory perception negative sample set +.>In (3), only the f of the latest tracking success is collected by the online negative sample set _s Frame samples; positive sample set for memory perception>When it collects more than f _l In frame, the samples are clustered into N by using a K-means clustering algorithm _C When there are new samples, the feature mean vector and N of the new samples are calculated _C The Euclidean distance of the clustering centers is added into the class with the smallest Euclidean distance, and the earliest sample with the same number as the new sample in the class is deleted at the same time, so that the memory perception positive sample set is ensured>The total number of sample sets is unchanged before and after the sample sets; negative sample set for memory perception>When collectingExceeding f _s Deleting samples collected in the earliest frame at frame time, i.e. memory-aware negative sample sets collect only the latest f _s Frame samples;

(4) Model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and joint optimization training;

when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the sample yet, the online sample set collected in the step (2) is utilizedTraining the model, calculating the classification loss of the classification score finally output by the network by adopting a classification cross entropy loss function; finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H ₂ =15 iterations; when there are samples in the memory perception sample set, first, the online sample set collected in the step (2) is used +.>Preheating training the model, calculating the classification loss by adopting a two-classification cross entropy loss function, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H ₃ =10 iterations; after model warm-up training is finished, using the online sample set collected in step (2)>And memory perception sample set->Performing joint optimization training on the model, calculating classification loss of the online sample set by using a two-class cross entropy loss function, calculating knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, and adding lambda=0.7 times of the total loss to the classification loss; after calculating the total loss, using a ladderUpdating parameters of the network full-connection layer by a degree-dropping method, and training the full-connection layer to carry out H ₄ =15 iterations; wherein in each training stage, the learning rate of the fully connected FC4-5 layers is set to 0.001, the learning rate of the classifying layer FC6 layers is set to 0.01, and the momentum and weight attenuation are set to 0.9 and 0.0005 respectively.

2. The method of claim 1, wherein the network model design comprises the following specific steps:

deep neural network structure designed for long-term visual target tracking: the network structure consists of a sharing layer and a classifying layer; the sharing layer comprises 3 convolution layers, 2 maximum value pooling layers, 2 full connection layers and 5 nonlinear activation ReLU layers; the convolution layer is the same as the corresponding part of the general VGG-M network; the next two fully connected layers each have 512 output units and incorporate the ReLU and Dropoits modules; the classification layer is a binary classification layer containing a Dropouts module and having softmax loss and is responsible for distinguishing targets from backgrounds;

3. The method according to claim 1, wherein the model initialization comprises the following specific steps:

(1) Initial frame segmentation image acquisition: the quality of the initial frame template has important influence on the current tracking result; in order to increase the detailed representation of the tracked target, the segmented image is consistent with the target in color and texture through super-pixel level segmentation, and the structural information of the target is reserved;

(2) Generating a training sample library: randomly sampling and extracting N around the initial target positions of the original image and the divided image of the first frame respectively ₁ A sample number; the samples are based on their cross-ratios with the true annotation frameThe scores are marked as positive and negative samples;

(3) Model initialization training and model acquisition: in an initial frame of a tracking sequence, classifying scores finally output by a network are obtained, two classification cross entropy losses are adopted as a loss function to calculate losses, and then a gradient descent method is used for updating parameters of a network full-connection layer; wherein, training the full connection layer to carry out H ₁ Iterating, setting the learning rate of the fully connected FC4-5 layer to be 0.0005, and setting the learning rate of the classified layer FC6 layer to be 0.005; momentum and weight decay are set to 0.9 and 0.0005, respectively; finally, after repeated iteration, when H is reached ₁ And stopping training when more than 50 iterations are performed, and obtaining a network initialization model.

4. The method according to claim 1, wherein the online tracking comprises the following specific steps:

(1) Generating target candidate samples: given each frame in a video sequence, N is first drawn around the predicted position of the object in the previous frame ₂ Candidate samples;

(2) Obtaining the best candidate sample: n obtained in the step (1) ₂ Sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position;