CN110728694A

CN110728694A - Long-term visual target tracking method based on continuous learning

Info

Publication number: CN110728694A
Application number: CN201910956780.XA
Authority: CN
Inventors: 张辉; 朱牧; 张菁; 卓力; 齐天卉; 张磊
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2020-01-24
Anticipated expiration: 2039-10-10
Also published as: CN110728694B

Abstract

The invention relates to a long-term visual target tracking method based on continuous learning. A deep neural network structure is designed for long-term visual target tracking, an initialized network model is obtained through model initialization, then online tracking is carried out by using the initialized network model, and long-term or short-term model updating is carried out by using a continuous learning method in the tracking process so as to adapt to various changes of a target in the tracking process. The method converts the on-line updating process of the traditional visual target tracking model into the continuous learning process, establishes the complete appearance description of the target from all historical data of the video, and effectively improves the robustness of long-term visual tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description

Long-term visual target tracking method based on continuous learning

Technical Field

The invention belongs to the field of computer vision and image video processing, and particularly relates to a long-term visual target tracking method based on continuous learning.

Background

Visual target tracking is a basic problem in computer vision and image video processing, and has wide application in the fields of monitoring video automatic analysis, human-computer interaction, visual navigation and the like. Tracking methods can be roughly divided into two main categories according to the length of a video sequence frame: short-term target tracking and long-term target tracking. Generally, when the length of a frame of a tracked video sequence is greater than 1000 frames, the tracked video sequence is called long-term target tracking. At present, a short-time tracking algorithm has better performance on relatively short video data, but when the short-time tracking algorithm is directly applied to the processing of real long-time video sequences, the tracking precision and robustness can not meet the index requirements of actual scenes.

In the long-term tracking task, besides the common challenges in the short-term scene, such as target scale change, illumination change, target deformation, etc., the problem of robust locking of frequent 'reappearance after disappearance' targets needs to be solved. Therefore, compared with the traditional short-time tracking, the long-time tracking has more challenges and is more suitable for the actual requirements of various application scenes. However, the tracking technology facing such long-time data is relatively deficient, and the performance of the existing method is very limited. An existing long-term tracking idea is to combine traditional tracking and traditional target detection methods to solve the problems of target deformation, partial shielding and the like in tracking. Meanwhile, the 'significant feature points' of the tracking module, the target model of the detection module and related parameters are continuously updated through an online learning mechanism, so that the tracking effect is more robust and reliable. In addition, there are methods that utilize keypoint matching tracking and robust estimation techniques, that can integrate long-term memory, and that can provide additional information for output control. The tracking method can search for the target in the whole frame of image, but the performance is not ideal because only simple characteristics of manual design are adopted. Recently, some tracking methods based on correlation filtering and deep learning are proposed, and although there are re-detection schemes for long-term tracking, the methods are limited to searching only in a local range of an image, so that a target cannot be captured again after being out of view, and the method cannot meet the requirements of long-term tracking tasks.

From the current state of technical development, the visual target tracking method based on deep convolutional neural network image classification has great potential of effectively distinguishing the target from a disordered background, and the tracking method based on the frame has wide development prospect. However, tracking models trained only offline are often difficult to adapt to online changes in video, and simply updating the models frequently with new data accelerates tracking drift, making it prone to failure when dealing with long-term tracking problems. The invention provides a long-term visual target tracking method based on continuous learning by balancing historical memory and online updating of a model through a continuous learning method.

Disclosure of Invention

The method utilizes the continuous learning theory to convert the online updating of the model of the visual target tracking method into a continuous learning process, learns the effective abstraction and characterization of the time sequence image in the whole video sequence and establishes the complete portrait of the target. Finally, the method adapts to the conditions of target deformation, background interference, shielding, illumination change and the like in the tracking process, the adaptability and the reliability of the existing tracking method during online updating are improved, the sensitivity of the model to noise such as target deformation, shielding and the like is reduced, and the purpose of stably tracking the target for a long time is achieved.

The invention is realized by adopting the following technical means: a long-term visual target tracking method based on continuous learning mainly comprises four parts of network model design, model initialization, online tracking and model updating.

Designing a network model: firstly, designing a deep neural network structure according to the overall flow shown in the attached figure 1; the network stage profiles are then adapted to adaptive size.

Model initialization: mainly comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition. The model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method.

Online tracking: mainly comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; the target region is located using target box regression.

Updating the model: mainly comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; and (5) continuous learning mode model training and model obtaining. The method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.

The network model design comprises the following specific steps:

(1) the deep neural network structure designed by the invention comprises the following components: as shown in fig. 2, the network structure of the present invention is composed of a sharing layer and a classification layer. Wherein, the shared layer comprises 3 convolutional layers, 2 maximum pooling layers, 2 full-link layers and 5 nonlinear active ReLU layers. The convolutional layers are identical to the corresponding parts of the generic VGG-M network. The next two fully connected layers each have 512 output units and incorporate the ReLU and Dropouts modules. The classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing the target from the background.

In the image processing process of the convolutional neural network CNN, convolutional layers need to be connected through a convolutional filter, and the definition of the convolutional filter is expressed as NxCxWxH, wherein N represents the type of the convolutional filter, and C represents the number of channels of a filtered channel; w, H represent the width and height of the filtering range, respectively.

(2) In the continuous learning long-term target tracking process, the change of the input and output characteristic diagram of each convolution layer is as follows:

in the tracking process, images with different sizes are unified into images with the size of 3 multiplied by 107 and then input into a network, in a first convolution layer, the number of output channels is 96 after 96 convolution kernels with the size of 7 multiplied by 7, the number of the output channels is 96 after the output channels pass through a nonlinear activation layer ReLU and a local response normalization layer, and finally a characteristic diagram with the size of 96 multiplied by 25 is obtained through a maximum value pooling layer; in the second convolutional layer, inputting a feature map with the size of 96 × 25 × 25, firstly passing through 256 5 × 5 convolutional kernels, then passing through a nonlinear active layer ReLU and a local response normalization layer to output a channel with the number of 256, and finally passing through a maximum pooling layer to obtain a 256 × 5 × 5 feature map; inputting a feature map with the size of 256 × 5 × 5 in the third convolutional layer, firstly performing 512 convolution kernels with the size of 3 × 3, and then performing nonlinear activation layer ReLU to obtain a feature map with the size of 512 × 3 × 3; in the fourth full-connection layer, inputting a characteristic diagram with the size of 512 multiplied by 3, firstly passing through a 512 neural unit, and then passing through a nonlinear activation layer ReLU to obtain a 512-dimensional characteristic vector; inputting a characteristic vector with the size of 512 dimensions in a fifth full-connection layer, firstly passing through a 512 neural unit, then passing through a Dropouts layer, and finally obtaining the characteristic vector with the size of 512 dimensions through a nonlinear activation layer ReLU; in the classification layer, 512-dimension feature vectors pass through a Dropouts layer, then a binary classification layer with softmax loss is input, and finally a classification score with 2-dimension is output.

The model initialization comprises the following specific steps:

(1) obtaining an initial frame segmentation image: the quality of the initial frame template has a significant impact on the current tracking results. To increase the detailed representation of the tracked target, superpixel level segmentation is applied by a Simple Linear Iterative Clustering (SLIC) superpixel segmentation method, so that the segmented image not only coincides in color and texture with the target, but also retains the structural information of the target, as shown in fig. 3.

(2) Generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively₁And (4) sampling. These samples are labeled as positive samples (between 0.7 and 1.0) and negative samples (between 0 and 0.5) according to their intersection and score ratio with the true label box (ground truth).

(3) Model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score of the final output of the network adopts two-classification cross entropy loss as a loss function to solve the loss, and then gradient is used for solving the lossAnd updating the parameters of the network full-connection layer by a descent method. Wherein the full link layer is trained to perform H₁Iteration (50 times), the learning rate of the full-connection FC4-5 layer is set to 0.0005, and the learning rate of the classification layer FC6 layer is set to 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M⁺(32) A positive sample and a slave M^-(1024) Selected from a negative sample

(96) The composition of each difficultly-divided negative sample; finally, after repeated iteration, when H is reached₁And (50) stopping training during iteration to obtain a network initialization model.

The online tracking comprises the following specific steps:

(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame₂A candidate sample.

(2) Obtaining the best candidate sample: n obtained in the step (1)₂And sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position.

(3) Regression of a target frame: and (3) after the estimated target position is obtained in the step (2), positioning a target area by using a target frame regression method to obtain a tracking result.

The model updating comprises the following specific steps:

(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity. And two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-term updating is performed every f (8-10) frames, and short-term updating is performed when the model classifies the estimated target position as the background.

(2) Generating and updating a model updating sample library: the model update sample library comprises an online sample set

And memory perception sample set

Two parts of which f_l(80-100) and f_s(20-30) respectively representing the set frame number of the long-time collected samples and the set frame number of the short-time collected samples.

And

respectively representing an online positive sample set and an online negative sample set in the online sample set,and

respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set. In particular, on-line positive and negative sample concentration

(500) An(5000) One is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collected

(50) A positive sample and

(200) a negative sample is added to

And

sample collection ofWhere t represents the tth frame of the online tracking video sequence. For online positive sample set

When tracking succeeds beyond f_l(80-100) deleting positive samples collected in the earliest frame in the frame, and then adding the deleted positive samples to a memory-aware positive sample set

In (i.e. on-line positive sample set only collects the latest successfully tracked f)_l(80-100) frame samples; for online negative sample setWhen tracking succeeds beyond f_s(20-30) deleting negative samples collected in the earliest frame during the frame, and then adding the deleted negative samples to a memory perception negative sample setIn, i.e. the online negative sample set only collects the latest successfully traced f_s(20-30) frame samples. Perception of memory positive sample set

When it collects more than f_l(80-100) when the frame is in use, clustering the samples into N by using a K mean value clustering algorithm_C(10-15) classes, when there is a new sample, calculating the characteristic mean vector and N of the new sample respectively_CThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensured

The sample lumped number is unchanged before and after; perception of memory with negative sample set

When the collection exceeds f_s(20-30) deleting the frame collected in the earliest frameSamples, i.e. sets of memory-aware negative samples, only the latest f_s(20-30) frame samples.

(3) Model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and joint optimization training. The purpose of the preheating training is to enable the model to learn to adapt to the current target change, the purpose of the combined optimization training is to enable the model to remember the historical target change, so that the complete description of the target is established in the long-term target tracking process, and when the tracked target appears after being out of the visual field, the target can be quickly found back by utilizing the historical memory of the model, so that the long-term stable tracking is realized. When the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilized

And training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function. Finally, according to the current classification loss, updating the parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to carry out H₂(15) iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilized

Preheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training₃(10) iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)

And memory perception sample set

Performing joint optimization training on the model, and calculating the classification loss of the online sample set by using a two-classification cross entropy loss functionAnd calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, wherein the final total loss is the classification loss plus the knowledge distillation loss multiplied by lambda. After the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H₄(15) iterations. Wherein in each training stage, the learning rates of the full-connection FC4-5 layer are all set to be 0.001, the learning rates of the classification FC6 layer are all set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch of the M-class training data is trained⁺(32) A positive sample and a slave M^-(1024) Selected from a negative sample

(96) And (4) forming a hard negative sample.

The invention has the characteristics that:

the invention provides a long-term visual target tracking method based on continuous learning. The method converts the on-line updating of a traditional visual target tracking model into a continuous learning process, combines a dynamically constructed on-line sample set and a memory perception sample set, and learns the changes of the shielding, the shape, the scale, the illumination and the like of a target in a long-term time dimension, so that the time series data are effectively abstracted and represented in the whole video sequence, and a complete portrait of the target is established. The method can quickly retrieve the target reappearing in the visual field according to the continuously learned historical model after the target is blocked or out of the visual field for a long time. Compared with the existing visual target tracking technology, the method balances the history memory and online updating of the model through a continuous learning method, overcomes the problem of 'catastrophic forgetting' of the model caused by frequent updating by using new data in the prior art, establishes complete portrait description of the target integrally from all historical data of the video, obtains the target model insensitive to noise, improves the robustness of visual tracking and achieves the purpose of long-term tracking. The method can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description of the drawings:

FIG. 1 is an overall flow chart

FIG. 2 is a network architecture

FIG. 3 shows an initial frame segmentation image

The specific implementation mode is as follows:

the following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:

a long-term target tracking method based on continuous learning is disclosed, the whole flow is shown in figure 1; the algorithm is divided into a model initialization part, an online tracking part and a model updating part. A model initialization part: processing an initial frame, firstly obtaining an initial frame segmentation image only with prospect by using a superpixel segmentation method, then inputting the initial frame original image and the initial frame segmentation image to respectively extract convolution layer characteristics, then fusing the two parts of characteristics, namely adding the two parts of characteristics, then obtaining classification scores through a full connection layer and a classification layer and calculating classification loss, and then optimally solving an optimal initialization model through a back propagation gradient loss item. And an online tracking part: in the subsequent frame processing process, firstly, candidate samples are generated by utilizing the predicted position of the target in the previous frame, then, each candidate sample is input into a network to calculate the classification score, the candidate sample with the highest classification score is selected, and finally, the target frame is used for regression positioning of the target area to obtain a tracking result. And a model updating part: in the tracking process, when the estimated target is classified as the background every 10 frames or models, the long-term or short-term model is updated by using a continuous learning method to adapt to various changes of the target in the tracking process.

The model initialization part comprises the following specific steps:

(1) obtaining an initial frame segmentation image: the initial frame consisting of a super set of pixels

Composition, where N is the number of super pixels in the image, O_iRepresenting the pixel value of the ith superpixel in the superpixel set. The superpixels that lie completely outside the bounding box are considered as background, and the remaining superpixels are unknown (background or foreground). P pixel values x sampled randomly with superpixels_vModeling superpixels as

Where P is the number of randomly sampled superpixels, x_vRepresenting the pixel value of the v-th superpixel in superpixel model m. This can be seen as an empirical histogram of the color distribution of the super-blend. For any known super-hybrid model m^bIf the similarity score is S (m)^a,m^b) Eta, eta is 0.5, which corresponds to the unknown super-hybrid model m^aThe hyper-hybrid model of (a), is labeled as background, wherein:

wherein x_kIs an unknown superpixel model m^aPixel value of the kth super pixel, score (x)_k,m^b) Is defined as:

wherein x_jIs a known superpixel model m^bThe pixel value of the jth super pixel. The parameter R is set to 0.5, which controls the radius of the sphere centered at each model pixel, allowing for slight errors. Figure 3 shows the segmentation results.

(2) Generation of a model initialization training sample library: 500 positive samples are randomly sampled around the initial target positions of the first frame original image and the segmented image, respectively, and 5000 negative samples are extracted only in the first frame original image. The samples are scored according to their intersection ratio with the real labeled box, the samples with the score between [0.7,1] are marked as positive samples, and the samples with the score between [0,0.5] are marked as negative samples.

(3) Model initialization training and model acquisition: and (3) solving a loss term of the classification score finally output by the network by adopting the two-classification cross entropy loss as a loss function, wherein the formula is as follows:

wherein，X_n/Y_nTraining samples and training sample labels representing the initial training sample library, N_nIs from X_nA batch of samples is taken of the sample,

is N_nThe label corresponding to the ith sample in (1),

is N_nThe ith sample

The corresponding softmax output. Then, solving the optimized network parameters by a stochastic gradient descent method, training a full-connection layer to perform 50 iterations in an initial frame of a test sequence, setting the learning rate of the full-connection FC4-5 layer to be 0.0005, and setting the learning rate of the classification layer FC6 layer to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; each small batch is composed of M⁺32 positive samples and slave M^-Selected from 1024 negative samples

And (4) forming a negative difficult sample.

The online tracking part comprises the following specific steps:

(1) for each frame in the on-line tracking, 256 candidate samples are generated using a Gaussian distribution based on the estimated target position of the previous frame

x_uRepresenting the u-th one of the candidate samples. The mean of the Gaussian distribution is r and the covariance is the diagonal matrix diag (0.09 r)²,0.09r²0.25) where r is the average of the width and height of the target position estimated from the previous frame.

(2) The output of the network function is a two-dimensional vector representing the scores of the target and the background respectively corresponding to the input candidate samples. Selecting the candidate sample with the highest classification score as the estimated target position:

where u is the candidate sample index, f⁺(. represents a current network function, x)^*Representing the candidate sample with the highest classification score computed by the network, i.e., the estimated target location.

(3) And finally, performing target frame regression on the obtained target position to locate the target area. The target frame regression adopts a ridge regression method, and a parameter alpha in the ridge regression is set to be 1000.

The model updating part comprises the following specific steps:

(1) and (3) selecting an updating mode: and two model updating modes of long-time updating and short-time updating are adopted. In the tracking process, long-time update is performed every f-10 frames, and short-time update is performed when the model classifies the estimated target position as the background.

And memory perception sample set

Two moieties, subscript f_lAnd f_sRespectively indicating a long-time collection sample setting frame number and a short-time collection sample setting frame number.

Andrespectively representing an online positive sample set and an online negative sample set in the online sample set,and

respectively representing memory in a set of memory perception samplesA perceptual positive sample set and a memory perceptual negative sample set. In particular, on-line positive and negative sample concentration

An

One is the positive and negative samples generated by random sampling at the initial frame target location. For each frame in online tracking, when the model classifies the estimated target position as foreground, indicating that the tracking is successful, randomly sampling around the estimated target position, respectively collecting 50 positive samples and 200 negative samples to be added

Andand (4) collecting samples. For online positive sample setDeleting positive samples collected in the earliest frame when tracking succeeds beyond 100 frames, and then adding the deleted positive samples to a memory-aware positive sample set

In other words, only 100 frame samples which are successfully tracked newly are collected in the online positive sample set; for online negative sample set

Deleting negative samples collected in the earliest frame when tracking succeeds beyond 30 frames, and then adding the deleted negative samples to a memory-aware negative sample set

In other words, only 30 frame samples of which the latest tracking is successful are collected in the online negative sample set.

Perception of memory positive sample set

When the collected samples exceed the long-time collected sample setting frame of 100 frames, the samples are clustered into 10 classes by using a K-means clustering algorithm:

where tau denotes the cluster index of the cluster,

the result of the clustering is represented by,is a feature vector calculation function:

where W and b represent the network weight and offset, respectively, before the network is fully connected to the FC5 layer, x represents the input sample,

representing a convolution operation. When there is a new memory perception sample

And respectively calculating Euclidean distances between the feature mean vector of the new sample and 10 clustering centers, wherein the Euclidean distance calculation formula is as follows:

d_τ(μ_new-μ_τ)＝||μ_new-μ_τ||,τ＝1,...,10 (7)

in the formula, mu_newRepresents the new sample feature mean vector, μ_τRepresenting the feature mean vector of the # th class in the 10 clusters. Determining cluster labels of the new samples according to the mean vector closest to the new samples:

and draw new samples into correspondingClustering:

the earliest samples with the same number as the new samples in the class are deleted at the same time, so that the memory perception positive sample set is ensured

The sample lumped number is unchanged before and after; perception of memory with negative sample setThe oldest collected samples are deleted when more than 30 samples are collected, i.e. the memory-aware negative sample set collects only the latest 30 samples.

(3) Continuous learning mode training and model acquisition: when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not have a sample, the online sample set collected in the step (2) is utilized

And (4) training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function formula (3). And finally, updating parameters of the network full-connection layer by using a gradient descent method according to the current classification loss, wherein the formula of the gradient descent method is as follows:

in the formula, theta_nRepresenting the network parameters, η is the learning rate, l (-) represents the loss function. The fully connected layer was trained for 15 iterations. When the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilized

Carrying out preheating training on the model, calculating the classification loss of the model by adopting a two-classification cross entropy loss function formula (3), then updating the parameters of the network full-connection layer by using a gradient descent method formula (9), and training the full-connection layer to carry out 10 iterations; when the mouldAfter the preheating training is finished, utilizing the online sample set collected in the step (2)

And memory perception sample set

Performing combined optimization training on the model, and calculating the classification loss L of the online sample set by adopting a two-classification cross entropy loss function formula (3)_CCalculating the distillation loss L of the memory perception sample set by using a knowledge distillation loss function_DThe formula of the knowledge distillation loss function is as follows:

in the formula (I), the compound is shown in the specification,

represents a memory perception sample set, training samples and sample labels, different from formula (3),

is a soft label output by the old network, N_mIs from X_mA batch of samples is taken of the sample,is N_mThe label corresponding to the ith sample in (1),

is the ith sampleThe corresponding softmax output. Finally, the overall loss function is:

L_sum＝L_C+λ·L_D(11)

in the formula, the parameter λ is set to 0.7. After the total loss is calculated, parameters of the network full-connection layer are updated by using a gradient descent method formula (9), and the full-connection layer is trained to perform 15 iterations. In each training stage, the learning rate of the full-connection FC4-5 layer is set to be 0.001, the learning rate of the classification FC6 layer is set to be 0.01, the momentum attenuation and the weight attenuation are respectively set to be 0.9 and 0.0005, and each small batch consists of 32 positive samples and 96 hard-to-divide negative samples selected from 1024 negative samples during training.

Claims

1. A long-term visual target tracking method based on continuous learning is characterized in that: the method comprises four parts of network model design, model initialization, online tracking and model updating;

designing a network model: a deep neural network structure designed for long-term visual target tracking;

model initialization: comprises 3 steps: acquiring an initial frame segmentation image; generating a model initialization training sample library; model initialization training and model acquisition; the model initialization training and model obtaining stage comprises selection of a loss function and a gradient descent method;

online tracking: comprises 3 steps: generating a candidate sample; obtaining the best candidate sample; using the target frame to regress and locate the target area;

updating the model: comprises 3 steps: selecting an updating mode; generating and updating a model updating sample library; model training and model obtaining in a continuous learning mode; the method comprises the steps of generating a sample library, wherein the generation of the sample library comprises the acquisition of an online sample set and a memory perception sample set; the updating of the sample library comprises updating of an online sample set and a memory perception sample set; the continuous learning mode model training and model obtaining stage comprises selection of a loss function and a gradient descent method.

2. The method of claim 1, wherein the network model is designed by the following specific steps:

the deep neural network structure designed for long-term visual target tracking is as follows: the network structure consists of a sharing layer and a classification layer; wherein, the sharing layer comprises 3 convolution layers, 2 maximum value pooling layers, 2 full-connection layers and 5 nonlinear activation ReLU layers; the convolutional layer is the same as the corresponding part of the general VGG-M network; the next two fully connected layers each have 512 output units, and incorporate the ReLU and Dropouts modules; the classification layer is a binary classification layer which comprises a Dropouts module and a softmax loss and is responsible for distinguishing a target from a background;

3. The method of claim 1, wherein the model is initialized by the following specific steps:

(1) obtaining an initial frame segmentation image: the quality of the initial frame template has important influence on the current tracking result; in order to increase the detailed representation of the tracked target, the segmented image is enabled to be consistent with the target in color and texture through super-pixel level segmentation, and structural information of the target is kept;

(2) generation of a training sample library: randomly sampling N around the initial target positions of the first frame original image and the segmented image respectively₁A sample is obtained; the samples are marked as positive samples and negative samples according to the intersection scores of the samples and the real marking box;

(3) model initialization training and model acquisition: in the initial frame of the tracking sequence, the classification score finally output by the network is calculated by adopting two-classification cross entropy loss as a loss function, and then the parameters of the network full-connection layer are updated by using a gradient descent method; wherein the full link layer is trained to perform H₁Iteration, the learning rate of the full-connection FC4-5 layer is set to be 0.0005, and the learning rate of the classification layer FC6 layer is set to be 0.005; momentum and weight decay were set to 0.9 and 0.0005, respectively; finally, after repeated iteration, when H is reached₁Namely stopping training when more than 50 iterations to obtain the network initialization model.

4. The method according to claim 1, wherein the online tracking comprises the following specific steps:

(1) target candidate sample generation: given each frame in a video sequence, N is first drawn around the predicted position of the target in the previous frame₂A candidate sample;

(2) obtaining the best candidate sample: n obtained in the step (1)₂Sending the candidate samples into a current network model to calculate classification scores, and taking the candidate sample with the highest classification score as an estimated target position;

5. The method of claim 1, wherein the model is updated by the following steps:

(1) and (3) selecting an updating mode: two complementary aspects in target tracking are considered together: robustness and adaptivity; two model updating modes of long-time updating and short-time updating are adopted; in the tracking process, long-term updating is performed every f-8-10 frames, and short-term updating is performed when the model classifies the estimated target position as a background;

And memory perception sample set

Two parts of which f_l80-100 and f_sRespectively representing a long-time collection sample setting frame number and a short-time collection sample setting frame number by 20-30;

andrespectively representing online positive samples in an online sample setThe present set and the online negative sample set,and

respectively representing a memory perception positive sample set and a memory perception negative sample set in the memory perception sample set;

(3) for each frame in the on-line tracking, when the model classifies the estimated target position as the foreground, the tracking is successful, the random sampling is carried out around the estimated target position, and the random sampling is respectively collectedA positive sample and

a negative sample is added to

Anda sample set, wherein t represents the tth frame of the online tracking video sequence; for online positive sample setWhen tracking succeeds beyond f_lFrame-wise deleting positive samples collected in the earliest frame and then adding the deleted positive samples to a memory-aware positive sample set

In (i.e. on-line positive sample set only collects the latest successfully tracked f)_lA frame sample; for online negative sample set

When tracking succeeds beyond f_sFrame time erasure at the earliest frameThe negative samples collected are added into the memory perception negative sample set

In, i.e. the online negative sample set only collects the latest successfully traced f_sA frame sample; perception of memory positive sample setWhen it collects more than f_lClustering samples into N using K-means clustering algorithm when frames are processed_CWhen there are new samples, calculating the characteristic mean vector and N of the new samples respectively_CThe Euclidean distance of each cluster center, new samples are added into the class with the minimum Euclidean distance, and the earliest samples in the class with the same number as the new samples are deleted at the same time, so that a memory perception positive sample set is ensured

When the collection exceeds f_sDeleting samples collected in the earliest frame at the time of frame, i.e. only collecting the latest f in the memory perception negative sample set_sA frame sample;

(4) model training and model acquisition in a continuous learning mode: the continuous learning mode model training comprises two stages of preheating training and combined optimization training;

when the model is updated for a long time or is updated for a short time, if the memory perception sample set does not collect the samples, the online sample set collected in the step (2) is utilized

Training the model, and calculating the classification loss of the classification score finally output by the network by adopting a two-classification cross entropy loss function; finally, according to the current classification loss, the parameters of the network full-connection layer are updated by using a gradient descent method,training full-connectivity layer to perform H₂15 iterations; when the memory perception sample set has samples, firstly, the online sample set collected in the step (2) is utilizedPreheating training is carried out on the model, classification loss of the model is calculated by adopting a two-classification cross entropy loss function, then parameters of a network full-connection layer are updated by using a gradient descent method, and H is carried out on the full-connection layer by training₃10 iterations; after the model preheating training is finished, utilizing the online sample set collected in the step (2)

And memory perception sample set

Performing joint optimization training on the model, calculating the classification loss of the online sample set by using a two-classification cross entropy loss function, calculating the knowledge distillation loss of the memory perception sample set by using a knowledge distillation loss function, and finally calculating the total loss by adding 0.7 times of the knowledge distillation loss to the classification loss; after the total loss is calculated, updating parameters of the network full-connection layer by using a gradient descent method, and training the full-connection layer to perform H₄15 iterations; in each training stage, the learning rates of the full-connection FC4-5 layers are all set to 0.001, the learning rates of the classification FC6 layers are all set to 0.01, and the momentum attenuation and the weight attenuation are all set to 0.9 and 0.0005 respectively.