CN114581486A

CN114581486A - Template updating target tracking algorithm based on full convolution twin network multilayer characteristics

Info

Publication number: CN114581486A
Application number: CN202210213267.3A
Authority: CN
Inventors: 鲁晓锋; 李小鹏; 王轩; 王正洋; 柏晓飞; 李思训; 姬文江; 黑新宏
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03

Abstract

The invention discloses a template updating target tracking algorithm based on the multilayer characteristics of a full convolution twin network, which specifically comprises the following steps: constructing an integral network and training; using a trained network to perform initial tracking setting on a video image sequence to be tracked to obtain initial target templates and initial position information of targets; entering a normal tracking flow to obtain a tracking result response diagram of the current frame; judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reserved reliable tracking results reach 2, replacing the oldest result with the newest result; using the latest template to continuously carry out normal tracking on a video image sequence subsequent to the currently tracked video frame; and (5) repeating the steps 3 to 5 until all video image sequences are tracked, thus obtaining the position of the target in each frame of the video, and ending the tracking task.

Description

Template updating target tracking algorithm based on full convolution twin network multilayer characteristics

Technical Field

The invention belongs to the technical field of target tracking of videos, and relates to a template updating target tracking algorithm based on full convolution twin network multilayer characteristics.

Background

Target tracking is an important subject in the field of computer vision, has extremely profound research significance, and is widely applied to the fields of intelligent video monitoring, unmanned driving, human-computer interaction and the like.

The single-target tracking task is a process of positioning the position of a target in a subsequent frame according to a target tracking algorithm after the size and the position information of the target in a first frame of a video are given to a group of video image sequences. With the maturity of deep learning technology, researchers begin to apply the target tracking technology to target tracking, and a target tracking algorithm based on deep learning and based on a twin neural network gradually becomes a mainstream research direction, and the achievement of the target tracking algorithm plays an important role in the scientific research field and the life application.

In recent years, the development of deep learning algorithms is rapid, and the combination of deep learning and target tracking algorithms is more and more emphasized by people. Among them, the algorithm based on the twin neural network structure is a mainstream direction. And generating a template by using a target image given by the first frame, performing cross-correlation operation on subsequent images, and mapping the position of the maximum value in the obtained response image to the position where the target is most likely to be located in the original image. The target template used by the target tracking algorithm based on the twin neural network is usually kept unchanged, and many current methods related to template updating lack a good template updating judgment condition and easily pollute the template. On the other hand, these algorithms generally use the features of the highest layer extracted by the twin network, and do not exploit the features of each layer.

Disclosure of Invention

The invention aims to provide a template updating target tracking algorithm based on the multilayer characteristics of a full convolution twin network, and solves the problems that the robustness of the external deformation of an object in tracking is poor and a template is polluted due to template updating in the prior art.

The technical scheme adopted by the invention is that a template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network is implemented according to the following steps:

step 1, constructing an integral network and carrying out end-to-end training on the integral network structure;

step 2, initializing tracking setting is carried out on the video image sequence to be tracked by using the network trained in the step 1, and initial target templates and initial position information of targets of the tracking task are obtained;

step 3, entering a normal tracking flow, calculating the position of a target in an image for each frame of the video image sequence, and displaying the position at the corresponding position in the image to obtain a tracking result response graph of the current frame;

step 4, after the tracking result response graph of the step 3 is obtained, judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reliable tracking results reserved in the step 3 reach 2, replacing the oldest result with the newest result;

step 5, using the latest template obtained in step 4 to continue normal tracking of step 3 on the video image sequence subsequent to the currently tracked video frame;

and 6, repeating the steps 3 to 5 until all video image sequences are tracked, so that the position of the target in each frame of the video is obtained, and the tracking task is finished.

The present invention is also characterized in that,

in step 1, the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;

the twin neural network is divided into four layers: the first two layers are composed of a convolution layer, a maximum pooling layer and an activation function layer; the last two layers each comprise a convolution layer and an activation function layer; the 3D template updating module is composed of a layer of 3D convolution layer.

In the

step

1, 10 picture pairs are selected from each video, each picture pair comprises four video frames, the first video frame is a first frame of the video, the next 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing a tracking template, the last video frame is used as a search image, in the process of the search image, the three images led into a 3D convolution updating module are the same and are the last video frame of the picture pair, the training is carried out for 50 times, and the loss function adopts the Logistic loss function which is the same as that of a SimFC algorithm.

Generating a picture pair in the step 1, and performing data enhancement on the selected picture, wherein the data enhancement is specifically implemented according to the following steps:

step 1.1, firstly, randomly stretching RandomStretch operation is carried out on a sample selected from a training set, the size multiple after stretching is set to be 0.095-1.005, and parts needing to be filled after amplification are filled by using a linear interpolation method; then, center crop operation is carried out, namely a region with the size of 263 x 263 is cropped from the center of the training picture pair, random crop operation is carried out, a region with the size of 255 x 255 is cropped from the random position in the training picture pair, finally, crop conversion is carried out, the BOX of the original GOT-10K data set picture is taken as a target position frame and is given in the form of (left, top, weight and height), namely the distance between the left frame and the upper frame of the picture and the width and height of the target position frame, and the coordinate form of the target position frame is converted into (n, m, h and w), namely the coordinate of the center point of the target position frame and the height and width of the target position frame through the crop conversion operation;

step 1.2, calculation of LOSS

The loss function of the classification branch in the training process uses focal loss, the loss function of the regression branch uses IoU loss, and the calculation formula of the total loss L is as follows:

in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l is_clsFocal length representing the classification result; l is_qualityRepresenting a binary cross entropy loss for quality assessment; l is_regIoU loss representing the bounding box regression results; p is a radical of_x，y、q_x，y、t_x，yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;

respectively representing a classification branch prediction result, a quality evaluation result and a regression branch prediction result; λ is a constant;

step 1.3, performing parameter optimization by using a gradient descent method, wherein a calculation formula of a random gradient descent method SGD is as follows:

argmin_θE_(z,x,y)L_{(y,f(z,x,θ))} (2)

in the formula (2), theta is an obtained optimal parameter; z is an input target picture; x is a search graph; y is a able (label); f (z, x; 0) is the prediction result;

and after 50 times of training, the final total loss L of the network is stabilized below 0.1, and the training process is finished.

The step 2 is implemented according to the following steps:

step 2.1, designating the position of a target on a first frame image of a video image sequence, zooming the target after intercepting the target from the image to obtain a target picture with the size of 127 × 3, then transmitting the target picture into a twin neural network in an overall network to obtain four layers of features, transmitting the last layer of features as high-layer features into a regression branch as a regression branch initial template, transmitting the first layer of features as low-layer features into a classification branch as a classification branch initial template, wherein the sizes of the regression branch initial template and the classification branch initial template are both 6 × 256, the unit is pixel, and the calculation formulas of the regression branch initial template and the classification branch initial template are as follows:

φ_z(cls)φ_z(reg)＝φ(z) (3)

in equation (3), z is the input target picture, and the function φ () represents the feature extraction network, φ_z(cls)Target template, phi, representing a classification branch of the feature extraction network output_z(reg)A target template representing regression branches output by the feature extraction network;

step 2.2, initializing parameters:

in a first frame of a video image sequence, target position information given by artificial calibration is called BOX, and the BOX has four pieces of information which are respectively an abscissa, an ordinate, a width and a height of a target, so that the first frame does not need to be tracked, and only the initial central coordinate and the initial width and height of the target corresponding to the first frame are set as numerical values in the BOX according to the artificially calibrated BOX, namely, the initialization process of the target is completed, and the initial position information of the target is obtained.

Step 3 is implemented specifically according to the following steps:

step 3.1, object search

Adopting an anchorfree target search strategy, taking a target coordinate in a previous frame tracking result of an image video sequence as a center, intercepting a search area, cutting the search area into a patch picture to obtain a search image with the size of 255 × 255, introducing the patch picture into a feature extraction network, and extracting the multilayer depth feature of the search area, wherein the formula is as follows:

φ_x(cls)，φ_x(reg)＝φ(x) (4)

in the formula (4), x is a search graph; the function phi () represents the feature extraction network, phi_x(cls)Classification branch representing output of feature extraction networkIs searched for the feature of phi_x(reg)Search features representing regression branches output by the feature extraction network;

step 3.2, target position prediction based on the classification branch and the regression branch;

step 3.2.1, calculating a regression branch result:

for regression branching, the target template phi is first extracted by the feature extraction network_z(reg)And search for the feature phi_x(reg)Mapping to the same feature space, and calculating the formula as follows:

g(z，x)＝φ_z(reg)*φ_x(reg)+b (5)

in the formula (5), b represents an offset;

if the point (m, n) on the feature map g (z, x) corresponds to the point on the original map

The regression branch will output the predicted value of the position GT at the point (m, n) expressed as a 4-dimensional vector t ═ l^*，t^*，r^*，b^*) Then, the calculation process corresponding to each GT component is:

in the formula (6), (x)₀，y₀) And (x)₁，y₁) Respectively representing the corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l^*，t^*，r^*，b^*Respectively representing the distances from the corresponding positions of points (m, n) on the feature map to four frames of the GT, namely the left frame, the upper frame, the right frame and the lower frame;

step 3.2.2, calculate the Classification Branch result

For the classification branch, the target template phi is firstly extracted by the feature extraction network_z(cls)And search for the feature phi_x(cls)Mapping to the same feature space, and calculating the formula as follows:

f(z，x)＝φ_z(cls)*φ_x(cls)+b (7)

dividing the point (m, n) on the obtained feature map f (z, x) into a positive sample point and a negative sample point through the ground route of the search map, and if the point (m, n) on the feature map f (z, x) corresponds to the position on the patch picture

Within the ground route, the sample is regarded as a positive sample, the classification score is recorded as 1, the rest are negative samples, and the classification score is recorded as 0;

for better relationship of balance point (m, n) and target position, quality score PSS is introduced^*(ii) PSS to be predicted^*And multiplying the corresponding classification scores to calculate final scores as the results of the classification branches, wherein the quality score calculation formula is as follows:

and 3.2.3, adding the classification branch result and the regression branch result to obtain a tracking result response graph of the current frame.

Step 4 is specifically implemented according to the following steps:

step 4.1, template updating condition judgment based on mutual information

In the tracking process, a first Frame of a video is used as a Template Frame, and simultaneously the first Frame is used as a Detection Frame and is input into a network to obtain a heat map of a classification branch and is marked as X, a heat map of a classification branch of a t-th Frame and is marked as Y, and then the X and the Y are used as two variables to calculate mutual information values of the X and the Y;

the mutual information calculation formula is as follows:

in equation (9), X, Y represents the classification branch heat map of the first frame and the classification branch heat map of the t-th frame, respectively, where p (X) and p (Y) are the edge distributions of X and Y, respectively, and p (X, Y) is the combined distribution of X and Y;

and carrying out standardized conversion on the obtained mutual information value, wherein the formula is as follows:

in the formula (10), H (X), H (Y) are the entropies of X and Y respectively;

if the obtained mutual information is larger than the set threshold value V_thresholdIf not, the template updating mechanism is not started, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;

the threshold adopts a dynamic threshold, the dynamic threshold is set as a local maximum, and the dynamic updating formula of the threshold is as follows:

in the formula (11), t represents the t-th frame, and I (t) represents the mutual information value between the classification branch heat map of the t-th frame and the classification branch heat map of the first frame; mean (I (t-1), I (t-2)) represents the average of mutual information over a period of time; reflecting that the matching degree of the t frame is better; i' (t) ═ 0 and I ″ (t) > 0 denote mutual information local maximum points; since the mutual information value of the classification branch heat map of each search map and the classification branch heat map of the first frame is discrete, equation (11) can be expressed as:

since mutual information values of continuous 3 frames of search graphs are needed, but the 1 st frame and the 2 nd frame of search graphs do not meet the conditions required by the formula during searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, and then V of the 1 st frame and the 2 nd frame of search graphs is set_thresholdIs arranged asA fixed value of 0.75;

and 4.2, updating the template based on the 3D convolution:

and (3) updating the templates according with the queue property, wherein the updating mode of the templates is first in first out, namely a new template enters, the old templates are eliminated, the number of the templates is always three, the three templates are respectively marked as an initial target template, a historical template and a current template, and a feature graph obtained after the three templates pass through a feature extraction network is subjected to 3 x 3 convolution to obtain a fused latest template.

The specific process of the step 5 is as follows: and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.

The beneficial effect of the invention is that,

(1) the template updating target tracking algorithm based on the full convolution twin network multilayer characteristics uses the SimFC + FPN as a backbone to obtain the characteristics of different layers, and the classification branch and the regression branch use the characteristics of the different layers to finally predict the target position, thereby exerting the characteristics of the different layers extracted by the neural network and greatly improving the performance and the robustness of the classification network and the regression network;

(2) the template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network filters most harmful template updates by using a template updating condition judgment method based on mutual information, and effectively solves the problem of template pollution caused by the template update;

(3) the method is based on a template updating target tracking algorithm of a full convolution twin network multilayer characteristic, a 3D convolution updating module is used for fusing the latest and most reliable two tracking results retained in history and target information manually marked when a tracking task is started to fuse the updating template, so that the obtained new template can capture the recent appearance information of the target and also can have the most accurate target appearance information when a first frame is available, the robustness of the template on the target appearance deformation is improved, the performance and the tracking speed of the target tracking algorithm are improved, and the accuracy is also improved.

Drawings

FIG. 1 is a schematic diagram of an overall framework of a method for updating a target tracking method based on a template with multilayer characteristics of a full convolution twin network according to the present invention;

FIG. 2 is a schematic diagram of network training of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;

FIG. 3 is a schematic diagram of a SimFC + FPN network model of the template update target tracking method based on the multi-layer characteristics of the fully-convolutional twin network;

FIG. 4 is a schematic diagram of a tracking initialization stage of the template update target tracking method based on the multilayer characteristics of the full convolution twin network according to the present invention;

FIG. 5 is a schematic diagram illustrating a standard mutual information template updating condition judgment of the template updating target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;

FIG. 6 is a schematic diagram of template update of the template update target tracking method based on the multi-layer characteristics of the fully-convolutional twin network according to the present invention;

FIG. 7 is a graph of tracking accuracy of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;

FIG. 8 is a graph of the tracking success rate of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a template updating target tracking algorithm based on a full convolution twin network multilayer characteristic, which is specifically implemented according to the following steps as shown in figure 1:

the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;

the twin neural network is divided into four layers (P2, P3, P4, P5): the first two layers are composed of a convolution layer, a maximum pooling layer and an activation function layer; the last two layers each comprise a convolution layer and an activation function layer; the 3D template updating module consists of a layer of 3D convolution layer; the twin neural network extracts the features of the three pictures, and then the three pictures are combined into one picture through the 3D template updating module, namely the tracking template; the classification branch and the regression branch are used to predict the outcome.

Selecting 10 picture pairs from each video, wherein each picture pair comprises four video frames, the first video frame is a first frame of the video, the next 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing a tracking template, the last video frame is used as a search image, in the process of the search image, the three images led into a 3D convolution updating module are the same and are the last video frame of the picture pair, the training is carried out for 50 times, and the loss function adopts a Logistic loss function which is the same as that of a SimFC algorithm, as shown in FIG. 2;

generating a picture pair, and performing data enhancement on the selected picture, wherein the data enhancement is specifically implemented according to the following steps:

step 1.1, firstly, randomly stretching RandomStretch operation is carried out on a sample selected from a training set (GOT-10K data set), the size multiple after stretching is set to be 0.095-1.005, and parts needing to be filled after amplification are filled by using a linear interpolation method; then, center crop operation is carried out, namely a region with the size of 263 x 263 is cropped from the center of the training picture pair, random crop operation is carried out, a region with the size of 255 x 255 is cropped from the random position in the training picture pair, finally, crop conversion is carried out, the BOX of the original GOT-10K data set picture is taken as a target position frame and is given in the form of (left, top, weight and height), namely the distance between the left frame and the upper frame of the picture and the width and height of the target position frame, and the coordinate form of the target position frame is converted into (n, m, h and w), namely the coordinate of the center point of the target position frame and the height and width of the target position frame through the crop conversion operation;

step 1.2, calculation of LOSS

in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l is_clsFocal length representing the classification result; l is_qualityRepresenting a binary cross-entropy loss for quality assessment; l is a radical of an alcohol_regIoU loss representing the bounding box regression results; p is a radical of formula_x，y、q_x，y、t_x，yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;

argmin_θE_(z,x,y)L_{(y,f(z,x,θ))} (2)

in the formula (2), theta is an obtained optimal parameter; z is an input target picture; x is a search graph; y is a able (label); f (z, x; theta) is a prediction result;

after 50 times of training, the final total loss L of the network is stabilized below 0.1, and the training process is finished;

step 2.1, appointing the position of the target on the first frame image of the video image sequence, cutting the target from the image, zooming to obtain a target picture with the size of 127 × 3, as shown in fig. 3, then transmitting the target picture into a twin neural network in the whole network to obtain four layers of characteristics, and finally transmitting the last layer (P layer) of the target picture₅) Introducing the features into regression branch as high-level features, using the features as initial template of regression branch, and fitting the first level (P)₂) The features are transmitted into the classification branches as low-level features and serve as classification branch initial templates, the sizes of the regression branch initial templates and the classification branch initial templates are 6 × 256, the unit is pixel, and the calculation formulas of the regression branch initial templates and the classification branch initial templates are as follows:

φ_z(cls)，φ_z(reg)＝φ(z) (3)

step 2.2, initializing parameters:

as shown in fig. 4, in a first frame of a video image sequence, target position information given by manual calibration is called BOX, and the BOX has four pieces of information, namely, abscissa, ordinate, width and height of a target, so that the first frame does not need to be tracked, and only the initial center coordinate and the initial width and height of the target corresponding to the manually calibrated BOX are set as values in the BOX, that is, the initialization process of the target is completed, and the initial position information of the target is obtained;

step 3, entering a normal tracking flow, calculating the position of a target in an image by each frame of the video image sequence, and displaying the position at the corresponding position in the image;

step 3.1, object search

φ_x(cls)，φ_x(reg)＝φ(x) (4)

in the formula (4), x is a search graph; the function phi () represents the feature extraction network, phi_x(cls)Search features, phi, representing classification branches of the feature extraction network output_x(reg)Search features representing regression branches output by the feature extraction network;

step 3.2, target location prediction based on classification and regression branches

Step 3.2.1, calculating a regression branch result:

g(z，x)＝φ_z(reg)*φ_x(reg)+b (5)

in the formula (5), b represents an offset;

The regression branch will output the predicted value of the position GT at this point (m, n), which is expressed as a 4-dimensional vector t ═ l^*，t^*，r^*，b^*) Then, the calculation process corresponding to each GT component is:

in the formula (6), (x)₀，y₀) And (x)₁，x₁) Respectively representing the corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l^*，t^*，r^*，b^*Respectively representing the distances from the corresponding position of a point (m, n) on the feature map to the left, upper, right and lower borders of the GT on the original image;

step 3.2.2, calculate the Classification Branch result

f(z，x)＝φ_z(cls)*φ_x(cls)+b (7)

dividing the point (m, n) on the obtained feature map f (z, x) into a positive sample point and a negative sample point through the ground route of the search map, and if the point (m, n) on the feature map f (z, x) is at the corresponding position on the patch picture

step 3.2.3, adding the result of the classification branch and the result of the regression branch to obtain a tracking result response graph of the current frame;

step 4.1, template updating condition judgment based on mutual information

As shown in fig. 5, in the tracking process, a first Frame of the video is used as a Template Frame, and simultaneously, the first Frame is used as a Detection Frame and is input into the network, so that a heat map of a classification branch is obtained and recorded as X, a heat map of a classification branch of a t-th Frame is recorded as Y, and then X and Y are used as two variables to calculate mutual information values of the two variables;

the mutual information calculation formula is as follows:

in equation (9), X, Y represents the classification branch heat map of the first frame and the classification branch heat map of the t-th frame, respectively, where p (Y) and p (Y) are the edge distributions of X and Y, respectively, and p (X, Y) is the combined distribution of X and Y;

in the formula (10), H (X), H (Y) are the entropies of X and Y, respectively;

if the obtained mutual information is larger than the threshold value V set in the text_thresholdIf not, the template updating mechanism is not started, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;

in order to make the mutual information judgment more accurate, a dynamic threshold is used herein, because the larger the mutual information value is, the better the mutual information value is, the dynamic threshold is set as a local maximum, and the threshold dynamic update formula is as follows:

because the text requires mutual information value of continuous 3 frames of search graphs, but the 1 st frame and the 2 nd frame of search graphs do not meet the conditions required by the formula in searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, because the target area obtained by the 1 st frame of search graph generally has little difference from the template graph of the first frame of video and can be used for direct updating, but because few 2 nd frame of search graph of video can be shielded, V of the 1 st frame and the 2 nd frame of search graph_thresholdSet to a fixed value of 0.75;

and 4.2, updating the template based on the 3D convolution:

as shown in fig. 6, the updating manner of the templates conforms to the queue property, first in first out, that is, a new template enters, the old template is eliminated, the number of the templates is always three, and the three templates are respectively marked as an initial target template, a historical template and a current template, and a feature diagram obtained by passing the three templates through a feature extraction network is subjected to 3 × 3 convolution to obtain a fused latest template;

step 5, using the fused latest template obtained in the step 4.2 to continue normal tracking of the step 3 on the video image sequence subsequent to the currently tracked video frame;

the step 5 is as follows:

and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.

The step 6 is as follows:

for a tracking task, after initialization is completed, the whole process is as in steps 3-5, and the process is repeated continuously between template updating and tracking calculation, in the process, the target position of each frame of a video sequence is calculated and a BOX (BOX) is obtained for representation, the motion trail of the target can be obtained for a video whole until the target positions of all pictures of the whole video image sequence are obtained, and the tracking task is finished. The accuracy and success rate of the method on the test set are shown in fig. 7 and 8.

The invention innovatively uses the SimFC + FPN as a backbone to obtain the characteristics of different layers, and the classification branch and the regression branch use the characteristics of different layers to finally predict the target position, thereby exerting the characteristics of different layer characteristics extracted by the neural network and greatly improving the performance and the robustness of the classification network and the regression network. And then, a template updating condition judgment method based on mutual information is used, most harmful template updates are filtered, and the problem of template pollution caused by the template updates is effectively solved. And finally, fusing the latest and most reliable two tracking results which are kept in history and target information which is manually marked when a tracking task is started by using a 3D convolution updating module to fuse and update the template, so that the obtained new template can capture the recent appearance information of the target and also can have the most accurate target appearance information when the first frame is available, and the robustness of the template on the target appearance deformation is improved.

Claims

1. The template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network is characterized by being implemented according to the following steps:

2. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein in step 1, the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;

3. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristics as claimed in claim 2, wherein in step 1, 10 picture pairs are selected for each video, each picture pair comprises four video frames, the first video frame is the first frame of the video, the following 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing the tracking template, the last video frame is used as a search image, in the search image processing, the three images led to the 3D convolution updating module are the same and are the last video frame of the picture pair, the training is performed 50 times, and the loss function adopts a Logistic loss function same as that of the SimFC algorithm.

4. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristics as claimed in claim 3, wherein the picture pair is generated in step 1, data enhancement needs to be performed on the selected picture, and the data enhancement is specifically performed according to the following steps:

step 1.2, calculation of LOSS

in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l is a radical of an alcohol_clsFocal length representing the classification result; l is_qualityRepresenting a binary cross-entropy loss for quality assessment; l is_regIoU loss representing the bounding box regression results; p is a radical of_x，y、q_x，y、t_x，yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;

argmin_θE_(z，x，y)L_{(y，f(z，x，θ))} (2)

5. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 2 is implemented according to the following steps:

φ_z(cls)，φ_z(reg)＝φ(z) (3)

in the formula (3), z is an input target picture, and a function phi () represents a feature extraction network phi_z(cls)Target template, phi, representing a classification branch of the feature extraction network output_z(reg)A target template representing regression branches output by the feature extraction network;

step 2.2, initializing parameters:

6. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 3 is implemented according to the following steps:

step 3.1, object search

φ_x(cls)，φ_x(reg)＝φ(x) (4)

step 3.2.1, calculating a regression branch result:

g(z，x)＝φ_z(reg)*φ_x(reg)+b (5)

in the formula (5), b represents an offset;

if the corresponding point of one point (m, n) on the feature map g (z, x) on the original map is

The regression branch will output the predicted value of the position GT at this point (m, n), which is expressed as a 4-dimensional vector t ═ l^*，t^*r^*，b^*) Then, the calculation process corresponding to each GT component is:

in the formula (6), (x)₀，y₀) And (x)₁，y₁) Respectively representing corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l^*，t^*，r^*，b^*Respectively representing the distances from the corresponding position of a point (m, n) on the feature map to the left, upper, right and lower borders of the GT on the original image;

step 3.2.2, calculate the Classification Branch result

f(z，x)＝φ_z(cls)*φ_x(cls)+b (7)

7. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 4 is implemented according to the following steps:

step 4.1, template updating condition judgment based on mutual information

the mutual information calculation formula is as follows:

in the formula (10), H (X), H (Y) are the entropies of X and Y respectively;

if the obtained mutual information is larger than the set threshold value V_thresholdThe target area image of the current frame can be used for updating the template, otherwise, a template updating mechanism is not entered, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;

since mutual information values of continuous 3-frame search graphs are needed, but the 1 st frame and the 2 nd frame search graphs do not meet the conditions required by the formula during searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, and then V of the 1 st frame and the 2 nd frame search graphs_thresholdSet to a fixed value of 0.75;

and 4.2, updating the template based on the 3D convolution:

8. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the specific process of step 5 is as follows: and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.