CN114581486A - Template updating target tracking algorithm based on full convolution twin network multilayer characteristics - Google Patents

Template updating target tracking algorithm based on full convolution twin network multilayer characteristics Download PDF

Info

Publication number
CN114581486A
CN114581486A CN202210213267.3A CN202210213267A CN114581486A CN 114581486 A CN114581486 A CN 114581486A CN 202210213267 A CN202210213267 A CN 202210213267A CN 114581486 A CN114581486 A CN 114581486A
Authority
CN
China
Prior art keywords
template
frame
target
tracking
updating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210213267.3A
Other languages
Chinese (zh)
Inventor
鲁晓锋
李小鹏
王轩
王正洋
柏晓飞
李思训
姬文江
黑新宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202210213267.3A priority Critical patent/CN114581486A/en
Publication of CN114581486A publication Critical patent/CN114581486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a template updating target tracking algorithm based on the multilayer characteristics of a full convolution twin network, which specifically comprises the following steps: constructing an integral network and training; using a trained network to perform initial tracking setting on a video image sequence to be tracked to obtain initial target templates and initial position information of targets; entering a normal tracking flow to obtain a tracking result response diagram of the current frame; judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reserved reliable tracking results reach 2, replacing the oldest result with the newest result; using the latest template to continuously carry out normal tracking on a video image sequence subsequent to the currently tracked video frame; and (5) repeating the steps 3 to 5 until all video image sequences are tracked, thus obtaining the position of the target in each frame of the video, and ending the tracking task.

Description

Template updating target tracking algorithm based on full convolution twin network multilayer characteristics
Technical Field
The invention belongs to the technical field of target tracking of videos, and relates to a template updating target tracking algorithm based on full convolution twin network multilayer characteristics.
Background
Target tracking is an important subject in the field of computer vision, has extremely profound research significance, and is widely applied to the fields of intelligent video monitoring, unmanned driving, human-computer interaction and the like.
The single-target tracking task is a process of positioning the position of a target in a subsequent frame according to a target tracking algorithm after the size and the position information of the target in a first frame of a video are given to a group of video image sequences. With the maturity of deep learning technology, researchers begin to apply the target tracking technology to target tracking, and a target tracking algorithm based on deep learning and based on a twin neural network gradually becomes a mainstream research direction, and the achievement of the target tracking algorithm plays an important role in the scientific research field and the life application.
In recent years, the development of deep learning algorithms is rapid, and the combination of deep learning and target tracking algorithms is more and more emphasized by people. Among them, the algorithm based on the twin neural network structure is a mainstream direction. And generating a template by using a target image given by the first frame, performing cross-correlation operation on subsequent images, and mapping the position of the maximum value in the obtained response image to the position where the target is most likely to be located in the original image. The target template used by the target tracking algorithm based on the twin neural network is usually kept unchanged, and many current methods related to template updating lack a good template updating judgment condition and easily pollute the template. On the other hand, these algorithms generally use the features of the highest layer extracted by the twin network, and do not exploit the features of each layer.
Disclosure of Invention
The invention aims to provide a template updating target tracking algorithm based on the multilayer characteristics of a full convolution twin network, and solves the problems that the robustness of the external deformation of an object in tracking is poor and a template is polluted due to template updating in the prior art.
The technical scheme adopted by the invention is that a template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network is implemented according to the following steps:
step 1, constructing an integral network and carrying out end-to-end training on the integral network structure;
step 2, initializing tracking setting is carried out on the video image sequence to be tracked by using the network trained in the step 1, and initial target templates and initial position information of targets of the tracking task are obtained;
step 3, entering a normal tracking flow, calculating the position of a target in an image for each frame of the video image sequence, and displaying the position at the corresponding position in the image to obtain a tracking result response graph of the current frame;
step 4, after the tracking result response graph of the step 3 is obtained, judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reliable tracking results reserved in the step 3 reach 2, replacing the oldest result with the newest result;
step 5, using the latest template obtained in step 4 to continue normal tracking of step 3 on the video image sequence subsequent to the currently tracked video frame;
and 6, repeating the steps 3 to 5 until all video image sequences are tracked, so that the position of the target in each frame of the video is obtained, and the tracking task is finished.
The present invention is also characterized in that,
in step 1, the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;
the twin neural network is divided into four layers: the first two layers are composed of a convolution layer, a maximum pooling layer and an activation function layer; the last two layers each comprise a convolution layer and an activation function layer; the 3D template updating module is composed of a layer of 3D convolution layer.
In the step 1, 10 picture pairs are selected from each video, each picture pair comprises four video frames, the first video frame is a first frame of the video, the next 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing a tracking template, the last video frame is used as a search image, in the process of the search image, the three images led into a 3D convolution updating module are the same and are the last video frame of the picture pair, the training is carried out for 50 times, and the loss function adopts the Logistic loss function which is the same as that of a SimFC algorithm.
Generating a picture pair in the step 1, and performing data enhancement on the selected picture, wherein the data enhancement is specifically implemented according to the following steps:
step 1.1, firstly, randomly stretching RandomStretch operation is carried out on a sample selected from a training set, the size multiple after stretching is set to be 0.095-1.005, and parts needing to be filled after amplification are filled by using a linear interpolation method; then, center crop operation is carried out, namely a region with the size of 263 x 263 is cropped from the center of the training picture pair, random crop operation is carried out, a region with the size of 255 x 255 is cropped from the random position in the training picture pair, finally, crop conversion is carried out, the BOX of the original GOT-10K data set picture is taken as a target position frame and is given in the form of (left, top, weight and height), namely the distance between the left frame and the upper frame of the picture and the width and height of the target position frame, and the coordinate form of the target position frame is converted into (n, m, h and w), namely the coordinate of the center point of the target position frame and the height and width of the target position frame through the crop conversion operation;
step 1.2, calculation of LOSS
The loss function of the classification branch in the training process uses focal loss, the loss function of the regression branch uses IoU loss, and the calculation formula of the total loss L is as follows:
Figure BDA0003532590540000031
in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l isclsFocal length representing the classification result; l isqualityRepresenting a binary cross entropy loss for quality assessment; l isregIoU loss representing the bounding box regression results; p is a radical ofx,y、qx,y、tx,yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;
Figure BDA0003532590540000041
respectively representing a classification branch prediction result, a quality evaluation result and a regression branch prediction result; λ is a constant;
step 1.3, performing parameter optimization by using a gradient descent method, wherein a calculation formula of a random gradient descent method SGD is as follows:
argminθE(z,x,y)L(y,f(z,x,θ)) (2)
in the formula (2), theta is an obtained optimal parameter; z is an input target picture; x is a search graph; y is a able (label); f (z, x; 0) is the prediction result;
and after 50 times of training, the final total loss L of the network is stabilized below 0.1, and the training process is finished.
The step 2 is implemented according to the following steps:
step 2.1, designating the position of a target on a first frame image of a video image sequence, zooming the target after intercepting the target from the image to obtain a target picture with the size of 127 × 3, then transmitting the target picture into a twin neural network in an overall network to obtain four layers of features, transmitting the last layer of features as high-layer features into a regression branch as a regression branch initial template, transmitting the first layer of features as low-layer features into a classification branch as a classification branch initial template, wherein the sizes of the regression branch initial template and the classification branch initial template are both 6 × 256, the unit is pixel, and the calculation formulas of the regression branch initial template and the classification branch initial template are as follows:
φz(cls)φz(reg)=φ(z) (3)
in equation (3), z is the input target picture, and the function φ () represents the feature extraction network, φz(cls)Target template, phi, representing a classification branch of the feature extraction network outputz(reg)A target template representing regression branches output by the feature extraction network;
step 2.2, initializing parameters:
in a first frame of a video image sequence, target position information given by artificial calibration is called BOX, and the BOX has four pieces of information which are respectively an abscissa, an ordinate, a width and a height of a target, so that the first frame does not need to be tracked, and only the initial central coordinate and the initial width and height of the target corresponding to the first frame are set as numerical values in the BOX according to the artificially calibrated BOX, namely, the initialization process of the target is completed, and the initial position information of the target is obtained.
Step 3 is implemented specifically according to the following steps:
step 3.1, object search
Adopting an anchorfree target search strategy, taking a target coordinate in a previous frame tracking result of an image video sequence as a center, intercepting a search area, cutting the search area into a patch picture to obtain a search image with the size of 255 × 255, introducing the patch picture into a feature extraction network, and extracting the multilayer depth feature of the search area, wherein the formula is as follows:
φx(cls),φx(reg)=φ(x) (4)
in the formula (4), x is a search graph; the function phi () represents the feature extraction network, phix(cls)Classification branch representing output of feature extraction networkIs searched for the feature of phix(reg)Search features representing regression branches output by the feature extraction network;
step 3.2, target position prediction based on the classification branch and the regression branch;
step 3.2.1, calculating a regression branch result:
for regression branching, the target template phi is first extracted by the feature extraction networkz(reg)And search for the feature phix(reg)Mapping to the same feature space, and calculating the formula as follows:
g(z,x)=φz(reg)x(reg)+b (5)
in the formula (5), b represents an offset;
if the point (m, n) on the feature map g (z, x) corresponds to the point on the original map
Figure BDA0003532590540000051
The regression branch will output the predicted value of the position GT at the point (m, n) expressed as a 4-dimensional vector t ═ l*,t*,r*,b*) Then, the calculation process corresponding to each GT component is:
Figure BDA0003532590540000052
Figure BDA0003532590540000053
in the formula (6), (x)0,y0) And (x)1,y1) Respectively representing the corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l*,t*,r*,b*Respectively representing the distances from the corresponding positions of points (m, n) on the feature map to four frames of the GT, namely the left frame, the upper frame, the right frame and the lower frame;
step 3.2.2, calculate the Classification Branch result
For the classification branch, the target template phi is firstly extracted by the feature extraction networkz(cls)And search for the feature phix(cls)Mapping to the same feature space, and calculating the formula as follows:
f(z,x)=φz(cls)x(cls)+b (7)
dividing the point (m, n) on the obtained feature map f (z, x) into a positive sample point and a negative sample point through the ground route of the search map, and if the point (m, n) on the feature map f (z, x) corresponds to the position on the patch picture
Figure BDA0003532590540000061
Within the ground route, the sample is regarded as a positive sample, the classification score is recorded as 1, the rest are negative samples, and the classification score is recorded as 0;
for better relationship of balance point (m, n) and target position, quality score PSS is introduced*(ii) PSS to be predicted*And multiplying the corresponding classification scores to calculate final scores as the results of the classification branches, wherein the quality score calculation formula is as follows:
Figure BDA0003532590540000062
and 3.2.3, adding the classification branch result and the regression branch result to obtain a tracking result response graph of the current frame.
Step 4 is specifically implemented according to the following steps:
step 4.1, template updating condition judgment based on mutual information
In the tracking process, a first Frame of a video is used as a Template Frame, and simultaneously the first Frame is used as a Detection Frame and is input into a network to obtain a heat map of a classification branch and is marked as X, a heat map of a classification branch of a t-th Frame and is marked as Y, and then the X and the Y are used as two variables to calculate mutual information values of the X and the Y;
the mutual information calculation formula is as follows:
Figure BDA0003532590540000063
in equation (9), X, Y represents the classification branch heat map of the first frame and the classification branch heat map of the t-th frame, respectively, where p (X) and p (Y) are the edge distributions of X and Y, respectively, and p (X, Y) is the combined distribution of X and Y;
and carrying out standardized conversion on the obtained mutual information value, wherein the formula is as follows:
Figure BDA0003532590540000071
in the formula (10), H (X), H (Y) are the entropies of X and Y respectively;
if the obtained mutual information is larger than the set threshold value VthresholdIf not, the template updating mechanism is not started, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;
the threshold adopts a dynamic threshold, the dynamic threshold is set as a local maximum, and the dynamic updating formula of the threshold is as follows:
Figure BDA0003532590540000072
in the formula (11), t represents the t-th frame, and I (t) represents the mutual information value between the classification branch heat map of the t-th frame and the classification branch heat map of the first frame; mean (I (t-1), I (t-2)) represents the average of mutual information over a period of time; reflecting that the matching degree of the t frame is better; i' (t) ═ 0 and I ″ (t) > 0 denote mutual information local maximum points; since the mutual information value of the classification branch heat map of each search map and the classification branch heat map of the first frame is discrete, equation (11) can be expressed as:
Figure BDA0003532590540000073
since mutual information values of continuous 3 frames of search graphs are needed, but the 1 st frame and the 2 nd frame of search graphs do not meet the conditions required by the formula during searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, and then V of the 1 st frame and the 2 nd frame of search graphs is setthresholdIs arranged asA fixed value of 0.75;
and 4.2, updating the template based on the 3D convolution:
and (3) updating the templates according with the queue property, wherein the updating mode of the templates is first in first out, namely a new template enters, the old templates are eliminated, the number of the templates is always three, the three templates are respectively marked as an initial target template, a historical template and a current template, and a feature graph obtained after the three templates pass through a feature extraction network is subjected to 3 x 3 convolution to obtain a fused latest template.
The specific process of the step 5 is as follows: and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.
The beneficial effect of the invention is that,
(1) the template updating target tracking algorithm based on the full convolution twin network multilayer characteristics uses the SimFC + FPN as a backbone to obtain the characteristics of different layers, and the classification branch and the regression branch use the characteristics of the different layers to finally predict the target position, thereby exerting the characteristics of the different layers extracted by the neural network and greatly improving the performance and the robustness of the classification network and the regression network;
(2) the template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network filters most harmful template updates by using a template updating condition judgment method based on mutual information, and effectively solves the problem of template pollution caused by the template update;
(3) the method is based on a template updating target tracking algorithm of a full convolution twin network multilayer characteristic, a 3D convolution updating module is used for fusing the latest and most reliable two tracking results retained in history and target information manually marked when a tracking task is started to fuse the updating template, so that the obtained new template can capture the recent appearance information of the target and also can have the most accurate target appearance information when a first frame is available, the robustness of the template on the target appearance deformation is improved, the performance and the tracking speed of the target tracking algorithm are improved, and the accuracy is also improved.
Drawings
FIG. 1 is a schematic diagram of an overall framework of a method for updating a target tracking method based on a template with multilayer characteristics of a full convolution twin network according to the present invention;
FIG. 2 is a schematic diagram of network training of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;
FIG. 3 is a schematic diagram of a SimFC + FPN network model of the template update target tracking method based on the multi-layer characteristics of the fully-convolutional twin network;
FIG. 4 is a schematic diagram of a tracking initialization stage of the template update target tracking method based on the multilayer characteristics of the full convolution twin network according to the present invention;
FIG. 5 is a schematic diagram illustrating a standard mutual information template updating condition judgment of the template updating target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;
FIG. 6 is a schematic diagram of template update of the template update target tracking method based on the multi-layer characteristics of the fully-convolutional twin network according to the present invention;
FIG. 7 is a graph of tracking accuracy of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention;
FIG. 8 is a graph of the tracking success rate of the template update target tracking method based on the full convolution twin network multi-layer characteristics according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a template updating target tracking algorithm based on a full convolution twin network multilayer characteristic, which is specifically implemented according to the following steps as shown in figure 1:
step 1, constructing an integral network and carrying out end-to-end training on the integral network structure;
the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;
the twin neural network is divided into four layers (P2, P3, P4, P5): the first two layers are composed of a convolution layer, a maximum pooling layer and an activation function layer; the last two layers each comprise a convolution layer and an activation function layer; the 3D template updating module consists of a layer of 3D convolution layer; the twin neural network extracts the features of the three pictures, and then the three pictures are combined into one picture through the 3D template updating module, namely the tracking template; the classification branch and the regression branch are used to predict the outcome.
Selecting 10 picture pairs from each video, wherein each picture pair comprises four video frames, the first video frame is a first frame of the video, the next 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing a tracking template, the last video frame is used as a search image, in the process of the search image, the three images led into a 3D convolution updating module are the same and are the last video frame of the picture pair, the training is carried out for 50 times, and the loss function adopts a Logistic loss function which is the same as that of a SimFC algorithm, as shown in FIG. 2;
generating a picture pair, and performing data enhancement on the selected picture, wherein the data enhancement is specifically implemented according to the following steps:
step 1.1, firstly, randomly stretching RandomStretch operation is carried out on a sample selected from a training set (GOT-10K data set), the size multiple after stretching is set to be 0.095-1.005, and parts needing to be filled after amplification are filled by using a linear interpolation method; then, center crop operation is carried out, namely a region with the size of 263 x 263 is cropped from the center of the training picture pair, random crop operation is carried out, a region with the size of 255 x 255 is cropped from the random position in the training picture pair, finally, crop conversion is carried out, the BOX of the original GOT-10K data set picture is taken as a target position frame and is given in the form of (left, top, weight and height), namely the distance between the left frame and the upper frame of the picture and the width and height of the target position frame, and the coordinate form of the target position frame is converted into (n, m, h and w), namely the coordinate of the center point of the target position frame and the height and width of the target position frame through the crop conversion operation;
step 1.2, calculation of LOSS
The loss function of the classification branch in the training process uses focal loss, the loss function of the regression branch uses IoU loss, and the calculation formula of the total loss L is as follows:
Figure BDA0003532590540000101
in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l isclsFocal length representing the classification result; l isqualityRepresenting a binary cross-entropy loss for quality assessment; l is a radical of an alcoholregIoU loss representing the bounding box regression results; p is a radical of formulax,y、qx,y、tx,yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;
Figure BDA0003532590540000111
respectively representing a classification branch prediction result, a quality evaluation result and a regression branch prediction result; λ is a constant;
step 1.3, performing parameter optimization by using a gradient descent method, wherein a calculation formula of a random gradient descent method SGD is as follows:
argminθE(z,x,y)L(y,f(z,x,θ)) (2)
in the formula (2), theta is an obtained optimal parameter; z is an input target picture; x is a search graph; y is a able (label); f (z, x; theta) is a prediction result;
after 50 times of training, the final total loss L of the network is stabilized below 0.1, and the training process is finished;
step 2, initializing tracking setting is carried out on the video image sequence to be tracked by using the network trained in the step 1, and initial target templates and initial position information of targets of the tracking task are obtained;
step 2.1, appointing the position of the target on the first frame image of the video image sequence, cutting the target from the image, zooming to obtain a target picture with the size of 127 × 3, as shown in fig. 3, then transmitting the target picture into a twin neural network in the whole network to obtain four layers of characteristics, and finally transmitting the last layer (P layer) of the target picture5) Introducing the features into regression branch as high-level features, using the features as initial template of regression branch, and fitting the first level (P)2) The features are transmitted into the classification branches as low-level features and serve as classification branch initial templates, the sizes of the regression branch initial templates and the classification branch initial templates are 6 × 256, the unit is pixel, and the calculation formulas of the regression branch initial templates and the classification branch initial templates are as follows:
φz(cls),φz(reg)=φ(z) (3)
in equation (3), z is the input target picture, and the function φ () represents the feature extraction network, φz(cls)Target template, phi, representing a classification branch of the feature extraction network outputz(reg)A target template representing regression branches output by the feature extraction network;
step 2.2, initializing parameters:
as shown in fig. 4, in a first frame of a video image sequence, target position information given by manual calibration is called BOX, and the BOX has four pieces of information, namely, abscissa, ordinate, width and height of a target, so that the first frame does not need to be tracked, and only the initial center coordinate and the initial width and height of the target corresponding to the manually calibrated BOX are set as values in the BOX, that is, the initialization process of the target is completed, and the initial position information of the target is obtained;
step 3, entering a normal tracking flow, calculating the position of a target in an image by each frame of the video image sequence, and displaying the position at the corresponding position in the image;
step 3.1, object search
Adopting an anchorfree target search strategy, taking a target coordinate in a previous frame tracking result of an image video sequence as a center, intercepting a search area, cutting the search area into a patch picture to obtain a search image with the size of 255 × 255, introducing the patch picture into a feature extraction network, and extracting the multilayer depth feature of the search area, wherein the formula is as follows:
φx(cls),φx(reg)=φ(x) (4)
in the formula (4), x is a search graph; the function phi () represents the feature extraction network, phix(cls)Search features, phi, representing classification branches of the feature extraction network outputx(reg)Search features representing regression branches output by the feature extraction network;
step 3.2, target location prediction based on classification and regression branches
Step 3.2.1, calculating a regression branch result:
for regression branching, the target template phi is first extracted by the feature extraction networkz(reg)And search for the feature phix(reg)Mapping to the same feature space, and calculating the formula as follows:
g(z,x)=φz(reg)x(reg)+b (5)
in the formula (5), b represents an offset;
if the point (m, n) on the feature map g (z, x) corresponds to the point on the original map
Figure BDA0003532590540000121
The regression branch will output the predicted value of the position GT at this point (m, n), which is expressed as a 4-dimensional vector t ═ l*,t*,r*,b*) Then, the calculation process corresponding to each GT component is:
Figure BDA0003532590540000131
Figure BDA0003532590540000132
in the formula (6), (x)0,y0) And (x)1,x1) Respectively representing the corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l*,t*,r*,b*Respectively representing the distances from the corresponding position of a point (m, n) on the feature map to the left, upper, right and lower borders of the GT on the original image;
step 3.2.2, calculate the Classification Branch result
For the classification branch, the target template phi is firstly extracted by the feature extraction networkz(cls)And search for the feature phix(cls)Mapping to the same feature space, and calculating the formula as follows:
f(z,x)=φz(cls)x(cls)+b (7)
dividing the point (m, n) on the obtained feature map f (z, x) into a positive sample point and a negative sample point through the ground route of the search map, and if the point (m, n) on the feature map f (z, x) is at the corresponding position on the patch picture
Figure BDA0003532590540000133
Within the ground route, the sample is regarded as a positive sample, the classification score is recorded as 1, the rest are negative samples, and the classification score is recorded as 0;
for better relationship of balance point (m, n) and target position, quality score PSS is introduced*(ii) PSS to be predicted*And multiplying the corresponding classification scores to calculate final scores as the results of the classification branches, wherein the quality score calculation formula is as follows:
Figure BDA0003532590540000134
step 3.2.3, adding the result of the classification branch and the result of the regression branch to obtain a tracking result response graph of the current frame;
step 4, after the tracking result response graph of the step 3 is obtained, judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reliable tracking results reserved in the step 3 reach 2, replacing the oldest result with the newest result;
step 4.1, template updating condition judgment based on mutual information
As shown in fig. 5, in the tracking process, a first Frame of the video is used as a Template Frame, and simultaneously, the first Frame is used as a Detection Frame and is input into the network, so that a heat map of a classification branch is obtained and recorded as X, a heat map of a classification branch of a t-th Frame is recorded as Y, and then X and Y are used as two variables to calculate mutual information values of the two variables;
the mutual information calculation formula is as follows:
Figure BDA0003532590540000141
in equation (9), X, Y represents the classification branch heat map of the first frame and the classification branch heat map of the t-th frame, respectively, where p (Y) and p (Y) are the edge distributions of X and Y, respectively, and p (X, Y) is the combined distribution of X and Y;
and carrying out standardized conversion on the obtained mutual information value, wherein the formula is as follows:
Figure BDA0003532590540000142
in the formula (10), H (X), H (Y) are the entropies of X and Y, respectively;
if the obtained mutual information is larger than the threshold value V set in the textthresholdIf not, the template updating mechanism is not started, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;
in order to make the mutual information judgment more accurate, a dynamic threshold is used herein, because the larger the mutual information value is, the better the mutual information value is, the dynamic threshold is set as a local maximum, and the threshold dynamic update formula is as follows:
Figure BDA0003532590540000143
in the formula (11), t represents the t-th frame, and I (t) represents the mutual information value between the classification branch heat map of the t-th frame and the classification branch heat map of the first frame; mean (I (t-1), I (t-2)) represents the average of mutual information over a period of time; reflecting that the matching degree of the t frame is better; i' (t) ═ 0 and I ″ (t) > 0 denote mutual information local maximum points; since the mutual information value of the classification branch heat map of each search map and the classification branch heat map of the first frame is discrete, equation (11) can be expressed as:
Figure BDA0003532590540000151
because the text requires mutual information value of continuous 3 frames of search graphs, but the 1 st frame and the 2 nd frame of search graphs do not meet the conditions required by the formula in searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, because the target area obtained by the 1 st frame of search graph generally has little difference from the template graph of the first frame of video and can be used for direct updating, but because few 2 nd frame of search graph of video can be shielded, V of the 1 st frame and the 2 nd frame of search graphthresholdSet to a fixed value of 0.75;
and 4.2, updating the template based on the 3D convolution:
as shown in fig. 6, the updating manner of the templates conforms to the queue property, first in first out, that is, a new template enters, the old template is eliminated, the number of the templates is always three, and the three templates are respectively marked as an initial target template, a historical template and a current template, and a feature diagram obtained by passing the three templates through a feature extraction network is subjected to 3 × 3 convolution to obtain a fused latest template;
step 5, using the fused latest template obtained in the step 4.2 to continue normal tracking of the step 3 on the video image sequence subsequent to the currently tracked video frame;
the step 5 is as follows:
and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.
And 6, repeating the steps 3 to 5 until all video image sequences are tracked, so that the position of the target in each frame of the video is obtained, and the tracking task is finished.
The step 6 is as follows:
for a tracking task, after initialization is completed, the whole process is as in steps 3-5, and the process is repeated continuously between template updating and tracking calculation, in the process, the target position of each frame of a video sequence is calculated and a BOX (BOX) is obtained for representation, the motion trail of the target can be obtained for a video whole until the target positions of all pictures of the whole video image sequence are obtained, and the tracking task is finished. The accuracy and success rate of the method on the test set are shown in fig. 7 and 8.
The invention innovatively uses the SimFC + FPN as a backbone to obtain the characteristics of different layers, and the classification branch and the regression branch use the characteristics of different layers to finally predict the target position, thereby exerting the characteristics of different layer characteristics extracted by the neural network and greatly improving the performance and the robustness of the classification network and the regression network. And then, a template updating condition judgment method based on mutual information is used, most harmful template updates are filtered, and the problem of template pollution caused by the template updates is effectively solved. And finally, fusing the latest and most reliable two tracking results which are kept in history and target information which is manually marked when a tracking task is started by using a 3D convolution updating module to fuse and update the template, so that the obtained new template can capture the recent appearance information of the target and also can have the most accurate target appearance information when the first frame is available, and the robustness of the template on the target appearance deformation is improved.

Claims (8)

1. The template updating target tracking algorithm based on the multilayer characteristics of the full convolution twin network is characterized by being implemented according to the following steps:
step 1, constructing an integral network and carrying out end-to-end training on the integral network structure;
step 2, initializing tracking setting is carried out on the video image sequence to be tracked by using the network trained in the step 1, and initial target templates and initial position information of targets of the tracking task are obtained;
step 3, entering a normal tracking flow, calculating the position of a target in an image for each frame of the video image sequence, and displaying the position at the corresponding position in the image to obtain a tracking result response graph of the current frame;
step 4, after the tracking result response graph of the step 3 is obtained, judging whether the current tracking result is reliable or not by using a template updating condition judgment method based on standard mutual information, if so, updating the template, if not, not updating the template, and if the reliable tracking results reserved in the step 3 reach 2, replacing the oldest result with the newest result;
step 5, using the latest template obtained in step 4 to continue normal tracking of step 3 on the video image sequence subsequent to the currently tracked video frame;
and 6, repeating the steps 3 to 5 until all video image sequences are tracked, so that the position of the target in each frame of the video is obtained, and the tracking task is finished.
2. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein in step 1, the whole network structure is divided into three parts: the first part is a twin neural network used for depth feature extraction, the second part is a 3D convolutional neural network used for template updating, namely a 3D template updating module, the first part and the second part form a feature extraction network, and the third part comprises classification branches and regression branches;
the twin neural network is divided into four layers: the first two layers are composed of a convolution layer, a maximum pooling layer and an activation function layer; the last two layers each comprise a convolution layer and an activation function layer; the 3D template updating module is composed of a layer of 3D convolution layer.
3. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristics as claimed in claim 2, wherein in step 1, 10 picture pairs are selected for each video, each picture pair comprises four video frames, the first video frame is the first frame of the video, the following 3 video frames are randomly selected from the video, the distance between the second video frame and the third video frame is not more than 15 frames, the distance between the third video frame and the fourth video frame is not more than 10 frames, the first three video frames are used as target images for synthesizing the tracking template, the last video frame is used as a search image, in the search image processing, the three images led to the 3D convolution updating module are the same and are the last video frame of the picture pair, the training is performed 50 times, and the loss function adopts a Logistic loss function same as that of the SimFC algorithm.
4. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristics as claimed in claim 3, wherein the picture pair is generated in step 1, data enhancement needs to be performed on the selected picture, and the data enhancement is specifically performed according to the following steps:
step 1.1, firstly, randomly stretching RandomStretch operation is carried out on a sample selected from a training set, the size multiple after stretching is set to be 0.095-1.005, and parts needing to be filled after amplification are filled by using a linear interpolation method; then, center crop operation is carried out, namely a region with the size of 263 x 263 is cropped from the center of the training picture pair, random crop operation is carried out, a region with the size of 255 x 255 is cropped from the random position in the training picture pair, finally, crop conversion is carried out, the BOX of the original GOT-10K data set picture is taken as a target position frame and is given in the form of (left, top, weight and height), namely the distance between the left frame and the upper frame of the picture and the width and height of the target position frame, and the coordinate form of the target position frame is converted into (n, m, h and w), namely the coordinate of the center point of the target position frame and the height and width of the target position frame through the crop conversion operation;
step 1.2, calculation of LOSS
The loss function of the classification branch in the training process uses focal loss, the loss function of the regression branch uses IoU loss, and the calculation formula of the total loss L is as follows:
Figure FDA0003532590530000021
in the formula (1), the { } is a subscript function, if the condition in the subscript is satisfied, 1 is taken, otherwise 0 is taken; l is a radical of an alcoholclsFocal length representing the classification result; l isqualityRepresenting a binary cross-entropy loss for quality assessment; l isregIoU loss representing the bounding box regression results; p is a radical ofx,y、qx,y、tx,yRespectively representing a label of a classification branch, a label of quality evaluation and a label of a regression branch;
Figure FDA0003532590530000031
respectively representing a classification branch prediction result, a quality evaluation result and a regression branch prediction result; λ is a constant;
step 1.3, performing parameter optimization by using a gradient descent method, wherein a calculation formula of a random gradient descent method SGD is as follows:
argminθE(z,x,y)L(y,f(z,x,θ)) (2)
in the formula (2), theta is an obtained optimal parameter; z is an input target picture; x is a search graph; y is a able (label); f (z, x; theta) is a prediction result;
and after 50 times of training, the final total loss L of the network is stabilized below 0.1, and the training process is finished.
5. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 2 is implemented according to the following steps:
step 2.1, designating the position of a target on a first frame image of a video image sequence, zooming the target after intercepting the target from the image to obtain a target picture with the size of 127 × 3, then transmitting the target picture into a twin neural network in an overall network to obtain four layers of features, transmitting the last layer of features as high-layer features into a regression branch as a regression branch initial template, transmitting the first layer of features as low-layer features into a classification branch as a classification branch initial template, wherein the sizes of the regression branch initial template and the classification branch initial template are both 6 × 256, the unit is pixel, and the calculation formulas of the regression branch initial template and the classification branch initial template are as follows:
φz(cls),φz(reg)=φ(z) (3)
in the formula (3), z is an input target picture, and a function phi () represents a feature extraction network phiz(cls)Target template, phi, representing a classification branch of the feature extraction network outputz(reg)A target template representing regression branches output by the feature extraction network;
step 2.2, initializing parameters:
in a first frame of a video image sequence, target position information given by artificial calibration is called BOX, and the BOX has four pieces of information which are respectively an abscissa, an ordinate, a width and a height of a target, so that the first frame does not need to be tracked, and only the initial central coordinate and the initial width and height of the target corresponding to the first frame are set as numerical values in the BOX according to the artificially calibrated BOX, namely, the initialization process of the target is completed, and the initial position information of the target is obtained.
6. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 3 is implemented according to the following steps:
step 3.1, object search
Adopting an anchorfree target search strategy, taking a target coordinate in a previous frame tracking result of an image video sequence as a center, intercepting a search area, cutting the search area into a patch picture to obtain a search image with the size of 255 × 255, introducing the patch picture into a feature extraction network, and extracting the multilayer depth feature of the search area, wherein the formula is as follows:
φx(cls),φx(reg)=φ(x) (4)
in the formula (4), x is a search graph; the function phi () represents the feature extraction network, phix(cls)Search features, phi, representing classification branches of the feature extraction network outputx(reg)Search features representing regression branches output by the feature extraction network;
step 3.2, target position prediction based on the classification branch and the regression branch;
step 3.2.1, calculating a regression branch result:
for regression branching, the target template phi is first extracted by the feature extraction networkz(reg)And search for the feature phix(reg)Mapping to the same feature space, and calculating the formula as follows:
g(z,x)=φz(reg)x(reg)+b (5)
in the formula (5), b represents an offset;
if the corresponding point of one point (m, n) on the feature map g (z, x) on the original map is
Figure FDA0003532590530000041
The regression branch will output the predicted value of the position GT at this point (m, n), which is expressed as a 4-dimensional vector t ═ l*,t*r*,b*) Then, the calculation process corresponding to each GT component is:
Figure FDA0003532590530000051
Figure FDA0003532590530000052
in the formula (6), (x)0,y0) And (x)1,y1) Respectively representing corner points of the upper left corner and the lower right corner of a Ground Truth (GT); s is stride of AlexNet, and s is 8; l*,t*,r*,b*Respectively representing the distances from the corresponding position of a point (m, n) on the feature map to the left, upper, right and lower borders of the GT on the original image;
step 3.2.2, calculate the Classification Branch result
For the classification branch, the target template phi is firstly extracted by the feature extraction networkz(cls)And search for the feature phix(cls)Mapping to the same feature space, and calculating the formula as follows:
f(z,x)=φz(cls)x(cls)+b (7)
dividing the point (m, n) on the obtained feature map f (z, x) into a positive sample point and a negative sample point through the ground route of the search map, and if the point (m, n) on the feature map f (z, x) corresponds to the position on the patch picture
Figure FDA0003532590530000053
Within the ground route, the sample is regarded as a positive sample, the classification score is recorded as 1, the rest are negative samples, and the classification score is recorded as 0;
for better relationship of balance point (m, n) and target position, quality score PSS is introduced*(ii) PSS to be predicted*And multiplying the corresponding classification scores to calculate final scores as the results of the classification branches, wherein the quality score calculation formula is as follows:
Figure FDA0003532590530000054
and 3.2.3, adding the classification branch result and the regression branch result to obtain a tracking result response graph of the current frame.
7. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the step 4 is implemented according to the following steps:
step 4.1, template updating condition judgment based on mutual information
In the tracking process, a first Frame of a video is used as a Template Frame, and simultaneously the first Frame is used as a Detection Frame and is input into a network to obtain a heat map of a classification branch and is marked as X, a heat map of a classification branch of a t-th Frame and is marked as Y, and then the X and the Y are used as two variables to calculate mutual information values of the X and the Y;
the mutual information calculation formula is as follows:
Figure FDA0003532590530000061
in equation (9), X, Y represents the classification branch heat map of the first frame and the classification branch heat map of the t-th frame, respectively, where p (X) and p (Y) are the edge distributions of X and Y, respectively, and p (X, Y) is the combined distribution of X and Y;
and carrying out standardized conversion on the obtained mutual information value, wherein the formula is as follows:
Figure FDA0003532590530000062
in the formula (10), H (X), H (Y) are the entropies of X and Y respectively;
if the obtained mutual information is larger than the set threshold value VthresholdThe target area image of the current frame can be used for updating the template, otherwise, a template updating mechanism is not entered, and the template updating process of the next frame is directly started after the template updating result of the current frame is obtained;
the threshold adopts a dynamic threshold, the dynamic threshold is set as a local maximum, and the dynamic updating formula of the threshold is as follows:
Figure FDA0003532590530000063
in the formula (11), t represents the t-th frame, and I (t) represents the mutual information value between the classification branch heat map of the t-th frame and the classification branch heat map of the first frame; mean (I (t-1), I (t-2)) represents the average of mutual information over a period of time; reflecting that the matching degree of the t frame is better; i' (t) ═ 0 and I ″ (t) > 0 denote mutual information local maximum points; since the mutual information value of the classification branch heat map of each search map and the classification branch heat map of the first frame is discrete, equation (11) can be expressed as:
Figure FDA0003532590530000064
since mutual information values of continuous 3-frame search graphs are needed, but the 1 st frame and the 2 nd frame search graphs do not meet the conditions required by the formula during searching, the threshold values of the 1 st frame and the 2 nd frame are set independently, and then V of the 1 st frame and the 2 nd frame search graphsthresholdSet to a fixed value of 0.75;
and 4.2, updating the template based on the 3D convolution:
and (3) updating the templates according with the queue property, wherein the updating mode of the templates is first in first out, namely a new template enters, the old templates are eliminated, the number of the templates is always three, the three templates are respectively marked as an initial target template, a historical template and a current template, and a feature graph obtained after the three templates pass through a feature extraction network is subjected to 3 x 3 convolution to obtain a fused latest template.
8. The template updating target tracking algorithm based on the full convolution twin network multilayer characteristic as claimed in claim 1, wherein the specific process of step 5 is as follows: and after the latest template is obtained, the latest template obtained this time is used all the time before the next template is updated, the specific tracking flow is the same as that in the step 3, the depth features obtained by reliable tracking results are continuously stored in the tracking process, once a new depth feature is obtained, the depth feature with the longest existence time is deleted, the template is updated, and the operation is carried out according to the step 4.
CN202210213267.3A 2022-03-04 2022-03-04 Template updating target tracking algorithm based on full convolution twin network multilayer characteristics Pending CN114581486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210213267.3A CN114581486A (en) 2022-03-04 2022-03-04 Template updating target tracking algorithm based on full convolution twin network multilayer characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210213267.3A CN114581486A (en) 2022-03-04 2022-03-04 Template updating target tracking algorithm based on full convolution twin network multilayer characteristics

Publications (1)

Publication Number Publication Date
CN114581486A true CN114581486A (en) 2022-06-03

Family

ID=81779260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210213267.3A Pending CN114581486A (en) 2022-03-04 2022-03-04 Template updating target tracking algorithm based on full convolution twin network multilayer characteristics

Country Status (1)

Country Link
CN (1) CN114581486A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049745A (en) * 2022-08-16 2022-09-13 江苏魔视智能科技有限公司 Calibration method, device, equipment and medium for roadside sensor
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116612157A (en) * 2023-07-21 2023-08-18 云南大学 Video single-target tracking method and device and electronic equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049745A (en) * 2022-08-16 2022-09-13 江苏魔视智能科技有限公司 Calibration method, device, equipment and medium for roadside sensor
CN115049745B (en) * 2022-08-16 2022-12-20 江苏魔视智能科技有限公司 Calibration method, device, equipment and medium for roadside sensor
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116486203B (en) * 2023-04-24 2024-02-02 燕山大学 Single-target tracking method based on twin network and online template updating
CN116612157A (en) * 2023-07-21 2023-08-18 云南大学 Video single-target tracking method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN110781776B (en) Road extraction method based on prediction and residual refinement network
CN114581486A (en) Template updating target tracking algorithm based on full convolution twin network multilayer characteristics
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN110781262B (en) Semantic map construction method based on visual SLAM
CN109801297B (en) Image panorama segmentation prediction optimization method based on convolution
CN112232351B (en) License plate recognition system based on deep neural network
CN113436227A (en) Twin network target tracking method based on inverted residual error
CN111161244B (en) Industrial product surface defect detection method based on FCN + FC-WXGboost
CN111382686A (en) Lane line detection method based on semi-supervised generation confrontation network
CN111160407A (en) Deep learning target detection method and system
CN111882620A (en) Road drivable area segmentation method based on multi-scale information
CN113393457B (en) Anchor-frame-free target detection method combining residual error dense block and position attention
CN113420794B (en) Binaryzation Faster R-CNN citrus disease and pest identification method based on deep learning
CN113591617B (en) Deep learning-based water surface small target detection and classification method
CN111462191A (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN117765373B (en) Lightweight road crack detection method and system with self-adaptive crack size
CN113223044A (en) Infrared video target detection method combining feature aggregation and attention mechanism
CN114926498B (en) Rapid target tracking method based on space-time constraint and leachable feature matching
CN115424177A (en) Twin network target tracking method based on incremental learning
CN114758178B (en) Hub real-time classification and air valve hole positioning method based on deep learning
CN113628246B (en) Twin network target tracking method based on 3D convolution template updating
CN116883650A (en) Image-level weak supervision semantic segmentation method based on attention and local stitching
CN114663769A (en) Fruit identification method based on YOLO v5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination