CN113963032A

CN113963032A - Twin network structure target tracking method fusing target re-identification

Info

Publication number: CN113963032A
Application number: CN202111451845.9A
Authority: CN
Inventors: 毛姣莉; 崔滢; 郑河荣; 郭东岩; 朱鹏飞; 王晓航
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-01-21

Abstract

The invention discloses a twin network structure target tracking method fusing target re-identification, which is used for training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, and a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module. And the backbone network of the full convolution twin network module shares parameters and weight with the backbone network of the target re-identification module. For a video sequence to be tracked, a first frame of the video sequence is used as a template frame, a tracking target is selected in the first frame, the template frame and each frame behind the first frame are respectively used as an image pair to be input into a trained target tracking network model, the position of the tracking target in the video frame is determined, and target tracking is achieved. The invention can better improve the capacity of distinguishing similar target interference.

Description

Twin network structure target tracking method fusing target re-identification

Technical Field

The application belongs to the technical field of target tracking, and particularly relates to a twin network structure target tracking method fusing target re-identification.

Background

Video target tracking is a research topic relating to multiple fields such as pattern recognition, computer vision, artificial intelligence and the like, and is a hotspot of research of people because of the wide application value in the fields of intelligent video monitoring, security protection, education and the like. However, in a real scene, due to factors such as target posture change, video jitter, occlusion, and similar target interference, it is difficult for a tracking algorithm to simultaneously consider result accuracy and operation real-time, and this problem is particularly obvious in a long-term tracking task. Therefore, how to implement an accurate and real-time target tracking method remains a difficult point of research.

Conventional visual target tracking can be generally divided into two research methods: a target tracking method based on a generative model and a target tracking method based on a discriminant model. The model generation method completes positioning by modeling in a current frame and searching a region most similar to the model in the next frame as a predicted position. The method has the main idea that the joint probability of the target and the sample is calculated, and the sample with the closest target is found to be used as the estimation of the current target. Common tracking algorithms based on a generative model method are known as Kalman filtering, particle filtering, mean-shift and the like. However, the model generation method only models the target, and does not utilize the difference between foreground and background information, and the background information and the target are similar in appearance during the target tracking process, so that the background information is seriously interfered. The discrimination model method fully utilizes the foreground and background information of the template frame and focuses on distinguishing the foreground from the background. The discriminant model method is to regard the tracking task as a two-classification process, regard the area where the template frame target object is located as a positive sample, regard the rest of background information as a negative sample, and then separate the target from the whole picture in other frames. The biggest difference between the discrimination method and the generation method is that the classifier adopts machine learning, and background information is used in training, so that the classifier can be concentrated on distinguishing the foreground and the background. The main idea of the method is to calculate the conditional probability and directly judge whether the target is the target or not. Classical discriminative methods include Struck and TLD.

The basic idea of the correlation filtering tracking is: and designing a filtering template, and performing correlation operation on the template and the target candidate region, wherein the position of the maximum output response is the target position of the current frame. When the correlation operation is carried out on the input image and the filtering template, the convolution operation is converted into the dot product operation by carrying out the fast Fourier transform on the input image and the filtering template, and the calculation amount is greatly reduced. Compared with the traditional tracking method, the related filtering tracking method greatly accelerates the tracking speed. Classical correlation filtering methods include MOSSE, KCF/DCF, SAMF, SRDCF and the like.

The deep learning method is highly advocated in the aspect of target tracking by the strong feature expression capability of the deep learning method, and particularly, the accuracy and speed of balance are remarkably improved by a Siamese series-twin network-based tracking algorithm from 2016. The twin tracker describes the visual target tracking problem as learning a generic similarity map by cross-correlating the learned feature representations of the target template and the search area. Such as Siam FC proposed by Luca Bertinetto et al, performs similarity matching at multiple scales on the search area image, and determines the position of the target using multiple similarity maps. Although this method improves the accuracy of target tracking by means of multiple similarity matching, the speed of tracking is compromised. DSiam proposes a dynamic twin network, effectively utilizes appearance change and background suppression of previous frames of online learning targets, structSim concentrates more on the local structure of the learning targets, the judgment force of tracking is improved, and a regional proposal network is added behind the twin network by the SimRPN, so that the tracking speed is further improved. However, the above methods all use classical, relatively shallow AlexNet as the backbone network in the framework, and cannot obtain deeper features. SiamRPN + + adopts a very deep neural network and can still work at a very high real-time speed, SiamCAR proposes a full-volume integral type and regression twin network structure, achieves a good effect in the aspects of precision and speed by using a simpler structure, but adopts the resnet50 as a backbone network to perform simple feature fusion on the last three layers of features to realize a multi-scale effect, and the effect is not good. Although these trackers have certain advantages, it is still difficult to handle the object with large shape deformation and appearance change in the long-term object tracking process.

Disclosure of Invention

Aiming at the defects of the existing method, the application provides a twin network structure target tracking method fusing target re-identification, and a simple and effective universal single-target tracking method is realized by enhancing the target expression robustness in a twin network tracking model.

In order to achieve the purpose, the technical scheme of the application is as follows:

a twin network structure target tracking method fusing target re-identification comprises the following steps:

acquiring a general visual target tracking data set, preprocessing the general visual target tracking data set, and generating a training sample set;

training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module, the classification regression branch and the target re-identification branch are alternately trained when the target tracking network model is trained, and the backbone network of the full convolution twin network module and the backbone network of the target re-identification module share parameters and weights;

for a video sequence to be tracked, a first frame of the video sequence is used as a template frame, a tracking target is selected in the first frame, the template frame and each frame behind the first frame are respectively used as an image pair to be input into a trained target tracking network model, the position of the tracking target in the video frame is determined, and target tracking is achieved.

Further, the backbone network of the full convolutional twin network module and the backbone network of the target re-identification module adopt googlenet, and a convolutional layer composed of two convolution kernel sequences of 3 × 3 is adopted to replace a single convolution layer of 5 × 5 in the googlenet.

Further, a Transformer-Head network is further included behind the backbone network of the target re-identification module, after the Transformer-Head network performs pixel-by-pixel leveling operation on the feature matrix obtained by the backbone network, sinusoidal position coding is performed on each position of the feature sequence, then the sinusoidal position coding is added with the feature of the corresponding position, a new feature sequence is obtained and serves as an input sequence of the Transformer-Head network, and then the new feature sequence sequentially passes through the encoder network and the decoder network.

Furthermore, the encoder network is formed by stacking a plurality of encoding layers, and each encoding layer comprises a multi-head self-attention module and a full-link feedforward network; the decoder network is formed by stacking a plurality of decoding layers, and each decoding layer comprises a multi-head self-attention module and a full-link feedforward network; the input of the decoder network comprises the output of the encoder network and target query characteristics obtained by performing linear transformation on a transform-Head input sequence through 1 × 1 convolution.

Further, the classification regression module classifies a branch network and a regression branch network, wherein the classification branch network comprises a classification branch and a centrality branch.

Further, when the target tracking network model is trained, alternately training a classification regression branch and a target re-recognition branch includes:

when the number of training rounds is M, each round of training is performed once on the classification regression branch and the target re-identification branch.

when the number of training rounds is M, the parameters of the backbone network are frozen in the first M1 rounds, the classification regression branch and the target re-identification branch are trained, the backbone network is opened in the last M2 rounds, the classification regression branch and the target re-identification branch are trained, and the sum of M1 and M2 is equal to M.

The twin network structure target tracking method fusing target re-identification can enhance twin network target expression by combining the network correction effect of the target re-identification task, and can better improve the capability of distinguishing similar target interference.

Drawings

FIG. 1 is a flowchart of a twin network structure target tracking method for merging target re-identification according to the present application;

FIG. 2 is a schematic diagram of a target tracking network model according to the present application;

FIG. 3 is a Reid branch network diagram;

fig. 4 is a schematic view of target tracking.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a twin network structure target tracking method fusing target re-identification, including:

and step S1, acquiring a general visual target tracking data set, preprocessing the general visual target tracking data set, and generating a training sample set.

According to the method, a general visual target tracking data set is used, a target template image (template frame) and a search area image (search frame) are cut out from an original training set according to the position of a target in an image, and the cut-out image set forms a training data set used by the method.

The data set uses the common data sets COCO, Det, VID2015 and YOUTUBEBE tracked by a single target, and the training results of the data sets of Lasot and tracing net are increased in follow-up experiments. In the training process, the twin network is input by an image pair of the template frame and the search frame and an image with the same size as the template frame, the image pair is used as the input of the classification regression network, and the other image is used as the input of the Reid network. The image set is cropped prior to input.

The method for clipping the target template image comprises the following steps: and determining a square cutting area by taking the center of the target as a cutting center, setting the cutting side length of the square as p, wherein p may exceed the size of the original picture, filling the exceeding part with the picture color mean value, and then zooming the picture to 127 × 127 size. Where P ═ 2 (w + h)/w represents the target-position wide pixel, and h represents the target-position high pixel.

The cutting method of the target search image comprises the following steps: selecting one frame in the range of T frames before and after the template frame, taking the center of the target as a cropping center, determining a square cropping area, setting the cropping side length of the square as C, wherein the C may exceed the size of the original picture, filling the exceeding part with the picture color mean value, and then zooming the image to 127 × 127 size. Wherein in the data sets COCO, Det, VID2015, youtube, T corresponds to 1, 1, 100, 3, respectively.

In the Reid network (Re-identification, target Re-identification), VID2015, COCO, youtube data set is used, and the picture number distribution is: VID2015 was 90000, COCO 90000, youtube 100000, total 260000, and batch number of images per session was 4 × 8. Firstly, randomly selecting 8 videos from a data set according to categories, firstly selecting a picture from the videos, and then randomly selecting 3 different pictures in the front T frame and the rear T frame of the picture. Each video sequence is provided with a label target which represents the category information of a target, the targets of the instances selected from the same video are the same, and the targets of the different video instances are different; then, each picture is cut into 127 × 127 pictures according to the scheme of cutting the template sample. Where in the data sets COCO, VID2015, youtube, T corresponds to 20, 30, 20, respectively.

After cutting, data enhancement preprocessing operation is finally carried out on the image set, wherein the data enhancement preprocessing operation comprises contrast enhancement, noise increase and the like, and the training difficulty is increased.

Step S2, training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module, the classification regression branch and the target re-identification branch are alternately trained when the target tracking network model is trained, and the backbone network of the full convolution twin network module and the backbone network of the target re-identification module share parameters and weights.

The target tracking network model constructed by the present application is shown in fig. 2, wherein (a) is a full convolution twin network (full convolution twin network module) for extracting image features, (b) is a Reid network (target re-identification module) for enhancing the target expression capability of the backbone network, and (c) is a classification regression module for predicting classification and bounding box information. The Reid network is independently used as a target re-identification branch, and the full convolution twin network module and the classification regression module form a classification regression branch.

The full convolution twin network module comprises a neural network architecture of two identical sub-networks, which have the same configuration, i.e. have the same parameters and weights, and the parameter update is performed on both sub-networks together. The application uses the googlenet as the backbone network of the twin network (CNN in fig. 2), and replaces a single 5 × 5 convolutional layer in the googlenet with a convolutional layer composed of two 3 × 3 convolutional kernel sequences, thereby reducing the parameter quantity while maintaining the receptive field range. In addition, the googlenet uses convolution kernels of different sizes, increasing diversity. According to the method and the device, the googlenet is used as a backbone network, the output of the last layer is directly taken as a feature matrix of the image, the semantic information of the image can be well fused, and multi-scale fusion of the target is achieved. The input template image and the search image extract close feature information through a twin network.

In consideration of the fact that under the condition of long-term video target tracking, when a target with high similarity to a tracked target appears in a search area, tracking errors are prone to occurring, the method designs target re-identification branches, extracts global features of the target, and learns and classifies similar and different targets so as to enhance target expression of a twin network and further improve the capability of a model for distinguishing the similar targets.

Fig. 2 (b) shows a network structure of a target heavy identification branch (Reid branch), and a backbone network of the Reid branch for extracting image features is consistent with a twin network, and forms a three-cell network together, and shares parameters and weights. An input picture obtains a 5 multiplied by n characteristic matrix through a backbone network, and then the characteristic matrix is sent to a Transformer-Head to obtain a global characteristic vector with the size of 1 multiplied by 2048, and the global information is focused more. The loss is then calculated using the triplet loss, increasing the distance between different classes, and decreasing the distance between the same classes to enhance the ability of the network to distinguish similar objects. Where n is 256.

The tracked target may have a significant deformation in the long-term tracking process, such as: rotation, occlusion, distortion, etc., and therefore, how to better learn global information of a target becomes a significant challenge in this field. The method is combined with a Transformer network, the structure is shown in fig. 3, and the model can better learn various invariant features through the global attention characteristic of the model in processing images, so that the error Transformer-Head network caused by target deformation in the tracking process is reduced.

After a feature matrix obtained by a backbone network is subjected to pixel-by-pixel flattening (flatten) operation by a Transformer-Head network, a feature sequence is obtained, and due to the arrangement invariance of an original Transformer, each position of the feature sequence is subjected to sine position coding and then added with the feature of the corresponding position, so that a new feature sequence is obtained and serves as an input sequence of the Transformer-Head. FIG. 3 is a network structure diagram of a transform-Head in the present solution, which mainly consists of an encoder network and a decoder network. In which an Encoder part is stacked by N encoding layers (encode). Each encoding layer includes a multi-headed self-attention module and a fully-linked feed-forward network. I.e. for each layer, it is made up of two parts: one part is a multi-head self-attention mechanism, which disperses the attention calculation to different subspaces for attention learning from multiple aspects; the other part is a simple fully-connected feedforward network, which is composed of two layers of fully-connected and ReLU activation functions. In this embodiment, N is 6, and the setting is performed with reference to the aspects of the machine graphics card, the speed, and the like. The Encode module extracts the feature dependency of each part of the image and enhances the original features by using the global context information, so that the model can learn distinctive features to enhance the target expression capability of the network.

The Decode module uses the enhancement features of the Encode module and Target Query features (Target Query) as input, the Target Query features are obtained by performing linear transformation of 1 × 1 convolution on an input sequence of a transform-Head, and a Decode part is also stacked by N decoding layers (decodes), and each decoding layer Decode comprises a multi-Head self-attention and fully-connected feedforward network. Finally, a decoding sequence with the same size as the input sequence is obtained, and then a global feature vector with the size of 1xC is finally obtained through dimensionality reduction and averaging

Where C is 1024.

On the basis of a twin network structure, a parallel target re-recognition network branch capable of being alternately trained is added, and a backbone network and a classification regression network are alternately trained by adopting the same network structure and parameters as those of the twin network. The loss function of the target re-recognition network branch adopts triple loss. The method and the device have the advantages that the network modification effect of the target re-identification task is combined, the twin network target expression can be enhanced, and the capability of distinguishing similar target interference can be better improved.

In the full convolution twin network module, after the template frame picture X and the search frame picture Z are subjected to the feature extraction of the backbone network, two feature graphs with the shapes of 5 multiplied by n and 13 multiplied by n are respectively obtained, and the feature graphs are respectively expressed as

And

wherein the content of the first and second substances,

is a feature map of the target template image,

is a feature map of the search area image. The image X and the image Z pass through the same convolution network, similar feature expressions can be learned, and then

And

and performing depth cross-correlation operation to obtain a response graph R with the shape of 25 × 25 × n, wherein n is 256.

The classification regression module contains two branches, the upper part of FIG. 2(c) is a classification branch network, and the lower part is a regression branch network. The classification branch comprises two parallel branches, namely a classification branch and a centrality branch, wherein the classification branch predicts the foreground and the background of the image, and the centrality branch calculates the centrality score of each pixel point p (i, j); and (5) calculating the distances between the pixel point p (i, j) and the upper, lower, left and right boundaries of the target frame respectively by the regression branch.

The response graph R is input into the classification branch network and the regression branch network, and both the response graph R and the regression branch network pass through a convolution network Refine model with 4 convolution layers, the two convolution networks have the same structure and different parameters, and the structure of each convolution layer is also the same. The convolutional network is followed by a parallel classification branch and a regression branch, wherein the classification branch is used for performing a classification task and a centrality task, the two tasks are both connected with a 1x1 convolutional layer after a refine model, the number of output channels of the classification task is 2, and a foreground background classification chart with the shape of 25 x 2 is obtained

Each pixel point in the image represents whether the point is a target foreground or a target backgroundScene, 1 indicates that the point falls within the target, and 0 indicates that the point falls outside the target. The output channel of the centrality task is 1, and a centrality score map with the shape of 25 multiplied by 1 is obtained

The output channel of the regression branch network is 4, and a target frame regression graph with the shape of 25 multiplied by 4 is obtained

Each point in the vector is a 4-dimensional vector

Here, |^*,t^*,r^*,b^*Respectively representing the distances from the point to the four boundaries of the left, lower, right and upper parts of the real target frame.

Object i is defined on the picture as

c⁽ⁱ⁾∈R⁴X {1,2 … C }, here

And

representing the top left and bottom right coordinates of the object, respectively, C represents the number of categories, and in the COCO dataset, C is 80. Characteristic diagram

Each point in

The position mapped to the original image is

And s is the size scaling multiple of the feature map to the size of the original image. Thus, it is possible to provide

The calculation formula of (2) is as follows:

the output channel of the central degree branch is 1, and a central degree score chart with the shape of 25 multiplied by 1 is obtained

Response graph

Each value C (x, y) of (a) represents a score of whether the location is at the center of the target. C (x, y) is defined as:

wherein l^*,t^*,b^*,r^*And predicting the result of each pixel point according to the regression branch. And calculating to obtain an enclosing frame of the target according to the product fusion of the classification branch and the central degree branch and the edge distance information of the regression branch, and completing target positioning.

The loss function of the target tracking network model in this embodiment is:

wherein L is_clsIs the classification loss, using the cross entropy loss, L_cenIs the centrality loss, using nn_regIs the regression loss, using the cross-over ratio loss, L_triThe triad loss is given by tripletloss. alpha, beta, gamma values of 1, 3, 1, respectively.

The target tracking network model can better learn the training result of the other side by the two branch networks, and the two branch networks can be trained alternately in the training process. The Reid branch network and the classification regression branch network use the same backbone network to extract image features and share network parameters.

In a specific embodiment, when the number of training rounds is M, each training round is used for training the classification regression branch and the target re-identification branch once.

In another specific embodiment, when the number of training rounds is M, the first M1 rounds of freezing parameters of the backbone network, training the classification regression branch and the target re-identification branch, and the last M2 rounds of opening the backbone network, training the classification regression branch and the target re-identification branch, wherein the sum of M1 and M2 is equal to M.

For example, the number of training rounds is 20, each training round is performed on the Reid branch network and the classification regression branch network once, and the initial learning rate of the stochastic gradient descent is 0.001. The parameters of the backbone network are frozen in the first 10 rounds, only the Reid branch network and the classification regression network are trained, the backbone network is opened in the last 10 rounds, the backbone network is added into the subsequent branch networks for training, the parameters of the backbone network can be optimized during the training of each branch network, but the parameters of the other branch network are not changed. For the classification regression branch network training, the batch of the data sets input each time is 76, and the total number of the data sets is 6000000; for the Reid branching network, the batch size of the data set input at each time is 32, and the total number of the data sets is 260000.

And step S3, regarding the video sequence to be tracked, selecting a tracking target in the first frame by taking the first frame of the video sequence as a template frame, inputting the template frame and each frame after the first frame into a trained target tracking network model as an image pair respectively, determining the position of the tracking target in the video frame, and realizing target tracking.

Fig. 3 is a schematic diagram of an algorithm flow for tracking an object according to the present invention, and a specific flow of tracking will be described with reference to fig. 3. A video sequence is input, the first frame of the video is used as a template frame X, a tracking target is selected in the first frame, and then each frame Z (t), t (2,3,4 …) is input into a trained network together. Image of a personInitializing the input model, mainly including selecting template frame and searching frame area, cutting the template frame to 127X 127 size with target frame center, searching in 256X 256 area, network extracting feature information of two pictures to embed features to obtain a correlated response image, obtaining classification image cls and central degree score image cen as figure 4 through classification regression network, and regression feature image reg, where (l, T, r, b) is the distance from current pixel point P (x, y) to four sides of target frame, for position P (x, y), the frame can generate a 6-dimensional vector T_x,yThe final target frame predicted by averaging the regression frames of the highest scoring pixel and its neighborhood pixels is shown on the right of fig. 4.

In the tracking process, the position and the size of a target frame between adjacent frames only slightly change, and after the target frame output by a model is obtained, a size penalty P is introduced_xyCosine window H and balance weight factor mu_xyCalculating a new center degree classification score map, and calculating the score q of each pixel point according to the following formula:

q＝argmax_x,y{(1-μ_xy)cls_xy×P_xy+μ_xyH}

in order to prevent a situation where a unique q may jitter the target frame, in actual tracking, if a q-unique bounding box is used as the target frame, jitter may occur between adjacent frames. It is observed in the experiment that the pixel point near q may be the target pixel, so according to cls_xy×P_xyAnd (3) selecting the first k pixel points from n neighborhoods of q, wherein the final prediction result is the weighted average value of the selected k regression frames. Wherein when n is 1 and k is 5, the tracking result is most stable. x and y are pixel point coordinates.

Through the steps, the position of the tracking target in the current frame in the image can be determined.

In summary, the application discloses a twin network structure target tracking method fusing a Reid method, which solves the problems of target deformation, shielding and similar target interference in the long-term video tracking process, and is mainly characterized in that the target expression capability of a network to a model is improved, and the capability of better distinguishing similar targets is learned, so that the similar targets are better distinguished, and video monitoring is assisted.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A twin network structure target tracking method for merging target re-identification is characterized by comprising the following steps:

2. The twin network structure target tracking method with fusion target re-recognition according to claim 1, wherein the backbone network of the full convolution twin network module and the backbone network of the target re-recognition module adopt googlenet, and a convolution layer composed of two convolution kernel sequences of 3 x 3 is adopted to replace a single convolution layer of 5 x 5 in the googlenet.

3. The twin network structure target tracking method fusing target re-identification as claimed in claim 1, wherein a Transformer-Head network is further included after the backbone network of the target re-identification module, and after the Transformer-Head network performs pixel-by-pixel flattening operation on the feature matrix obtained by the backbone network, sinusoidal position coding is performed on each position of the feature sequence, and then the sinusoidal position coding is added to the feature of the corresponding position, so as to obtain a new feature sequence as an input sequence of the Transformer-Head network, and then the new feature sequence sequentially passes through an encoder network and a decoder network.

4. The twin network structure target tracking method with fusion target re-identification as claimed in claim 3, wherein the encoder network is formed by stacking a plurality of encoding layers, each encoding layer comprising a multi-headed self-attention module and a fully-linked feedforward network; the decoder network is formed by stacking a plurality of decoding layers, and each decoding layer comprises a multi-head self-attention module and a full-link feedforward network; the input of the decoder network comprises the output of the encoder network and target query characteristics obtained by performing linear transformation on a transform-Head input sequence through 1 × 1 convolution.

5. The twin network structure target tracking method with fusion target re-identification as claimed in claim 3, wherein the classification regression module classifies a branch network and a regression branch network, the classification branch network includes classification branches and a centrality branch.

6. The twin network structure target tracking method fusing target re-recognition according to claim 1, wherein the alternately training classification regression branch and target re-recognition branch when training the target tracking network model comprises:

7. The twin network structure target tracking method fusing target re-recognition according to claim 1, wherein the alternately training classification regression branch and target re-recognition branch when training the target tracking network model comprises: