CN113963032A - Twin network structure target tracking method fusing target re-identification - Google Patents

Twin network structure target tracking method fusing target re-identification Download PDF

Info

Publication number
CN113963032A
CN113963032A CN202111451845.9A CN202111451845A CN113963032A CN 113963032 A CN113963032 A CN 113963032A CN 202111451845 A CN202111451845 A CN 202111451845A CN 113963032 A CN113963032 A CN 113963032A
Authority
CN
China
Prior art keywords
target
network
branch
identification
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111451845.9A
Other languages
Chinese (zh)
Inventor
毛姣莉
崔滢
郑河荣
郭东岩
朱鹏飞
王晓航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202111451845.9A priority Critical patent/CN113963032A/en
Publication of CN113963032A publication Critical patent/CN113963032A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a twin network structure target tracking method fusing target re-identification, which is used for training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, and a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module. And the backbone network of the full convolution twin network module shares parameters and weight with the backbone network of the target re-identification module. For a video sequence to be tracked, a first frame of the video sequence is used as a template frame, a tracking target is selected in the first frame, the template frame and each frame behind the first frame are respectively used as an image pair to be input into a trained target tracking network model, the position of the tracking target in the video frame is determined, and target tracking is achieved. The invention can better improve the capacity of distinguishing similar target interference.

Description

Twin network structure target tracking method fusing target re-identification
Technical Field
The application belongs to the technical field of target tracking, and particularly relates to a twin network structure target tracking method fusing target re-identification.
Background
Video target tracking is a research topic relating to multiple fields such as pattern recognition, computer vision, artificial intelligence and the like, and is a hotspot of research of people because of the wide application value in the fields of intelligent video monitoring, security protection, education and the like. However, in a real scene, due to factors such as target posture change, video jitter, occlusion, and similar target interference, it is difficult for a tracking algorithm to simultaneously consider result accuracy and operation real-time, and this problem is particularly obvious in a long-term tracking task. Therefore, how to implement an accurate and real-time target tracking method remains a difficult point of research.
Conventional visual target tracking can be generally divided into two research methods: a target tracking method based on a generative model and a target tracking method based on a discriminant model. The model generation method completes positioning by modeling in a current frame and searching a region most similar to the model in the next frame as a predicted position. The method has the main idea that the joint probability of the target and the sample is calculated, and the sample with the closest target is found to be used as the estimation of the current target. Common tracking algorithms based on a generative model method are known as Kalman filtering, particle filtering, mean-shift and the like. However, the model generation method only models the target, and does not utilize the difference between foreground and background information, and the background information and the target are similar in appearance during the target tracking process, so that the background information is seriously interfered. The discrimination model method fully utilizes the foreground and background information of the template frame and focuses on distinguishing the foreground from the background. The discriminant model method is to regard the tracking task as a two-classification process, regard the area where the template frame target object is located as a positive sample, regard the rest of background information as a negative sample, and then separate the target from the whole picture in other frames. The biggest difference between the discrimination method and the generation method is that the classifier adopts machine learning, and background information is used in training, so that the classifier can be concentrated on distinguishing the foreground and the background. The main idea of the method is to calculate the conditional probability and directly judge whether the target is the target or not. Classical discriminative methods include Struck and TLD.
The basic idea of the correlation filtering tracking is: and designing a filtering template, and performing correlation operation on the template and the target candidate region, wherein the position of the maximum output response is the target position of the current frame. When the correlation operation is carried out on the input image and the filtering template, the convolution operation is converted into the dot product operation by carrying out the fast Fourier transform on the input image and the filtering template, and the calculation amount is greatly reduced. Compared with the traditional tracking method, the related filtering tracking method greatly accelerates the tracking speed. Classical correlation filtering methods include MOSSE, KCF/DCF, SAMF, SRDCF and the like.
The deep learning method is highly advocated in the aspect of target tracking by the strong feature expression capability of the deep learning method, and particularly, the accuracy and speed of balance are remarkably improved by a Siamese series-twin network-based tracking algorithm from 2016. The twin tracker describes the visual target tracking problem as learning a generic similarity map by cross-correlating the learned feature representations of the target template and the search area. Such as Siam FC proposed by Luca Bertinetto et al, performs similarity matching at multiple scales on the search area image, and determines the position of the target using multiple similarity maps. Although this method improves the accuracy of target tracking by means of multiple similarity matching, the speed of tracking is compromised. DSiam proposes a dynamic twin network, effectively utilizes appearance change and background suppression of previous frames of online learning targets, structSim concentrates more on the local structure of the learning targets, the judgment force of tracking is improved, and a regional proposal network is added behind the twin network by the SimRPN, so that the tracking speed is further improved. However, the above methods all use classical, relatively shallow AlexNet as the backbone network in the framework, and cannot obtain deeper features. SiamRPN + + adopts a very deep neural network and can still work at a very high real-time speed, SiamCAR proposes a full-volume integral type and regression twin network structure, achieves a good effect in the aspects of precision and speed by using a simpler structure, but adopts the resnet50 as a backbone network to perform simple feature fusion on the last three layers of features to realize a multi-scale effect, and the effect is not good. Although these trackers have certain advantages, it is still difficult to handle the object with large shape deformation and appearance change in the long-term object tracking process.
Disclosure of Invention
Aiming at the defects of the existing method, the application provides a twin network structure target tracking method fusing target re-identification, and a simple and effective universal single-target tracking method is realized by enhancing the target expression robustness in a twin network tracking model.
In order to achieve the purpose, the technical scheme of the application is as follows:
a twin network structure target tracking method fusing target re-identification comprises the following steps:
acquiring a general visual target tracking data set, preprocessing the general visual target tracking data set, and generating a training sample set;
training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module, the classification regression branch and the target re-identification branch are alternately trained when the target tracking network model is trained, and the backbone network of the full convolution twin network module and the backbone network of the target re-identification module share parameters and weights;
for a video sequence to be tracked, a first frame of the video sequence is used as a template frame, a tracking target is selected in the first frame, the template frame and each frame behind the first frame are respectively used as an image pair to be input into a trained target tracking network model, the position of the tracking target in the video frame is determined, and target tracking is achieved.
Further, the backbone network of the full convolutional twin network module and the backbone network of the target re-identification module adopt googlenet, and a convolutional layer composed of two convolution kernel sequences of 3 × 3 is adopted to replace a single convolution layer of 5 × 5 in the googlenet.
Further, a Transformer-Head network is further included behind the backbone network of the target re-identification module, after the Transformer-Head network performs pixel-by-pixel leveling operation on the feature matrix obtained by the backbone network, sinusoidal position coding is performed on each position of the feature sequence, then the sinusoidal position coding is added with the feature of the corresponding position, a new feature sequence is obtained and serves as an input sequence of the Transformer-Head network, and then the new feature sequence sequentially passes through the encoder network and the decoder network.
Furthermore, the encoder network is formed by stacking a plurality of encoding layers, and each encoding layer comprises a multi-head self-attention module and a full-link feedforward network; the decoder network is formed by stacking a plurality of decoding layers, and each decoding layer comprises a multi-head self-attention module and a full-link feedforward network; the input of the decoder network comprises the output of the encoder network and target query characteristics obtained by performing linear transformation on a transform-Head input sequence through 1 × 1 convolution.
Further, the classification regression module classifies a branch network and a regression branch network, wherein the classification branch network comprises a classification branch and a centrality branch.
Further, when the target tracking network model is trained, alternately training a classification regression branch and a target re-recognition branch includes:
when the number of training rounds is M, each round of training is performed once on the classification regression branch and the target re-identification branch.
Further, when the target tracking network model is trained, alternately training a classification regression branch and a target re-recognition branch includes:
when the number of training rounds is M, the parameters of the backbone network are frozen in the first M1 rounds, the classification regression branch and the target re-identification branch are trained, the backbone network is opened in the last M2 rounds, the classification regression branch and the target re-identification branch are trained, and the sum of M1 and M2 is equal to M.
The twin network structure target tracking method fusing target re-identification can enhance twin network target expression by combining the network correction effect of the target re-identification task, and can better improve the capability of distinguishing similar target interference.
Drawings
FIG. 1 is a flowchart of a twin network structure target tracking method for merging target re-identification according to the present application;
FIG. 2 is a schematic diagram of a target tracking network model according to the present application;
FIG. 3 is a Reid branch network diagram;
fig. 4 is a schematic view of target tracking.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a twin network structure target tracking method fusing target re-identification, including:
and step S1, acquiring a general visual target tracking data set, preprocessing the general visual target tracking data set, and generating a training sample set.
According to the method, a general visual target tracking data set is used, a target template image (template frame) and a search area image (search frame) are cut out from an original training set according to the position of a target in an image, and the cut-out image set forms a training data set used by the method.
The data set uses the common data sets COCO, Det, VID2015 and YOUTUBEBE tracked by a single target, and the training results of the data sets of Lasot and tracing net are increased in follow-up experiments. In the training process, the twin network is input by an image pair of the template frame and the search frame and an image with the same size as the template frame, the image pair is used as the input of the classification regression network, and the other image is used as the input of the Reid network. The image set is cropped prior to input.
The method for clipping the target template image comprises the following steps: and determining a square cutting area by taking the center of the target as a cutting center, setting the cutting side length of the square as p, wherein p may exceed the size of the original picture, filling the exceeding part with the picture color mean value, and then zooming the picture to 127 × 127 size. Where P ═ 2 (w + h)/w represents the target-position wide pixel, and h represents the target-position high pixel.
The cutting method of the target search image comprises the following steps: selecting one frame in the range of T frames before and after the template frame, taking the center of the target as a cropping center, determining a square cropping area, setting the cropping side length of the square as C, wherein the C may exceed the size of the original picture, filling the exceeding part with the picture color mean value, and then zooming the image to 127 × 127 size. Wherein in the data sets COCO, Det, VID2015, youtube, T corresponds to 1, 1, 100, 3, respectively.
In the Reid network (Re-identification, target Re-identification), VID2015, COCO, youtube data set is used, and the picture number distribution is: VID2015 was 90000, COCO 90000, youtube 100000, total 260000, and batch number of images per session was 4 × 8. Firstly, randomly selecting 8 videos from a data set according to categories, firstly selecting a picture from the videos, and then randomly selecting 3 different pictures in the front T frame and the rear T frame of the picture. Each video sequence is provided with a label target which represents the category information of a target, the targets of the instances selected from the same video are the same, and the targets of the different video instances are different; then, each picture is cut into 127 × 127 pictures according to the scheme of cutting the template sample. Where in the data sets COCO, VID2015, youtube, T corresponds to 20, 30, 20, respectively.
After cutting, data enhancement preprocessing operation is finally carried out on the image set, wherein the data enhancement preprocessing operation comprises contrast enhancement, noise increase and the like, and the training difficulty is increased.
Step S2, training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module, the classification regression branch and the target re-identification branch are alternately trained when the target tracking network model is trained, and the backbone network of the full convolution twin network module and the backbone network of the target re-identification module share parameters and weights.
The target tracking network model constructed by the present application is shown in fig. 2, wherein (a) is a full convolution twin network (full convolution twin network module) for extracting image features, (b) is a Reid network (target re-identification module) for enhancing the target expression capability of the backbone network, and (c) is a classification regression module for predicting classification and bounding box information. The Reid network is independently used as a target re-identification branch, and the full convolution twin network module and the classification regression module form a classification regression branch.
The full convolution twin network module comprises a neural network architecture of two identical sub-networks, which have the same configuration, i.e. have the same parameters and weights, and the parameter update is performed on both sub-networks together. The application uses the googlenet as the backbone network of the twin network (CNN in fig. 2), and replaces a single 5 × 5 convolutional layer in the googlenet with a convolutional layer composed of two 3 × 3 convolutional kernel sequences, thereby reducing the parameter quantity while maintaining the receptive field range. In addition, the googlenet uses convolution kernels of different sizes, increasing diversity. According to the method and the device, the googlenet is used as a backbone network, the output of the last layer is directly taken as a feature matrix of the image, the semantic information of the image can be well fused, and multi-scale fusion of the target is achieved. The input template image and the search image extract close feature information through a twin network.
In consideration of the fact that under the condition of long-term video target tracking, when a target with high similarity to a tracked target appears in a search area, tracking errors are prone to occurring, the method designs target re-identification branches, extracts global features of the target, and learns and classifies similar and different targets so as to enhance target expression of a twin network and further improve the capability of a model for distinguishing the similar targets.
Fig. 2 (b) shows a network structure of a target heavy identification branch (Reid branch), and a backbone network of the Reid branch for extracting image features is consistent with a twin network, and forms a three-cell network together, and shares parameters and weights. An input picture obtains a 5 multiplied by n characteristic matrix through a backbone network, and then the characteristic matrix is sent to a Transformer-Head to obtain a global characteristic vector with the size of 1 multiplied by 2048, and the global information is focused more. The loss is then calculated using the triplet loss, increasing the distance between different classes, and decreasing the distance between the same classes to enhance the ability of the network to distinguish similar objects. Where n is 256.
The tracked target may have a significant deformation in the long-term tracking process, such as: rotation, occlusion, distortion, etc., and therefore, how to better learn global information of a target becomes a significant challenge in this field. The method is combined with a Transformer network, the structure is shown in fig. 3, and the model can better learn various invariant features through the global attention characteristic of the model in processing images, so that the error Transformer-Head network caused by target deformation in the tracking process is reduced.
After a feature matrix obtained by a backbone network is subjected to pixel-by-pixel flattening (flatten) operation by a Transformer-Head network, a feature sequence is obtained, and due to the arrangement invariance of an original Transformer, each position of the feature sequence is subjected to sine position coding and then added with the feature of the corresponding position, so that a new feature sequence is obtained and serves as an input sequence of the Transformer-Head. FIG. 3 is a network structure diagram of a transform-Head in the present solution, which mainly consists of an encoder network and a decoder network. In which an Encoder part is stacked by N encoding layers (encode). Each encoding layer includes a multi-headed self-attention module and a fully-linked feed-forward network. I.e. for each layer, it is made up of two parts: one part is a multi-head self-attention mechanism, which disperses the attention calculation to different subspaces for attention learning from multiple aspects; the other part is a simple fully-connected feedforward network, which is composed of two layers of fully-connected and ReLU activation functions. In this embodiment, N is 6, and the setting is performed with reference to the aspects of the machine graphics card, the speed, and the like. The Encode module extracts the feature dependency of each part of the image and enhances the original features by using the global context information, so that the model can learn distinctive features to enhance the target expression capability of the network.
The Decode module uses the enhancement features of the Encode module and Target Query features (Target Query) as input, the Target Query features are obtained by performing linear transformation of 1 × 1 convolution on an input sequence of a transform-Head, and a Decode part is also stacked by N decoding layers (decodes), and each decoding layer Decode comprises a multi-Head self-attention and fully-connected feedforward network. Finally, a decoding sequence with the same size as the input sequence is obtained, and then a global feature vector with the size of 1xC is finally obtained through dimensionality reduction and averaging
Figure BDA0003386387470000071
Where C is 1024.
On the basis of a twin network structure, a parallel target re-recognition network branch capable of being alternately trained is added, and a backbone network and a classification regression network are alternately trained by adopting the same network structure and parameters as those of the twin network. The loss function of the target re-recognition network branch adopts triple loss. The method and the device have the advantages that the network modification effect of the target re-identification task is combined, the twin network target expression can be enhanced, and the capability of distinguishing similar target interference can be better improved.
In the full convolution twin network module, after the template frame picture X and the search frame picture Z are subjected to the feature extraction of the backbone network, two feature graphs with the shapes of 5 multiplied by n and 13 multiplied by n are respectively obtained, and the feature graphs are respectively expressed as
Figure BDA0003386387470000072
And
Figure BDA0003386387470000073
wherein the content of the first and second substances,
Figure BDA0003386387470000074
is a feature map of the target template image,
Figure BDA0003386387470000075
is a feature map of the search area image. The image X and the image Z pass through the same convolution network, similar feature expressions can be learned, and then
Figure BDA0003386387470000076
And
Figure BDA0003386387470000077
and performing depth cross-correlation operation to obtain a response graph R with the shape of 25 × 25 × n, wherein n is 256.
The classification regression module contains two branches, the upper part of FIG. 2(c) is a classification branch network, and the lower part is a regression branch network. The classification branch comprises two parallel branches, namely a classification branch and a centrality branch, wherein the classification branch predicts the foreground and the background of the image, and the centrality branch calculates the centrality score of each pixel point p (i, j); and (5) calculating the distances between the pixel point p (i, j) and the upper, lower, left and right boundaries of the target frame respectively by the regression branch.
The response graph R is input into the classification branch network and the regression branch network, and both the response graph R and the regression branch network pass through a convolution network Refine model with 4 convolution layers, the two convolution networks have the same structure and different parameters, and the structure of each convolution layer is also the same. The convolutional network is followed by a parallel classification branch and a regression branch, wherein the classification branch is used for performing a classification task and a centrality task, the two tasks are both connected with a 1x1 convolutional layer after a refine model, the number of output channels of the classification task is 2, and a foreground background classification chart with the shape of 25 x 2 is obtained
Figure BDA0003386387470000081
Each pixel point in the image represents whether the point is a target foreground or a target backgroundScene, 1 indicates that the point falls within the target, and 0 indicates that the point falls outside the target. The output channel of the centrality task is 1, and a centrality score map with the shape of 25 multiplied by 1 is obtained
Figure BDA0003386387470000082
The output channel of the regression branch network is 4, and a target frame regression graph with the shape of 25 multiplied by 4 is obtained
Figure BDA0003386387470000083
Each point in the vector is a 4-dimensional vector
Figure BDA0003386387470000084
Here, |*,t*,r*,b*Respectively representing the distances from the point to the four boundaries of the left, lower, right and upper parts of the real target frame.
Object i is defined on the picture as
Figure BDA0003386387470000085
c(i)∈R4X {1,2 … C }, here
Figure BDA0003386387470000086
And
Figure BDA0003386387470000087
representing the top left and bottom right coordinates of the object, respectively, C represents the number of categories, and in the COCO dataset, C is 80. Characteristic diagram
Figure BDA0003386387470000088
Each point in
Figure BDA0003386387470000089
The position mapped to the original image is
Figure BDA00033863874700000810
And s is the size scaling multiple of the feature map to the size of the original image. Thus, it is possible to provide
Figure BDA00033863874700000811
The calculation formula of (2) is as follows:
Figure BDA00033863874700000812
Figure BDA00033863874700000813
the output channel of the central degree branch is 1, and a central degree score chart with the shape of 25 multiplied by 1 is obtained
Figure BDA00033863874700000814
Response graph
Figure BDA00033863874700000815
Each value C (x, y) of (a) represents a score of whether the location is at the center of the target. C (x, y) is defined as:
Figure BDA00033863874700000816
wherein l*,t*,b*,r*And predicting the result of each pixel point according to the regression branch. And calculating to obtain an enclosing frame of the target according to the product fusion of the classification branch and the central degree branch and the edge distance information of the regression branch, and completing target positioning.
The loss function of the target tracking network model in this embodiment is:
Figure BDA0003386387470000091
wherein L isclsIs the classification loss, using the cross entropy loss, LcenIs the centrality loss, using nnregIs the regression loss, using the cross-over ratio loss, LtriThe triad loss is given by tripletloss. alpha, beta, gamma values of 1, 3, 1, respectively.
The target tracking network model can better learn the training result of the other side by the two branch networks, and the two branch networks can be trained alternately in the training process. The Reid branch network and the classification regression branch network use the same backbone network to extract image features and share network parameters.
In a specific embodiment, when the number of training rounds is M, each training round is used for training the classification regression branch and the target re-identification branch once.
In another specific embodiment, when the number of training rounds is M, the first M1 rounds of freezing parameters of the backbone network, training the classification regression branch and the target re-identification branch, and the last M2 rounds of opening the backbone network, training the classification regression branch and the target re-identification branch, wherein the sum of M1 and M2 is equal to M.
For example, the number of training rounds is 20, each training round is performed on the Reid branch network and the classification regression branch network once, and the initial learning rate of the stochastic gradient descent is 0.001. The parameters of the backbone network are frozen in the first 10 rounds, only the Reid branch network and the classification regression network are trained, the backbone network is opened in the last 10 rounds, the backbone network is added into the subsequent branch networks for training, the parameters of the backbone network can be optimized during the training of each branch network, but the parameters of the other branch network are not changed. For the classification regression branch network training, the batch of the data sets input each time is 76, and the total number of the data sets is 6000000; for the Reid branching network, the batch size of the data set input at each time is 32, and the total number of the data sets is 260000.
And step S3, regarding the video sequence to be tracked, selecting a tracking target in the first frame by taking the first frame of the video sequence as a template frame, inputting the template frame and each frame after the first frame into a trained target tracking network model as an image pair respectively, determining the position of the tracking target in the video frame, and realizing target tracking.
Fig. 3 is a schematic diagram of an algorithm flow for tracking an object according to the present invention, and a specific flow of tracking will be described with reference to fig. 3. A video sequence is input, the first frame of the video is used as a template frame X, a tracking target is selected in the first frame, and then each frame Z (t), t (2,3,4 …) is input into a trained network together. Image of a personInitializing the input model, mainly including selecting template frame and searching frame area, cutting the template frame to 127X 127 size with target frame center, searching in 256X 256 area, network extracting feature information of two pictures to embed features to obtain a correlated response image, obtaining classification image cls and central degree score image cen as figure 4 through classification regression network, and regression feature image reg, where (l, T, r, b) is the distance from current pixel point P (x, y) to four sides of target frame, for position P (x, y), the frame can generate a 6-dimensional vector Tx,yThe final target frame predicted by averaging the regression frames of the highest scoring pixel and its neighborhood pixels is shown on the right of fig. 4.
In the tracking process, the position and the size of a target frame between adjacent frames only slightly change, and after the target frame output by a model is obtained, a size penalty P is introducedxyCosine window H and balance weight factor muxyCalculating a new center degree classification score map, and calculating the score q of each pixel point according to the following formula:
q=argmaxx,y{(1-μxy)clsxy×PxyxyH}
in order to prevent a situation where a unique q may jitter the target frame, in actual tracking, if a q-unique bounding box is used as the target frame, jitter may occur between adjacent frames. It is observed in the experiment that the pixel point near q may be the target pixel, so according to clsxy×PxyAnd (3) selecting the first k pixel points from n neighborhoods of q, wherein the final prediction result is the weighted average value of the selected k regression frames. Wherein when n is 1 and k is 5, the tracking result is most stable. x and y are pixel point coordinates.
Through the steps, the position of the tracking target in the current frame in the image can be determined.
In summary, the application discloses a twin network structure target tracking method fusing a Reid method, which solves the problems of target deformation, shielding and similar target interference in the long-term video tracking process, and is mainly characterized in that the target expression capability of a network to a model is improved, and the capability of better distinguishing similar targets is learned, so that the similar targets are better distinguished, and video monitoring is assisted.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. A twin network structure target tracking method for merging target re-identification is characterized by comprising the following steps:
acquiring a general visual target tracking data set, preprocessing the general visual target tracking data set, and generating a training sample set;
training a target tracking network model, wherein the target tracking network model comprises a classification regression branch and a target re-identification branch, the classification regression branch comprises a full convolution twin network module and a classification regression module, a backbone network of the full convolution twin network module is consistent with a backbone network of the target re-identification module, the classification regression branch and the target re-identification branch are alternately trained when the target tracking network model is trained, and the backbone network of the full convolution twin network module and the backbone network of the target re-identification module share parameters and weights;
for a video sequence to be tracked, a first frame of the video sequence is used as a template frame, a tracking target is selected in the first frame, the template frame and each frame behind the first frame are respectively used as an image pair to be input into a trained target tracking network model, the position of the tracking target in the video frame is determined, and target tracking is achieved.
2. The twin network structure target tracking method with fusion target re-recognition according to claim 1, wherein the backbone network of the full convolution twin network module and the backbone network of the target re-recognition module adopt googlenet, and a convolution layer composed of two convolution kernel sequences of 3 x 3 is adopted to replace a single convolution layer of 5 x 5 in the googlenet.
3. The twin network structure target tracking method fusing target re-identification as claimed in claim 1, wherein a Transformer-Head network is further included after the backbone network of the target re-identification module, and after the Transformer-Head network performs pixel-by-pixel flattening operation on the feature matrix obtained by the backbone network, sinusoidal position coding is performed on each position of the feature sequence, and then the sinusoidal position coding is added to the feature of the corresponding position, so as to obtain a new feature sequence as an input sequence of the Transformer-Head network, and then the new feature sequence sequentially passes through an encoder network and a decoder network.
4. The twin network structure target tracking method with fusion target re-identification as claimed in claim 3, wherein the encoder network is formed by stacking a plurality of encoding layers, each encoding layer comprising a multi-headed self-attention module and a fully-linked feedforward network; the decoder network is formed by stacking a plurality of decoding layers, and each decoding layer comprises a multi-head self-attention module and a full-link feedforward network; the input of the decoder network comprises the output of the encoder network and target query characteristics obtained by performing linear transformation on a transform-Head input sequence through 1 × 1 convolution.
5. The twin network structure target tracking method with fusion target re-identification as claimed in claim 3, wherein the classification regression module classifies a branch network and a regression branch network, the classification branch network includes classification branches and a centrality branch.
6. The twin network structure target tracking method fusing target re-recognition according to claim 1, wherein the alternately training classification regression branch and target re-recognition branch when training the target tracking network model comprises:
when the number of training rounds is M, each round of training is performed once on the classification regression branch and the target re-identification branch.
7. The twin network structure target tracking method fusing target re-recognition according to claim 1, wherein the alternately training classification regression branch and target re-recognition branch when training the target tracking network model comprises:
when the number of training rounds is M, the parameters of the backbone network are frozen in the first M1 rounds, the classification regression branch and the target re-identification branch are trained, the backbone network is opened in the last M2 rounds, the classification regression branch and the target re-identification branch are trained, and the sum of M1 and M2 is equal to M.
CN202111451845.9A 2021-12-01 2021-12-01 Twin network structure target tracking method fusing target re-identification Pending CN113963032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111451845.9A CN113963032A (en) 2021-12-01 2021-12-01 Twin network structure target tracking method fusing target re-identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111451845.9A CN113963032A (en) 2021-12-01 2021-12-01 Twin network structure target tracking method fusing target re-identification

Publications (1)

Publication Number Publication Date
CN113963032A true CN113963032A (en) 2022-01-21

Family

ID=79472706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111451845.9A Pending CN113963032A (en) 2021-12-01 2021-12-01 Twin network structure target tracking method fusing target re-identification

Country Status (1)

Country Link
CN (1) CN113963032A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898472A (en) * 2022-04-26 2022-08-12 华南理工大学 Signature identification method and system based on twin vision Transformer network
CN115061574A (en) * 2022-07-06 2022-09-16 陈伟 Human-computer interaction system based on visual core algorithm
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116612157A (en) * 2023-07-21 2023-08-18 云南大学 Video single-target tracking method and device and electronic equipment
CN116703980A (en) * 2023-08-04 2023-09-05 南昌工程学院 Target tracking method and system based on pyramid pooling transducer backbone network
WO2023216572A1 (en) * 2022-05-07 2023-11-16 深圳先进技术研究院 Cross-video target tracking method and system, and electronic device and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898472A (en) * 2022-04-26 2022-08-12 华南理工大学 Signature identification method and system based on twin vision Transformer network
CN114898472B (en) * 2022-04-26 2024-04-05 华南理工大学 Signature identification method and system based on twin vision transducer network
WO2023216572A1 (en) * 2022-05-07 2023-11-16 深圳先进技术研究院 Cross-video target tracking method and system, and electronic device and storage medium
CN115061574A (en) * 2022-07-06 2022-09-16 陈伟 Human-computer interaction system based on visual core algorithm
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116486203B (en) * 2023-04-24 2024-02-02 燕山大学 Single-target tracking method based on twin network and online template updating
CN116612157A (en) * 2023-07-21 2023-08-18 云南大学 Video single-target tracking method and device and electronic equipment
CN116703980A (en) * 2023-08-04 2023-09-05 南昌工程学院 Target tracking method and system based on pyramid pooling transducer backbone network
CN116703980B (en) * 2023-08-04 2023-10-24 南昌工程学院 Target tracking method and system based on pyramid pooling transducer backbone network

Similar Documents

Publication Publication Date Title
CN110135375B (en) Multi-person attitude estimation method based on global information integration
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
Gould et al. Decomposing a scene into geometric and semantically consistent regions
CN109815826B (en) Method and device for generating face attribute model
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
CN108268859A (en) A kind of facial expression recognizing method based on deep learning
CN110458038B (en) Small data cross-domain action identification method based on double-chain deep double-current network
CN105373777B (en) A kind of method and device for recognition of face
CN108256421A (en) A kind of dynamic gesture sequence real-time identification method, system and device
CN112418095A (en) Facial expression recognition method and system combined with attention mechanism
CN111178208A (en) Pedestrian detection method, device and medium based on deep learning
CN112418041B (en) Multi-pose face recognition method based on face orthogonalization
CN110728183A (en) Human body action recognition method based on attention mechanism neural network
CN107657625A (en) Merge the unsupervised methods of video segmentation that space-time multiple features represent
CN112633061A (en) Lightweight FIRE-DET flame detection method and system
CN109902565A (en) The Human bodys' response method of multiple features fusion
CN113554679A (en) Anchor-frame-free target tracking algorithm for computer vision application
Kim et al. Real-time facial feature extraction scheme using cascaded networks
CN114782977A (en) Method for guiding pedestrian re-identification based on topological information and affinity information
CN115205903A (en) Pedestrian re-identification method for generating confrontation network based on identity migration
Symeonidis et al. Neural attention-driven non-maximum suppression for person detection
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
Ming et al. A unified 3D face authentication framework based on robust local mesh SIFT feature
Yadav et al. DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognition
Mallis et al. From keypoints to object landmarks via self-training correspondence: A novel approach to unsupervised landmark discovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination