CN112541468A - Target tracking method based on dual-template response fusion - Google Patents
Target tracking method based on dual-template response fusion Download PDFInfo
- Publication number
- CN112541468A CN112541468A CN202011524190.9A CN202011524190A CN112541468A CN 112541468 A CN112541468 A CN 112541468A CN 202011524190 A CN202011524190 A CN 202011524190A CN 112541468 A CN112541468 A CN 112541468A
- Authority
- CN
- China
- Prior art keywords
- module
- response
- fusion
- frame
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a target tracking method based on dual-template response fusion, and aims to solve the problems that the accuracy of a template is reduced in a template fixing mode, and the robustness of the template is reduced in a template dynamic updating mode. Firstly, constructing a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module; then, selecting a training set, and training a target tracking system; and finally, performing target tracking on each frame of the video sequence by using a trained target tracking system, wherein the target tracking comprises the steps of feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction, and obtaining a target tracking result. In the target tracking process, the template initialized by the first frame of the video and the template dynamically updated by the tracking result in each subsequent frame are used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, the target tracking precision is improved, and the robustness and the real-time performance are ensured.
Description
Technical Field
The invention belongs to the field of computer vision target tracking, and particularly relates to a target tracking method based on dual-template response fusion in a twin network structure.
Background
Target tracking is an important task in the field of computer vision and can be divided into single-target tracking and multi-target tracking according to the number of targets. Wherein the application range of the single target tracking is wider. Specifically, for a video sequence, single object tracking uses a given object bounding box in the first frame of images in the sequence to initialize the tracker and to predict the location of the object in subsequent video frames and frame out the object. The target tracking technology plays an important role in the technical fields of military, agriculture, security and the like, and along with the rapid development of the artificial intelligence technology and the increase of the requirements of practical application, the performance requirements of people on the target tracking algorithm are higher and higher, so that the research on the target tracking technology is very necessary.
Single target tracking algorithms are mainly divided into two main categories: generative tracking and discriminant tracking. Generative tracking focuses primarily on the characterization of the intrinsic distribution of target appearance data, which is indistinguishable. Discriminant tracking equates the tracking problem to a classification problem, separating the target from the background by learning a classifier. Because the discriminant tracking algorithm has strong foreground and background distinguishing capability, the accuracy and the robustness of the discriminant tracking algorithm are higher than those of a generative tracking algorithm.
Two of the most popular discriminant tracking methods at present are correlation filtering-based tracking and deep learning-based tracking, respectively. The tracking algorithm based on deep learning can extract the depth features of the target, the depth features have richer semantic information, and the appearance expression capability of the target is stronger. Therefore, the current deep learning-based method has more excellent accuracy and robustness. In the current tracking method based on deep learning, the twin network structure is most commonly adopted. The twin network is a convolutional neural network with two parameters identical. The twin network tracker uses a twin network as a feature extraction network, takes one of the network branches as a template branch, takes GT (real target bounding box) in a first frame of a video as input, and extracts features of a target as a template. And the other network branch is used as a detection branch, firstly, a detection area is extracted from each frame after the first frame, and then the detection area is used as network input to output the characteristics of the detection area. In the subsequent classification and regression network, candidate frames are extracted from the detection area, the candidate frames are matched with the template features, the candidate frame with the highest matching score is selected as a target frame, and regression correction is carried out on the length and the width of the target frame.
Currently, in twin network based tracking methods, the template is either fixed or dynamically updated.
Template fixing is to initialize the template by using the GT of the first frame of the video, and the template is kept unchanged in the following tracking process. Since the template is initialized with GT, the template is absolutely correct; but in subsequent video frames, the object may change in size, shape, etc., resulting in a change in the feature. Therefore, the latest features of the target cannot be obtained by using the fixed template, so that the template cannot be correctly matched with the features of the target, and inaccurate tracking is caused.
The dynamic updating of the template is to initialize the template in the first frame and update the template according to the predicted target frame of each frame in the following tracking process. Because the template is updated by using the tracking result of each frame, the template always contains the latest features of the target, and the tracking accuracy can be improved; however, if the tracking is lost or biased, the updated template will be damaged by the background features, so that the template becomes no longer robust, and the tracking is wrong.
Therefore, how to consider the accuracy and robustness of the template in the target tracking method based on the twin network enables the template to always acquire the latest features of the target in the tracking process, and meanwhile, the template is not damaged due to tracking drift, so that the overall accuracy and robustness of the target tracking method are improved, and the method is a hotspot problem which is researched by researchers in the field.
Disclosure of Invention
The invention aims to solve the technical problem that the template fixing mode in the twin network tracking method may reduce the accuracy of the template, and the template dynamic updating mode may reduce the robustness of the template.
The invention provides a target tracking method based on a dual template on the basis of the existing twin network tracking method, namely, an initial template (a template initialized by a first frame of a video) and an updated template (a template dynamically updated by using a tracking result in each frame later) are simultaneously used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, and the target tracking precision is improved.
In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module is constructed. Then select ImageNet VID and DET (see the documents "Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual basic recognition [ J ]. International juuran of Computer Vision,2015,115(3):211-, paper by Esteban Real: YouTube bounding box: large high-precision manual labeling datasets for object detection in video) and GOT-10k (see the literature "Huang L, Zhao X, Huang k. go-10 k: a large high-diversity benchmark for genetic object tracking in the world [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2019.", paper by Lianghua Huang: GOT-10k: large-scale complex diversity reference for field general target tracking) as a training set for training a target tracking system, and then performing feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction on each frame of a video sequence by using the trained target tracking system to obtain a target tracking result.
The specific technical scheme of the invention is as follows:
the method comprises the following steps of firstly, building a target tracking system, wherein the target tracking system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.
The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And the target search area image of each subsequent frame (the target search area image of the ith frame is XiRepresent), respectively for Z0And XiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet (see the literature "Krizhevsky A, Sutskeeper I, Hinton G E. imaging class with default neural networks [ C)]// Advances in neural information processing systems.2012: 1097-: ImageNet classification using a deep convolutional neural network), the modified AlexNet comprises 5 convolutional layers, 2 maximal pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers. Z0Has a size of 127 × 127 × 3, the image content is the target to be tracked, and the feature extraction module pair Z0Initial template feature z obtained by feature extraction0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, the task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiExtracting the features to obtain the features x of the search areai,xiThe size is 26 × 26 × 256. The meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels, e.g., 6 × 6 × 256 represents 256 channels, each channel having a length and width of 6.
Linear fusion of submodules with initial template features z0The feature of the fused template of the previous frame (i.e., frame i-1)And the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image feature of the i-1 th frame. Linear fusion submodule pair z0、And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame (i.e. the ith frame)And will beAnd sending the data to a cross-correlation response module. In the first frame, the fused template feature is the initial template feature z0And then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame.
The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, twoThe branches are all convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template featuresThe classification nucleon module then receives from the linear fusion submoduleGenerating fused template features after further integrationWhere cls is an abbreviation for classification,andall of the dimensions of (1) are 4X 256. The classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristicsThe size is 24 × 24 × 256. Then the first classification branch will firstAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion templateAndall of the dimensions of (a) are 22X 256. Final first sort branch outputAndto the response fusion module.
The first regression branch is used to generate a regression cross-correlation response. Like the first classification branch, the first regression branch also contains two sub-modules: the system comprises a regression kernel module and a regression search submodule, wherein the network structures of the regression kernel module and the regression search submodule are the same as that of the classification kernel module of the first classification branch. The regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template featuresThen received from the linear fusion submoduleGenerating integrated fused template featuresWhere reg is an abbreviation for regression,andall of the dimensions of (1) are 4X 256. The regression search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristicsThe size is 24 × 24 × 256. Then the first regression branch is firstlyAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateAndall of the dimensions of (a) are 22X 256. Finally, the first regression branch is outputAndto the response fusion module.
The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The module is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branchAndwill be provided withAndstacking in channel dimension to generate 22 × 22 × 512-sized classification stacking response, and then performing classification fusion on the classification stacking response to obtain 22 × 22 × 256-dimensional classification fusion responseWill be provided withAnd sending the data to a target frame output module. The network structure of the second regression branch is the same as the second classification branch. The second regression branch receives two regression cross-correlation responses from the first regression branchAndwill be provided withAndstacking in channel dimension to generate regression stacking response with the size of 22 multiplied by 512, and then performing regression fusion on the regression stacking response to obtain regression fusion response with the size of 22 multiplied by 256Will be provided withAnd sending the data to a target frame output module.
The target frame output module is connected with the response fusion module. The target frame output module is branched by two parallel neural networks: a third classification branch and a third regression branch. The third branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (the convolution kernel size is 1 × 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer. The third classification branch receives the classification fusion response from the second classification branchTo pairPerforming response classification to obtain the result of the response classificationHas a dimension of 22X 2K, where K is a Region pro-social Network (RPN, Region suggestion Network, see the document "Ren S, He K, Girshick R, et al. faster R-cnn: Towards real-time object detection with Region pro-social Network [ C]91-99. ", article by Shaoqing Ren: faster r-cnn: real-time target detection is achieved through a regional proposal network), 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that an image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branchTo pairPerforming response regression to obtain response regression resultDimension (d) is 22 × 22 × 4 k. Wherein k is the same as k in the third classification branch, and represents the number of anchor frames in the RPN, and 4k means that there are k anchor frames and each anchor frame corresponds to 4 regression values: dx, dy, dw, dh which respectively represent the corrected values of the x, y coordinates and the length, width of the corresponding original anchor frame. And the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is the tracking frame of the target.
Secondly, preparing a training data set of the target tracking system, wherein the method comprises the following steps:
the training data set of the system is divided into two parts: first training data set T1And a second training data set T2,T1For training feature extraction module, cross-correlation response module and target box output module, T2For training the response fusion module.
2.1 from VID and YTB (i.e., Youtube-BoundingBox) 100000 positive sample pairs were selected by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting one frame from one video sequence of VID and YTB as a template image, and randomly selecting one frame from the other video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair. 100000 negative sample pairs are generated in this sampling manner.
2.3 choose 100000 positive sample pairs from DET and COCO by: two different images in the same object are randomly selected from DET and COCO to be used as 1 positive sample pair, and 100000 positive sample pairs are generated according to the sampling mode. One sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection datasets, containing target box labels.
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 select two images of the same kind but not the same object from DET and COCO, one as template image and the other as search area image, to get 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
2.4.2 select two different objects of different classes, one as a template image and the other as a search area image, from DET and COCO to obtain 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.
2.5 the template images in all positive and negative sample pairs are scaled to a size of 127 × 127 × 3 and all search area images are scaled to a size of 287 × 287 × 3. Taking all finally obtained positive and negative sample pairs as a first training data set T1。
2.6 choosing the training set of GOT-10k as the second training data set T2。
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method.
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1)。
3.2.2 Using T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.2.1 initialization variable d is 1.
3.2.2.2 taking T1The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. Random gradient descent (SGD) method is used (see the literature "Back propagation applied to hand write zip code recognition [ J ]]// Neural Computation,1989 ", article by Yannlecun et al: back propagation applied to handwritten zip code recognition) to train the feature extraction module, the cross-correlation response module, and the target box output module, minimizing the loss function, to update the network parameters of the convolutional neural network sub-module, the cross-correlation response module, and the target box output module. The loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box.
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turn 3.2.2.2.
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if epoch >10, go to 3.2.2.6.
3.2.2.5 sets all 16-layer network parameters of the convolutional neural network sub-module to fixed (i.e., untrained), go to 3.2.2.6.
3.2.2.6 sets the first 11 level parameters of the convolutional neural network sub-modules to fixed and the last 5 level parameters to trainable, go to 3.2.2.7.
3.2.2.7 if epoch is less than 50, make lr equal to 0.5 × lr, turn 3.2.2.1; if the epoch is 50, 3.2.2.8 is turned.
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification moduleFused template responsesAnd GT response(using the response between the target template and the search area corresponding to GroudTruth) and the initial template response of the regression modelPredicting template responsesAnd GT responseWill T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′。
3.4 use of T2' training a response fusion module, the method comprises the following steps:
3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T2The number of samples in is Len (T)2′)。
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializes a variable d to 1.
3.4.2.2 get T2The d < th > to the (d + batch size) pictures as training data, and the SGD algorithm is used for training the response fusion module to optimize the parameters of the response fusion module, the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
whereinFor fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible.
3.4.2.3 let d be d + batchsize. If d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2') to 3.4.2.2.
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epoch is 50, 3.4.2.5 is turned.
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
Fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in turn. Wherein IiFor the ith frame in the video, N is the total number of video frames. The initialization variable i is 1.
4.2 feature extraction Module from frame 1I0To obtain a target image Z with the size of 127 multiplied by 30And to Z0Performing feature extraction to obtain initial template features z with the size of 6 multiplied by 2560。
4.4 before the ith frame tracking, the target tracking system obtains the fusion template characteristics used by the ith-1 frame trackingTracking result Z ofi-1. Using feature extraction module pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1。
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frameAnd the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frameThe fusion formula is:
wherein λ1=0.99,λ2=0.01,λ1,λ2For preset parameters。
4.6 feature extraction Module with Zi-1In Ii-1Is centered at the central coordinate ofiSelecting an image area having a size of 287 × 287 × 3 as the target search area XiAnd to XiExtracting the features to obtain the search area features x with the size of 26 multiplied by 256i。
4.7 the Cross-correlation response Module receives z from the feature extraction Module0、And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification responseFollowed by a first sort branch pairAnd xiProcessing to obtain fusion template classification responseFirst regression branch first pair z0And xiProcessing to obtain regression response of initial templateFollowed by a first regression branch pairAnd xiProcessing to obtain regression response of fused templateThe four responses are 22 × 22 × 256 in size.
4.8 response fusion Module receptionAndusing a second sort branch pairAndperforming fusion to obtain classified fusion responseUsing a second regression branch pairAndperforming fusion to obtain regression fusion responseThe two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
whereinShow thatAndstacking is carried out on the channel dimension, and the channel dimension is input into a second classification branch for fusion.Show thatAndafter fusion, withCarrying out residual error connection, and finally generating classification fusion response after residual error connectionShow thatAndstacking is performed in the channel dimension, and the stacking is input into a second regression branch for fusion.Show thatAndafter fusion, withPerforming residual connection, and finally generating regression fusion response
4.9 target Box output Module receptionAndwherein the third branch pairProcessing to obtain classification resultThe dimension is 22 × 22 × 2k, where k is the number of anchor boxes, 22 × 22 × 2k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 2 classification values, and the 2 classification values of each anchor box respectively indicate the probability that the anchor box is a target and not a target; third regression branch pairProcessing to obtain regression resultThe dimension is 22 × 22 × 4k, and 22 × 22 × 4k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values indicate the correction values of the center position coordinates x and y and the correction values of the length and the width of the anchor box to the actual target box, respectively.
4.10 third branch of Classification on the results of the ClassificationAnd counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame.
4.11 third regression Branch on regression resultsFinding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h) obtained in 4.10, and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
obtainedNamely the target frame is obtained,is the coordinates of the center of the target frame,the length and width of the target box. The target image Z of the ith frame can be obtained by using the target framei,ZiI.e. the tracking result of the ith frame.
4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
The invention can achieve the following technical effects:
1. the invention adds the fusion template into the tracking system on the basis of the twin network tracking system of the single template. The fusion template can continuously acquire the latest tracking result of the system, and the template is ensured to be more accurate, so that the aim of enhancing the template matching effect is fulfilled, and the tracking precision is improved.
2. The invention adopts a double-template strategy and simultaneously uses an initial template and a fusion template in target tracking. By the method, when the tracking system drifts or loses in the tracking process, the initial template can ensure the template purity of the whole target tracking system, so that the robustness of the whole target tracking system is ensured.
3. The invention can improve the tracking precision and meet the real-time requirement. On NVIDIA GTX1050Ti, the tracking speed of the method is 68.3 FPS.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 is a logical block diagram of a target tracking system constructed in accordance with the present invention.
Detailed Description
Fig. 1 is a general flow chart of the present invention, and as shown in fig. 1, the present invention comprises the following steps:
firstly, a target tracking system is built, and as shown in fig. 2, the system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.
The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And the target search area image of each subsequent frame (the target search area image of the ith frame is XiRepresent), respectively for Z0And XiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet, the modified AlexNet totally comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and totally 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are ReLU activation function layers. Z0Has a size of 127X 3, as shown in the figureLike the content is the target to be tracked, the feature extraction module pairs Z0Initial template feature z obtained by feature extraction0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, the task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiExtracting the features to obtain the features x of the search areai,xiThe size is 26 × 26 × 256.
Linear fusion of submodules with initial template features z0The feature of the fused template of the previous frame (i.e., frame i-1)And the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image feature of the i-1 th frame. Linear fusion submodule pair z0、And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame (i.e. the ith frame)And will beAnd sending the data to a cross-correlation response module. In the first frame, the fused template feature is the initial template feature z0And then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame.
The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networksThe network structure is identical, but the parameters in the network are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template featuresThe classification nucleon module then receives from the linear fusion submoduleGenerating fused template features after further integration Andall of the dimensions of (1) are 4X 256. The classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristicsThe size is 24 × 24 × 256. Then the first classification branch will firstAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion templateAndall of the dimensions of (a) are 22X 256. Final first sort branch outputAndto the response fusion module.
The first regression branch is used to generate a regression cross-correlation response. Like the first classification branch, the first regression branch also contains two sub-modules: the system comprises a regression kernel module and a regression search submodule, wherein the network structures of the regression kernel module and the regression search submodule are the same as that of the classification kernel module of the first classification branch. The regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template featuresThen received from the linear fusion submoduleGenerating integrated fused template featuresAndall of the dimensions of (1) are 4X 256. Regression search submodule slave convolutional neural network submoduleReceive xiTo xiIntegrating to obtain integrated search region characteristics The size is 24 × 24 × 256. Then the first regression branch is firstlyAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateAndall of the dimensions of (a) are 22X 256. Finally, the first regression branch is outputAndto the response fusion module.
The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The moldThe block is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branchAndwill be provided withAndstacking in channel dimension to generate 22 × 22 × 512-sized classification stacking response, and then performing classification fusion on the classification stacking response to obtain 22 × 22 × 256-dimensional classification fusion responseWill be provided withAnd sending the data to a target frame output module. The network structure of the second regression branch is the same as the second classification branch. The second regression branch receives two regression cross-correlation responses from the first regression branchAndwill be provided withAndstacking in channel dimensions generates a regressive stacking response of dimensions 22 x 512Then, regression fusion is carried out on the regression stack response to obtain 22 multiplied by 256 dimensional regression fusion responseWill be provided withAnd sending the data to a target frame output module.
The target frame output module is connected with the response fusion module. The target frame output module is branched by two parallel neural networks: a third classification branch and a third regression branch. The third branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (the convolution kernel size is 1 × 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer. The third classification branch receives the classification fusion response from the second classification branchTo pairPerforming response classification to obtain the result of the response classificationIs 22 × 22 × 2k, where k is the number of anchor frames in the region-proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branchTo pairPerforming response regression to obtain response regression result Dimension (d) is 22 × 22 × 4 k. Wherein k is the same as k in the third classification branch, and represents the number of anchor frames in the RPN, and 4k means that there are k anchor frames and each anchor frame corresponds to 4 regression values: dx, dy, dw, dh which respectively represent the corrected values of the x, y coordinates and the length, width of the corresponding original anchor frame. And the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is the tracking frame of the target.
Secondly, preparing a training data set of the target tracking system, wherein the method comprises the following steps:
the training data set of the system is divided into two parts: first training data set T1And a second training data set T2,T1For training feature extraction module, cross-correlation response module and target box output module, T2For training the response fusion module.
2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting one frame from one video sequence of VID and YTB as a template image, and randomly selecting one frame from the other video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair. 100000 negative sample pairs are generated in this sampling manner.
2.3 choose 100000 positive sample pairs from DET and COCO by: two different images in the same object are randomly selected from DET and COCO to be used as 1 positive sample pair, and 100000 positive sample pairs are generated according to the sampling mode. One sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection datasets, containing target box labels.
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 select two images of the same kind but not the same object from DET and COCO, one as template image and the other as search area image, to get 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
2.4.2 select two different objects of different classes, one as a template image and the other as a search area image, from DET and COCO to obtain 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.
2.5 the template images in all positive and negative sample pairs are scaled to a size of 127 × 127 × 3 and all search area images are scaled to a size of 287 × 287 × 3. Taking all finally obtained positive and negative sample pairs as a first training data set T1。
2.6 choosing the training set of GOT-10k as the second training data set T2。
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method.
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting up the standardThe total number of the training iteration rounds is 50, and the epoch is initialized to 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1)。
3.2.2 Using T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.2.1 initialization variable d is 1.
3.2.2.2 taking T1The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. And training the feature extraction module, the cross-correlation response module and the target frame output module by using a Stochastic Gradient Descent (SGD) method to minimize a loss function so as to update network parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module. The loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box.
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turn 3.2.2.2.
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if epoch >10, go to 3.2.2.6.
3.2.2.5 sets all 16-layer network parameters of the convolutional neural network sub-module to fixed (i.e., untrained), go to 3.2.2.6.
3.2.2.6 sets the first 11 level parameters of the convolutional neural network sub-modules to fixed and the last 5 level parameters to trainable, go to 3.2.2.7.
3.2.2.7 if epoch is less than 50, make lr equal to 0.5 × lr, turn 3.2.2.1; if the epoch is 50, 3.2.2.8 is turned.
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification moduleFused template responsesAnd GT response(using the response between the target template and the search area corresponding to GroudTruth) and the initial template response of the regression modelPredicting template responsesAnd GT responseWill T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′。
3.4 use of T2' training a response fusion module, the method comprises the following steps:
3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition ofT2The number of samples in is Len (T)2′)。
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializes a variable d to 1.
3.4.2.2 get T2The d < th > to the (d + batch size) pictures as training data, and the SGD algorithm is used for training the response fusion module to optimize the parameters of the response fusion module, the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
whereinFor fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible.
3.4.2.3 let d be d + batchsize. If d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2') to 3.4.2.2.
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epoch is 50, 3.4.2.5 is turned.
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
Fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in turn. Wherein IiFor the ith frame in the video, N is the total number of video frames. The initialization variable i is 1.
4.2 feature extraction Module from frame 1I0To obtain a target image Z with the size of 127 multiplied by 30And to Z0Performing feature extraction to obtain a size of6 × 6 × 256 initial template feature z0。
4.4 before the ith frame tracking, the target tracking system obtains the fusion template characteristics used by the ith-1 frame trackingTracking result Z ofi-1. Using feature extraction module pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1。
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frameAnd the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frameThe fusion formula is:
wherein λ1=0.99,λ2=0.01,λ1,λ2Is a preset parameter.
4.6 feature extraction Module with Zi-1In Ii-1Is centered at the central coordinate ofiSelecting an image area having a size of 287 × 287 × 3 as the target search area XiAnd to XiExtracting the features to obtain the search area features x with the size of 26 multiplied by 256i。
4.7 the Cross-correlation response Module receives z from the feature extraction Module0、And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification responseFollowed by a first sort branch pairAnd xiProcessing to obtain fusion template classification responseFirst regression branch first pair z0And xiProcessing to obtain regression response of initial templateFollowed by a first regression branch pairAnd xiProcessing to obtain regression response of fused templateThe four responses are 22 × 22 × 256 in size.
4.8 response fusion Module receptionAndusing a second sort branch pairAndperforming fusion to obtain classified fusion responseUsing a second regression branch pairAndperforming fusion to obtain regression fusion responseThe two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
whereinShow thatAndstacking is carried out on the channel dimension, and the channel dimension is input into a second classification branch for fusion.Show thatAndafter fusion, withCarrying out residual error connection, and finally generating classification fusion response after residual error connectionShow thatAndstacking is performed in the channel dimension, and the stacking is input into a second regression branch for fusion.Show thatAndafter fusion, withPerforming residual connection, and finally generating regression fusion response
4.9 target Box output Module receptionAndwherein the third branch pairProcessing to obtain classification resultDimension 22X 2k, where k isThe number of the anchor frames is 22 multiplied by 2k, wherein the 22 multiplied by 22 k indicates that k anchor frames exist on each point of 22 multiplied by 22, each anchor frame corresponds to 2 classification values, and the 2 classification values of each anchor frame respectively indicate the probability that the anchor frame is a target and is not the target; third regression branch pairProcessing to obtain regression resultThe dimension is 22 × 22 × 4k, and 22 × 22 × 4k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values indicate the correction values of the center position coordinates x and y and the correction values of the length and the width of the anchor box to the actual target box, respectively.
4.10 third branch of Classification on the results of the ClassificationAnd counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame.
4.11 third regression Branch on regression resultsFinding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h) obtained in 4.10, and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
obtainedNamely the target frame is obtained,is the coordinates of the center of the target frame,the length and width of the target box. The target image Z of the ith frame can be obtained by using the target framei,ZiI.e. the tracking result of the ith frame.
4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
The target tracking field adopts success rate (suc) and confidence level precision (pre) to represent the accurate performance of tracking. SUC denotes the fraction of overlap between the predicted target box and GT box, and PRE denotes the percentage of frames with a center positioning error below some threshold to the total number of frames. Higher both SUC and PRE indicate better tracking performance. The tracking speed is measured by FPS (frames per second), and represents the number of frames processed per second, and the larger the FPS, the faster the tracking speed.
Table 1 shows the results of comparing the present invention with eight other high-performance target tracking methods on an OTB-100 dataset.
TABLE 1 comparison of test indexes of the present invention on OTB-100 data set with other eight high-performance target tracking methods
The first row of Table 1 is a shorthand for eight tracking algorithms to compare, the second row is the SUC values measured by these algorithms on the OTB-100 dataset, and the third row represents the PRE values measured by the algorithms. Bold font represents the optimal value. It can be seen from table 1 that the present invention outperforms both SUC and PRE in comparison to these eight high performance algorithms, and that the present invention improves 2.8% on SUC and 1.4% on PRE in comparison to the optimal DaSiamRPN of the eight algorithms. The invention improves the target tracking precision.
While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Any modification which does not depart from the functional and structural principles of the present invention is intended to be included within the scope of the claims.
Claims (11)
1. A target tracking method based on dual-template response fusion is characterized by comprising the following steps:
firstly, a target tracking system is set up, and the target tracking system consists of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module;
the characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module; the convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And then the target search area image of each frame, for Z separately0And target search region image X of the i-th frameiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0Sending the data to a linear fusion submodule; z0Is the target to be tracked, the feature extraction module is on Z0Initial template feature z obtained by feature extraction0(ii) a The task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiPerforming feature extraction to obtainSearch area feature xi;
Linear fusion of submodules with initial template features z0The fusion template feature of the previous frame, i.e. the i-1 th frameAnd the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image characteristic of the (i-1) th frame; linear fusion submodule pair z0、And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame, namely the ith frameAnd will beSending the data to a cross-correlation response module; in the first frame, the fused template feature is the initial template feature z0Then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame;
the cross-correlation response module is connected with the feature extraction module and the response fusion module; the module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different; the first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the classification kernel module firstly receives z from the convolutional neural network submodule0Generating an integrated initialFeatures of the formThe classification nucleon module then receives from the linear fusion submoduleGenerating fused template features after further integrationThe classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristicsThen the first classification branch will firstAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion templateFinal first sort branch outputAndto a response fusion module;
the first regression branch is used for generating a regression cross-correlation response; the first regression branch contains two sub-modules: a regression kernel module and a regression search submodule, wherein the regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template featuresThen received from the linear fusion submoduleGenerating integrated fused template featuresThe regression search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristicsThen the first regression branch is firstlyAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateThen will beAs a convolution kernel, willAs the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial templateFinally, the first regression branch is outputAndto a response fusion module;
the response fusion module is a convolutional neural network and is connected with the cross-correlation response module and the target frame output module; the module is branched by two parallel neural networks: the second classification branch and the second regression branch; the second classification branch receives two classification cross-correlation responses from the first classification branchAndwill be provided withAndstacking on channel dimension to generate a classification stacking response, and then performing classification fusion on the classification stacking response to obtain a classification fusion responseWill be provided withSending the data to a target frame output module; the second regression branch receives two regression cross-correlation responses from the first regression branchAndwill be provided withAndstacking on channel dimension to generate regression stacking response, and performing regression fusion on the regression stacking response to obtain regression fusion responseWill be provided withSending the data to a target frame output module;
the target frame output module is connected with the response fusion module; the target frame output module is branched by two parallel neural networks: the third classification branch and the third regression branch; the third classification branch receives the classification fusion response from the second classification branchTo pairPerforming response classification to obtain the result of the response classification Is 22 × 22 × 2k, where k is the number of anchor frames in the regional proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target; the third regression branch receives the regression fused response from the second regression branchTo pairAfter the response regression is carried out, the result of the response regression is obtained Has a size of 22 × 22 × 4 k; 4k means that there are k anchor boxes and each anchor box corresponds to 4 regression values: dx, dy, dw and dh which respectively represent the x and y coordinates and the length and width correction values of the corresponding original anchor frame; the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is a tracking frame of the target;
secondly, preparing a training data set of the target tracking system, wherein the training data set is divided into two parts: first training data set T1And a second training data set T2The method comprises the following steps:
2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, each video frame is marked with a target frame, the target frame is marked as the coordinates of the upper left corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames the target position;
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting a frame from a certain video sequence of VID and YTB as a template image, randomly selecting a frame from another video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair; generating 100000 negative sample pairs in the sampling mode;
2.3 choose 100000 positive sample pairs from DET and COCO by: randomly selecting two different images in the same object from DET and COCO as 1 positive sample pair, and generating 100000 positive sample pairs according to the sampling mode; one sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection data sets and comprise target box labels;
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 selecting two images of the same type but not the same object from DET and COCO respectively, wherein one image is used as a template image, and the other image is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;
2.4.2 selecting two different objects of different types from DET and COCO respectively to be pictures, wherein one is used as a template image, and the other is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;
2.5 scaling the template images in all positive and negative sample pairs to 127 × 127 × 3 size, all search area images to 287 × 287 × 3 size; taking all the positive and negative sample pairs after scaling as T1;
2.6 choosing the training set of GOT-10k as T2;
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method;
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1);
3.2.2 Using T1Training the feature extraction module, the cross-correlation response module and the target frame output module, and taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module which are updated after training as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network;
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification moduleFused template responsesAnd GT responseAnd initial template response of regression modulePredicting template responsesAnd GT responseWill T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′;
3.4 use of T2Training a response fusion module, and taking a response fusion module parameter obtained by training as a network parameter of a final response fusion module;
fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in sequence; wherein IiThe frame number is the ith frame in the video, and N is the total frame number of the video; initializing a variable i to 1;
4.2 feature extraction Module from frame 1I0To obtain a target image Z0And to Z0Carrying out feature extraction to obtain initial template features z0;
4.4 Using feature extraction Module Pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1;
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frameAnd the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frame
4.6 feature extraction Module with Zi-1In Ii-1Center ofCoordinates as center, at IiUpper selection target search area XiAnd to XiExtracting the features to obtain the features x of the search areai;
4.7 the Cross-correlation response Module receives z from the feature extraction Module0、And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification responseFollowed by a first sort branch pairAnd xiProcessing to obtain fusion template classification responseFirst regression branch first pair z0And xiProcessing to obtain regression response of initial templateFollowed by a first regression branch pairAnd xiProcessing to obtain regression response of fused template
4.8 response fusion Module receptionAndusing a second sort branch pairAndperforming fusion to obtain classified fusion responseUsing a second regression branch pairAndperforming fusion to obtain regression fusion response
4.9 target Box output Module receptionAndwherein the third branch pairProcessing to obtain classification resultThe dimension is 22 × 22 × 2k, where k is the number of anchor boxes, 22 × 22 × 2k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 2 classification values, and the 2 classification values of each anchor box respectively indicate the probability that the anchor box is a target and not a target; third regression branch pairProcessing to obtain regression resultThe dimension is 22 × 22 × 4k, 22 × 22 × 4k indicates that there are k anchor frames at each point of 22 × 22, and each anchor frame corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values respectively indicate the corrected values of the coordinates x and y of the center positions of the anchor frames to the actual target frame and the corrected values of the length and the width;
4.10 third branch of Classification on the results of the ClassificationCounting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame;
4.11 third regression Branch on regression resultsFinding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h), and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
obtainedNamely the target frame is obtained,is the coordinates of the center of the target frame,the length and width of the target frame; the target image Z of the ith frame can be obtained by using the target framei,ZiThe tracking result of the ith frame is obtained;
4.12 if i < N, make i ═ i +1, change 4.4; if i is equal to N, 4.13 is rotated;
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
2. The target tracking method based on dual-template response fusion of claim 1, wherein the convolutional neural network submodule is a modified AlexNet, the modified AlexNet comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and comprises 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers.
3. The method of claim 1, wherein Z is the target tracking method based on dual-template response fusion0Has a size of 127 × 127 × 3, the initial template feature z0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, search area feature xiThe size is 26 × 26 × 256; the meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels.
4. The method of claim 1, wherein the classification kernel module and the classification search submodule of the first classification branch each comprise 1 convolution layer, 1 BatchNorm layer, and 1 ReLU activation function layer; the network structures of the regression kernel module and the regression search submodule are respectively the same as the network structure of the classification kernel module.
5. The method of claim 1, wherein the second classification branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolutional layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer; the network structure of the second regression branch is the same as the second classification branch.
6. The target tracking method based on dual-template response fusion of claim 1, wherein the third classification branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (convolution kernel size 1 x 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer; the network structure of the third regression branch is the same as that of the third classification branch.
7. The method of claim 1, wherein the target tracking method based on dual-template response fusion is characterized in thatAndare all 4 x 256 in size,the size is 24 x 256,andall the sizes of (1) are 22 multiplied by 256; the above-mentionedAndare all 4 x 256 in size,the size is 24 x 256,andall the sizes of (1) are 22 multiplied by 256; the size of the classification stacking response is 22 multiplied by 512, and the classification fusion responseHas a size of 2 × 22 × 256; the regression stacking response size is 22 multiplied by 512, and the regression fusion responseHas a size of 22 × 22 × 256.
8. The target tracking method based on dual-template response fusion as claimed in claim 1, wherein said step 3.2.2 adopts T1The method for training the feature extraction module, the cross-correlation response module and the target frame output module comprises the following steps:
3.2.2.1 initialization variable d ═ 1;
3.2.2.2 taking T1The pictures from the d th picture to the d + batch size picture are used as training data and input into a feature extraction module, and the direction of the data flow is the feature extraction module → a mutual correlation response module → a target frame output module; training the feature extraction module, the cross-correlation response module and the target frame output module by using a stochastic gradient descent method, minimizing a loss function, so as to update the convolutional neural network sub-module, the target frame output module and the target frame output module of the feature extraction module,Network parameters of a cross-correlation response module and a target frame output module; the loss function is based on the classification loss LclsAnd regression loss LregThe combination composition is used for optimizing a target function and minimizing a loss function so as to update network parameters of the convolutional neural network submodule, the cross-correlation response module and the target frame output module; the loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box;
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turning to 3.2.2.2;
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if the epoch is greater than 10, switching to 3.2.2.6;
3.2.2.5, setting all 16-layer network parameters of the convolutional neural network submodule to be fixed, and turning to 3.2.2.6;
3.2.2.6, setting the first 11 layer parameters of the convolutional neural network sub-module as fixed and the last 5 layer parameters as trainable, turning to 3.2.2.7;
3.2.2.7 if epoch <50, let lr be 0.5 x lr for 3.2.2.1, if epoch <50, go 3.2.2.8;
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
9. The method for tracking targets based on fusion of dual-template responses as claimed in claim 1, wherein said step 3.4 uses T2' the method for training the response fusion module comprises the following steps:
3.4.1 set Total number of training iterations to 50, and initiateChanging epoch to 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T2The number of samples in is Len (T)2′);
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializing variable d ═ 1;
3.4.2.2 get T2The d < th > picture to the d + batch size picture as training data, and training the response fusion module by using an SGD algorithm to optimize the parameters of the response fusion module, wherein the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
whereinFor fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible;
3.4.2.3 let d be d + batchsize; if d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2'), go to 3.4.2.2;
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epich is 50, switching to 3.4.2.5;
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
10. The method of claim 1, wherein the linear fusion submodule of the feature extraction module in step 4.5 uses an initial template z0Fusion template used in tracking of i-1 th frameAnd the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frameThe fusion formula of (a) is:
wherein λ1=0.99,λ2=0.01,λ1,λ2Is a preset parameter.
11. The target tracking method based on dual-template response fusion of claim 1, wherein the second classification branch pair of the response fusion module in step 4.8Andperforming fusion to obtain classified fusion responseSecond regression branch pairAndperforming fusion to obtain regression fusion responseThe two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
whereinShow thatAndstacking on channel dimension, and inputting the stacked channels into a second classification branch for fusion;show thatAndafter fusion, withCarrying out residual error connection, and finally generating classification fusion response after residual error connection Show thatAndstacking on channel dimension, and inputting the stacked channels into a second regression branch for fusion;show thatAndafter fusion, withPerforming residual connection, and finally generating regression fusion response
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011524190.9A CN112541468B (en) | 2020-12-22 | 2020-12-22 | Target tracking method based on dual-template response fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011524190.9A CN112541468B (en) | 2020-12-22 | 2020-12-22 | Target tracking method based on dual-template response fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112541468A true CN112541468A (en) | 2021-03-23 |
CN112541468B CN112541468B (en) | 2022-09-06 |
Family
ID=75019483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011524190.9A Active CN112541468B (en) | 2020-12-22 | 2020-12-22 | Target tracking method based on dual-template response fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112541468B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113361329A (en) * | 2021-05-11 | 2021-09-07 | 浙江大学 | Robust single-target tracking method based on example feature perception |
CN113592906A (en) * | 2021-07-12 | 2021-11-02 | 华中科技大学 | Long video target tracking method and system based on annotation frame feature fusion |
CN113628246A (en) * | 2021-07-28 | 2021-11-09 | 西安理工大学 | Twin network target tracking method based on 3D convolution template updating |
CN113658224A (en) * | 2021-08-18 | 2021-11-16 | 中国人民解放军陆军炮兵防空兵学院 | Target contour tracking method and system based on correlated filtering and Deep Snake |
CN113808166A (en) * | 2021-09-15 | 2021-12-17 | 西安电子科技大学 | Single-target tracking method based on clustering difference and depth twin convolutional neural network |
CN114332151A (en) * | 2021-11-05 | 2022-04-12 | 电子科技大学 | Method for tracking interested target in shadow Video-SAR (synthetic aperture radar) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108830170A (en) * | 2018-05-24 | 2018-11-16 | 杭州电子科技大学 | A kind of end-to-end method for tracking target indicated based on layered characteristic |
WO2019085377A1 (en) * | 2017-11-03 | 2019-05-09 | 北京深鉴智能科技有限公司 | Target tracking hardware implementation system and method |
CN110827327A (en) * | 2018-08-13 | 2020-02-21 | 中国科学院长春光学精密机械与物理研究所 | Long-term target tracking method based on fusion |
CN111401178A (en) * | 2020-03-09 | 2020-07-10 | 蔡晓刚 | Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering |
CN112069896A (en) * | 2020-08-04 | 2020-12-11 | 河南科技大学 | Video target tracking method based on twin network fusion multi-template features |
-
2020
- 2020-12-22 CN CN202011524190.9A patent/CN112541468B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019085377A1 (en) * | 2017-11-03 | 2019-05-09 | 北京深鉴智能科技有限公司 | Target tracking hardware implementation system and method |
CN108830170A (en) * | 2018-05-24 | 2018-11-16 | 杭州电子科技大学 | A kind of end-to-end method for tracking target indicated based on layered characteristic |
CN110827327A (en) * | 2018-08-13 | 2020-02-21 | 中国科学院长春光学精密机械与物理研究所 | Long-term target tracking method based on fusion |
CN111401178A (en) * | 2020-03-09 | 2020-07-10 | 蔡晓刚 | Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering |
CN112069896A (en) * | 2020-08-04 | 2020-12-11 | 河南科技大学 | Video target tracking method based on twin network fusion multi-template features |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113361329A (en) * | 2021-05-11 | 2021-09-07 | 浙江大学 | Robust single-target tracking method based on example feature perception |
CN113361329B (en) * | 2021-05-11 | 2022-05-06 | 浙江大学 | Robust single-target tracking method based on example feature perception |
CN113592906A (en) * | 2021-07-12 | 2021-11-02 | 华中科技大学 | Long video target tracking method and system based on annotation frame feature fusion |
CN113592906B (en) * | 2021-07-12 | 2024-02-13 | 华中科技大学 | Long video target tracking method and system based on annotation frame feature fusion |
CN113628246A (en) * | 2021-07-28 | 2021-11-09 | 西安理工大学 | Twin network target tracking method based on 3D convolution template updating |
CN113628246B (en) * | 2021-07-28 | 2024-04-12 | 西安理工大学 | Twin network target tracking method based on 3D convolution template updating |
CN113658224A (en) * | 2021-08-18 | 2021-11-16 | 中国人民解放军陆军炮兵防空兵学院 | Target contour tracking method and system based on correlated filtering and Deep Snake |
CN113658224B (en) * | 2021-08-18 | 2024-02-06 | 中国人民解放军陆军炮兵防空兵学院 | Target contour tracking method and system based on correlation filtering and Deep Snake |
CN113808166A (en) * | 2021-09-15 | 2021-12-17 | 西安电子科技大学 | Single-target tracking method based on clustering difference and depth twin convolutional neural network |
CN114332151A (en) * | 2021-11-05 | 2022-04-12 | 电子科技大学 | Method for tracking interested target in shadow Video-SAR (synthetic aperture radar) |
CN114332151B (en) * | 2021-11-05 | 2023-04-07 | 电子科技大学 | Method for tracking interested target in shadow Video-SAR (synthetic aperture radar) |
Also Published As
Publication number | Publication date |
---|---|
CN112541468B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112541468B (en) | Target tracking method based on dual-template response fusion | |
CN107609460B (en) | Human body behavior recognition method integrating space-time dual network flow and attention mechanism | |
CN113807187B (en) | Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion | |
Kawewong et al. | Online and incremental appearance-based SLAM in highly dynamic environments | |
CN107633226B (en) | Human body motion tracking feature processing method | |
Kim et al. | Fast pedestrian detection in surveillance video based on soft target training of shallow random forest | |
CN108764019A (en) | A kind of Video Events detection method based on multi-source deep learning | |
CN110263855B (en) | Method for classifying images by utilizing common-basis capsule projection | |
CN113963032A (en) | Twin network structure target tracking method fusing target re-identification | |
Zou et al. | Microarray camera image segmentation with Faster-RCNN | |
CN115690152A (en) | Target tracking method based on attention mechanism | |
CN107609571A (en) | A kind of adaptive target tracking method based on LARK features | |
CN114241250A (en) | Cascade regression target detection method and device and computer readable storage medium | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
Kim et al. | Self-supervised keypoint detection based on multi-layer random forest regressor | |
Jin et al. | Face recognition based on MTCNN and Facenet | |
Aliakbarian et al. | Deep action-and context-aware sequence learning for activity recognition and anticipation | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
Bengamra et al. | A comprehensive survey on object detection in Visual Art: taxonomy and challenge | |
Li et al. | IIE-SegNet: Deep semantic segmentation network with enhanced boundary based on image information entropy | |
CN116883457B (en) | Light multi-target tracking method based on detection tracking joint network and mixed density network | |
Qin et al. | Structure-aware feature disentanglement with knowledge transfer for appearance-changing place recognition | |
Zhang et al. | Joint segmentation of images and scanned point cloud in large-scale street scenes with low-annotation cost | |
CN111144469B (en) | End-to-end multi-sequence text recognition method based on multi-dimensional associated time sequence classification neural network | |
Yang et al. | Collaborative strategy for visual object tracking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |