CN113538507A - Single-target tracking method based on full convolution network online training - Google Patents

Single-target tracking method based on full convolution network online training Download PDF

Info

Publication number
CN113538507A
CN113538507A CN202010293393.5A CN202010293393A CN113538507A CN 113538507 A CN113538507 A CN 113538507A CN 202010293393 A CN202010293393 A CN 202010293393A CN 113538507 A CN113538507 A CN 113538507A
Authority
CN
China
Prior art keywords
layer
convolution
frame
convolution kernel
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010293393.5A
Other languages
Chinese (zh)
Other versions
CN113538507B (en
Inventor
王利民
崔玉涛
蒋承
武港山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010293393.5A priority Critical patent/CN113538507B/en
Publication of CN113538507A publication Critical patent/CN113538507A/en
Application granted granted Critical
Publication of CN113538507B publication Critical patent/CN113538507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target tracking method based on full convolution network online training, which comprises the following steps: 1) generating a training sample stage; 2) a configuration stage of the network; 3) an off-line training stage; 4) an online tracking stage; the invention adopts the strategy of generating a target classification and target regression template to guide classification and regression tasks and updating the classification and regression template on line to realize the target tracking task through a designed full convolution network with complete end-to-end training. According to the single target tracking method, a simple full convolution network structure and online optimization of the classification and regression template are adopted, so that the single target tracking method with strong robustness and high precision is obtained.

Description

Single-target tracking method based on full convolution network online training
Technical Field
The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to a single-target tracking method based on full convolution network online training.
Background
As a basic task in computer vision, visual object tracking aims to estimate, for an arbitrary general object in a video, the spatial position where it appears in each frame and mark out the object borders. In general, current visual object tracking can be divided into two subtasks, object classification and bounding box regression.
According to different solutions of the object classification subtasks, the current tracking method can be divided into a generating method and a discriminating method. The generative method is based on the idea of template matching, and is typically a series of methods for performing similarity learning on previous and subsequent frame objects using a twin network. The SiamFC method proposed by Bertinetto first introduces a twin network into visual object tracking to learn the similarity of the search area in the known tracking target and the subsequent video frame. Li.B further proposes a SimRPN method, introduces RPN (region Proposal Net) into a twin network, and solves the tracking problem from the perspective of single-stage target detection in a local area; the second discriminant method learns to obtain an adaptive filter that can capture the tracked object by maximizing the classification response score between the tracked object and the background. The related filtering-based tracking method proposed by Henriques and the classifier-based tracking method proposed by Hyeonseob Nam are two typical discriminant methods at present. Compared with the generating equation, the discriminant method can better distinguish the difference between the tracked object and the background by updating the filter on line. Meanwhile, the complex online updating mechanism in the discriminant method is often difficult to be easily integrated into an end-to-end training framework. Recently, the DiMP proposed by Bhat introduces a meta-learning framework into a discriminant tracking method, designs a target model predictor for online training, and greatly improves the performance of the tracking method.
Bounding box regression is another important subtask for visual object tracking. Earlier tracking methods, exemplified by DCF and SiamFC, used exhaustive testing of multiple scales to estimate the bounding box of an object. While the RPN-based tracking method, exemplified by SiamRPN, follows the single-stage object detection concept, using a series of predefined anchor frames to determine the object size and regress the bounding box. In the ATOM and DiMP tracking methods, a plurality of manually generated initial candidate frames are iteratively adjusted by utilizing the existing IoUNet model, and finally, the frames are selected. For the bounding box regression subtask, the existing tracking method usually depends on the pre-designed bounding box generation rule and cannot train the update on line.
The method is inspired by the fact that online training is introduced into an object classification subtask by a discriminant method and succeeds, and the method is used for introducing an online training mechanism into a frame regression subtask, so that the method is better matched with the existing discriminant method, and the object information in the tracking process is learned online to achieve the purpose of predicting the frame of the object more stably and accurately. The FCOT (full relational on line tracking) work provides a precise and effective Online training tracking framework for directly processing two different subtasks of object classification and border regression. The designed RMG (regression Model Generator) module successfully introduces online updating into the frame regression, and ensures that the frame prediction still has high accuracy when the appearance of an object changes.
Disclosure of Invention
The invention aims to solve the problems that: target tracking can be generally divided into a classification task to identify the target from the background to roughly locate the target and a regression task to generate a precise target bounding box. For the classification task, an online training mode can be adopted to maximize the response gap between similar objects in the target and the background so as to learn an online adjustable filter. However, for the regression branch, the existing method generally performs relatively complex regression through a plurality of preset frames, which introduces a lot of additional parameters on one hand, and on the other hand, only one fixed target template can be used for guiding regression, and the situations of shape changes such as object deformation or rotation in subsequent videos cannot be well handled. Therefore, the main problems to be solved by the present invention are: how to design a simple target regression branch can avoid complex additional parameters and can simultaneously update the templates of the target classification and target regression tasks on line.
The technical scheme of the invention is as follows: a single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to perform on-line training and update network parameters, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:
1) generating a training sample stage, generating the training sample in an off-line training process, firstly carrying out target area dithering processing on each frame image of each video in an off-line training data set, then cutting out a target search area after dithering processing, extracting three frames from the first half part of each video frame sequence as a training frame, extracting one frame from the second half part of each video frame sequence as a test frame, marking a target frame as a verification frame on the test frame, generating a Gaussian label image taking the center of the target frame as the center as a label of the verification set for each verification frame, and recording the distance from the center of the target frame in each verification frame to four boundaries of the target frame as labels of regression branches in the off-line training process;
2) a network configuration stage, extracting the classification characteristic diagram and regression characteristic diagram of the test frame and the training frame, and generating an adaptable convolution kernel f of the classification branch according to the classification characteristic diagram of the training frameclsGenerating an adaptable convolution kernel f of the regression branch from the regression feature map of the training frameregTo accommodate the convolution kernel fclsActing on the classification characteristic map of the test frame as a convolution kernel of classification convolution, and generating a final classification score confidence map M after the convolution operationclsTo accommodate the convolution kernel fregActing on regression feature map of test frame as convolution kernel of regression branch convolution to generate final regression map M of distance from central point to target boundaryregRepresenting the distance of the center point of the object from the four boundaries of the object, according to the confidence map MclsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point, namely outputting a final target frame on the test frame;
3) in the off-line training stage, for off-line training of the classification branches, the class hinge Loss LBHinge Loss proposed by DiMP is used as a Loss function, for the regression branches, IoU Loss function is used, the label obtained by the truth-checking frame is combined, the SGD optimizer is used, the whole network parameters are updated through a back propagation algorithm, and the step 2 is repeated continuously until the iteration times are reached;
4) in the on-line tracking stage, firstly, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and then the template frame is expanded into an on-line training data set containing 30 frames of images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg
Compared with the prior art, the invention has the following advantages.
The invention provides a full convolution Online tracking method (FCOT). The method adopts a simple tracking frame for directly carrying out object classification and frame regression, and improves the tracking accuracy while ensuring the object tracking efficiency.
The invention designs a Regression Model Generator (RMG), which introduces an online training mechanism in frame Regression and brings object classification and frame Regression into a unified frame for online training and tracking. Compared with the manual design rule relied on by the existing method, the frame regression of the on-line training has better adaptability to the object deformation in the tracking process.
The invention obtains good accuracy on the task of tracking the visual object and solves the problem of frame misalignment caused by the interference of similar background contents when the frame of the object returns. Compared with the existing method, the FCOT tracking method provided by the invention has the advantages that the tracking success rate and the positioning accuracy are good in a plurality of vision tracking test datum data sets.
Drawings
FIG. 1 is a system framework diagram used by the present invention.
Fig. 2 is a schematic diagram of a frame extraction process of a video.
Fig. 3 is a schematic diagram of a multivariate information fusion module provided by the present invention.
FIG. 4 is a schematic diagram of multi-scale inter-frame difference feature extraction and fusion proposed by the present invention.
Fig. 5 is a schematic diagram of a probability map solving process proposed by the present invention.
Fig. 6 is a schematic diagram of a single-frame feature sequence feature extraction process.
Detailed Description
The invention provides a single-target tracking method based on full convolution network on-line training, which firstly needs off-line training and then updates parameters of partial modules through on-line training in the tracking process. Performing off-line training on four training data sets of TrackingNet-Train, LaSOT-Train, COCO-Train and GOT-10k-Train, and performing off-line training on OTB100, NFS, UAV123, GOT-10k-Test and LaSOT-Test、TrackingNet-TestThe test on the six test sets achieves high accuracy and tracking success rate, and is implemented by using a Python3 programming language and a Pytroch 1.1 deep learning framework.
FIG. 1 is a system framework diagram used by the present invention to generate a target classification and target regression template through a designed full convolution network trained end-to-end to guide classification and regression tasks and to update the strategy of the classification and regression template online to implement a target tracking task. The whole method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:
1) and a preparation stage of data, namely a training sample stage. In the off-line training process, a training sample is generated in the off-line training process, firstly, target area dithering is carried out on each frame image of each video in an off-line training data set, then a target search area after dithering is cut out, three frames are extracted from the first half of each video frame sequence to be used as a training frame, one frame is extracted from the second half of each video frame sequence to be used as a test frame, a target frame is marked on the test frame to be used as a verification frame, a Gaussian label image taking the center of the target frame as the center is generated for each verification frame to be used as a label of a verification set, and the distance between the center of the target frame in each verification frame and four boundaries of the target frame is recorded to be used as labels of regression branches in the off-line training process.
2) The configuration phase of the model, i.e. the network configuration phase, is as follows.
2.1) extracting coding features of the test frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a test frame F is subjected to feature extractiontest∈RB×3×288×288Extracting features to obtain
Figure BDA0002451263870000041
Wherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript test represents the test frame, B represents the size of batch size, the convolution layer and the pooling layer are included, the convolution layer is two convolution kernels of 3 × 3 and 1 × 1, which are respectively used for extracting the feature with higher dimension and transforming the feature dimension, and the convolution kernel of each convolution layer is initialized in a random initialization mode;
2.2) extracting the decoding characteristics of the test frame: applying a convolution layer with a convolution kernel of 1 x 1) to the convolution layer obtained in step 2.1)
Figure BDA0002451263870000042
The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtained
Figure BDA0002451263870000043
The feature layer size was 256 × 18. For those obtained in step 2.1)
Figure BDA0002451263870000051
After a 1 x 1 convolutional layer, the number of input channels is 512, the number of output channels is 256, so that the number of channels becomes 256, then a bilinear interpolation layer is used to perform 2 times of up-sampling operation on the decoding characteristics of the first layer, and then the two characteristic layers are processed, namely the upper two characteristic layers
Figure BDA0002451263870000052
Adding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoding characteristic of the second layer
Figure BDA0002451263870000053
The feature layer size was 256 × 36. For those obtained in step 2.1)
Figure BDA0002451263870000054
Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtained
Figure BDA0002451263870000055
The feature layer size was 256 × 72.
2.3) extracting classification features of the test frames: decoding characteristics obtained in step 2.2)
Figure BDA0002451263870000056
Through a convolutional layer using 3 x 3 convolutional kernels with step size of 1, 256 input channels and 256 output channels, followed by a Group Normalization layer Group Normalization with Group 32, and a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3 x 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two Deformable convolutions, the Group size of the Group Normalization layer is 32, and therefore the classification feature map of the test frame is obtained
Figure BDA0002451263870000057
The size of the feature map is 256 × 72;
2.4) extracting test framesThe regression feature of (1): decoding characteristics obtained in step 2.2)
Figure BDA0002451263870000058
Passing through a convolutional layer, which uses a convolutional kernel of 3, with a step size of 1, a number of input channels of 256, and a number of output channels of 256, then through a Group Normalization layer, which has a Group of 32, and then through a ReLU active layer. And performing two Deformable convolutions (Deformable Convolution) on the obtained features, wherein the Convolution kernel size of the Deformable convolutions is 3, the step size is 1, the number of input channels and the number of output channels are both 256, the first Deformable Convolution is followed by a Group Normalization layer and a ReLU active layer, the Group Normalization layer has a Group size of 32, and the ReLU active layer is added after the second Deformable Convolution. Simultaneously using the decoding characteristics obtained in step 2.2)
Figure BDA0002451263870000059
Over the same network and the resulting features are upsampled by a factor of 2. Will be provided with
Figure BDA00024512638700000510
And after the two feature maps which are obtained after the up-sampling are spliced, inputting the two feature maps into a 1 x 1 convolution layer, wherein the input channel number of the convolution layer is 512, and the output channel number of the convolution layer is 256, so that the regression feature map of the test frame is obtained
Figure BDA00024512638700000511
The size of the feature map is 256 × 72;
2.5) extracting the coding features of the training frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a plurality of training frames Ftrain∈RB×3×3×288×288Extracting features to obtain
Figure BDA00024512638700000512
Where the meaning of superscript e2 is the features extracted by the coding layer Block-2, the meaning of e3 is the features extracted by the coding layer Block-3,e4 means the features extracted by the coding layer Block-4, the subscript train indicates the training frame, B indicates the size of the batch size, which includes convolution layer and pooling layer, the convolution layer is two convolution kernels, 3 × 3 and 1 × 1, respectively, used to extract the features of higher dimension and transform the feature dimension, the convolution kernel of each convolution layer is initialized by random initialization;
2.6) extracting the decoding characteristics of the training frame: applying a convolution layer with a convolution kernel of 1 x 1 to the convolution layer obtained in step 2.5)
Figure BDA0002451263870000061
The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtained
Figure BDA0002451263870000062
The feature layer size was 256 × 18. For those obtained in step 2.1)
Figure BDA0002451263870000063
After passing through a convolution layer of 1 x 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then a bilinear interpolation layer is used for carrying out up-sampling operation of 2 times on the decoding characteristics of the first layer, then the two characteristic layers are added and then carry out convolution operation, the convolution kernel is 1 x 1, and therefore the decoding characteristics of the second layer are obtained
Figure BDA0002451263870000064
The feature layer size was 256 × 36. For those obtained in step 2.1)
Figure BDA0002451263870000065
Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtained
Figure BDA0002451263870000066
The feature layer size was 256 × 72.
2.7) extracting the classification features of the training frames: decoding characteristics obtained in step 2.6)
Figure BDA0002451263870000067
Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, and the first Deformable Convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group size of the Group Normalization layer is 32, thereby obtaining a classification feature map of the training frame
Figure BDA0002451263870000068
The size of the feature map is 256 × 72;
2.8) extracting regression characteristics of the training frame: decoding characteristics obtained in step 2.6)
Figure BDA0002451263870000069
Passing through a convolutional layer, which uses a convolutional kernel of 3, with a step size of 1, a number of input channels of 256, and a number of output channels of 256, then through a Group Normalization layer, which has a Group of 32, and then through a ReLU active layer. And performing two Deformable convolutions (Deformable Convolution) on the obtained features, wherein the Convolution kernel size of the Deformable convolutions is 3, the step size is 1, the number of input channels and the number of output channels are both 256, the first Deformable Convolution is followed by a Group Normalization layer and a ReLU active layer, the Group Normalization layer has a Group size of 32, and the ReLU active layer is added after the second Deformable Convolution. Simultaneously using the decoding characteristics obtained in step 2.2)
Figure BDA00024512638700000610
The same network is performed and the resulting features are upsampled by a factor of 2. Splicing the two feature maps obtained above, and inputting the spliced two feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that the regression feature map of the test frame is obtained
Figure BDA0002451263870000071
The size of the feature map is 256 × 72;
2.9) generating an adaptable convolution kernel for the classification branch: the classification characteristic map obtained in the step 2.7) is used
Figure BDA0002451263870000072
Firstly inputting the convolution kernel into a convolution layer and a region-of-interest Pooling layer ROI Pooling, wherein the convolution kernel size of the convolution layer is 3 x 3, the step length is 1, the number of input channels and output channels is 256, the size of the ROI Pooling layer is 4 x 4, and the step length is 16, thereby obtaining an initial adaptable convolution kernel
Figure BDA0002451263870000073
The convolution kernel size is 256 × 4. Will next be
Figure BDA0002451263870000074
Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchcls
2.10) generating an adaptable convolution kernel of the regression branch: subjecting the regression feature map obtained in the step 2.8)
Figure BDA0002451263870000075
Firstly inputting the data into a region-of-interest Pooling layer ROI Pooling with the size of 3 x 3 and the step length of 16 to obtain an initial adaptable convolution kernel
Figure BDA0002451263870000076
The convolution kernel size is 256 × 4. The initial convolution kernel is then
Figure BDA0002451263870000077
Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchreg
2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)clsAs a convolution kernel for classification convolution, the convolution kernel fclsActing on the classification characteristic map of the test frame obtained in step 2.3)
Figure BDA0002451263870000078
After the convolution operation, a final classification score confidence map M is generatedclsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;
classification confidence map MclsThe specific calculation method is as follows:
remember that the last classification convolution is Conv1And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.9)cls
Figure BDA0002451263870000079
2.12) obtaining the regression bias distance: adaptable convolution kernel f obtained in step 2.10)regAs a convolution kernel of the regression branch convolution, the convolution kernel fregActing on the regression feature map of the test frame obtained in step 2.4)
Figure BDA00024512638700000710
After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generatedregThe regression graph has a size of 4 × 72, and represents distances from the center point of the target to the four boundaries of the object. Note that the final regression convolution is Conv2And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)regThen, then
Figure BDA00024512638700000711
Confidence map M obtained according to step 2.11)clsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point to obtain a final target frame.
The network configuration phase is described in detail below in one embodiment. The first four blocks of ResNet-50 are used as a network structure of an encoding layer, parameters of an ImageNet pre-training model are loaded in the network, then a training frame and a test frame are decoded respectively, specifically, a convolutional layer with a convolutional kernel of 1 × 1 is firstly passed, a first layer of decoding features is obtained, and the size of the feature graph is 256 × 18. And then, after passing through a convolution layer of 1 × 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then, a bilinear interpolation layer is used for performing 2 times of upsampling operation on the decoding characteristics of the first layer, then, after the two characteristic layers are added, a convolution operation is performed, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the second layer are obtained, and the size of the characteristic diagram is 256 × 36. Next, the feature map is up-sampled by a factor of 2 to obtain a third layer of decoded features, the size of the feature map being 256 × 72.
Then, classification features and regression branches are respectively extracted from the training frames and the test frames, as shown in fig. 2, the features obtained after passing through the decoder are input into the convolution layer, then two deformable convolution operations are executed, a set of normalization layer and a ReLU activation layer are arranged between each convolution operation, and therefore the classification features of the training frames can be obtained, and the size of the classification feature map is 256 × 72. As also shown in fig. 3, 4, and 5, the features obtained by passing the training frame and the test frame through the decoder pass through several convolutional layers, deformable convolutional layers, group normalization, and activation function layers in the figure. The size of the regression feature map of the training frame was 1024 × 72, and the size of both the classification feature map and the regression feature map of the test frame was 256 × 72.
The generation of the classification and regression optimizable models is performed in training frames, and as shown in FIG. 6, the classification and regression features of the training frames are input into an initial optimizable model generator that producesThe initial coarse model (actually a convolution kernel) is sized for classification and regression as 256 x 4 and 4 x 256 x 3, respectively. Then, optimizing the initial model by the steepest descent method to obtain a final classification and regression optimizable model for the test frame, namely an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg
And finally, respectively executing convolution operation on the classification characteristic and the regression characteristic of the test frame, wherein the convolution kernel is the optimized classification and regression model obtained above, so that a 72 × 72 classification score map and a 4 × 72 regression bias distance map are obtained. And finding out a point with the highest score in the classification score map as a central point of the target, and finding out four values of the point in the regression offset distance map as distances between the central point and four boundaries of the target, thereby obtaining a final target frame.
3) In the off-line training stage, hinge-like Loss LBHinge Loss proposed by DiMP is used as a Loss function for off-line training of the classification branch, IoU Loss function is used for the regression branch, an SGD optimizer is used, BatchSize is set to be 40, the total number of training rounds is set to be 100, the learning rate is divided by 10 in 25 and 45 rounds, the attenuation rate is set to be 0.2, training is performed on 8 RTX 2080ti, the whole network parameter is updated through a back propagation algorithm, and the steps 2.1) to 2.12 are repeated continuously until the number of iterations is reached.
4) And in the on-line tracking stage, a training frame sequence and a test frame are input into the network, and a final target frame is obtained on the basis of initial parameters.
Firstly, cutting out a target frame search area in a first frame image of a video to be tracked as a template, and then expanding the template frame into an online training data set containing 30 frames of images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are added as labelsAdaptable convolution kernel f for updating classification branches in an online training datasetclsAnd adaptable convolution kernel f of regression branchreg
In the invention, the on-line training means that the convolution kernels of the two classification branches and the regression branch can be trained on line to update parameters so as to adapt to the current tracking video, and the initial of the full convolution network is still obtained by off-line training.
The first frame of each video in the test set is subjected to the same operation as the training set, firstly, an area with the size 5 times that of a target is cut out, then, the area is scaled to 288 × 288, the first frame is subjected to data enhancement such as rotation, translation and noise addition to obtain 30 training samples, online training is carried out through the enhanced training data, online training is carried out every 25 frames in the process of tracking each frame, and the frame with the highest classification score is selected and added into the online training set. On the OTB100, NFS, UAV123, GOT-10k, LaSOT and TrackingNet test data sets, the tracking efficiency is 53fps, and on the tracking precision, Auc reaches 69.2% on the OTB100 data set, and Pre reaches 90.6%; on the NFS dataset, Auc reached 62.2% and Pre reached 74.5%; on the UAV123 data set, Auc reached 65.7%, Pre reached 87.6%; on the GOT-10k dataset, AO reached 62.7%, SR0.75Reaches 51.7 percent, SR0.5Reaches 75.3 percent; on the TrackingNet dataset, Suc reached 74.5% and Pre reached 71.4%; on the LaSOT data set, Suc reached 69.2%, Pre reached 90.6%, and the indices on the 6 test data sets exceeded DiMP.

Claims (2)

1. A single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to train and update network parameters on line, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:
1) generating a training sample stage, generating the training sample in an off-line training process, firstly carrying out target area dithering processing on each frame image of each video in an off-line training data set, then cutting out a target search area after dithering processing, extracting three frames from the first half part of each video frame sequence as a training frame, extracting one frame from the second half part of each video frame sequence as a test frame, marking a target frame as a verification frame on the test frame, generating a Gaussian label image taking the center of the target frame as the center as a label of the verification set for each verification frame, and recording the distance from the center of the target frame in each verification frame to four boundaries of the target frame as labels of regression branches in the off-line training process;
2) a network configuration stage, extracting the classification characteristic diagram and regression characteristic diagram of the test frame and the training frame, and generating an adaptable convolution kernel f of the classification branch according to the classification characteristic diagram of the training frameclsGenerating an adaptable convolution kernel f of the regression branch from the regression feature map of the training frameregTo accommodate the convolution kernel fclsActing on the classification characteristic map of the test frame as a convolution kernel of classification convolution, and generating a final classification score confidence map M after the convolution operationclsTo accommodate the convolution kernel fregActing on regression feature map of test frame as convolution kernel of regression branch convolution to generate final regression map M of distance from central point to target boundaryregRepresenting the distance of the center point of the object from the four boundaries of the object, according to the confidence map MclsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point, namely outputting a final target frame on the test frame;
3) in the off-line training stage, for off-line training of the classification branches, the class hinge Loss LBHinge Loss proposed by DiMP is used as a Loss function, for the regression branches, IoU Loss function is used, the label obtained by the truth-checking frame is combined, the SGD optimizer is used, the whole network parameters are updated through a back propagation algorithm, and the step 2 is repeated continuously until the iteration times are reached;
4) in the on-line tracking stage, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and thenThe template frame is then expanded into an on-line training data set containing 30 images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestThe target frame is used for realizing tracking; in the tracking process, one frame with the highest classification score and a tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg
2. The single-target tracking method based on the full convolution network online training as claimed in claim 1, wherein the network configuration stage specifically comprises:
2.1) extracting coding features of the test frame: firstly, using Block-1, Block-2, Block-3 and Block-4 of Resnet-50 as encoders to extract features, and carrying out feature extraction on a test frame Ftest∈RB×3×288×288Extracting features to obtain
Figure FDA0002451263860000021
Wherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript test represents the test frame, B represents the size of batch size, the convolution layer and the pooling layer are included, the convolution layer is two convolution kernels of 3 × 3 and 1 × 1, which are respectively used for extracting the feature with higher dimension and transforming the feature dimension, and the convolution kernel of each convolution layer is initialized in a random initialization mode;
2.2) extracting the decoding characteristics of the test frame: applying a convolution layer with a convolution kernel of 1 x 1) to the convolution layer obtained in step 2.1)
Figure FDA0002451263860000022
The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristics are obtained
Figure FDA0002451263860000023
Feature layer size 256 × 18; for those obtained in step 2.1)
Figure FDA0002451263860000024
After a 1 x 1 convolutional layer, the input channel number is 512, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the characteristics of the first layer
Figure FDA0002451263860000025
Perform 2 times of upsampling operation, and then will
Figure FDA0002451263860000026
Adding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 x 1 to obtain second layer decoding characteristics
Figure FDA0002451263860000027
Feature layer size 256 × 36; for those obtained in step 2.1)
Figure FDA0002451263860000028
Passing through a 1 x 1 convolutional layer, the input channel number is 256, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the second layer
Figure FDA0002451263860000029
Performing 2 times of upsampling operation, adding the two feature layers, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoded feature of the third layer
Figure FDA00024512638600000210
Feature layer size 256 × 72;
2.3) extracting classification features of the test frames: decoding characteristics obtained in step 2.2)
Figure FDA00024512638600000211
Passing through a convolutional layer, the convolutional layer adopts a convolution kernel with the size of 3, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer Group Normalization, the Group of Normalization layers is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained features, the size of the convolution kernel of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer Group Normalization and a ReLU activation layer are also arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and thus obtaining the classification characteristic diagram of the test frame
Figure FDA00024512638600000212
Characteristic diagram
Figure FDA00024512638600000213
256 × 72;
2.4) extracting regression characteristics of the test frame: decoding characteristics obtained in step 2.2)
Figure FDA00024512638600000214
Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained characteristics, the size of the convolutional kernel of the deformable convolutions is 3, the step length is 1, the number of the input channels and the number of the output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and a ReLU activation layer is added after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)
Figure FDA0002451263860000031
Through and with
Figure FDA0002451263860000032
The same network carries out 2 times of upsampling on the obtained characteristics; will be provided with
Figure FDA0002451263860000033
And after splicing the 2 times of upsampling features, inputting the spliced 2 times of upsampling features into a 1 x 1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that a regression feature map of the test frame is obtained
Figure FDA0002451263860000034
Characteristic diagram
Figure FDA0002451263860000035
256 × 72;
2.5) extracting the coding features of the training frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a plurality of training frames Ftrain∈RB×3×3×288×288Extracting features to obtain
Figure FDA0002451263860000036
Wherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript train indicates the training frame, B represents the size of batch size, the convolutional layer and the pooling layer are included, the convolutional layer is two convolution kernels of 3 × 3 and 1 × 1, and is used for extracting the feature with higher dimension and transforming the feature dimension respectively, and the convolution kernel of each convolutional layer is initialized in a random initialization mode;
2.6) extracting the decoding characteristics of the training frame: applying a convolution layer with a convolution kernel of 1 x 1 to the convolution layer obtained in step 2.5)
Figure FDA0002451263860000037
The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtained
Figure FDA0002451263860000038
Feature layer size 256 × 18; for those obtained in step 2.1)
Figure FDA0002451263860000039
After passing through a convolution layer of 1 x 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then a bilinear interpolation layer is used for carrying out up-sampling operation of 2 times on the decoding characteristics of the first layer, then the two characteristic layers are added and then carry out convolution operation, the convolution kernel is 1 x 1, and therefore the decoding characteristics of the second layer are obtained
Figure FDA00024512638600000310
Feature layer size 256 × 36; for those obtained in step 2.1)
Figure FDA00024512638600000311
Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtained
Figure FDA00024512638600000312
Feature layer size 256 × 72;
2.7) extracting the classification features of the training frames: decoding characteristics obtained in step 2.6)
Figure FDA00024512638600000313
Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. And performing two deformable convolutions on the obtained features, wherein the convolution kernel of the deformable convolution has the size of 3 and the step length of1, the number of input channels and the number of output channels are both 256, and the first deformable convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group Normalization layer has a Group size of 32, thereby obtaining a classification feature map of the training frame
Figure FDA00024512638600000314
The size of the feature map is 256 × 72;
2.8) extracting regression characteristics of the training frame: decoding characteristics obtained in step 2.6)
Figure FDA0002451263860000041
Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, and then passing through a ReLU active layer; performing two deformable convolutions on the obtained features, wherein the convolution kernel size of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU active layer are arranged after the first deformable convolution, the Group size of the Group Normalization layer is 32, and the ReLU active layer is arranged after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)
Figure FDA0002451263860000042
Performing the same network, upsampling the obtained features, wherein the upsampling multiple is 2, splicing the two feature maps obtained above, and inputting the spliced feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, the number of output channels of the convolutional layer is 256, and thus obtaining the regression feature map of the test frame
Figure FDA0002451263860000043
The size of the feature map is 256 × 72;
2.9) generating an adaptable convolution kernel for the classification branch: the classification characteristic map obtained in the step 2.7) is used
Figure FDA0002451263860000044
Firstly inputting the convolution kernel into a convolution layer and a region-of-interest Pooling layer ROI Pooling, wherein the convolution kernel size of the convolution layer is 3 x 3, the step length is 1, the number of input channels and output channels is 256, the size of the ROI Pooling layer is 4 x 4, and the step length is 16, thereby obtaining an initial adaptable convolution kernel
Figure FDA0002451263860000045
Size 256 x 4; will next be
Figure FDA0002451263860000046
Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchcls
2.10) generating an adaptable convolution kernel of the regression branch: subjecting the regression feature map obtained in the step 2.8)
Figure FDA0002451263860000047
Firstly inputting the data into a region-of-interest Pooling layer ROI Pooling with the size of 3 x 3 and the step length of 16 to obtain an initial adaptable convolution kernel
Figure FDA0002451263860000048
Size 256 x 4; the initial convolution kernel is then
Figure FDA0002451263860000049
Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchreg
2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)clsAs a convolution kernel for classification convolution, a convolution kernel fclsActing on the classification characteristic map of the test frame obtained in step 2.3)
Figure FDA00024512638600000410
Through the convolution operationAfter doing so, a final classification score confidence map M is generatedclsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;
classification confidence map MclsThe specific calculation method is as follows:
remember that the last classification convolution is Conv1And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.9)cls
Figure FDA00024512638600000411
Mcls∈R1×72×72
2.12) obtaining the regression bias distance: adaptable convolution kernel f obtained in step 2.10)regAs a convolution kernel of the regression branch convolution, the convolution kernel fregActing on the regression feature map of the test frame obtained in step 2.4)
Figure FDA0002451263860000051
After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generatedregThe size of the regression graph is 4 × 72, and the sizes respectively represent the distances from the center point of the target to the four boundaries of the object; note that the final regression convolution is Conv2And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)regThen, then
Figure FDA0002451263860000052
Mreg∈R4×72×72
Confidence map M obtained according to step 2.11)clsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point to obtain a final target frame.
CN202010293393.5A 2020-04-15 2020-04-15 Single-target tracking method based on full convolution network online training Active CN113538507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010293393.5A CN113538507B (en) 2020-04-15 2020-04-15 Single-target tracking method based on full convolution network online training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010293393.5A CN113538507B (en) 2020-04-15 2020-04-15 Single-target tracking method based on full convolution network online training

Publications (2)

Publication Number Publication Date
CN113538507A true CN113538507A (en) 2021-10-22
CN113538507B CN113538507B (en) 2023-11-17

Family

ID=78088144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010293393.5A Active CN113538507B (en) 2020-04-15 2020-04-15 Single-target tracking method based on full convolution network online training

Country Status (1)

Country Link
CN (1) CN113538507B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024012243A1 (en) * 2022-07-15 2024-01-18 Mediatek Inc. Unified cross-component model derivation

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945210A (en) * 2017-11-30 2018-04-20 天津大学 Target tracking algorism based on deep learning and environment self-adaption
WO2018086607A1 (en) * 2016-11-11 2018-05-17 纳恩博(北京)科技有限公司 Target tracking method, electronic device, and storage medium
CN108171112A (en) * 2017-12-01 2018-06-15 西安电子科技大学 Vehicle identification and tracking based on convolutional neural networks
US20180285692A1 (en) * 2017-03-28 2018-10-04 Ulsee Inc. Target Tracking with Inter-Supervised Convolutional Networks
CN108960086A (en) * 2018-06-20 2018-12-07 电子科技大学 Based on the multi-pose human body target tracking method for generating confrontation network positive sample enhancing
CN109191491A (en) * 2018-08-03 2019-01-11 华中科技大学 The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer
US20200051250A1 (en) * 2018-08-08 2020-02-13 Beihang University Target tracking method and device oriented to airborne-based monitoring scenarios

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018086607A1 (en) * 2016-11-11 2018-05-17 纳恩博(北京)科技有限公司 Target tracking method, electronic device, and storage medium
US20180285692A1 (en) * 2017-03-28 2018-10-04 Ulsee Inc. Target Tracking with Inter-Supervised Convolutional Networks
CN107945210A (en) * 2017-11-30 2018-04-20 天津大学 Target tracking algorism based on deep learning and environment self-adaption
CN108171112A (en) * 2017-12-01 2018-06-15 西安电子科技大学 Vehicle identification and tracking based on convolutional neural networks
CN108960086A (en) * 2018-06-20 2018-12-07 电子科技大学 Based on the multi-pose human body target tracking method for generating confrontation network positive sample enhancing
CN109191491A (en) * 2018-08-03 2019-01-11 华中科技大学 The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion
US20200051250A1 (en) * 2018-08-08 2020-02-13 Beihang University Target tracking method and device oriented to airborne-based monitoring scenarios
CN109978921A (en) * 2019-04-01 2019-07-05 南京信息工程大学 A kind of real-time video target tracking algorithm based on multilayer attention mechanism
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
DONG-HYUN LEE: "Fully Convolutional Single-Crop Siamese Networks for Real-Time Visual Object Tracking", 《ELECTRONICS》 *
LIJUN WANG ET AL.: "Visual Tracking with Fully Convolutional Networks", 《PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
LUCA BERTINETTO ET AL.: "Fully-Convolutional Siamese Networks for Object Tracking", 《COMPUTER VISION AND PATTERN RECOGNITION (CS.CV)》 *
YANGLIU KUAI ET AL.: "Hyper-Feature Based Tracking with the Fully-Convolutional Siamese Network", 《2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA)》 *
YANGLIU KUAI ET AL.: "Learning Fully Convolutional Network for Visual Tracking With Multi-Layer Feature Fusion", 《IEEE ACCESS》 *
史璐璐: "深度学习及其在视频目标跟踪中的应用研究", 《中国优秀硕士学位论文全文数据库》 *
许多: "基于CNN与RNN结构化处理的目标跟踪算法研究", 《中国优秀硕士学位论文全文数据库》 *
许正: "基于深度学习的目标跟踪方法研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024012243A1 (en) * 2022-07-15 2024-01-18 Mediatek Inc. Unified cross-component model derivation

Also Published As

Publication number Publication date
CN113538507B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
US20230186056A1 (en) Grabbing detection method based on rp-resnet
US11636570B2 (en) Generating digital images utilizing high-resolution sparse attention and semantic layout manipulation neural networks
Peyrard et al. ICDAR2015 competition on text image super-resolution
CN113361636B (en) Image classification method, system, medium and electronic device
CN108595558B (en) Image annotation method based on data equalization strategy and multi-feature fusion
Yang et al. Diffusion model as representation learner
CN113807340B (en) Attention mechanism-based irregular natural scene text recognition method
CN112861915A (en) Anchor-frame-free non-cooperative target detection method based on high-level semantic features
CN114359603A (en) Self-adaptive unsupervised matching method in multi-mode remote sensing image field
CN116129289A (en) Attention edge interaction optical remote sensing image saliency target detection method
CN114529574A (en) Image matting method and device based on image segmentation, computer equipment and medium
CN115049921A (en) Method for detecting salient target of optical remote sensing image based on Transformer boundary sensing
CN115908806A (en) Small sample image segmentation method based on lightweight multi-scale feature enhancement network
CN113538507A (en) Single-target tracking method based on full convolution network online training
EP4237997A1 (en) Segmentation models having improved strong mask generalization
CN115115667A (en) Accurate target tracking method based on target transformation regression network
Goud et al. Text localization and recognition from natural scene images using ai
CN113554655B (en) Optical remote sensing image segmentation method and device based on multi-feature enhancement
CN112732943B (en) Chinese character library automatic generation method and system based on reinforcement learning
CN113743497B (en) Fine granularity identification method and system based on attention mechanism and multi-scale features
CN115393491A (en) Ink video generation method and device based on instance segmentation and reference frame
CN112529081A (en) Real-time semantic segmentation method based on efficient attention calibration
US20240095929A1 (en) Methods and systems of real-time hierarchical image matting
CN117593755B (en) Method and system for recognizing gold text image based on skeleton model pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant