CN113538507A - Single-target tracking method based on full convolution network online training - Google Patents
Single-target tracking method based on full convolution network online training Download PDFInfo
- Publication number
- CN113538507A CN113538507A CN202010293393.5A CN202010293393A CN113538507A CN 113538507 A CN113538507 A CN 113538507A CN 202010293393 A CN202010293393 A CN 202010293393A CN 113538507 A CN113538507 A CN 113538507A
- Authority
- CN
- China
- Prior art keywords
- layer
- convolution
- frame
- convolution kernel
- size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 238000012360 testing method Methods 0.000 claims description 66
- 238000010606 normalization Methods 0.000 claims description 32
- 238000010586 diagram Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 14
- 238000012795 verification Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000002945 steepest descent method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/251—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a target tracking method based on full convolution network online training, which comprises the following steps: 1) generating a training sample stage; 2) a configuration stage of the network; 3) an off-line training stage; 4) an online tracking stage; the invention adopts the strategy of generating a target classification and target regression template to guide classification and regression tasks and updating the classification and regression template on line to realize the target tracking task through a designed full convolution network with complete end-to-end training. According to the single target tracking method, a simple full convolution network structure and online optimization of the classification and regression template are adopted, so that the single target tracking method with strong robustness and high precision is obtained.
Description
Technical Field
The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to a single-target tracking method based on full convolution network online training.
Background
As a basic task in computer vision, visual object tracking aims to estimate, for an arbitrary general object in a video, the spatial position where it appears in each frame and mark out the object borders. In general, current visual object tracking can be divided into two subtasks, object classification and bounding box regression.
According to different solutions of the object classification subtasks, the current tracking method can be divided into a generating method and a discriminating method. The generative method is based on the idea of template matching, and is typically a series of methods for performing similarity learning on previous and subsequent frame objects using a twin network. The SiamFC method proposed by Bertinetto first introduces a twin network into visual object tracking to learn the similarity of the search area in the known tracking target and the subsequent video frame. Li.B further proposes a SimRPN method, introduces RPN (region Proposal Net) into a twin network, and solves the tracking problem from the perspective of single-stage target detection in a local area; the second discriminant method learns to obtain an adaptive filter that can capture the tracked object by maximizing the classification response score between the tracked object and the background. The related filtering-based tracking method proposed by Henriques and the classifier-based tracking method proposed by Hyeonseob Nam are two typical discriminant methods at present. Compared with the generating equation, the discriminant method can better distinguish the difference between the tracked object and the background by updating the filter on line. Meanwhile, the complex online updating mechanism in the discriminant method is often difficult to be easily integrated into an end-to-end training framework. Recently, the DiMP proposed by Bhat introduces a meta-learning framework into a discriminant tracking method, designs a target model predictor for online training, and greatly improves the performance of the tracking method.
Bounding box regression is another important subtask for visual object tracking. Earlier tracking methods, exemplified by DCF and SiamFC, used exhaustive testing of multiple scales to estimate the bounding box of an object. While the RPN-based tracking method, exemplified by SiamRPN, follows the single-stage object detection concept, using a series of predefined anchor frames to determine the object size and regress the bounding box. In the ATOM and DiMP tracking methods, a plurality of manually generated initial candidate frames are iteratively adjusted by utilizing the existing IoUNet model, and finally, the frames are selected. For the bounding box regression subtask, the existing tracking method usually depends on the pre-designed bounding box generation rule and cannot train the update on line.
The method is inspired by the fact that online training is introduced into an object classification subtask by a discriminant method and succeeds, and the method is used for introducing an online training mechanism into a frame regression subtask, so that the method is better matched with the existing discriminant method, and the object information in the tracking process is learned online to achieve the purpose of predicting the frame of the object more stably and accurately. The FCOT (full relational on line tracking) work provides a precise and effective Online training tracking framework for directly processing two different subtasks of object classification and border regression. The designed RMG (regression Model Generator) module successfully introduces online updating into the frame regression, and ensures that the frame prediction still has high accuracy when the appearance of an object changes.
Disclosure of Invention
The invention aims to solve the problems that: target tracking can be generally divided into a classification task to identify the target from the background to roughly locate the target and a regression task to generate a precise target bounding box. For the classification task, an online training mode can be adopted to maximize the response gap between similar objects in the target and the background so as to learn an online adjustable filter. However, for the regression branch, the existing method generally performs relatively complex regression through a plurality of preset frames, which introduces a lot of additional parameters on one hand, and on the other hand, only one fixed target template can be used for guiding regression, and the situations of shape changes such as object deformation or rotation in subsequent videos cannot be well handled. Therefore, the main problems to be solved by the present invention are: how to design a simple target regression branch can avoid complex additional parameters and can simultaneously update the templates of the target classification and target regression tasks on line.
The technical scheme of the invention is as follows: a single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to perform on-line training and update network parameters, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:
1) generating a training sample stage, generating the training sample in an off-line training process, firstly carrying out target area dithering processing on each frame image of each video in an off-line training data set, then cutting out a target search area after dithering processing, extracting three frames from the first half part of each video frame sequence as a training frame, extracting one frame from the second half part of each video frame sequence as a test frame, marking a target frame as a verification frame on the test frame, generating a Gaussian label image taking the center of the target frame as the center as a label of the verification set for each verification frame, and recording the distance from the center of the target frame in each verification frame to four boundaries of the target frame as labels of regression branches in the off-line training process;
2) a network configuration stage, extracting the classification characteristic diagram and regression characteristic diagram of the test frame and the training frame, and generating an adaptable convolution kernel f of the classification branch according to the classification characteristic diagram of the training frameclsGenerating an adaptable convolution kernel f of the regression branch from the regression feature map of the training frameregTo accommodate the convolution kernel fclsActing on the classification characteristic map of the test frame as a convolution kernel of classification convolution, and generating a final classification score confidence map M after the convolution operationclsTo accommodate the convolution kernel fregActing on regression feature map of test frame as convolution kernel of regression branch convolution to generate final regression map M of distance from central point to target boundaryregRepresenting the distance of the center point of the object from the four boundaries of the object, according to the confidence map MclsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point, namely outputting a final target frame on the test frame;
3) in the off-line training stage, for off-line training of the classification branches, the class hinge Loss LBHinge Loss proposed by DiMP is used as a Loss function, for the regression branches, IoU Loss function is used, the label obtained by the truth-checking frame is combined, the SGD optimizer is used, the whole network parameters are updated through a back propagation algorithm, and the step 2 is repeated continuously until the iteration times are reached;
4) in the on-line tracking stage, firstly, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and then the template frame is expanded into an on-line training data set containing 30 frames of images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg。
Compared with the prior art, the invention has the following advantages.
The invention provides a full convolution Online tracking method (FCOT). The method adopts a simple tracking frame for directly carrying out object classification and frame regression, and improves the tracking accuracy while ensuring the object tracking efficiency.
The invention designs a Regression Model Generator (RMG), which introduces an online training mechanism in frame Regression and brings object classification and frame Regression into a unified frame for online training and tracking. Compared with the manual design rule relied on by the existing method, the frame regression of the on-line training has better adaptability to the object deformation in the tracking process.
The invention obtains good accuracy on the task of tracking the visual object and solves the problem of frame misalignment caused by the interference of similar background contents when the frame of the object returns. Compared with the existing method, the FCOT tracking method provided by the invention has the advantages that the tracking success rate and the positioning accuracy are good in a plurality of vision tracking test datum data sets.
Drawings
FIG. 1 is a system framework diagram used by the present invention.
Fig. 2 is a schematic diagram of a frame extraction process of a video.
Fig. 3 is a schematic diagram of a multivariate information fusion module provided by the present invention.
FIG. 4 is a schematic diagram of multi-scale inter-frame difference feature extraction and fusion proposed by the present invention.
Fig. 5 is a schematic diagram of a probability map solving process proposed by the present invention.
Fig. 6 is a schematic diagram of a single-frame feature sequence feature extraction process.
Detailed Description
The invention provides a single-target tracking method based on full convolution network on-line training, which firstly needs off-line training and then updates parameters of partial modules through on-line training in the tracking process. Performing off-line training on four training data sets of TrackingNet-Train, LaSOT-Train, COCO-Train and GOT-10k-Train, and performing off-line training on OTB100, NFS, UAV123, GOT-10k-Test and LaSOT-Test、TrackingNet-TestThe test on the six test sets achieves high accuracy and tracking success rate, and is implemented by using a Python3 programming language and a Pytroch 1.1 deep learning framework.
FIG. 1 is a system framework diagram used by the present invention to generate a target classification and target regression template through a designed full convolution network trained end-to-end to guide classification and regression tasks and to update the strategy of the classification and regression template online to implement a target tracking task. The whole method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:
1) and a preparation stage of data, namely a training sample stage. In the off-line training process, a training sample is generated in the off-line training process, firstly, target area dithering is carried out on each frame image of each video in an off-line training data set, then a target search area after dithering is cut out, three frames are extracted from the first half of each video frame sequence to be used as a training frame, one frame is extracted from the second half of each video frame sequence to be used as a test frame, a target frame is marked on the test frame to be used as a verification frame, a Gaussian label image taking the center of the target frame as the center is generated for each verification frame to be used as a label of a verification set, and the distance between the center of the target frame in each verification frame and four boundaries of the target frame is recorded to be used as labels of regression branches in the off-line training process.
2) The configuration phase of the model, i.e. the network configuration phase, is as follows.
2.1) extracting coding features of the test frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a test frame F is subjected to feature extractiontest∈RB×3×288×288Extracting features to obtainWherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript test represents the test frame, B represents the size of batch size, the convolution layer and the pooling layer are included, the convolution layer is two convolution kernels of 3 × 3 and 1 × 1, which are respectively used for extracting the feature with higher dimension and transforming the feature dimension, and the convolution kernel of each convolution layer is initialized in a random initialization mode;
2.2) extracting the decoding characteristics of the test frame: applying a convolution layer with a convolution kernel of 1 x 1) to the convolution layer obtained in step 2.1)The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtainedThe feature layer size was 256 × 18. For those obtained in step 2.1)After a 1 x 1 convolutional layer, the number of input channels is 512, the number of output channels is 256, so that the number of channels becomes 256, then a bilinear interpolation layer is used to perform 2 times of up-sampling operation on the decoding characteristics of the first layer, and then the two characteristic layers are processed, namely the upper two characteristic layersAdding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoding characteristic of the second layerThe feature layer size was 256 × 36. For those obtained in step 2.1)Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtainedThe feature layer size was 256 × 72.
2.3) extracting classification features of the test frames: decoding characteristics obtained in step 2.2)Through a convolutional layer using 3 x 3 convolutional kernels with step size of 1, 256 input channels and 256 output channels, followed by a Group Normalization layer Group Normalization with Group 32, and a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3 x 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two Deformable convolutions, the Group size of the Group Normalization layer is 32, and therefore the classification feature map of the test frame is obtainedThe size of the feature map is 256 × 72;
2.4) extracting test framesThe regression feature of (1): decoding characteristics obtained in step 2.2)Passing through a convolutional layer, which uses a convolutional kernel of 3, with a step size of 1, a number of input channels of 256, and a number of output channels of 256, then through a Group Normalization layer, which has a Group of 32, and then through a ReLU active layer. And performing two Deformable convolutions (Deformable Convolution) on the obtained features, wherein the Convolution kernel size of the Deformable convolutions is 3, the step size is 1, the number of input channels and the number of output channels are both 256, the first Deformable Convolution is followed by a Group Normalization layer and a ReLU active layer, the Group Normalization layer has a Group size of 32, and the ReLU active layer is added after the second Deformable Convolution. Simultaneously using the decoding characteristics obtained in step 2.2)Over the same network and the resulting features are upsampled by a factor of 2. Will be provided withAnd after the two feature maps which are obtained after the up-sampling are spliced, inputting the two feature maps into a 1 x 1 convolution layer, wherein the input channel number of the convolution layer is 512, and the output channel number of the convolution layer is 256, so that the regression feature map of the test frame is obtainedThe size of the feature map is 256 × 72;
2.5) extracting the coding features of the training frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a plurality of training frames Ftrain∈RB×3×3×288×288Extracting features to obtainWhere the meaning of superscript e2 is the features extracted by the coding layer Block-2, the meaning of e3 is the features extracted by the coding layer Block-3,e4 means the features extracted by the coding layer Block-4, the subscript train indicates the training frame, B indicates the size of the batch size, which includes convolution layer and pooling layer, the convolution layer is two convolution kernels, 3 × 3 and 1 × 1, respectively, used to extract the features of higher dimension and transform the feature dimension, the convolution kernel of each convolution layer is initialized by random initialization;
2.6) extracting the decoding characteristics of the training frame: applying a convolution layer with a convolution kernel of 1 x 1 to the convolution layer obtained in step 2.5)The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtainedThe feature layer size was 256 × 18. For those obtained in step 2.1)After passing through a convolution layer of 1 x 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then a bilinear interpolation layer is used for carrying out up-sampling operation of 2 times on the decoding characteristics of the first layer, then the two characteristic layers are added and then carry out convolution operation, the convolution kernel is 1 x 1, and therefore the decoding characteristics of the second layer are obtainedThe feature layer size was 256 × 36. For those obtained in step 2.1)Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtainedThe feature layer size was 256 × 72.
2.7) extracting the classification features of the training frames: decoding characteristics obtained in step 2.6)Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, and the first Deformable Convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group size of the Group Normalization layer is 32, thereby obtaining a classification feature map of the training frameThe size of the feature map is 256 × 72;
2.8) extracting regression characteristics of the training frame: decoding characteristics obtained in step 2.6)Passing through a convolutional layer, which uses a convolutional kernel of 3, with a step size of 1, a number of input channels of 256, and a number of output channels of 256, then through a Group Normalization layer, which has a Group of 32, and then through a ReLU active layer. And performing two Deformable convolutions (Deformable Convolution) on the obtained features, wherein the Convolution kernel size of the Deformable convolutions is 3, the step size is 1, the number of input channels and the number of output channels are both 256, the first Deformable Convolution is followed by a Group Normalization layer and a ReLU active layer, the Group Normalization layer has a Group size of 32, and the ReLU active layer is added after the second Deformable Convolution. Simultaneously using the decoding characteristics obtained in step 2.2)The same network is performed and the resulting features are upsampled by a factor of 2. Splicing the two feature maps obtained above, and inputting the spliced two feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that the regression feature map of the test frame is obtainedThe size of the feature map is 256 × 72;
2.9) generating an adaptable convolution kernel for the classification branch: the classification characteristic map obtained in the step 2.7) is usedFirstly inputting the convolution kernel into a convolution layer and a region-of-interest Pooling layer ROI Pooling, wherein the convolution kernel size of the convolution layer is 3 x 3, the step length is 1, the number of input channels and output channels is 256, the size of the ROI Pooling layer is 4 x 4, and the step length is 16, thereby obtaining an initial adaptable convolution kernelThe convolution kernel size is 256 × 4. Will next beUsing Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchcls;
2.10) generating an adaptable convolution kernel of the regression branch: subjecting the regression feature map obtained in the step 2.8)Firstly inputting the data into a region-of-interest Pooling layer ROI Pooling with the size of 3 x 3 and the step length of 16 to obtain an initial adaptable convolution kernelThe convolution kernel size is 256 × 4. The initial convolution kernel is thenUsing Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchreg;
2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)clsAs a convolution kernel for classification convolution, the convolution kernel fclsActing on the classification characteristic map of the test frame obtained in step 2.3)After the convolution operation, a final classification score confidence map M is generatedclsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;
classification confidence map MclsThe specific calculation method is as follows:
remember that the last classification convolution is Conv1And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.9)cls,
2.12) obtaining the regression bias distance: adaptable convolution kernel f obtained in step 2.10)regAs a convolution kernel of the regression branch convolution, the convolution kernel fregActing on the regression feature map of the test frame obtained in step 2.4)After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generatedregThe regression graph has a size of 4 × 72, and represents distances from the center point of the target to the four boundaries of the object. Note that the final regression convolution is Conv2And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)regThen, then
Confidence map M obtained according to step 2.11)clsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point to obtain a final target frame.
The network configuration phase is described in detail below in one embodiment. The first four blocks of ResNet-50 are used as a network structure of an encoding layer, parameters of an ImageNet pre-training model are loaded in the network, then a training frame and a test frame are decoded respectively, specifically, a convolutional layer with a convolutional kernel of 1 × 1 is firstly passed, a first layer of decoding features is obtained, and the size of the feature graph is 256 × 18. And then, after passing through a convolution layer of 1 × 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then, a bilinear interpolation layer is used for performing 2 times of upsampling operation on the decoding characteristics of the first layer, then, after the two characteristic layers are added, a convolution operation is performed, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the second layer are obtained, and the size of the characteristic diagram is 256 × 36. Next, the feature map is up-sampled by a factor of 2 to obtain a third layer of decoded features, the size of the feature map being 256 × 72.
Then, classification features and regression branches are respectively extracted from the training frames and the test frames, as shown in fig. 2, the features obtained after passing through the decoder are input into the convolution layer, then two deformable convolution operations are executed, a set of normalization layer and a ReLU activation layer are arranged between each convolution operation, and therefore the classification features of the training frames can be obtained, and the size of the classification feature map is 256 × 72. As also shown in fig. 3, 4, and 5, the features obtained by passing the training frame and the test frame through the decoder pass through several convolutional layers, deformable convolutional layers, group normalization, and activation function layers in the figure. The size of the regression feature map of the training frame was 1024 × 72, and the size of both the classification feature map and the regression feature map of the test frame was 256 × 72.
The generation of the classification and regression optimizable models is performed in training frames, and as shown in FIG. 6, the classification and regression features of the training frames are input into an initial optimizable model generator that producesThe initial coarse model (actually a convolution kernel) is sized for classification and regression as 256 x 4 and 4 x 256 x 3, respectively. Then, optimizing the initial model by the steepest descent method to obtain a final classification and regression optimizable model for the test frame, namely an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg。
And finally, respectively executing convolution operation on the classification characteristic and the regression characteristic of the test frame, wherein the convolution kernel is the optimized classification and regression model obtained above, so that a 72 × 72 classification score map and a 4 × 72 regression bias distance map are obtained. And finding out a point with the highest score in the classification score map as a central point of the target, and finding out four values of the point in the regression offset distance map as distances between the central point and four boundaries of the target, thereby obtaining a final target frame.
3) In the off-line training stage, hinge-like Loss LBHinge Loss proposed by DiMP is used as a Loss function for off-line training of the classification branch, IoU Loss function is used for the regression branch, an SGD optimizer is used, BatchSize is set to be 40, the total number of training rounds is set to be 100, the learning rate is divided by 10 in 25 and 45 rounds, the attenuation rate is set to be 0.2, training is performed on 8 RTX 2080ti, the whole network parameter is updated through a back propagation algorithm, and the steps 2.1) to 2.12 are repeated continuously until the number of iterations is reached.
4) And in the on-line tracking stage, a training frame sequence and a test frame are input into the network, and a final target frame is obtained on the basis of initial parameters.
Firstly, cutting out a target frame search area in a first frame image of a video to be tracked as a template, and then expanding the template frame into an online training data set containing 30 frames of images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are added as labelsAdaptable convolution kernel f for updating classification branches in an online training datasetclsAnd adaptable convolution kernel f of regression branchreg。
In the invention, the on-line training means that the convolution kernels of the two classification branches and the regression branch can be trained on line to update parameters so as to adapt to the current tracking video, and the initial of the full convolution network is still obtained by off-line training.
The first frame of each video in the test set is subjected to the same operation as the training set, firstly, an area with the size 5 times that of a target is cut out, then, the area is scaled to 288 × 288, the first frame is subjected to data enhancement such as rotation, translation and noise addition to obtain 30 training samples, online training is carried out through the enhanced training data, online training is carried out every 25 frames in the process of tracking each frame, and the frame with the highest classification score is selected and added into the online training set. On the OTB100, NFS, UAV123, GOT-10k, LaSOT and TrackingNet test data sets, the tracking efficiency is 53fps, and on the tracking precision, Auc reaches 69.2% on the OTB100 data set, and Pre reaches 90.6%; on the NFS dataset, Auc reached 62.2% and Pre reached 74.5%; on the UAV123 data set, Auc reached 65.7%, Pre reached 87.6%; on the GOT-10k dataset, AO reached 62.7%, SR0.75Reaches 51.7 percent, SR0.5Reaches 75.3 percent; on the TrackingNet dataset, Suc reached 74.5% and Pre reached 71.4%; on the LaSOT data set, Suc reached 69.2%, Pre reached 90.6%, and the indices on the 6 test data sets exceeded DiMP.
Claims (2)
1. A single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to train and update network parameters on line, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:
1) generating a training sample stage, generating the training sample in an off-line training process, firstly carrying out target area dithering processing on each frame image of each video in an off-line training data set, then cutting out a target search area after dithering processing, extracting three frames from the first half part of each video frame sequence as a training frame, extracting one frame from the second half part of each video frame sequence as a test frame, marking a target frame as a verification frame on the test frame, generating a Gaussian label image taking the center of the target frame as the center as a label of the verification set for each verification frame, and recording the distance from the center of the target frame in each verification frame to four boundaries of the target frame as labels of regression branches in the off-line training process;
2) a network configuration stage, extracting the classification characteristic diagram and regression characteristic diagram of the test frame and the training frame, and generating an adaptable convolution kernel f of the classification branch according to the classification characteristic diagram of the training frameclsGenerating an adaptable convolution kernel f of the regression branch from the regression feature map of the training frameregTo accommodate the convolution kernel fclsActing on the classification characteristic map of the test frame as a convolution kernel of classification convolution, and generating a final classification score confidence map M after the convolution operationclsTo accommodate the convolution kernel fregActing on regression feature map of test frame as convolution kernel of regression branch convolution to generate final regression map M of distance from central point to target boundaryregRepresenting the distance of the center point of the object from the four boundaries of the object, according to the confidence map MclsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point, namely outputting a final target frame on the test frame;
3) in the off-line training stage, for off-line training of the classification branches, the class hinge Loss LBHinge Loss proposed by DiMP is used as a Loss function, for the regression branches, IoU Loss function is used, the label obtained by the truth-checking frame is combined, the SGD optimizer is used, the whole network parameters are updated through a back propagation algorithm, and the step 2 is repeated continuously until the iteration times are reached;
4) in the on-line tracking stage, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and thenThe template frame is then expanded into an on-line training data set containing 30 images as a training frame FtrainTaking the frame to be tracked in the video to be tracked as a test frame FtestWith FtrainInputting the network in the step 2) to obtain FtestThe target frame is used for realizing tracking; in the tracking process, one frame with the highest classification score and a tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branchclsAnd adaptable convolution kernel f of regression branchreg。
2. The single-target tracking method based on the full convolution network online training as claimed in claim 1, wherein the network configuration stage specifically comprises:
2.1) extracting coding features of the test frame: firstly, using Block-1, Block-2, Block-3 and Block-4 of Resnet-50 as encoders to extract features, and carrying out feature extraction on a test frame Ftest∈RB×3×288×288Extracting features to obtainWherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript test represents the test frame, B represents the size of batch size, the convolution layer and the pooling layer are included, the convolution layer is two convolution kernels of 3 × 3 and 1 × 1, which are respectively used for extracting the feature with higher dimension and transforming the feature dimension, and the convolution kernel of each convolution layer is initialized in a random initialization mode;
2.2) extracting the decoding characteristics of the test frame: applying a convolution layer with a convolution kernel of 1 x 1) to the convolution layer obtained in step 2.1)The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristics are obtainedFeature layer size 256 × 18; for those obtained in step 2.1)After a 1 x 1 convolutional layer, the input channel number is 512, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the characteristics of the first layerPerform 2 times of upsampling operation, and then willAdding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 x 1 to obtain second layer decoding characteristicsFeature layer size 256 × 36; for those obtained in step 2.1)Passing through a 1 x 1 convolutional layer, the input channel number is 256, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the second layerPerforming 2 times of upsampling operation, adding the two feature layers, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoded feature of the third layerFeature layer size 256 × 72;
2.3) extracting classification features of the test frames: decoding characteristics obtained in step 2.2)Passing through a convolutional layer, the convolutional layer adopts a convolution kernel with the size of 3, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer Group Normalization, the Group of Normalization layers is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained features, the size of the convolution kernel of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer Group Normalization and a ReLU activation layer are also arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and thus obtaining the classification characteristic diagram of the test frameCharacteristic diagram256 × 72;
2.4) extracting regression characteristics of the test frame: decoding characteristics obtained in step 2.2)Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained characteristics, the size of the convolutional kernel of the deformable convolutions is 3, the step length is 1, the number of the input channels and the number of the output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and a ReLU activation layer is added after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)Through and withThe same network carries out 2 times of upsampling on the obtained characteristics; will be provided withAnd after splicing the 2 times of upsampling features, inputting the spliced 2 times of upsampling features into a 1 x 1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that a regression feature map of the test frame is obtainedCharacteristic diagram256 × 72;
2.5) extracting the coding features of the training frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a plurality of training frames Ftrain∈RB×3×3×288×288Extracting features to obtainWherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript train indicates the training frame, B represents the size of batch size, the convolutional layer and the pooling layer are included, the convolutional layer is two convolution kernels of 3 × 3 and 1 × 1, and is used for extracting the feature with higher dimension and transforming the feature dimension respectively, and the convolution kernel of each convolutional layer is initialized in a random initialization mode;
2.6) extracting the decoding characteristics of the training frame: applying a convolution layer with a convolution kernel of 1 x 1 to the convolution layer obtained in step 2.5)The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtainedFeature layer size 256 × 18; for those obtained in step 2.1)After passing through a convolution layer of 1 x 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then a bilinear interpolation layer is used for carrying out up-sampling operation of 2 times on the decoding characteristics of the first layer, then the two characteristic layers are added and then carry out convolution operation, the convolution kernel is 1 x 1, and therefore the decoding characteristics of the second layer are obtainedFeature layer size 256 × 36; for those obtained in step 2.1)Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtainedFeature layer size 256 × 72;
2.7) extracting the classification features of the training frames: decoding characteristics obtained in step 2.6)Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. And performing two deformable convolutions on the obtained features, wherein the convolution kernel of the deformable convolution has the size of 3 and the step length of1, the number of input channels and the number of output channels are both 256, and the first deformable convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group Normalization layer has a Group size of 32, thereby obtaining a classification feature map of the training frameThe size of the feature map is 256 × 72;
2.8) extracting regression characteristics of the training frame: decoding characteristics obtained in step 2.6)Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, and then passing through a ReLU active layer; performing two deformable convolutions on the obtained features, wherein the convolution kernel size of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU active layer are arranged after the first deformable convolution, the Group size of the Group Normalization layer is 32, and the ReLU active layer is arranged after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)Performing the same network, upsampling the obtained features, wherein the upsampling multiple is 2, splicing the two feature maps obtained above, and inputting the spliced feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, the number of output channels of the convolutional layer is 256, and thus obtaining the regression feature map of the test frameThe size of the feature map is 256 × 72;
2.9) generating an adaptable convolution kernel for the classification branch: the classification characteristic map obtained in the step 2.7) is usedFirstly inputting the convolution kernel into a convolution layer and a region-of-interest Pooling layer ROI Pooling, wherein the convolution kernel size of the convolution layer is 3 x 3, the step length is 1, the number of input channels and output channels is 256, the size of the ROI Pooling layer is 4 x 4, and the step length is 16, thereby obtaining an initial adaptable convolution kernelSize 256 x 4; will next beUsing Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchcls;
2.10) generating an adaptable convolution kernel of the regression branch: subjecting the regression feature map obtained in the step 2.8)Firstly inputting the data into a region-of-interest Pooling layer ROI Pooling with the size of 3 x 3 and the step length of 16 to obtain an initial adaptable convolution kernelSize 256 x 4; the initial convolution kernel is thenUsing Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branchreg;
2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)clsAs a convolution kernel for classification convolution, a convolution kernel fclsActing on the classification characteristic map of the test frame obtained in step 2.3)Through the convolution operationAfter doing so, a final classification score confidence map M is generatedclsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;
classification confidence map MclsThe specific calculation method is as follows:
remember that the last classification convolution is Conv1And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.9)cls,
2.12) obtaining the regression bias distance: adaptable convolution kernel f obtained in step 2.10)regAs a convolution kernel of the regression branch convolution, the convolution kernel fregActing on the regression feature map of the test frame obtained in step 2.4)After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generatedregThe size of the regression graph is 4 × 72, and the sizes respectively represent the distances from the center point of the target to the four boundaries of the object; note that the final regression convolution is Conv2And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)regThen, then
Confidence map M obtained according to step 2.11)clsFind the point with the highest score, then at MregFinding out four offset distances corresponding to the point to obtain a final target frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010293393.5A CN113538507B (en) | 2020-04-15 | 2020-04-15 | Single-target tracking method based on full convolution network online training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010293393.5A CN113538507B (en) | 2020-04-15 | 2020-04-15 | Single-target tracking method based on full convolution network online training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113538507A true CN113538507A (en) | 2021-10-22 |
CN113538507B CN113538507B (en) | 2023-11-17 |
Family
ID=78088144
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010293393.5A Active CN113538507B (en) | 2020-04-15 | 2020-04-15 | Single-target tracking method based on full convolution network online training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113538507B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024012243A1 (en) * | 2022-07-15 | 2024-01-18 | Mediatek Inc. | Unified cross-component model derivation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107945210A (en) * | 2017-11-30 | 2018-04-20 | 天津大学 | Target tracking algorism based on deep learning and environment self-adaption |
WO2018086607A1 (en) * | 2016-11-11 | 2018-05-17 | 纳恩博(北京)科技有限公司 | Target tracking method, electronic device, and storage medium |
CN108171112A (en) * | 2017-12-01 | 2018-06-15 | 西安电子科技大学 | Vehicle identification and tracking based on convolutional neural networks |
US20180285692A1 (en) * | 2017-03-28 | 2018-10-04 | Ulsee Inc. | Target Tracking with Inter-Supervised Convolutional Networks |
CN108960086A (en) * | 2018-06-20 | 2018-12-07 | 电子科技大学 | Based on the multi-pose human body target tracking method for generating confrontation network positive sample enhancing |
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN109978921A (en) * | 2019-04-01 | 2019-07-05 | 南京信息工程大学 | A kind of real-time video target tracking algorithm based on multilayer attention mechanism |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN110533691A (en) * | 2019-08-15 | 2019-12-03 | 合肥工业大学 | Method for tracking target, equipment and storage medium based on multi-categorizer |
US20200051250A1 (en) * | 2018-08-08 | 2020-02-13 | Beihang University | Target tracking method and device oriented to airborne-based monitoring scenarios |
-
2020
- 2020-04-15 CN CN202010293393.5A patent/CN113538507B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018086607A1 (en) * | 2016-11-11 | 2018-05-17 | 纳恩博(北京)科技有限公司 | Target tracking method, electronic device, and storage medium |
US20180285692A1 (en) * | 2017-03-28 | 2018-10-04 | Ulsee Inc. | Target Tracking with Inter-Supervised Convolutional Networks |
CN107945210A (en) * | 2017-11-30 | 2018-04-20 | 天津大学 | Target tracking algorism based on deep learning and environment self-adaption |
CN108171112A (en) * | 2017-12-01 | 2018-06-15 | 西安电子科技大学 | Vehicle identification and tracking based on convolutional neural networks |
CN108960086A (en) * | 2018-06-20 | 2018-12-07 | 电子科技大学 | Based on the multi-pose human body target tracking method for generating confrontation network positive sample enhancing |
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
US20200051250A1 (en) * | 2018-08-08 | 2020-02-13 | Beihang University | Target tracking method and device oriented to airborne-based monitoring scenarios |
CN109978921A (en) * | 2019-04-01 | 2019-07-05 | 南京信息工程大学 | A kind of real-time video target tracking algorithm based on multilayer attention mechanism |
CN110210551A (en) * | 2019-05-28 | 2019-09-06 | 北京工业大学 | A kind of visual target tracking method based on adaptive main body sensitivity |
CN110533691A (en) * | 2019-08-15 | 2019-12-03 | 合肥工业大学 | Method for tracking target, equipment and storage medium based on multi-categorizer |
Non-Patent Citations (8)
Title |
---|
DONG-HYUN LEE: "Fully Convolutional Single-Crop Siamese Networks for Real-Time Visual Object Tracking", 《ELECTRONICS》 * |
LIJUN WANG ET AL.: "Visual Tracking with Fully Convolutional Networks", 《PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 * |
LUCA BERTINETTO ET AL.: "Fully-Convolutional Siamese Networks for Object Tracking", 《COMPUTER VISION AND PATTERN RECOGNITION (CS.CV)》 * |
YANGLIU KUAI ET AL.: "Hyper-Feature Based Tracking with the Fully-Convolutional Siamese Network", 《2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA)》 * |
YANGLIU KUAI ET AL.: "Learning Fully Convolutional Network for Visual Tracking With Multi-Layer Feature Fusion", 《IEEE ACCESS》 * |
史璐璐: "深度学习及其在视频目标跟踪中的应用研究", 《中国优秀硕士学位论文全文数据库》 * |
许多: "基于CNN与RNN结构化处理的目标跟踪算法研究", 《中国优秀硕士学位论文全文数据库》 * |
许正: "基于深度学习的目标跟踪方法研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024012243A1 (en) * | 2022-07-15 | 2024-01-18 | Mediatek Inc. | Unified cross-component model derivation |
Also Published As
Publication number | Publication date |
---|---|
CN113538507B (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110335290B (en) | Twin candidate region generation network target tracking method based on attention mechanism | |
US20230186056A1 (en) | Grabbing detection method based on rp-resnet | |
US11636570B2 (en) | Generating digital images utilizing high-resolution sparse attention and semantic layout manipulation neural networks | |
Peyrard et al. | ICDAR2015 competition on text image super-resolution | |
CN113361636B (en) | Image classification method, system, medium and electronic device | |
CN108595558B (en) | Image annotation method based on data equalization strategy and multi-feature fusion | |
Yang et al. | Diffusion model as representation learner | |
CN113807340B (en) | Attention mechanism-based irregular natural scene text recognition method | |
CN112861915A (en) | Anchor-frame-free non-cooperative target detection method based on high-level semantic features | |
CN114359603A (en) | Self-adaptive unsupervised matching method in multi-mode remote sensing image field | |
CN116129289A (en) | Attention edge interaction optical remote sensing image saliency target detection method | |
CN114529574A (en) | Image matting method and device based on image segmentation, computer equipment and medium | |
CN115049921A (en) | Method for detecting salient target of optical remote sensing image based on Transformer boundary sensing | |
CN115908806A (en) | Small sample image segmentation method based on lightweight multi-scale feature enhancement network | |
CN113538507A (en) | Single-target tracking method based on full convolution network online training | |
EP4237997A1 (en) | Segmentation models having improved strong mask generalization | |
CN115115667A (en) | Accurate target tracking method based on target transformation regression network | |
Goud et al. | Text localization and recognition from natural scene images using ai | |
CN113554655B (en) | Optical remote sensing image segmentation method and device based on multi-feature enhancement | |
CN112732943B (en) | Chinese character library automatic generation method and system based on reinforcement learning | |
CN113743497B (en) | Fine granularity identification method and system based on attention mechanism and multi-scale features | |
CN115393491A (en) | Ink video generation method and device based on instance segmentation and reference frame | |
CN112529081A (en) | Real-time semantic segmentation method based on efficient attention calibration | |
US20240095929A1 (en) | Methods and systems of real-time hierarchical image matting | |
CN117593755B (en) | Method and system for recognizing gold text image based on skeleton model pre-training |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |