CN113538507A

CN113538507A - Single-target tracking method based on full convolution network online training

Info

Publication number: CN113538507A
Application number: CN202010293393.5A
Authority: CN
Inventors: 王利民; 崔玉涛; 蒋承; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2021-10-22
Anticipated expiration: 2040-04-15
Also published as: CN113538507B

Abstract

The invention provides a target tracking method based on full convolution network online training, which comprises the following steps: 1) generating a training sample stage; 2) a configuration stage of the network; 3) an off-line training stage; 4) an online tracking stage; the invention adopts the strategy of generating a target classification and target regression template to guide classification and regression tasks and updating the classification and regression template on line to realize the target tracking task through a designed full convolution network with complete end-to-end training. According to the single target tracking method, a simple full convolution network structure and online optimization of the classification and regression template are adopted, so that the single target tracking method with strong robustness and high precision is obtained.

Description

Single-target tracking method based on full convolution network online training

Technical Field

The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to a single-target tracking method based on full convolution network online training.

Background

As a basic task in computer vision, visual object tracking aims to estimate, for an arbitrary general object in a video, the spatial position where it appears in each frame and mark out the object borders. In general, current visual object tracking can be divided into two subtasks, object classification and bounding box regression.

According to different solutions of the object classification subtasks, the current tracking method can be divided into a generating method and a discriminating method. The generative method is based on the idea of template matching, and is typically a series of methods for performing similarity learning on previous and subsequent frame objects using a twin network. The SiamFC method proposed by Bertinetto first introduces a twin network into visual object tracking to learn the similarity of the search area in the known tracking target and the subsequent video frame. Li.B further proposes a SimRPN method, introduces RPN (region Proposal Net) into a twin network, and solves the tracking problem from the perspective of single-stage target detection in a local area; the second discriminant method learns to obtain an adaptive filter that can capture the tracked object by maximizing the classification response score between the tracked object and the background. The related filtering-based tracking method proposed by Henriques and the classifier-based tracking method proposed by Hyeonseob Nam are two typical discriminant methods at present. Compared with the generating equation, the discriminant method can better distinguish the difference between the tracked object and the background by updating the filter on line. Meanwhile, the complex online updating mechanism in the discriminant method is often difficult to be easily integrated into an end-to-end training framework. Recently, the DiMP proposed by Bhat introduces a meta-learning framework into a discriminant tracking method, designs a target model predictor for online training, and greatly improves the performance of the tracking method.

Bounding box regression is another important subtask for visual object tracking. Earlier tracking methods, exemplified by DCF and SiamFC, used exhaustive testing of multiple scales to estimate the bounding box of an object. While the RPN-based tracking method, exemplified by SiamRPN, follows the single-stage object detection concept, using a series of predefined anchor frames to determine the object size and regress the bounding box. In the ATOM and DiMP tracking methods, a plurality of manually generated initial candidate frames are iteratively adjusted by utilizing the existing IoUNet model, and finally, the frames are selected. For the bounding box regression subtask, the existing tracking method usually depends on the pre-designed bounding box generation rule and cannot train the update on line.

The method is inspired by the fact that online training is introduced into an object classification subtask by a discriminant method and succeeds, and the method is used for introducing an online training mechanism into a frame regression subtask, so that the method is better matched with the existing discriminant method, and the object information in the tracking process is learned online to achieve the purpose of predicting the frame of the object more stably and accurately. The FCOT (full relational on line tracking) work provides a precise and effective Online training tracking framework for directly processing two different subtasks of object classification and border regression. The designed RMG (regression Model Generator) module successfully introduces online updating into the frame regression, and ensures that the frame prediction still has high accuracy when the appearance of an object changes.

Disclosure of Invention

The invention aims to solve the problems that: target tracking can be generally divided into a classification task to identify the target from the background to roughly locate the target and a regression task to generate a precise target bounding box. For the classification task, an online training mode can be adopted to maximize the response gap between similar objects in the target and the background so as to learn an online adjustable filter. However, for the regression branch, the existing method generally performs relatively complex regression through a plurality of preset frames, which introduces a lot of additional parameters on one hand, and on the other hand, only one fixed target template can be used for guiding regression, and the situations of shape changes such as object deformation or rotation in subsequent videos cannot be well handled. Therefore, the main problems to be solved by the present invention are: how to design a simple target regression branch can avoid complex additional parameters and can simultaneously update the templates of the target classification and target regression tasks on line.

The technical scheme of the invention is as follows: a single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to perform on-line training and update network parameters, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:

1) generating a training sample stage, generating the training sample in an off-line training process, firstly carrying out target area dithering processing on each frame image of each video in an off-line training data set, then cutting out a target search area after dithering processing, extracting three frames from the first half part of each video frame sequence as a training frame, extracting one frame from the second half part of each video frame sequence as a test frame, marking a target frame as a verification frame on the test frame, generating a Gaussian label image taking the center of the target frame as the center as a label of the verification set for each verification frame, and recording the distance from the center of the target frame in each verification frame to four boundaries of the target frame as labels of regression branches in the off-line training process;

2) a network configuration stage, extracting the classification characteristic diagram and regression characteristic diagram of the test frame and the training frame, and generating an adaptable convolution kernel f of the classification branch according to the classification characteristic diagram of the training frame_clsGenerating an adaptable convolution kernel f of the regression branch from the regression feature map of the training frame_regTo accommodate the convolution kernel f_clsActing on the classification characteristic map of the test frame as a convolution kernel of classification convolution, and generating a final classification score confidence map M after the convolution operation_clsTo accommodate the convolution kernel f_regActing on regression feature map of test frame as convolution kernel of regression branch convolution to generate final regression map M of distance from central point to target boundary_regRepresenting the distance of the center point of the object from the four boundaries of the object, according to the confidence map M_clsFind the point with the highest score, then at M_regFinding out four offset distances corresponding to the point, namely outputting a final target frame on the test frame;

3) in the off-line training stage, for off-line training of the classification branches, the class hinge Loss LBHinge Loss proposed by DiMP is used as a Loss function, for the regression branches, IoU Loss function is used, the label obtained by the truth-checking frame is combined, the SGD optimizer is used, the whole network parameters are updated through a back propagation algorithm, and the step 2 is repeated continuously until the iteration times are reached;

4) in the on-line tracking stage, firstly, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and then the template frame is expanded into an on-line training data set containing 30 frames of images as a training frame F_trainTaking the frame to be tracked in the video to be tracked as a test frame F_testWith F_trainInputting the network in the step 2) to obtain F_testIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branch_clsAnd adaptable convolution kernel f of regression branch_reg。

Compared with the prior art, the invention has the following advantages.

The invention provides a full convolution Online tracking method (FCOT). The method adopts a simple tracking frame for directly carrying out object classification and frame regression, and improves the tracking accuracy while ensuring the object tracking efficiency.

The invention designs a Regression Model Generator (RMG), which introduces an online training mechanism in frame Regression and brings object classification and frame Regression into a unified frame for online training and tracking. Compared with the manual design rule relied on by the existing method, the frame regression of the on-line training has better adaptability to the object deformation in the tracking process.

The invention obtains good accuracy on the task of tracking the visual object and solves the problem of frame misalignment caused by the interference of similar background contents when the frame of the object returns. Compared with the existing method, the FCOT tracking method provided by the invention has the advantages that the tracking success rate and the positioning accuracy are good in a plurality of vision tracking test datum data sets.

Drawings

FIG. 1 is a system framework diagram used by the present invention.

Fig. 2 is a schematic diagram of a frame extraction process of a video.

Fig. 3 is a schematic diagram of a multivariate information fusion module provided by the present invention.

FIG. 4 is a schematic diagram of multi-scale inter-frame difference feature extraction and fusion proposed by the present invention.

Fig. 5 is a schematic diagram of a probability map solving process proposed by the present invention.

Fig. 6 is a schematic diagram of a single-frame feature sequence feature extraction process.

Detailed Description

The invention provides a single-target tracking method based on full convolution network on-line training, which firstly needs off-line training and then updates parameters of partial modules through on-line training in the tracking process. Performing off-line training on four training data sets of TrackingNet-Train, LaSOT-Train, COCO-Train and GOT-10k-Train, and performing off-line training on OTB100, NFS, UAV123, GOT-10k-Test and LaSOT-Test、TrackingNet-TestThe test on the six test sets achieves high accuracy and tracking success rate, and is implemented by using a Python3 programming language and a Pytroch 1.1 deep learning framework.

FIG. 1 is a system framework diagram used by the present invention to generate a target classification and target regression template through a designed full convolution network trained end-to-end to guide classification and regression tasks and to update the strategy of the classification and regression template online to implement a target tracking task. The whole method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:

1) and a preparation stage of data, namely a training sample stage. In the off-line training process, a training sample is generated in the off-line training process, firstly, target area dithering is carried out on each frame image of each video in an off-line training data set, then a target search area after dithering is cut out, three frames are extracted from the first half of each video frame sequence to be used as a training frame, one frame is extracted from the second half of each video frame sequence to be used as a test frame, a target frame is marked on the test frame to be used as a verification frame, a Gaussian label image taking the center of the target frame as the center is generated for each verification frame to be used as a label of a verification set, and the distance between the center of the target frame in each verification frame and four boundaries of the target frame is recorded to be used as labels of regression branches in the off-line training process.

2) The configuration phase of the model, i.e. the network configuration phase, is as follows.

2.1) extracting coding features of the test frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a test frame F is subjected to feature extraction_test∈R^{B×3×288×288}Extracting features to obtain

Wherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript test represents the test frame, B represents the size of batch size, the convolution layer and the pooling layer are included, the convolution layer is two convolution kernels of 3 × 3 and 1 × 1, which are respectively used for extracting the feature with higher dimension and transforming the feature dimension, and the convolution kernel of each convolution layer is initialized in a random initialization mode;

2.2) extracting the decoding characteristics of the test frame: applying a convolution layer with a convolution kernel of 1 x 1) to the convolution layer obtained in step 2.1)

The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristic is obtained

The feature layer size was 256 × 18. For those obtained in step 2.1)

After a 1 x 1 convolutional layer, the number of input channels is 512, the number of output channels is 256, so that the number of channels becomes 256, then a bilinear interpolation layer is used to perform 2 times of up-sampling operation on the decoding characteristics of the first layer, and then the two characteristic layers are processed, namely the upper two characteristic layers

Adding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoding characteristic of the second layer

The feature layer size was 256 × 36. For those obtained in step 2.1)

Passing through a 1 × 1 convolutional layer, the number of input channels is 256, the number of output channels is 256, the number of channels is changed into 256, then a bilinear interpolation layer is used for carrying out 2 times of up-sampling operation on the decoding characteristics of the second layer, then the two characteristic layers are added, and then a convolution operation is carried out, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the third layer are obtained

The feature layer size was 256 × 72.

2.3) extracting classification features of the test frames: decoding characteristics obtained in step 2.2)

Through a convolutional layer using 3 x 3 convolutional kernels with step size of 1, 256 input channels and 256 output channels, followed by a Group Normalization layer Group Normalization with Group 32, and a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3 x 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two Deformable convolutions, the Group size of the Group Normalization layer is 32, and therefore the classification feature map of the test frame is obtained

The size of the feature map is 256 × 72;

2.4) extracting test framesThe regression feature of (1): decoding characteristics obtained in step 2.2)

Passing through a convolutional layer, which uses a convolutional kernel of 3, with a step size of 1, a number of input channels of 256, and a number of output channels of 256, then through a Group Normalization layer, which has a Group of 32, and then through a ReLU active layer. And performing two Deformable convolutions (Deformable Convolution) on the obtained features, wherein the Convolution kernel size of the Deformable convolutions is 3, the step size is 1, the number of input channels and the number of output channels are both 256, the first Deformable Convolution is followed by a Group Normalization layer and a ReLU active layer, the Group Normalization layer has a Group size of 32, and the ReLU active layer is added after the second Deformable Convolution. Simultaneously using the decoding characteristics obtained in step 2.2)

Over the same network and the resulting features are upsampled by a factor of 2. Will be provided with

And after the two feature maps which are obtained after the up-sampling are spliced, inputting the two feature maps into a 1 x 1 convolution layer, wherein the input channel number of the convolution layer is 512, and the output channel number of the convolution layer is 256, so that the regression feature map of the test frame is obtained

The size of the feature map is 256 × 72;

2.5) extracting the coding features of the training frame: firstly, Block-1, Block-2, Block-3 and Block-4 of Resnet-50 are used as encoders to extract features, and a plurality of training frames F_train∈R^{B×3×3×288×288}Extracting features to obtain

Where the meaning of superscript e2 is the features extracted by the coding layer Block-2, the meaning of e3 is the features extracted by the coding layer Block-3,e4 means the features extracted by the coding layer Block-4, the subscript train indicates the training frame, B indicates the size of the batch size, which includes convolution layer and pooling layer, the convolution layer is two convolution kernels, 3 × 3 and 1 × 1, respectively, used to extract the features of higher dimension and transform the feature dimension, the convolution kernel of each convolution layer is initialized by random initialization;

2.6) extracting the decoding characteristics of the training frame: applying a convolution layer with a convolution kernel of 1 x 1 to the convolution layer obtained in step 2.5)

The feature layer size was 256 × 18. For those obtained in step 2.1)

After passing through a convolution layer of 1 x 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then a bilinear interpolation layer is used for carrying out up-sampling operation of 2 times on the decoding characteristics of the first layer, then the two characteristic layers are added and then carry out convolution operation, the convolution kernel is 1 x 1, and therefore the decoding characteristics of the second layer are obtained

The feature layer size was 256 × 36. For those obtained in step 2.1)

The feature layer size was 256 × 72.

2.7) extracting the classification features of the training frames: decoding characteristics obtained in step 2.6)

Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. Subjecting the obtained features to two Deformable convolutions (Deformable Convolution), wherein the Convolution kernel size of the Deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, and the first Deformable Convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group size of the Group Normalization layer is 32, thereby obtaining a classification feature map of the training frame

The size of the feature map is 256 × 72;

2.8) extracting regression characteristics of the training frame: decoding characteristics obtained in step 2.6)

The same network is performed and the resulting features are upsampled by a factor of 2. Splicing the two feature maps obtained above, and inputting the spliced two feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that the regression feature map of the test frame is obtained

The size of the feature map is 256 × 72;

2.9) generating an adaptable convolution kernel for the classification branch: the classification characteristic map obtained in the step 2.7) is used

Firstly inputting the convolution kernel into a convolution layer and a region-of-interest Pooling layer ROI Pooling, wherein the convolution kernel size of the convolution layer is 3 x 3, the step length is 1, the number of input channels and output channels is 256, the size of the ROI Pooling layer is 4 x 4, and the step length is 16, thereby obtaining an initial adaptable convolution kernel

The convolution kernel size is 256 × 4. Will next be

Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branch_cls；

2.10) generating an adaptable convolution kernel of the regression branch: subjecting the regression feature map obtained in the step 2.8)

Firstly inputting the data into a region-of-interest Pooling layer ROI Pooling with the size of 3 x 3 and the step length of 16 to obtain an initial adaptable convolution kernel

The convolution kernel size is 256 × 4. The initial convolution kernel is then

Using Gauss Newton method to make optimization so as to obtain the adaptable convolution kernel f of final classification branch_reg；

2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)_clsAs a convolution kernel for classification convolution, the convolution kernel f_clsActing on the classification characteristic map of the test frame obtained in step 2.3)

After the convolution operation, a final classification score confidence map M is generated_clsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;

classification confidence map M_clsThe specific calculation method is as follows:

remember that the last classification convolution is Conv₁And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.9)_cls，

2.12) obtaining the regression bias distance: adaptable convolution kernel f obtained in step 2.10)_regAs a convolution kernel of the regression branch convolution, the convolution kernel f_regActing on the regression feature map of the test frame obtained in step 2.4)

After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generated_regThe regression graph has a size of 4 × 72, and represents distances from the center point of the target to the four boundaries of the object. Note that the final regression convolution is Conv₂And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)_regThen, then

Confidence map M obtained according to step 2.11)_clsFind the point with the highest score, then at M_regFinding out four offset distances corresponding to the point to obtain a final target frame.

The network configuration phase is described in detail below in one embodiment. The first four blocks of ResNet-50 are used as a network structure of an encoding layer, parameters of an ImageNet pre-training model are loaded in the network, then a training frame and a test frame are decoded respectively, specifically, a convolutional layer with a convolutional kernel of 1 × 1 is firstly passed, a first layer of decoding features is obtained, and the size of the feature graph is 256 × 18. And then, after passing through a convolution layer of 1 × 1, the number of input channels is 512, the number of output channels is 256, so that the number of channels is 256, then, a bilinear interpolation layer is used for performing 2 times of upsampling operation on the decoding characteristics of the first layer, then, after the two characteristic layers are added, a convolution operation is performed, the convolution kernel is 1 × 1, and therefore the decoding characteristics of the second layer are obtained, and the size of the characteristic diagram is 256 × 36. Next, the feature map is up-sampled by a factor of 2 to obtain a third layer of decoded features, the size of the feature map being 256 × 72.

Then, classification features and regression branches are respectively extracted from the training frames and the test frames, as shown in fig. 2, the features obtained after passing through the decoder are input into the convolution layer, then two deformable convolution operations are executed, a set of normalization layer and a ReLU activation layer are arranged between each convolution operation, and therefore the classification features of the training frames can be obtained, and the size of the classification feature map is 256 × 72. As also shown in fig. 3, 4, and 5, the features obtained by passing the training frame and the test frame through the decoder pass through several convolutional layers, deformable convolutional layers, group normalization, and activation function layers in the figure. The size of the regression feature map of the training frame was 1024 × 72, and the size of both the classification feature map and the regression feature map of the test frame was 256 × 72.

The generation of the classification and regression optimizable models is performed in training frames, and as shown in FIG. 6, the classification and regression features of the training frames are input into an initial optimizable model generator that producesThe initial coarse model (actually a convolution kernel) is sized for classification and regression as 256 x 4 and 4 x 256 x 3, respectively. Then, optimizing the initial model by the steepest descent method to obtain a final classification and regression optimizable model for the test frame, namely an adaptable convolution kernel f of a classification branch_clsAnd adaptable convolution kernel f of regression branch_reg。

And finally, respectively executing convolution operation on the classification characteristic and the regression characteristic of the test frame, wherein the convolution kernel is the optimized classification and regression model obtained above, so that a 72 × 72 classification score map and a 4 × 72 regression bias distance map are obtained. And finding out a point with the highest score in the classification score map as a central point of the target, and finding out four values of the point in the regression offset distance map as distances between the central point and four boundaries of the target, thereby obtaining a final target frame.

3) In the off-line training stage, hinge-like Loss LBHinge Loss proposed by DiMP is used as a Loss function for off-line training of the classification branch, IoU Loss function is used for the regression branch, an SGD optimizer is used, BatchSize is set to be 40, the total number of training rounds is set to be 100, the learning rate is divided by 10 in 25 and 45 rounds, the attenuation rate is set to be 0.2, training is performed on 8 RTX 2080ti, the whole network parameter is updated through a back propagation algorithm, and the steps 2.1) to 2.12 are repeated continuously until the number of iterations is reached.

4) And in the on-line tracking stage, a training frame sequence and a test frame are input into the network, and a final target frame is obtained on the basis of initial parameters.

Firstly, cutting out a target frame search area in a first frame image of a video to be tracked as a template, and then expanding the template frame into an online training data set containing 30 frames of images as a training frame F_trainTaking the frame to be tracked in the video to be tracked as a test frame F_testWith F_trainInputting the network in the step 2) to obtain F_testIn the tracking process, one frame with the highest classification score and the tracked target frame are selected from every 25 frames in the tracked frame sequence and are added as labelsAdaptable convolution kernel f for updating classification branches in an online training dataset_clsAnd adaptable convolution kernel f of regression branch_reg。

In the invention, the on-line training means that the convolution kernels of the two classification branches and the regression branch can be trained on line to update parameters so as to adapt to the current tracking video, and the initial of the full convolution network is still obtained by off-line training.

The first frame of each video in the test set is subjected to the same operation as the training set, firstly, an area with the size 5 times that of a target is cut out, then, the area is scaled to 288 × 288, the first frame is subjected to data enhancement such as rotation, translation and noise addition to obtain 30 training samples, online training is carried out through the enhanced training data, online training is carried out every 25 frames in the process of tracking each frame, and the frame with the highest classification score is selected and added into the online training set. On the OTB100, NFS, UAV123, GOT-10k, LaSOT and TrackingNet test data sets, the tracking efficiency is 53fps, and on the tracking precision, Auc reaches 69.2% on the OTB100 data set, and Pre reaches 90.6%; on the NFS dataset, Auc reached 62.2% and Pre reached 74.5%; on the UAV123 data set, Auc reached 65.7%, Pre reached 87.6%; on the GOT-10k dataset, AO reached 62.7%, SR_0.75Reaches 51.7 percent, SR_0.5Reaches 75.3 percent; on the TrackingNet dataset, Suc reached 74.5% and Pre reached 71.4%; on the LaSOT data set, Suc reached 69.2%, Pre reached 90.6%, and the indices on the 6 test data sets exceeded DiMP.

Claims

1. A single target tracking method based on full convolution network on-line training is characterized in that network initial parameters are trained firstly, then in the tracking process, partial tracked video frames are selected as training samples to train and update network parameters on line, and the tracking accuracy is improved, wherein the tracking method comprises a training sample generation stage, a network configuration stage, an off-line training stage and an on-line tracking stage:

4) in the on-line tracking stage, a target frame search area in a first frame image of a video to be tracked is cut out as a template, and thenThe template frame is then expanded into an on-line training data set containing 30 images as a training frame F_trainTaking the frame to be tracked in the video to be tracked as a test frame F_testWith F_trainInputting the network in the step 2) to obtain F_testThe target frame is used for realizing tracking; in the tracking process, one frame with the highest classification score and a tracked target frame are selected from every 25 frames in the tracked frame sequence and are used as labels to be added into an online training data set for updating an adaptable convolution kernel f of a classification branch_clsAnd adaptable convolution kernel f of regression branch_reg。

2. The single-target tracking method based on the full convolution network online training as claimed in claim 1, wherein the network configuration stage specifically comprises:

2.1) extracting coding features of the test frame: firstly, using Block-1, Block-2, Block-3 and Block-4 of Resnet-50 as encoders to extract features, and carrying out feature extraction on a test frame F_test∈R^{B×3×288×288}Extracting features to obtain

The number of input channels is 1024, the number of output channels is 256, and the first layer decoding characteristics are obtained

Feature layer size 256 × 18; for those obtained in step 2.1)

After a 1 x 1 convolutional layer, the input channel number is 512, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the characteristics of the first layer

Perform 2 times of upsampling operation, and then will

Adding the result of the 2 times up-sampling operation, and performing convolution operation with convolution kernel of 1 x 1 to obtain second layer decoding characteristics

Feature layer size 256 × 36; for those obtained in step 2.1)

Passing through a 1 x 1 convolutional layer, the input channel number is 256, the output channel number is 256, so that the channel number is 256, and then a bilinear interpolation layer is used to decode the second layer

Performing 2 times of upsampling operation, adding the two feature layers, and performing convolution operation with convolution kernel of 1 × 1 to obtain the decoded feature of the third layer

Feature layer size 256 × 72;

Passing through a convolutional layer, the convolutional layer adopts a convolution kernel with the size of 3, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer Group Normalization, the Group of Normalization layers is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained features, the size of the convolution kernel of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer Group Normalization and a ReLU activation layer are also arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and thus obtaining the classification characteristic diagram of the test frame

Characteristic diagram

256 × 72;

2.4) extracting regression characteristics of the test frame: decoding characteristics obtained in step 2.2)

Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, passing through a ReLU activation layer, passing through two deformable convolutions on the obtained characteristics, the size of the convolutional kernel of the deformable convolutions is 3, the step length is 1, the number of the input channels and the number of the output channels are both 256, a Group Normalization layer and a ReLU activation layer are arranged between the two deformable convolutions, the Group size of the Group Normalization layer is 32, and a ReLU activation layer is added after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)

Through and with

The same network carries out 2 times of upsampling on the obtained characteristics; will be provided with

And after splicing the 2 times of upsampling features, inputting the spliced 2 times of upsampling features into a 1 x 1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, and the number of output channels of the convolutional layer is 256, so that a regression feature map of the test frame is obtained

Characteristic diagram

256 × 72;

Wherein the meaning of superscript e2 is the feature extracted by the coding layer Block-2, the meaning of e3 is the feature extracted by the coding layer Block-3, the meaning of e4 is the feature extracted by the coding layer Block-4, the subscript train indicates the training frame, B represents the size of batch size, the convolutional layer and the pooling layer are included, the convolutional layer is two convolution kernels of 3 × 3 and 1 × 1, and is used for extracting the feature with higher dimension and transforming the feature dimension respectively, and the convolution kernel of each convolutional layer is initialized in a random initialization mode;

Feature layer size 256 × 18; for those obtained in step 2.1)

Feature layer size 256 × 36; for those obtained in step 2.1)

Feature layer size 256 × 72;

Passing through a convolutional layer, which uses a convolutional kernel of size 3, with step size 1, input channel number 256, output channel number 256, then a Group Normalization layer, with Group 32, and then a ReLU active layer. And performing two deformable convolutions on the obtained features, wherein the convolution kernel of the deformable convolution has the size of 3 and the step length of1, the number of input channels and the number of output channels are both 256, and the first deformable convolution is followed by a Group Normalization layer and a ReLU activation layer, wherein the Group Normalization layer has a Group size of 32, thereby obtaining a classification feature map of the training frame

The size of the feature map is 256 × 72;

Passing through a convolutional layer, wherein the convolutional layer adopts 3 convolutional kernels, the step length is 1, the number of input channels is 256, the number of output channels is 256, then passing through a Group Normalization layer, the Group of the Normalization layer is 32, and then passing through a ReLU active layer; performing two deformable convolutions on the obtained features, wherein the convolution kernel size of the deformable convolutions is 3, the step length is 1, the number of input channels and the number of output channels are both 256, a Group Normalization layer and a ReLU active layer are arranged after the first deformable convolution, the Group size of the Group Normalization layer is 32, and the ReLU active layer is arranged after the second deformable convolution; simultaneously using the decoding characteristics obtained in step 2.2)

Performing the same network, upsampling the obtained features, wherein the upsampling multiple is 2, splicing the two feature maps obtained above, and inputting the spliced feature maps into a 1-by-1 convolutional layer, wherein the number of input channels of the convolutional layer is 512, the number of output channels of the convolutional layer is 256, and thus obtaining the regression feature map of the test frame

The size of the feature map is 256 × 72;

Size 256 x 4; will next be

Size 256 x 4; the initial convolution kernel is then

2.11) obtaining a classification confidence map: adaptable convolution kernel f obtained in step 2.9)_clsAs a convolution kernel for classification convolution, a convolution kernel f_clsActing on the classification characteristic map of the test frame obtained in step 2.3)

Through the convolution operationAfter doing so, a final classification score confidence map M is generated_clsThe confidence map is 72 x 72 in size, and points with higher scores indicate higher confidence that the point is the center of the target;

M_cls∈R^1×72×72

After the convolution operation, a final regression graph M of the distance from the central point to the target boundary is generated_regThe size of the regression graph is 4 × 72, and the sizes respectively represent the distances from the center point of the target to the four boundaries of the object; note that the final regression convolution is Conv₂And the convolution kernel of the convolution layer is the adaptable convolution kernel f obtained in the step 2.10)_regThen, then

M_reg∈R^4×72×72