WO2023159558A1 - Real-time target tracking method, device, and storage medium - Google Patents

Real-time target tracking method, device, and storage medium Download PDF

Info

Publication number
WO2023159558A1
WO2023159558A1 PCT/CN2022/078255 CN2022078255W WO2023159558A1 WO 2023159558 A1 WO2023159558 A1 WO 2023159558A1 CN 2022078255 W CN2022078255 W CN 2022078255W WO 2023159558 A1 WO2023159558 A1 WO 2023159558A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
feature
feature extraction
training
frame
Prior art date
Application number
PCT/CN2022/078255
Other languages
French (fr)
Chinese (zh)
Inventor
胡金星
李东昊
王浩
陈卫华
罗亚林
Original Assignee
中国科学院深圳先进技术研究院
中广核工程有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院, 中广核工程有限公司 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2022/078255 priority Critical patent/WO2023159558A1/en
Publication of WO2023159558A1 publication Critical patent/WO2023159558A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation

Definitions

  • the invention relates to the field of artificial intelligence, in particular to a real-time target tracking method, device and storage medium.
  • Object tracking has broad application prospects in many fields such as human-machine interface, intelligent monitoring, virtual reality, and motion analysis, and has important research value in science and engineering. Due to the establishment of new benchmark object tracking datasets and the provision of standardized benchmarking platforms since 2013, object tracking has developed rapidly in the past decade, and many effective tracking algorithms have been proposed successively.
  • Bolme et al pioneered the introduction of convolution theorem from the field of signal processing to visual tracking, and converted the object template matching problem into a correlation operation in the frequency domain. In this way, not only the running speed of the Correlation Filters (CF) tracker is improved, but also the accuracy can be improved by using appropriate features. Since then, the correlation filter has become a research hotspot in the field of tracking at that time, and many related target tracking research methods have been proposed, such as combining multi-resolution feature maps to reduce the influence of periodic boundaries, and improving tracking performance by optimizing losses.
  • CF Correlation Filters
  • the early twin tracker is represented by the SINT algorithm, which is based on the idea of similarity learning, divides the network into a query branch and a search branch, and uses a matching function to find a suitable candidate area, but its tracking speed is too slow, only 2fps.
  • the GOTURN algorithm can achieve 100 frames per second on a single GPU through a deep regression network, but its robustness is poor.
  • FCNT and CREST like the above algorithms, focus on exploring the tracking ability of Siamese networks. Bertinetto et al. proposed SiamFC to construct a lightweight twin network structure, which is used to extract target features and search area features respectively and then perform related operations. The target bounding box is determined according to the maximum position of the response graph.
  • the training uses the ILSVRC dataset to provide video for training, offline After training, the parameters are not updated during the network tracking process, and the accuracy and speed of the algorithm have achieved good results.
  • Its subsequent correlation filter network CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on the GPU.
  • CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on the GPU.
  • CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on the GPU.
  • CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on
  • Bo Li proposed to add the RPN network in target detection to the tracking network, adopt the target positioning technology similar to the detection algorithm, and use coordinate regression to make the tracking structure more accurate, while avoiding the problem of multi-scale search.
  • Bo Li proposed SiamRPN++, which introduced the deep benchmark network into the tracking network, which greatly improved the detection accuracy.
  • Many subsequent algorithms have been improved on SiamRPN++, including adding the mask branch to obtain the mask, by using the internal clipping residual unit and using a wider network, and playing a deeper network effect through multi-level fusion.
  • Siamese has made a lot of progress in tracking the network, Siamese still has a big problem with the memory constraints of real-world applications. For example, the memory usage of SiameseRPN++ reaches 206MB, and its memory usage makes it difficult to deploy on some mobile devices, such as drones. How to reduce its parameter amount without losing a lot of accuracy has become an urgent problem to be solved.
  • Embodiments of the present invention provide a real-time target tracking method, device, and storage medium, which implement real-time tracking of targets in videos based on twin networks, so that fast target tracking can be achieved without losing too much accuracy.
  • a real-time target tracking method comprising the following steps:
  • the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
  • the feature extraction network input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;
  • each frame of the video stream As a search image frame, it is input into the specialized network, and the bounding box is returned to complete the target tracking.
  • the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame. Specifically:
  • the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category, which is used for pre-training the backbone network ;
  • the COCO2017 data set is a target detection data set.
  • the target detection data set is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;
  • data enhancement includes random flipping, blurring, and shifting operations on the image;
  • the feature extraction network input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box Specifically:
  • the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
  • the RPN network regresses the bounding boxes to obtain accurate position estimates.
  • the multiplexing module includes:
  • the channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
  • the spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
  • the RPN network includes two parts, namely the classification branch and the regression branch;
  • Classification branches used to distinguish objects from backgrounds
  • the regression branch is used to fine-tune the candidate regions.
  • a real-time target tracking device including:
  • the data selection module selects the image data required for training and enhances the search image frame to prevent network overfitting.
  • the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
  • Network configuration module configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network performs the bounding box return;
  • the RPN network is removed, and the first stage training of the backbone network is performed on the classified image dataset. After the classification data image dataset converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection dataset. train;
  • the target tracking module for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame for training to obtain a specialized network; the video stream Each frame of is used as a search image frame, which is input into the specialized network, and the bounding box is regressed to complete the target tracking.
  • the network configuration module includes:
  • the first feature output unit beneficially inputs paired target template frames and search image frames to the feature extraction network, and outputs feature maps T2 and S2 through the separate convolution feature extraction network;
  • the first feature output unit is used to obtain feature maps T2, S2, and input feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs feature maps T3, S3;
  • a feature input unit is used to input feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
  • the feature localization unit the RPN network regresses the bounding box to obtain an accurate position estimate.
  • the multiplexing module includes:
  • the channel multiplexing module is used to share the information of the channel to the space through the expansion and reduction operation, and disperse the information of the space to the channel, which promotes the circulation of information, and uses group convolution at the same time to reduce the amount of parameters;
  • the spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
  • a computer-readable medium stores one or more programs, and one or more programs can be executed by one or more processors to achieve the above-mentioned Steps in any one of the real-time object tracking methods.
  • the real-time target tracking method, device, and storage medium in the embodiment of the present invention includes: selecting the image data required for training, and enhancing the search image frame to prevent network overfitting, and the search image frame is marked with a bounding box Image data, the bounding box is intercepted separately as the target template frame; configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and input the two feature maps RPN network, the RPN network returns the bounding box; remove the RPN network, and perform the first stage training of the backbone network on the classified image data set.
  • the classification data image data set After the classification data image data set converges, add the RPN network after the backbone network to detect
  • the second stage of training is performed on the data set; for unknown video streams, mark the target to be tracked in the first frame of the video, and the backbone network will use it as the target template frame for a training to obtain a specialized Network; each frame of the video stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking.
  • the tracking of the target in the video is realized based on the twin network, and, in view of the problems of large memory usage and slow inference speed commonly existing in the current twin network, the network trained by the present invention can be used without losing too much precision Under the circumstances, achieve fast target tracking.
  • Fig. 1 is the flowchart of real-time object tracking method of the present invention
  • Fig. 2 is a network structure diagram of the present invention
  • Fig. 3 is a spatial multiplexing module diagram of the present invention.
  • Fig. 4 is a channel multiplexing module diagram of the present invention.
  • Fig. 5 is the structural diagram of RPN network of the present invention.
  • Fig. 6 is a schematic diagram of the real-time target tracking device of the present invention.
  • a real-time target tracking method comprising the following steps:
  • S100 Select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is image data marked with a bounding box, and the separately intercepted part of the bounding box is a target template frame;
  • S200 Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;
  • S300 remove the RPN network, perform the first-stage training of the backbone network on the classified image data set, after the classification data image data set converges, add the RPN network after the backbone network, and perform the second-stage training on the target detection data set;
  • S400 For an unknown video stream, mark the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; One frame is used as a search image frame, which is input into the specialized network, and the bounding box is returned to complete the target tracking.
  • the problem solved by the present invention is to realize the tracking of the target in the video based on the twin network, and, aiming at the problems of large memory usage and slow inference speed generally existing in the current twin network, a lightweight network is designed to make the Achieve fast object tracking without losing too much accuracy.
  • the present invention trains and fine-tunes the backbone network from scratch, and finally applies it to the Siamese network framework to achieve fast target tracking.
  • the inventive method comprises four stages:
  • Data preparation stage The training is divided into backbone network pre-training and regression sub-network training stage. Two kinds of data sets are prepared, which are respectively used for the classification task training of the backbone network and the positioning task training of the regression sub-network.
  • positioning training unlike the usual target detection tasks, it is necessary to use the target interception as the target template frame, and pair the search image into the network.
  • data enhancement on the search image such as flipping, rotating, shifting, etc., to prevent the network from overfitting.
  • Network configuration stage the network can be divided into two parts, feature extraction network and RPN network.
  • the aforementioned multiplexing module is included in the feature extraction network.
  • the input is pairs of target template frames and search image frames, sharing the same feature extraction network.
  • two feature maps are output as the input of the RPN network.
  • the RPN network is used to return the precise bounding box, and the feature map obtained by the target template frame and the search image frame is convolved once to obtain a thermal response map, where the place with a large response value is the place where the target is most likely to appear .
  • Network training phase training uses the sum of regression loss and classification loss as the loss function, classification loss uses cross-entropy loss, and regression loss uses SmoothL1 loss.
  • First remove the RPN network train the backbone network on the classification data set, after convergence, add the RPN network after the backbone network, and conduct the second stage of training on the target detection data set together, using SGD as the gradient in both stages Descent algorithm.
  • Online tracking stage online tracking adopts a one-time training method. For unknown videos, mark the target to be tracked in the first frame, and the network will use it as the target template frame for training, so that a specialized network can be obtained. Afterwards, each frame of the video stream is used as a search image frame, input into the network, and the bounding box is regressed, that is, the tracking is completed.
  • the invention improves the deduction speed of the target tracking network and reduces the memory occupation of the network while maintaining the precision by adopting the lightweight module and the backbone network formed by it. More details will be given in the following description.
  • step S100 select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame Specifically:
  • S101 Select two image data sets, namely the micro-ImageNet data set and the COCO2017 data set; wherein, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category for pre-training Backbone network; the COCO2017 dataset is a target detection dataset.
  • the target detection dataset is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;
  • S102 Perform data enhancement on the image data set, the data enhancement includes random flipping, blurring, and shifting operations on the image;
  • S103 Generate an Anchor to perform target positioning on the search image frame.
  • Step S100 is the data preparation stage, specifically including:
  • Step 1 data selection: the training is divided into two stages, namely backbone network pre-training and regression sub-network training.
  • the specific training configuration is introduced in detail in step 3.
  • Tiny ImageNet is a sub-dataset of ImageNet. 10,000 pictures with a resolution of 224*224 are selected, and each picture is marked with a single category for pre-training the backbone network.
  • COCO2017 is a target detection data set, and 5000 pictures are selected, and each picture is marked with multiple categories and the location of the bounding box.
  • the labeled bounding box needs to be intercepted separately as the target template frame, and the entire image is used as the search image frame, which is input into the network in pairs.
  • Step 2 data enhancement: Before entering the training, the data needs to be expanded, that is, data enhancement, which requires random flipping, blurring, and shifting operations on the image. It should be noted that in Siamese neural network, data augmentation only needs to be performed on the search image, not on the target template.
  • Anchor generation Anchor is needed to locate the target on the search image.
  • the specific method of generating Anchor is: first obtain the benchmark Anchor, where the dimension size of Anchor is [5,4], 4 represents four scales, and each scale records A vector [x-coordinate of center point, y-coordinate of center point, width, height], 5 represents the expansion scale of Anchor, set to [0.5, 0.66, 1, 1.5, 2].
  • step S200 configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network pairs
  • the bounding box for regression is specifically:
  • S201 Input the paired target template frame and the search image frame to the feature extraction network, and output the feature map T2, S2 by separating the convolutional feature extraction network;
  • S203 Input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
  • S204 The RPN network performs regression on the bounding box to obtain accurate position estimation.
  • Step S200 is the network model configuration stage, which specifically includes:
  • Step 1 feature extraction in the early stage: the overall structure of the network is shown in Figure 2.
  • the input is a target template frame of 98*98*3 and a search image frame of 354*354*3.
  • the convolution kernel Conv_1 obtains the feature maps T1 and S1 of 96*96*28 and 352*352*28; then configures three-layer depth separable convolution, and the three-layer depth separable convolution is DwConv_1, DwConv_2 and DwConv_3;
  • a depth-separable convolution layer is divided into channel-by-channel convolution and point-by-point convolution.
  • the convolution kernel in channel-by-channel convolution is set to 3*3*28, and one convolution kernel is only responsible for one channel; the point-by-point convolution
  • the convolution kernel is set to 1*1*28 for information fusion in the channel direction;
  • the stride of DwConv_1 and DwConv_2 is set to 1, the padding is set to the same, the output feature map is the same size as the input T1 and S1, and the size of DwConv_3
  • the stride is set to 2, the padding is set to 1, and the output feature maps T2 and S2 are half the size of the feature maps T1 and S1, namely 48*48*28 and 176*176*28.
  • Step 2 multiplexing feature extraction: After obtaining the feature maps T2 and S2, input them to the multiplexing feature extraction module.
  • the multiplexing feature extraction module is configured with three layers, respectively recorded as MPConv_1, MPConv_2, and MPConv_3; each layer of multiplexing feature extraction module is composed of three inverted residual modules (the three inverted residual modules are respectively recorded as InvResidual_1, InvResidual_2, InvResidual_3) and two multiplexing modules (the two multiplexing modules are respectively marked as Multiplexing Block1 and Multiplexing Block12).
  • the inverted residual module consists of two point-by-point convolutions and one channel-by-channel convolution, respectively denoted as PwConv_1, PwConv_2, and CwConv_1, and their sizes correspond to 1*1, 1*1, 3*3, and the output feature map size unchanged, it is input into the multiplexing module, and the following two parts introduce the multiplexing module.
  • the channel of the feature map of C*H*W is doubled by the original r, and its width and height are changed to The original 1/r;
  • the expansion operation in principle, it is a reverse operation of the reduction operation, which changes the channel of the feature map to 1/r 2 times the original, and changes the width and height to the original r times; after that, each A group of new feature maps are obtained through their respective group convolutions, and then an inverse operation is performed to change them back to the original feature maps for connection;
  • channel multiplexing uses expansion and reduction operations to convert channel information Shared to the space, the information in the space is distributed to the channel, and the circulation of information is promoted.
  • group convolution is used to reduce the amount of parameters.
  • L channels are selected each time to calculate, let it pass a 1*1 convolution to reshape the number of channels, and then use spatial multiplexing
  • the used module is reorganized through 1*1 convolution to talk about the channel, and the remaining 1-L channels are directly copied; this can avoid the calculation of C-L channels; in order to allow each channel to be calculated Opportunities, and then the rearrangement of the channels, that is, channel shuffle (Channel Shuffle).
  • channel shuffle Channel shuffle
  • Multiplexing Block1 is set so that the output size remains unchanged, and Multiplexing Block2 is set so that the output size becomes half of the original, and the number of channels is doubled; each time the feature map T2, S2 passes through an MPConv, the size becomes 24*24*56, 88* in turn 88*56; 12*12*112, 44*44*112; 6*6*224, 22*22*224, the final feature map is marked as T3, S3, which are respectively input into the RPN network for regression and classification.
  • the technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network.
  • the backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
  • Step 3 RPN network: The structure of the RPN network is shown in FIG. 5 .
  • the role of the RPN network is to regress the bounding box to obtain an accurate position estimate.
  • the RPN network consists of two parts, one is the classification branch, which is used to distinguish the target from the background, and the other is the regression branch, which fine-tunes the candidate area.
  • the target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the previous step 1.3, we can see that each Anchor has four scales, so the regression branch has increased by 4k times.
  • the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here.
  • the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated
  • the response map of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, dh, these four values measure the deviation between the Anchor and the real bounding box.
  • the formula for calculating the response graph is as follows:
  • step S300 is an offline training phase, which specifically includes:
  • Step 1 Selection of loss function: The loss function is divided into two parts, which are classification loss and regression loss, as follows:
  • the classification loss uses cross-entropy loss, as follows:
  • v is a single response value output by the network
  • y is the label
  • D is the generated response map
  • u is any value in the response map.
  • the regression loss uses Smooth L1 loss, and first standardizes the coordinates of the Anchor:
  • x, y, w, and h represent the coordinates of the center of the matrix and the width and height of the matrix
  • T and A represent the prediction frame and Anchor, respectively.
  • is an adjustable hyperparameter to balance the two losses.
  • Step 2 training settings: first train the network backbone on the micro ImageNet (20 categories), at this time, you need to change the stride of the first convolutional layer Conv_1 to 2, and remove the RPN network. At this time, the input is an image of 224*224*3, and the output is a feature map of 7*7*224. Connect this feature map to a three-layer fully connected neural network, and the final output is a 20-dimensional vector, representing 20 categories. . Perform 80-100 rounds of training on 10,000 images to converge better. Using SGD as the gradient descent algorithm, the learning rate drops by 0.001 every 5 rounds.
  • step 400 is an online target tracking stage, which specifically includes:
  • One-shot tracking takes the tracking task as a one-shot training detection task. That is, a neural network is first learned, and after the learning is completed, the convolution kernel parameters of the convolution operation are obtained through the initial frame learning in the tracking phase, so as to obtain a specialized network, and then the subsequent frames are tracked.
  • a real-time target tracking device including:
  • the data selection module 100 selects the image data required for training, and enhances the search image frame to prevent network overfitting.
  • the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
  • the network configuration module 200 configures the feature extraction network and the RPN network, inputs the paired search images and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and inputs the two feature maps into the RPN network, and the RPN network pairs the bounding box make regression;
  • the network training module 300 removes the RPN network, and performs the first stage training of the backbone network on the classified image data set. After the classification data image data set converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection data set. stage training;
  • the target tracking module 400 for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; Each frame of the stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking.
  • the problem solved by the present invention is to realize the tracking of the target in the video based on the twin network, and, aiming at the problems of large memory usage and slow inference speed generally existing in the current twin network, a lightweight network is designed to make the Achieve fast object tracking without losing too much accuracy.
  • the network configuration module 200 includes:
  • the first feature output unit beneficially inputs paired target template frames and search image frames to the feature extraction network, and outputs feature maps T2 and S2 through the separate convolution feature extraction network;
  • the first feature output unit is used to obtain feature maps T2, S2, and input feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs feature maps T3, S3;
  • a feature input unit is used to input feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
  • the feature localization unit the RPN network regresses the bounding box to obtain an accurate position estimate.
  • the network configuration module 200 processes the network model configuration stage, specifically including:
  • Step 1 feature extraction in the early stage: the overall structure of the network is shown in Figure 2.
  • the input is a target template frame of 98*98*3 and a search image frame of 354*354*3.
  • the convolution kernel Conv_1 obtains the feature maps T1 and S1 of 96*96*28 and 352*352*28; then configures three-layer depth separable convolution, and the three-layer depth separable convolution is DwConv_1, DwConv_2 and DwConv_3;
  • a depth-separable convolution layer is divided into channel-by-channel convolution and point-by-point convolution.
  • the convolution kernel in channel-by-channel convolution is set to 3*3*28, and one convolution kernel is only responsible for one channel; the point-by-point convolution
  • the convolution kernel is set to 1*1*28 for information fusion in the channel direction;
  • the stride of DwConv_1 and DwConv_2 is set to 1, the padding is set to the same, the output feature map is the same size as the input T1 and S1, and the size of DwConv_3
  • the stride is set to 2, the padding is set to 1, and the output feature maps T2 and S2 are half the size of the feature maps T1 and S1, namely 48*48*28 and 176*176*28.
  • Step 2 multiplexing feature extraction: After obtaining the feature maps T2 and S2, input them to the multiplexing feature extraction module.
  • the multiplexing feature extraction module is configured with three layers, respectively recorded as MPConv_1, MPConv_2, and MPConv_3; each layer of multiplexing feature extraction module is composed of three inverted residual modules (the three inverted residual modules are respectively recorded as InvResidual_1, InvResidual_2, InvResidual_3) and two multiplexing modules (the two multiplexing modules are respectively marked as Multiplexing Block1 and Multiplexing Block12).
  • the inverted residual module consists of two point-by-point convolutions and one channel-by-channel convolution, respectively denoted as PwConv_1, PwConv_2, and CwConv_1, and their sizes correspond to 1*1, 1*1, 3*3, and the output feature map size unchanged, it is input into the multiplexing module, and the following two parts introduce the multiplexing module.
  • the channel of the feature map of C*H*W is doubled by the original r, and its width and height are changed to The original 1/r;
  • the expansion operation in principle, it is a reverse operation of the reduction operation, which changes the channel of the feature map to 1/r 2 times the original, and changes the width and height to the original r times; after that, each A group of new feature maps are obtained through their respective group convolutions, and then an inverse operation is performed to change them back to the original feature maps for connection;
  • channel multiplexing uses expansion and reduction operations to convert channel information Shared to the space, the information in the space is distributed to the channel, and the circulation of information is promoted.
  • group convolution is used to reduce the amount of parameters.
  • L channels are selected each time to calculate, let it pass a 1*1 convolution to reshape the number of channels, and then use spatial multiplexing
  • the used module is reorganized through 1*1 convolution to talk about the channel, and the remaining 1-L channels are directly copied; this can avoid the calculation of C-L channels; in order to allow each channel to be calculated Opportunities, and then the rearrangement of the channels, that is, channel shuffle (Channel Shuffle).
  • channel shuffle Channel shuffle
  • Multiplexing Block1 is set so that the output size remains unchanged, and Multiplexing Block2 is set so that the output size becomes half of the original, and the number of channels is doubled; each time the feature map T2, S2 passes through an MPConv, the size becomes 24*24*56, 88* in turn 88*56; 12*12*112, 44*44*112; 6*6*224, 22*22*224, the final feature map is marked as T3, S3, which are respectively input into the RPN network for regression and classification.
  • the technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network.
  • the backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
  • Step 3 RPN network: The structure of the RPN network is shown in FIG. 5 .
  • the role of the RPN network is to regress the bounding box to obtain an accurate position estimate.
  • the RPN network consists of two parts, one is the classification branch, which is used to distinguish the target from the background, and the other is the regression branch, which fine-tunes the candidate area.
  • the target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the aforementioned step S100 is the third step in the data preparation stage, each Anchor has four scales, so the regression branch has increased by 4k times.
  • the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here.
  • the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated
  • the response graph of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, and dh.
  • the above four values measure the deviation between the Anchor and the real bounding box.
  • the formula for calculating the response graph is as follows:
  • the multiplexing module includes:
  • the channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
  • the spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
  • the technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network.
  • the backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
  • the spatial multiplexing module is used to fuse the feature map through two operations of expansion and reduction, which reduces the parameters while ensuring the accuracy.
  • the channel multiplexing module is adopted, and the calculation efficiency is improved through the operation of channel shuffling and partial selection.
  • the present invention can realize fast single target tracking by introducing two self-developed modules, channel multiplexing and space multiplexing.
  • the results on the VOT2018 data set show that compared with the Siamese network based on ResNet50, the memory usage of the present invention is 43MB, which is one-fifth of the former, and the inference speed on the RTX 2080Ti graphics card is 83FPS , compared with the former's 25FPS, it has increased by 3.3 times, and the accuracy of the method of the present invention is only lost by 3% relative to the former, which can be ignored in actual use.
  • this embodiment provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and one or more programs can be executed by one or more processors to realize Such as the steps in the real-time target tracking method of the above-mentioned embodiment.

Abstract

The present invention relates to the field of artificial intelligence, and specifically relates to a real-time target tracking method, a device, and a storage medium. The method comprises: selecting image data required for training, and enhancing a search image frame; configuring a feature extraction network and an RPN network, and performing regression on a bounding box by the RPN network; removing the RPN network, carrying out stage one training of a backbone network, adding an RPN network posterior to the backbone network, and carrying out stage two training; for an unknown video stream, marking a target to be tracked, and performing training to obtain a specialized network; taking each frame of the video stream as a search image frame, inputting the search image frames into the specialized network, performing regression to output bounding boxes, and completing target tracking. The present invention implements tracking of a target in a video on the basis of a siamese network; also, with respect to the problems of large amounts of memory being used and low inference speed that are pervasive in existing siamese networks, rapid target tracking is achieved without an excessive loss of precision by means of a network trained in the present invention.

Description

一种实时目标跟踪方法、装置及存储介质A real-time target tracking method, device and storage medium 技术领域technical field
本发明涉及人工智能领域,尤其涉及一种实时目标跟踪方法、装置及存储介质。The invention relates to the field of artificial intelligence, in particular to a real-time target tracking method, device and storage medium.
背景技术Background technique
目标跟踪在人机界面,智能监控,虚拟现实,运动分析等许多领域有着广泛的应用前景、在科学和工程中有着重要的研究价值。由于自2013年起新的基准目标跟踪数据集的建立和标准化基准测试平台的提供,在过去十年中目标跟踪得到了快速发展,许多有效的跟踪算法被相继提出。Bolme等人开创性的将卷积定理从信号处理领域引入视觉跟踪,并将对象模板匹配问题转换为频域的相关运算。通过此方式,不仅提高了相关滤波(Correlation Filters,CF)跟踪器的运行速度,同时使用适当的特征还可以提高准确性。自此,相关滤波器在跟踪领域中成为当时的研究热门,并提出了许多相关目标跟踪研究方法,如结合多分辨率特征图减少周期性边界影响,通过优化损失等方式来提高跟踪性能。Object tracking has broad application prospects in many fields such as human-machine interface, intelligent monitoring, virtual reality, and motion analysis, and has important research value in science and engineering. Due to the establishment of new benchmark object tracking datasets and the provision of standardized benchmarking platforms since 2013, object tracking has developed rapidly in the past decade, and many effective tracking algorithms have been proposed successively. Bolme et al pioneered the introduction of convolution theorem from the field of signal processing to visual tracking, and converted the object template matching problem into a correlation operation in the frequency domain. In this way, not only the running speed of the Correlation Filters (CF) tracker is improved, but also the accuracy can be improved by using appropriate features. Since then, the correlation filter has become a research hotspot in the field of tracking at that time, and many related target tracking research methods have been proposed, such as combining multi-resolution feature maps to reduce the influence of periodic boundaries, and improving tracking performance by optimizing losses.
随着计算机视觉中深度学习的兴起,跟踪领域目前正在采用数据驱动的学习方法。VOT17的十个性能最高的跟踪器中有九个依赖于深层特征,且性能优于以前的最新跟踪器。其中,孪生神经网络通过相似度比较策略进行跟踪,其简单的体系结构可以实现非常快的运行速度。Bertinetto等人采用了相对于搜索图像完全卷积的架构(Fully-Convolutional Siamese Networks,SiamFC),以估计两个帧之间区域上的特征相似度,在多个数据测试集上实现了最佳的性能。With the rise of deep learning in computer vision, the field of tracking is currently adopting data-driven learning methods. Nine of the ten top-performing trackers on VOT17 rely on deep features and outperform previous state-of-the-art trackers. Among them, the Siamese neural network is tracked through a similarity comparison strategy, and its simple architecture can achieve very fast running speed. Bertinetto et al. adopted a fully convolutional architecture (Fully-Convolutional Siamese Networks, SiamFC) relative to the search image to estimate the feature similarity in the region between two frames, and achieved the best on multiple data test sets. performance.
早期的孪生跟踪器以SINT算法为代表,其基于相似性学习思想,将网络分为查询分支和搜索分支,采用匹配函数寻找合适的候选区域,但其跟踪速度 过慢,只有2fps。GOTURN算法通过深度回归网络能够在单GPU上实现100帧每秒,但是其鲁棒性较差。FCNT和CREST也同上述算法一样,着重于探索孪生网络的跟踪能力。Bertinetto等提出SiamFC构造轻量化的孪生网络结构,用以分别提取目标特征和搜索区域特征再进行相关操作,根据响应图最大值位置确定目标包围框,其训练采用ILSVRC数据集提供视频进行训练,离线训练完以后网络跟踪过程中不更新参数,算法精度和速度都取得了不错的效果。其后续提出的相关滤波网络CFNet将相关滤波嵌入到网络分支中,将滤波器作为神经网络层,并推导前向和后向传播公式,实现端到端的训练,速度仍在GPU上保持实时。以此为基准网络,大量的孪生网络算法被研究者们提出,如SiamRPN、SiamRPN++、SiamMask、SiamAttn等。Bo Li提出了将目标检测中的RPN网络加入到跟踪网络中,采用类似检测算法的目标定位技术,采用坐标回归使得跟踪结构更加准确,同时避免多尺度搜索的问题。之后Bo Li又提出了SiamRPN++,将深度基准网络引入到了跟踪网络中,大幅提高了检测的精度。后续很多算法在SiamRPN++上进行改进,包括加入mask分支获取掩膜,通过采用内部裁剪残差单元并采用更宽网络,通过多层次融合发挥更深层网络效果。The early twin tracker is represented by the SINT algorithm, which is based on the idea of similarity learning, divides the network into a query branch and a search branch, and uses a matching function to find a suitable candidate area, but its tracking speed is too slow, only 2fps. The GOTURN algorithm can achieve 100 frames per second on a single GPU through a deep regression network, but its robustness is poor. FCNT and CREST, like the above algorithms, focus on exploring the tracking ability of Siamese networks. Bertinetto et al. proposed SiamFC to construct a lightweight twin network structure, which is used to extract target features and search area features respectively and then perform related operations. The target bounding box is determined according to the maximum position of the response graph. The training uses the ILSVRC dataset to provide video for training, offline After training, the parameters are not updated during the network tracking process, and the accuracy and speed of the algorithm have achieved good results. Its subsequent correlation filter network CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on the GPU. Using this as the benchmark network, a large number of twin network algorithms have been proposed by researchers, such as SiamRPN, SiamRPN++, SiamMask, SiamAttn, etc. Bo Li proposed to add the RPN network in target detection to the tracking network, adopt the target positioning technology similar to the detection algorithm, and use coordinate regression to make the tracking structure more accurate, while avoiding the problem of multi-scale search. Later, Bo Li proposed SiamRPN++, which introduced the deep benchmark network into the tracking network, which greatly improved the detection accuracy. Many subsequent algorithms have been improved on SiamRPN++, including adding the mask branch to obtain the mask, by using the internal clipping residual unit and using a wider network, and playing a deeper network effect through multi-level fusion.
尽管Siamese跟踪网络已经有很多的进展,但是Siamese仍然在现实应用的内存限制方面有很大的问题。如SiameseRPN++的内存占用就达到了206MB,它的内存占用使它很难部署在一些移动设备上,如无人机等。如何在不损失很多精度的情况下降低它的参数量,成为了一个亟待解决的问题。Although Siamese has made a lot of progress in tracking the network, Siamese still has a big problem with the memory constraints of real-world applications. For example, the memory usage of SiameseRPN++ reaches 206MB, and its memory usage makes it difficult to deploy on some mobile devices, such as drones. How to reduce its parameter amount without losing a lot of accuracy has become an urgent problem to be solved.
发明内容Contents of the invention
本发明实施例提供了一种实时目标跟踪方法、装置及存储介质,基于孪生网络实现对视频中的目标进行实时跟踪,使得在不损失过多精度的情况下,实 现快速的目标跟踪。Embodiments of the present invention provide a real-time target tracking method, device, and storage medium, which implement real-time tracking of targets in videos based on twin networks, so that fast target tracking can be achieved without losing too much accuracy.
根据本发明的一实施例,提供了一种实时目标跟踪方法,包括以下步骤:According to an embodiment of the present invention, a real-time target tracking method is provided, comprising the following steps:
选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧;Select the image data required for training and enhance the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归;Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;
去掉RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在骨干网络后加上RPN网络,在目标检测数据集上进行第二阶段训练;Remove the RPN network, and perform the first-stage training of the backbone network on the classified image data set. After the classification data image data set converges, add the RPN network after the backbone network, and perform the second-stage training on the target detection data set;
对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将视频流的每一帧作为搜索图像帧,输入到特化网络中,回归出包围框,完成目标跟踪。For an unknown video stream, mark the target that needs to be tracked in the first frame of the video, and the backbone network will use it as a target template frame for training to obtain a specialized network; each frame of the video stream As a search image frame, it is input into the specialized network, and the bounding box is returned to complete the target tracking.
进一步地,选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧具体为:Further, the image data required for training is selected, and the search image frame is enhanced to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame. Specifically:
选取两个图像数据集,分别为微型ImageNet数据集及COCO2017数据集;其中,微型ImageNet数据集为选取预设分辨率的若干张图片构成,每张图片标注有单一类别,用于预训练骨干网络;COCO2017数据集为目标检测数据集,目标检测数据集为选取预设分辨率的若干张图片构成,每张图片标注有多个类别和包围框的位置;Two image data sets are selected, namely the micro-ImageNet data set and the COCO2017 data set; among them, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category, which is used for pre-training the backbone network ;The COCO2017 data set is a target detection data set. The target detection data set is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;
对图像数据集进行数据增强,数据增强包括对图像进行随机的翻转,模糊、移位操作;Perform data enhancement on the image data set, data enhancement includes random flipping, blurring, and shifting operations on the image;
生成Anchor,以对搜索图像帧进行目标定位。Generate anchors for object localization of search image frames.
进一步地,配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归具体为:Further, configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box Specifically:
输入成对的目标模板帧和搜索图像帧至特征提取网络,通过分离卷积特征提取网络输出特征图T2,S2;Input the paired target template frame and search image frame to the feature extraction network, and output the feature map T2, S2 through the separation convolution feature extraction network;
得到特征图T2,S2,将特征图T2,S2输入到多路复用特征提取模块,多路复用特征提取模块输出特征图T3,S3;Get the feature map T2, S2, input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3;
将特征图T3,S3分别输入RPN网络进行回归和分类;其中,多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;Input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
RPN网络对包围框进行回归,得到精确的位置估计。The RPN network regresses the bounding boxes to obtain accurate position estimates.
进一步地,多路复用模块包括:Further, the multiplexing module includes:
通道多路复用模块,用于通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通,同时使用分组卷积,降低了参数量;The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
空间多路复用模块,用于通过卷积将通道重整回来,并将剩下的通道直接进行复制。The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
进一步地,RPN网络包括两部分,分别为分类分支及回归分支;Further, the RPN network includes two parts, namely the classification branch and the regression branch;
分类分支,用于区分目标和背景;Classification branches, used to distinguish objects from backgrounds;
回归分支,用于将候选区域进行微调。The regression branch is used to fine-tune the candidate regions.
根据本发明的另一实施例,提供了一种实时目标跟踪装置,包括:According to another embodiment of the present invention, a real-time target tracking device is provided, including:
数据选取模块,选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部 分为目标模板帧;The data selection module selects the image data required for training and enhances the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
网络配置模块,配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归;Network configuration module, configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network performs the bounding box return;
网络训练模块,去掉RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在骨干网络后加上RPN网络,在目标检测数据集上进行第二阶段训练;In the network training module, the RPN network is removed, and the first stage training of the backbone network is performed on the classified image dataset. After the classification data image dataset converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection dataset. train;
目标跟踪模块,对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将视频流的每一帧作为搜索图像帧,输入到特化网络中,回归出包围框,完成目标跟踪。The target tracking module, for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame for training to obtain a specialized network; the video stream Each frame of is used as a search image frame, which is input into the specialized network, and the bounding box is regressed to complete the target tracking.
进一步地,网络配置模块包括:Further, the network configuration module includes:
第一特征输出单元,有益输入成对的目标模板帧和搜索图像帧至特征提取网络,通过分离卷积特征提取网络输出特征图T2,S2;The first feature output unit beneficially inputs paired target template frames and search image frames to the feature extraction network, and outputs feature maps T2 and S2 through the separate convolution feature extraction network;
第一特征输出单元,用于得到特征图T2,S2,将特征图T2,S2输入到多路复用特征提取模块,多路复用特征提取模块输出特征图T3,S3;The first feature output unit is used to obtain feature maps T2, S2, and input feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs feature maps T3, S3;
特征输入单元,用于将特征图T3,S3分别输入RPN网络进行回归和分类;其中,多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;A feature input unit is used to input feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
特征定位单元,RPN网络对包围框进行回归,得到精确的位置估计。The feature localization unit, the RPN network regresses the bounding box to obtain an accurate position estimate.
进一步地,多路复用模块包括:Further, the multiplexing module includes:
通道多路复用模块,用于通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通,同时使用分组卷积,降低了 参数量;The channel multiplexing module is used to share the information of the channel to the space through the expansion and reduction operation, and disperse the information of the space to the channel, which promotes the circulation of information, and uses group convolution at the same time to reduce the amount of parameters;
空间多路复用模块,用于通过卷积将通道重整回来,并将剩下的通道直接进行复制。The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
根据本发明的另一实施例,提供了一种计算机可读介质,计算机可读存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现如上述任意一项的实时目标跟踪方法中的步骤。According to another embodiment of the present invention, a computer-readable medium is provided. The computer-readable storage medium stores one or more programs, and one or more programs can be executed by one or more processors to achieve the above-mentioned Steps in any one of the real-time object tracking methods.
本发明实施例中的实时目标跟踪方法、装置及存储介质,方法包括:选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧;配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归;去掉RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在骨干网络后加上RPN网络,在目标检测数据集上进行第二阶段训练;对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将视频流的每一帧作为搜索图像帧,输入到特化网络中,回归出包围框,完成目标跟踪。通过本发明,基于孪生网络实现对视频中的目标进行跟踪,并且,针对目前孪生网络普遍存在的占用内存大,推断速度慢的问题,通过本发明训练的网络,使的在不损失过多精度的情况下,实现快速的目标跟踪。The real-time target tracking method, device, and storage medium in the embodiment of the present invention, the method includes: selecting the image data required for training, and enhancing the search image frame to prevent network overfitting, and the search image frame is marked with a bounding box Image data, the bounding box is intercepted separately as the target template frame; configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and input the two feature maps RPN network, the RPN network returns the bounding box; remove the RPN network, and perform the first stage training of the backbone network on the classified image data set. After the classification data image data set converges, add the RPN network after the backbone network to detect The second stage of training is performed on the data set; for unknown video streams, mark the target to be tracked in the first frame of the video, and the backbone network will use it as the target template frame for a training to obtain a specialized Network; each frame of the video stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking. Through the present invention, the tracking of the target in the video is realized based on the twin network, and, in view of the problems of large memory usage and slow inference speed commonly existing in the current twin network, the network trained by the present invention can be used without losing too much precision Under the circumstances, achieve fast target tracking.
附图说明Description of drawings
图1为本发明实时目标跟踪方法的流程图;Fig. 1 is the flowchart of real-time object tracking method of the present invention;
图2为本发明网络结构图;Fig. 2 is a network structure diagram of the present invention;
图3为本发明空间多路复用模块图;Fig. 3 is a spatial multiplexing module diagram of the present invention;
图4为本发明通道多路复用模块图;Fig. 4 is a channel multiplexing module diagram of the present invention;
图5为本发明RPN网络的结构图;Fig. 5 is the structural diagram of RPN network of the present invention;
图6为本发明实时目标跟踪装置的原理图。Fig. 6 is a schematic diagram of the real-time target tracking device of the present invention.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
参见图1,根据本发明一实施例,提供了一种实时目标跟踪方法,包括以下步骤:Referring to FIG. 1, according to an embodiment of the present invention, a real-time target tracking method is provided, comprising the following steps:
S100:选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧;S100: Select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is image data marked with a bounding box, and the separately intercepted part of the bounding box is a target template frame;
S200:配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归;S200: Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;
S300:去掉RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在骨干网络后加上RPN网络,在目标检测数据集上进行第二阶段训练;S300: remove the RPN network, perform the first-stage training of the backbone network on the classified image data set, after the classification data image data set converges, add the RPN network after the backbone network, and perform the second-stage training on the target detection data set;
S400:对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将视频流的每一帧作为搜索图像帧,输入到特化网络中,回归出包围框,完成目标跟踪。S400: For an unknown video stream, mark the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; One frame is used as a search image frame, which is input into the specialized network, and the bounding box is returned to complete the target tracking.
本发明解决的问题是:基于孪生网络实现对视频中的目标进行跟踪,并且,针对目前孪生网络普遍存在的占用内存大,推断速度慢的问题,设计一种轻量级的网络,使的在不损失过多精度的情况下,实现快速的目标跟踪。The problem solved by the present invention is to realize the tracking of the target in the video based on the twin network, and, aiming at the problems of large memory usage and slow inference speed generally existing in the current twin network, a lightweight network is designed to make the Achieve fast object tracking without losing too much accuracy.
具体地,本发明将骨干网络从头进行训练与微调,最终将其应用在孪生网络框架上,实现快速的目标跟踪。本发明方法包含四个阶段:Specifically, the present invention trains and fine-tunes the backbone network from scratch, and finally applies it to the Siamese network framework to achieve fast target tracking. The inventive method comprises four stages:
数据准备阶段:训练分为骨干网络预训练和回归子网络训练阶段,准备两种数据集,分别用于骨干网络的分类任务训练和回归子网络的定位任务训练。在定位训练时,不同于通常的目标检测任务,需要将目标截取作为目标模板帧,和搜索图像成对输入网络。在训练时需要对搜索图像进行数据增强,如翻转、旋转、移位等操作,来防止网络过拟合。Data preparation stage: The training is divided into backbone network pre-training and regression sub-network training stage. Two kinds of data sets are prepared, which are respectively used for the classification task training of the backbone network and the positioning task training of the regression sub-network. In positioning training, unlike the usual target detection tasks, it is necessary to use the target interception as the target template frame, and pair the search image into the network. During training, it is necessary to perform data enhancement on the search image, such as flipping, rotating, shifting, etc., to prevent the network from overfitting.
网络配置阶段:网络可以分为两个部分,特征提取网络和RPN网络。特征提取网络中包含了前述的多路复用模块。输入为成对的目标模板帧和搜索图像帧,共享同一个特征提取网络。最终输出两张特征图,作为RPN网络的输入。RPN网络用于回归精确的包围框,将目标模板帧和搜索图像帧得到的特征图进行一次卷积,可以得到一张热力响应图,其中响应值大的地方即为目标最有可能出现的地方。Network configuration stage: the network can be divided into two parts, feature extraction network and RPN network. The aforementioned multiplexing module is included in the feature extraction network. The input is pairs of target template frames and search image frames, sharing the same feature extraction network. Finally, two feature maps are output as the input of the RPN network. The RPN network is used to return the precise bounding box, and the feature map obtained by the target template frame and the search image frame is convolved once to obtain a thermal response map, where the place with a large response value is the place where the target is most likely to appear .
网络训练阶段:训练使用回归损失和分类损失的加和作为损失函数,分类损失使用交叉熵损失,回归损失使用SmoothL1损失。首先去掉RPN网络,将骨干网络在分类数据集上进行训练,收敛后,在骨干网络后加上RPN网络,一同在目标检测数据集上进行第二阶段的训练,两个阶段都使用SGD作为梯度下降算法。Network training phase: training uses the sum of regression loss and classification loss as the loss function, classification loss uses cross-entropy loss, and regression loss uses SmoothL1 loss. First remove the RPN network, train the backbone network on the classification data set, after convergence, add the RPN network after the backbone network, and conduct the second stage of training on the target detection data set together, using SGD as the gradient in both stages Descent algorithm.
在线跟踪阶段:在线跟踪采用一次性训练的方式。对于未知的视频,在第一帧将需要跟踪的目标标出,网络会将其作为目标模板帧进行一次训练,这样 可以得到一个特化的网络。其后,将视频流的每一帧作为搜索图像帧,输入到网络中,回归出包围框,即完成了跟踪。Online tracking stage: online tracking adopts a one-time training method. For unknown videos, mark the target to be tracked in the first frame, and the network will use it as the target template frame for training, so that a specialized network can be obtained. Afterwards, each frame of the video stream is used as a search image frame, input into the network, and the bounding box is regressed, that is, the tracking is completed.
本发明通过采用轻量的模块以及其组成的骨干网络,在保持精度的同时,提高了目标跟踪网络的推断速度,降低了网络的内存占用。更具体的,将在下面的描述中给出。The invention improves the deduction speed of the target tracking network and reduces the memory occupation of the network while maintaining the precision by adopting the lightweight module and the backbone network formed by it. More details will be given in the following description.
实施例中,步骤S100:选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧具体为:In the embodiment, step S100: select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame Specifically:
S101:选取两个图像数据集,分别为微型ImageNet数据集及COCO2017数据集;其中,微型ImageNet数据集为选取预设分辨率的若干张图片构成,每张图片标注有单一类别,用于预训练骨干网络;COCO2017数据集为目标检测数据集,目标检测数据集为选取预设分辨率的若干张图片构成,每张图片标注有多个类别和包围框的位置;S101: Select two image data sets, namely the micro-ImageNet data set and the COCO2017 data set; wherein, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category for pre-training Backbone network; the COCO2017 dataset is a target detection dataset. The target detection dataset is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;
S102:对图像数据集进行数据增强,数据增强包括对图像进行随机的翻转,模糊、移位操作;S102: Perform data enhancement on the image data set, the data enhancement includes random flipping, blurring, and shifting operations on the image;
S103:生成Anchor,以对搜索图像帧进行目标定位。S103: Generate an Anchor to perform target positioning on the search image frame.
步骤S100为数据准备阶段,具体包括:Step S100 is the data preparation stage, specifically including:
步骤一,数据选取:训练分为两个阶段,分别是骨干网络预训练和回归子网络训练。具体的训练配置在步骤3详细介绍。在两个训练阶段分别使用了两个数据集,为微型ImageNet(20类)和COCO2017。微型ImageNet是ImageNet的子数据集,选取其中224*224分辨率的图片10000张,每张图片标注有单一类别,用于预训练骨干网络。COCO2017为目标检测数据集,选取其中图片5000张,每张图片标注有多个类别和包围框的位置。训练孪生网络时,需要 将标注的包围框单独截取,作为目标模板帧,将整个图像作为搜索图像帧,成对输入到网络中。Step 1, data selection: the training is divided into two stages, namely backbone network pre-training and regression sub-network training. The specific training configuration is introduced in detail in step 3. Two datasets, Tiny ImageNet (20 classes) and COCO2017, were used in the two training phases. Tiny ImageNet is a sub-dataset of ImageNet. 10,000 pictures with a resolution of 224*224 are selected, and each picture is marked with a single category for pre-training the backbone network. COCO2017 is a target detection data set, and 5000 pictures are selected, and each picture is marked with multiple categories and the location of the bounding box. When training the Siamese network, the labeled bounding box needs to be intercepted separately as the target template frame, and the entire image is used as the search image frame, which is input into the network in pairs.
步骤二,数据增强:在进入训练前,需要对数据进行扩展,即数据增强,需要对图像进行随机的翻转,模糊、移位操作。需要注意的是,在孪生神经网络中,只需要对搜索图像进行数据增强,而不需要对目标模板进行增强。Step 2, data enhancement: Before entering the training, the data needs to be expanded, that is, data enhancement, which requires random flipping, blurring, and shifting operations on the image. It should be noted that in Siamese neural network, data augmentation only needs to be performed on the search image, not on the target template.
步骤三,Anchor生成:在搜索图像上定位目标需要借助Anchor,具体生成Anchor的方法是:先得到基准Anchor,其中Anchor的维度大小为[5,4],4代表四种尺度,每种尺度记录了一个向量[中心点x坐标,中心点y坐标,宽,高],5代表Anchor的扩展尺度,设置为[0.5,0.66,1,1.5,2]。Step 3, Anchor generation: Anchor is needed to locate the target on the search image. The specific method of generating Anchor is: first obtain the benchmark Anchor, where the dimension size of Anchor is [5,4], 4 represents four scales, and each scale records A vector [x-coordinate of center point, y-coordinate of center point, width, height], 5 represents the expansion scale of Anchor, set to [0.5, 0.66, 1, 1.5, 2].
实施例中,步骤S200:配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归具体为:In the embodiment, step S200: configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network pairs The bounding box for regression is specifically:
S201:输入成对的目标模板帧和搜索图像帧至特征提取网络,通过分离卷积特征提取网络输出特征图T2,S2;S201: Input the paired target template frame and the search image frame to the feature extraction network, and output the feature map T2, S2 by separating the convolutional feature extraction network;
S202:得到特征图T2,S2,将特征图T2,S2输入到多路复用特征提取模块,多路复用特征提取模块输出特征图T3,S3;S202: Obtain the feature maps T2, S2, input the feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature maps T3, S3;
S203:将特征图T3,S3分别输入RPN网络进行回归和分类;其中,多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;S203: Input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
S204:RPN网络对包围框进行回归,得到精确的位置估计。S204: The RPN network performs regression on the bounding box to obtain accurate position estimation.
步骤S200为的网络模型配置阶段,具体包括:Step S200 is the network model configuration stage, which specifically includes:
步骤一,前期特征提取:网络总体结构如图2所示,输入为98*98*3的目标模板帧和354*354*3的搜索图像帧,首先令其通过一个共享的3*3*28的卷积核Conv_1,得到96*96*28和352*352*28的特征图T1,S1;之后配置三层 深度可分离卷积,三层深度可分离卷积分别为DwConv_1,DwConv_2及DwConv_3;一个深度可分离卷积层分为逐通道卷积和逐点卷积,逐通道卷积中的卷积核设置为3*3*28,一个卷积核只负责一个通道;逐点卷积的卷积核设置为1*1*28,用于在通道方向上进行信息融合;DwConv_1和DwConv_2的stride设置为1,padding设置为same,输出的特征图和输入的T1,S1尺寸相同,DwConv_3的stride设置为2,padding设置为1,输出特征图T2,S2,其尺寸分别为特征图T1,S1的一半,即48*48*28和176*176*28。Step 1, feature extraction in the early stage: the overall structure of the network is shown in Figure 2. The input is a target template frame of 98*98*3 and a search image frame of 354*354*3. First, let it pass through a shared 3*3*28 The convolution kernel Conv_1 obtains the feature maps T1 and S1 of 96*96*28 and 352*352*28; then configures three-layer depth separable convolution, and the three-layer depth separable convolution is DwConv_1, DwConv_2 and DwConv_3; A depth-separable convolution layer is divided into channel-by-channel convolution and point-by-point convolution. The convolution kernel in channel-by-channel convolution is set to 3*3*28, and one convolution kernel is only responsible for one channel; the point-by-point convolution The convolution kernel is set to 1*1*28 for information fusion in the channel direction; the stride of DwConv_1 and DwConv_2 is set to 1, the padding is set to the same, the output feature map is the same size as the input T1 and S1, and the size of DwConv_3 The stride is set to 2, the padding is set to 1, and the output feature maps T2 and S2 are half the size of the feature maps T1 and S1, namely 48*48*28 and 176*176*28.
步骤二,多路复用特征提取:得到特征图T2,S2后,将其输入到多路复用特征提取模块。多路复用特征提取模块配置三层,分别记作MPConv_1,MPConv_2,MPConv_3;每一层多路复用特征提取模块又由三个倒置残差模块(三个倒置残差模块分别记为InvResidual_1,InvResidual_2,InvResidual_3)和两个多路复用模块(两个多路复用模块分别记为Multiplexing Block1和Multiplexing Block12)组成。倒置残差模块由两个逐点卷积和一个逐通道卷积组成,分别记作PwConv_1,PwConv_2,CwConv_1,其尺寸分别对应是1*1,1*1,3*3,输出的特征图尺寸不变,输入到多路复用模块中,下面分两部分介绍多路复用模块。Step 2, multiplexing feature extraction: After obtaining the feature maps T2 and S2, input them to the multiplexing feature extraction module. The multiplexing feature extraction module is configured with three layers, respectively recorded as MPConv_1, MPConv_2, and MPConv_3; each layer of multiplexing feature extraction module is composed of three inverted residual modules (the three inverted residual modules are respectively recorded as InvResidual_1, InvResidual_2, InvResidual_3) and two multiplexing modules (the two multiplexing modules are respectively marked as Multiplexing Block1 and Multiplexing Block12). The inverted residual module consists of two point-by-point convolutions and one channel-by-channel convolution, respectively denoted as PwConv_1, PwConv_2, and CwConv_1, and their sizes correspond to 1*1, 1*1, 3*3, and the output feature map size unchanged, it is input into the multiplexing module, and the following two parts introduce the multiplexing module.
Multiplexing Block(多路复用模块)由两个部分组成,分别称为通道多路复用模块和空间多路复用模块;空间多路复用模块如图3所示,对于一个有C个通道的特征图,把其拆分成C1,C2,C3,其中C=C1+C2+C3。此处定义两种操作,分别叫做缩减操作和扩张操作,定义一个操作因子r,对于缩减操作,将C*H*W的特征图的通道变为原来的r 2倍,将其宽高变为原来的1/r;对于扩张操作,原理上是缩减操作的一个反向操作,它将特征图的通道变为原来的1/r 2倍,将宽高变成原来的r倍;之后,每一组分别通过各自的分组卷积,得到 新的特征图,再分别进行一个反操作,将其变回原来的特征图,进行连接;通道多路复用通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通。同时使用分组卷积,降低了参数量。 Multiplexing Block (multiplexing module) is made up of two parts, is called channel multiplexing module and space multiplexing module respectively; Space multiplexing module is shown in Fig. The feature map, split it into C1, C2, C3, where C=C1+C2+C3. Two operations are defined here, called reduction operation and expansion operation, and an operation factor r is defined. For the reduction operation, the channel of the feature map of C*H*W is doubled by the original r, and its width and height are changed to The original 1/r; for the expansion operation, in principle, it is a reverse operation of the reduction operation, which changes the channel of the feature map to 1/r 2 times the original, and changes the width and height to the original r times; after that, each A group of new feature maps are obtained through their respective group convolutions, and then an inverse operation is performed to change them back to the original feature maps for connection; channel multiplexing uses expansion and reduction operations to convert channel information Shared to the space, the information in the space is distributed to the channel, and the circulation of information is promoted. At the same time, group convolution is used to reduce the amount of parameters.
对于通道多路复用操作,如图4所示,发明中对于一个特征图,每次选择L个通道来计算,让它通过一个1*1卷积重整通道数量,之后通过空间多路复用的模块,再通过1*1卷积讲通道重整回来,剩下的1-L个通道直接进行复制;这样就可以免于C-L个通道的计算;为了让每个通道都有被计算的机会,之后要进行通道的重排列,即通道洗牌(Channel Shuffle)。本实施例中,将这样的操作重复一次,组合之前的空间复用模块,将它作为一个新的模块,即多路复用卷积模块。For the channel multiplexing operation, as shown in Figure 4, for a feature map in the invention, L channels are selected each time to calculate, let it pass a 1*1 convolution to reshape the number of channels, and then use spatial multiplexing The used module is reorganized through 1*1 convolution to talk about the channel, and the remaining 1-L channels are directly copied; this can avoid the calculation of C-L channels; in order to allow each channel to be calculated Opportunities, and then the rearrangement of the channels, that is, channel shuffle (Channel Shuffle). In this embodiment, such an operation is repeated once, and the previous spatial multiplexing module is combined to use it as a new module, that is, the multiplexed convolution module.
Multiplexing Block1设置为输出尺寸不变,Multiplexing Block2设置为输出尺寸变为原来的一半,并把通道数翻倍;特征图T2,S2每通过一个MPConv,尺寸依次变成24*24*56,88*88*56;12*12*112,44*44*112;6*6*224,22*22*224,最后的特征图记为T3,S3,将其分别输入RPN网络进行回归和分类。Multiplexing Block1 is set so that the output size remains unchanged, and Multiplexing Block2 is set so that the output size becomes half of the original, and the number of channels is doubled; each time the feature map T2, S2 passes through an MPConv, the size becomes 24*24*56, 88* in turn 88*56; 12*12*112, 44*44*112; 6*6*224, 22*22*224, the final feature map is marked as T3, S3, which are respectively input into the RPN network for regression and classification.
本发明的技术方案提出了两个轻量级的模块,空间多路复用模块以及通道多路复用模块,并将其组合成一种新的骨干网络。将骨干网络从头进行训练与微调,最终将其应用在孪生网络框架上,实现快速的目标跟踪。The technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network. The backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
步骤三,RPN网络:RPN网络的结构如图5所示。RPN网络的作用是对包围框进行回归,得到精确的位置估计,RPN网络由两部分组成,一部分是分类分支,用于区分目标和背景,另一部分是回归分支,它将候选区域进行微调。Step 3, RPN network: The structure of the RPN network is shown in FIG. 5 . The role of the RPN network is to regress the bounding box to obtain an accurate position estimate. The RPN network consists of two parts, one is the classification branch, which is used to distinguish the target from the background, and the other is the regression branch, which fine-tunes the candidate area.
目标模板和搜索图像经过前述特征提取网络分别得到6*6*224,22*22*224的特征图T3,S3;然后目标模板特征通过3*3的卷积核分别产生 了4*4*(2k*224)以及4*4*(4k*224)的特征,这里从6*6的尺寸经过3*3的卷积核得到4*4的特征尺寸比较简单,这里需要注意通道数从224上升到了2k*224以及4k*224,之所以通道数上升了2k倍,是因为在特征图的每个点生成k个Anchor,同时每个Anchor可以被分类到前景或背景,所以分类分支上升了2k倍,由前述步骤1.3可知,每个Anchor有四个尺度,所以回归分支上升了4k倍。同时搜索图像也通过3*3的卷积核分别得到两个特征,这里特征通道数保持不变。对于分类分支,将2k个模板图像Anchor的4*4*224特征作为卷积核与搜索图像20*20*224特征进行卷积操作,从而产生分类分支响应图;对于回归分支与此类似,产生的响应图为17*17*4k,其中每个点表示一个尺寸为4k的向量,记作dx,dy,dw,dh,这四个值衡量了Anchor与真实包围框的偏差。响应图计算公式如下:The target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the previous step 1.3, we can see that each Anchor has four scales, so the regression branch has increased by 4k times. At the same time, the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here. For the classification branch, the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated The response map of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, dh, these four values measure the deviation between the Anchor and the real bounding box. The formula for calculating the response graph is as follows:
Figure PCTCN2022078255-appb-000001
Figure PCTCN2022078255-appb-000001
Figure PCTCN2022078255-appb-000002
Figure PCTCN2022078255-appb-000002
其中,
Figure PCTCN2022078255-appb-000003
代表目标模板特征T3,
Figure PCTCN2022078255-appb-000004
代表搜索图像特征S3。
in,
Figure PCTCN2022078255-appb-000003
represents the target template feature T3,
Figure PCTCN2022078255-appb-000004
represents the search image feature S3.
实施例中,步骤S300为离线训练阶段,具体包括:In an embodiment, step S300 is an offline training phase, which specifically includes:
步骤一,损失函数的选取:损失函数分为两部分,为分类损失和回归损失,如下式:Step 1. Selection of loss function: The loss function is divided into two parts, which are classification loss and regression loss, as follows:
Loss=L cls+λL reg Loss=L cls +λL reg
分类损失使用交叉熵损失,如下式:The classification loss uses cross-entropy loss, as follows:
μ(y,v)=log(1+exp(-yv))μ(y,v)=log(1+exp(-yv))
Figure PCTCN2022078255-appb-000005
Figure PCTCN2022078255-appb-000005
其中v为网络输出的单个响应值,y为标签,D是生成的响应图,u是响应图中的任一个值。回归损失使用Smooth L1损失,首先将Anchor的坐标标 准化:where v is a single response value output by the network, y is the label, D is the generated response map, and u is any value in the response map. The regression loss uses Smooth L1 loss, and first standardizes the coordinates of the Anchor:
Figure PCTCN2022078255-appb-000006
Figure PCTCN2022078255-appb-000006
Figure PCTCN2022078255-appb-000007
Figure PCTCN2022078255-appb-000007
其中,x、y、w、h代表矩阵中心的坐标以及矩阵的宽和高,T和A分别代表预测框和Anchor。Among them, x, y, w, and h represent the coordinates of the center of the matrix and the width and height of the matrix, and T and A represent the prediction frame and Anchor, respectively.
Smooth L1损失函数如下所示:The Smooth L1 loss function looks like this:
Figure PCTCN2022078255-appb-000008
Figure PCTCN2022078255-appb-000008
Figure PCTCN2022078255-appb-000009
Figure PCTCN2022078255-appb-000009
其中,λ是一个可调整的超参数,用来平衡两个损失。Among them, λ is an adjustable hyperparameter to balance the two losses.
步骤二,训练设置:首先将网络骨干在微型ImageNet(20类)上进行训练,此时需要将第一个卷积层Conv_1的stride改为2,并去掉RPN网络。此时输入为224*224*3的图像,输出为7*7*224的特征图,将这个特征图连接上一个三层全连接神经网络,最终输出是一个20维的向量,代表20个分类。在10000张图像上进行80-100轮的训练,即可收敛得比较好,使用SGD作为梯度下降算法,学习率每5轮下降0.001。接下来,将Conv_1的stride改为1,去掉全连接神经网络,并接上回归分支和分类分支,在COCO2017上进行训练,输入为98*98*3和354*354*3,使用上述3.1提到的损失函数训练网络,网络在50轮训练左右可以收敛良好,使用SGD作为梯度下降算法,学习率从0.01非线性的降低到0.00001。Step 2, training settings: first train the network backbone on the micro ImageNet (20 categories), at this time, you need to change the stride of the first convolutional layer Conv_1 to 2, and remove the RPN network. At this time, the input is an image of 224*224*3, and the output is a feature map of 7*7*224. Connect this feature map to a three-layer fully connected neural network, and the final output is a 20-dimensional vector, representing 20 categories. . Perform 80-100 rounds of training on 10,000 images to converge better. Using SGD as the gradient descent algorithm, the learning rate drops by 0.001 every 5 rounds. Next, change the stride of Conv_1 to 1, remove the fully connected neural network, and connect the regression branch and the classification branch, train on COCO2017, the input is 98*98*3 and 354*354*3, use the above mentioned 3.1 The obtained loss function trains the network. The network can converge well after about 50 rounds of training. Using SGD as the gradient descent algorithm, the learning rate is reduced from 0.01 to 0.00001 nonlinearly.
实施例中,步骤400为在线目标跟踪阶段,具体包括:In an embodiment, step 400 is an online target tracking stage, which specifically includes:
一次性训练(One-shot)跟踪:发明将跟踪任务作为一个一次性训练的检测 任务。即首先学习一个神经网络,学习完成后,在跟踪阶段通过初始帧学习得到卷积操作的卷积核参数,从而得到特化的网络,然后再对后续帧进行跟踪。One-shot tracking: Invention takes the tracking task as a one-shot training detection task. That is, a neural network is first learned, and after the learning is completed, the convolution kernel parameters of the convolution operation are obtained through the initial frame learning in the tracking phase, so as to obtain a specialized network, and then the subsequent frames are tracked.
实际使用中,对于一个未知的视频,首先在第一帧进行手动标定,之后这一帧将作为目标模板对网络进行一次特化的训练。之后对于每一帧搜索图像,将搜索图像通过网络,计算相似度量后回归出目标的位置。In actual use, for an unknown video, first manually calibrate the first frame, and then this frame will be used as the target template for a specialized training of the network. After that, for each frame of the search image, the search image is passed through the network, and the similarity measure is calculated to return the position of the target.
参见图6,根据本发明一实施例,提供了一种实时目标跟踪装置,包括:Referring to FIG. 6, according to an embodiment of the present invention, a real-time target tracking device is provided, including:
数据选取模块100,选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,搜索图像帧为标注有包围框的图像数据,包围框单独截取部分为目标模板帧;The data selection module 100 selects the image data required for training, and enhances the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;
网络配置模块200,配置特征提取网络及RPN网络,将成对的搜索图像及目标模板帧输入特征提取网络,特征提取网络输出两张特征图,将两张特征图输入RPN网络,RPN网络对包围框进行回归;The network configuration module 200 configures the feature extraction network and the RPN network, inputs the paired search images and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and inputs the two feature maps into the RPN network, and the RPN network pairs the bounding box make regression;
网络训练模块300,去掉RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在骨干网络后加上RPN网络,在目标检测数据集上进行第二阶段训练;The network training module 300 removes the RPN network, and performs the first stage training of the backbone network on the classified image data set. After the classification data image data set converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection data set. stage training;
目标跟踪模块400,对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将视频流的每一帧作为搜索图像帧,输入到特化网络中,回归出包围框,完成目标跟踪。The target tracking module 400, for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; Each frame of the stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking.
本发明解决的问题是:基于孪生网络实现对视频中的目标进行跟踪,并且,针对目前孪生网络普遍存在的占用内存大,推断速度慢的问题,设计一种轻量级的网络,使的在不损失过多精度的情况下,实现快速的目标跟踪。The problem solved by the present invention is to realize the tracking of the target in the video based on the twin network, and, aiming at the problems of large memory usage and slow inference speed generally existing in the current twin network, a lightweight network is designed to make the Achieve fast object tracking without losing too much accuracy.
实施例中,网络配置模块200包括:In an embodiment, the network configuration module 200 includes:
第一特征输出单元,有益输入成对的目标模板帧和搜索图像帧至特征提取网络,通过分离卷积特征提取网络输出特征图T2,S2;The first feature output unit beneficially inputs paired target template frames and search image frames to the feature extraction network, and outputs feature maps T2 and S2 through the separate convolution feature extraction network;
第一特征输出单元,用于得到特征图T2,S2,将特征图T2,S2输入到多路复用特征提取模块,多路复用特征提取模块输出特征图T3,S3;The first feature output unit is used to obtain feature maps T2, S2, and input feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs feature maps T3, S3;
特征输入单元,用于将特征图T3,S3分别输入RPN网络进行回归和分类;其中,多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;A feature input unit is used to input feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
特征定位单元,RPN网络对包围框进行回归,得到精确的位置估计。The feature localization unit, the RPN network regresses the bounding box to obtain an accurate position estimate.
网络配置模块200处理网络模型配置阶段,具体包括:The network configuration module 200 processes the network model configuration stage, specifically including:
步骤一,前期特征提取:网络总体结构如图2所示,输入为98*98*3的目标模板帧和354*354*3的搜索图像帧,首先令其通过一个共享的3*3*28的卷积核Conv_1,得到96*96*28和352*352*28的特征图T1,S1;之后配置三层深度可分离卷积,三层深度可分离卷积分别为DwConv_1,DwConv_2及DwConv_3;一个深度可分离卷积层分为逐通道卷积和逐点卷积,逐通道卷积中的卷积核设置为3*3*28,一个卷积核只负责一个通道;逐点卷积的卷积核设置为1*1*28,用于在通道方向上进行信息融合;DwConv_1和DwConv_2的stride设置为1,padding设置为same,输出的特征图和输入的T1,S1尺寸相同,DwConv_3的stride设置为2,padding设置为1,输出特征图T2,S2,其尺寸分别为特征图T1,S1的一半,即48*48*28和176*176*28。Step 1, feature extraction in the early stage: the overall structure of the network is shown in Figure 2. The input is a target template frame of 98*98*3 and a search image frame of 354*354*3. First, let it pass through a shared 3*3*28 The convolution kernel Conv_1 obtains the feature maps T1 and S1 of 96*96*28 and 352*352*28; then configures three-layer depth separable convolution, and the three-layer depth separable convolution is DwConv_1, DwConv_2 and DwConv_3; A depth-separable convolution layer is divided into channel-by-channel convolution and point-by-point convolution. The convolution kernel in channel-by-channel convolution is set to 3*3*28, and one convolution kernel is only responsible for one channel; the point-by-point convolution The convolution kernel is set to 1*1*28 for information fusion in the channel direction; the stride of DwConv_1 and DwConv_2 is set to 1, the padding is set to the same, the output feature map is the same size as the input T1 and S1, and the size of DwConv_3 The stride is set to 2, the padding is set to 1, and the output feature maps T2 and S2 are half the size of the feature maps T1 and S1, namely 48*48*28 and 176*176*28.
步骤二,多路复用特征提取:得到特征图T2,S2后,将其输入到多路复用特征提取模块。多路复用特征提取模块配置三层,分别记作MPConv_1,MPConv_2,MPConv_3;每一层多路复用特征提取模块又由三个倒置残差模块(三个倒置残差模块分别记为InvResidual_1,InvResidual_2,InvResidual_3) 和两个多路复用模块(两个多路复用模块分别记为Multiplexing Block1和Multiplexing Block12)组成。倒置残差模块由两个逐点卷积和一个逐通道卷积组成,分别记作PwConv_1,PwConv_2,CwConv_1,其尺寸分别对应是1*1,1*1,3*3,输出的特征图尺寸不变,输入到多路复用模块中,下面分两部分介绍多路复用模块。Step 2, multiplexing feature extraction: After obtaining the feature maps T2 and S2, input them to the multiplexing feature extraction module. The multiplexing feature extraction module is configured with three layers, respectively recorded as MPConv_1, MPConv_2, and MPConv_3; each layer of multiplexing feature extraction module is composed of three inverted residual modules (the three inverted residual modules are respectively recorded as InvResidual_1, InvResidual_2, InvResidual_3) and two multiplexing modules (the two multiplexing modules are respectively marked as Multiplexing Block1 and Multiplexing Block12). The inverted residual module consists of two point-by-point convolutions and one channel-by-channel convolution, respectively denoted as PwConv_1, PwConv_2, and CwConv_1, and their sizes correspond to 1*1, 1*1, 3*3, and the output feature map size unchanged, it is input into the multiplexing module, and the following two parts introduce the multiplexing module.
Multiplexing Block(多路复用模块)由两个部分组成,分别称为通道多路复用模块和空间多路复用模块;空间多路复用模块如图3所示,对于一个有C个通道的特征图,把其拆分成C1,C2,C3,其中C=C1+C2+C3。此处定义两种操作,分别叫做缩减操作和扩张操作,定义一个操作因子r,对于缩减操作,将C*H*W的特征图的通道变为原来的r 2倍,将其宽高变为原来的1/r;对于扩张操作,原理上是缩减操作的一个反向操作,它将特征图的通道变为原来的1/r 2倍,将宽高变成原来的r倍;之后,每一组分别通过各自的分组卷积,得到新的特征图,再分别进行一个反操作,将其变回原来的特征图,进行连接;通道多路复用通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通。同时使用分组卷积,降低了参数量。 Multiplexing Block (multiplexing module) is made up of two parts, is called channel multiplexing module and space multiplexing module respectively; Space multiplexing module is shown in Fig. The feature map, split it into C1, C2, C3, where C=C1+C2+C3. Two operations are defined here, called reduction operation and expansion operation, and an operation factor r is defined. For the reduction operation, the channel of the feature map of C*H*W is doubled by the original r, and its width and height are changed to The original 1/r; for the expansion operation, in principle, it is a reverse operation of the reduction operation, which changes the channel of the feature map to 1/r 2 times the original, and changes the width and height to the original r times; after that, each A group of new feature maps are obtained through their respective group convolutions, and then an inverse operation is performed to change them back to the original feature maps for connection; channel multiplexing uses expansion and reduction operations to convert channel information Shared to the space, the information in the space is distributed to the channel, and the circulation of information is promoted. At the same time, group convolution is used to reduce the amount of parameters.
对于通道多路复用操作,如图4所示,发明中对于一个特征图,每次选择L个通道来计算,让它通过一个1*1卷积重整通道数量,之后通过空间多路复用的模块,再通过1*1卷积讲通道重整回来,剩下的1-L个通道直接进行复制;这样就可以免于C-L个通道的计算;为了让每个通道都有被计算的机会,之后要进行通道的重排列,即通道洗牌(Channel Shuffle)。本实施例中,将这样的操作重复一次,组合之前的空间复用模块,将它作为一个新的模块,即多路复用卷积模块。For the channel multiplexing operation, as shown in Figure 4, for a feature map in the invention, L channels are selected each time to calculate, let it pass a 1*1 convolution to reshape the number of channels, and then use spatial multiplexing The used module is reorganized through 1*1 convolution to talk about the channel, and the remaining 1-L channels are directly copied; this can avoid the calculation of C-L channels; in order to allow each channel to be calculated Opportunities, and then the rearrangement of the channels, that is, channel shuffle (Channel Shuffle). In this embodiment, such an operation is repeated once, and the previous spatial multiplexing module is combined to use it as a new module, that is, the multiplexed convolution module.
Multiplexing Block1设置为输出尺寸不变,Multiplexing Block2设置为输 出尺寸变为原来的一半,并把通道数翻倍;特征图T2,S2每通过一个MPConv,尺寸依次变成24*24*56,88*88*56;12*12*112,44*44*112;6*6*224,22*22*224,最后的特征图记为T3,S3,将其分别输入RPN网络进行回归和分类。Multiplexing Block1 is set so that the output size remains unchanged, and Multiplexing Block2 is set so that the output size becomes half of the original, and the number of channels is doubled; each time the feature map T2, S2 passes through an MPConv, the size becomes 24*24*56, 88* in turn 88*56; 12*12*112, 44*44*112; 6*6*224, 22*22*224, the final feature map is marked as T3, S3, which are respectively input into the RPN network for regression and classification.
本发明的技术方案提出了两个轻量级的模块,空间多路复用模块以及通道多路复用模块,并将其组合成一种新的骨干网络。将骨干网络从头进行训练与微调,最终将其应用在孪生网络框架上,实现快速的目标跟踪。The technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network. The backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
步骤三,RPN网络:RPN网络的结构如图5所示。RPN网络的作用是对包围框进行回归,得到精确的位置估计,RPN网络由两部分组成,一部分是分类分支,用于区分目标和背景,另一部分是回归分支,它将候选区域进行微调。Step 3, RPN network: The structure of the RPN network is shown in FIG. 5 . The role of the RPN network is to regress the bounding box to obtain an accurate position estimate. The RPN network consists of two parts, one is the classification branch, which is used to distinguish the target from the background, and the other is the regression branch, which fine-tunes the candidate area.
目标模板和搜索图像经过前述特征提取网络分别得到6*6*224,22*22*224的特征图T3,S3;然后目标模板特征通过3*3的卷积核分别产生了4*4*(2k*224)以及4*4*(4k*224)的特征,这里从6*6的尺寸经过3*3的卷积核得到4*4的特征尺寸比较简单,这里需要注意通道数从224上升到了2k*224以及4k*224,之所以通道数上升了2k倍,是因为在特征图的每个点生成k个Anchor,同时每个Anchor可以被分类到前景或背景,所以分类分支上升了2k倍,由前述步骤S100为数据准备阶段中的步骤三可知,每个Anchor有四个尺度,所以回归分支上升了4k倍。同时搜索图像也通过3*3的卷积核分别得到两个特征,这里特征通道数保持不变。对于分类分支,将2k个模板图像Anchor的4*4*224特征作为卷积核与搜索图像20*20*224特征进行卷积操作,从而产生分类分支响应图;对于回归分支与此类似,产生的响应图为17*17*4k,其中每个点表示一个尺寸为4k的向量,记作dx、dy、dw、dh,上述四个值衡量了Anchor与真实包围框的偏差。响应图计算公式如下:The target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the aforementioned step S100 is the third step in the data preparation stage, each Anchor has four scales, so the regression branch has increased by 4k times. At the same time, the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here. For the classification branch, the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated The response graph of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, and dh. The above four values measure the deviation between the Anchor and the real bounding box. The formula for calculating the response graph is as follows:
Figure PCTCN2022078255-appb-000010
Figure PCTCN2022078255-appb-000010
Figure PCTCN2022078255-appb-000011
Figure PCTCN2022078255-appb-000011
其中,
Figure PCTCN2022078255-appb-000012
代表目标模板特征T3,
Figure PCTCN2022078255-appb-000013
代表搜索图像特征S3。
in,
Figure PCTCN2022078255-appb-000012
represents the target template feature T3,
Figure PCTCN2022078255-appb-000013
represents the search image feature S3.
实施例中,多路复用模块包括:In an embodiment, the multiplexing module includes:
通道多路复用模块,用于通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通,同时使用分组卷积,降低了参数量;The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
空间多路复用模块,用于通过卷积将通道重整回来,并将剩下的通道直接进行复制。The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
本发明的技术方案提出了两个轻量级的模块,空间多路复用模块以及通道多路复用模块,并将其组合成一种新的骨干网络。将骨干网络从头进行训练与微调,最终将其应用在孪生网络框架上,实现快速的目标跟踪。The technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network. The backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.
本申请基于孪生网络目标跟踪方法的有益效果在于:The beneficial effect of this application based on twin network target tracking method is:
1.采用了空间多路复用模块,通过扩张和缩减两种操作,对特征图进行信息融合,保证精度的同时,降低了参数。1. The spatial multiplexing module is used to fuse the feature map through two operations of expansion and reduction, which reduces the parameters while ensuring the accuracy.
2.采用了通道多路复用模块,通过通道洗牌与部分选取的操作,提高了计算效率。2. The channel multiplexing module is adopted, and the calculation efficiency is improved through the operation of channel shuffling and partial selection.
3.将上述两个模块组合,发明了一种新的骨干网络,对网络从头进行训练并收敛,推断效率比较高,可以应用在实时单目标跟踪任务上。3. Combining the above two modules, a new backbone network is invented, which trains the network from scratch and converges. The inference efficiency is relatively high, and it can be applied to real-time single-target tracking tasks.
本发明通过引入两个自研的模块,通道多路复用与空间多路复用,可以实现快速的单目标跟踪。The present invention can realize fast single target tracking by introducing two self-developed modules, channel multiplexing and space multiplexing.
本发明实施例中,在VOT2018数据集上的结果显示,与基于ResNet50的Siamese网络相比较,本发明的内存占用为43MB,是前者的五分之一,在 RTX 2080Ti显卡上的推断速度为83FPS,相比较前者的25FPS,提高了3.3倍,而本发明方法的精度相对前者仅损失3%,在实际使用中可以忽略不计。In the embodiment of the present invention, the results on the VOT2018 data set show that compared with the Siamese network based on ResNet50, the memory usage of the present invention is 43MB, which is one-fifth of the former, and the inference speed on the RTX 2080Ti graphics card is 83FPS , compared with the former's 25FPS, it has increased by 3.3 times, and the accuracy of the method of the present invention is only lost by 3% relative to the former, which can be ignored in actual use.
基于上述实时目标跟踪方法,本实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现如上述实施例的实时目标跟踪方法中的步骤。Based on the above real-time target tracking method, this embodiment provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and one or more programs can be executed by one or more processors to realize Such as the steps in the real-time target tracking method of the above-mentioned embodiment.
此外,上述存储介质中的多条指令处理器加载并执行的具体过程在上述方法中已经详细说明,在这里就不再一一陈述。In addition, the specific process of loading and executing multiple instructions in the storage medium by the processor has been described in detail in the above method, and will not be described one by one here.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本发明中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本发明所示的这些实施例,而是要符合与本发明所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined in this invention may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to these embodiments shown in the present invention, but will conform to the widest scope consistent with the principles and novel features disclosed in the present invention.

Claims (9)

  1. 一种实时目标跟踪方法,其特征在于,包括以下步骤:A real-time target tracking method is characterized in that it comprises the following steps:
    选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,所述搜索图像帧为标注有包围框的图像数据,所述包围框单独截取部分为目标模板帧;Select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is image data marked with a bounding box, and the separate intercepted part of the bounding box is the target template frame;
    配置特征提取网络及RPN网络,将成对的所述搜索图像及目标模板帧输入所述特征提取网络,所述特征提取网络输出两张特征图,将两张所述特征图输入所述RPN网络,所述RPN网络对所述包围框进行回归;Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and input the two feature maps into the RPN network, The RPN network performs regression on the bounding box;
    去掉所述RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在所述骨干网络后加上所述RPN网络,在目标检测数据集上进行第二阶段训练;The RPN network is removed, and the first stage training of the backbone network is carried out on the classification image data set. After the classification data image data set converges, the RPN network is added after the backbone network, and the second stage is carried out on the target detection data set. two-stage training;
    对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,所述骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将所述视频流的每一帧作为搜索图像帧,输入到所述特化网络中,回归出所述包围框,完成目标跟踪。For an unknown video stream, when the first frame image of the video is used, the target to be tracked will be marked, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; the video stream Each frame of is used as a search image frame, which is input into the specialized network, and the bounding box is regressed to complete target tracking.
  2. 根据权利要求1所述的实时目标跟踪方法,其特征在于,所述选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,所述搜索图像帧为标注有包围框的图像数据,所述包围框单独截取部分为目标模板帧具体为:The real-time target tracking method according to claim 1, wherein the image data required for training is selected, and the search image frame is enhanced to prevent network overfitting, and the search image frame is marked with a bounding box For the image data, the separately intercepted part of the bounding box is the target template frame, specifically:
    选取两个图像数据集,分别为微型ImageNet数据集及COCO2017数据集;其中,所述微型ImageNet数据集为选取预设分辨率的若干张图片构成,每张图片标注有单一类别,用于预训练所述骨干网络;COCO2017数据集为所述目标检测数据集,所述目标检测数据集为选取预设分辨率的若干张图片构成,每张 图片标注有多个类别和包围框的位置;Two image data sets are selected, namely the micro-ImageNet data set and the COCO2017 data set; wherein, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category for pre-training The backbone network; the COCO2017 data set is the target detection data set, the target detection data set is composed of several pictures with preset resolutions, and each picture is marked with a plurality of categories and the position of the bounding box;
    对所述图像数据集进行数据增强,所述数据增强包括对图像进行随机的翻转,模糊、移位操作;Carrying out data enhancement on the image data set, the data enhancement includes randomly flipping, blurring and shifting the image;
    生成Anchor,以对所述搜索图像帧进行目标定位。Anchor is generated to perform target positioning on the search image frame.
  3. 根据权利要求1所述的实时目标跟踪方法,其特征在于,所述配置特征提取网络及RPN网络,将成对的所述搜索图像及目标模板帧输入所述特征提取网络,所述特征提取网络输出两张特征图,将两张所述特征图输入所述RPN网络,所述RPN网络对所述包围框进行回归具体为:The real-time target tracking method according to claim 1, wherein the configuration feature extraction network and the RPN network input the paired search images and target template frames into the feature extraction network, and the feature extraction network outputs Two feature maps, the two feature maps are input into the RPN network, and the RPN network performs regression on the bounding box as follows:
    输入成对的所述目标模板帧和搜索图像帧至所述特征提取网络,通过分离卷积所述特征提取网络输出特征图T2,S2;Input the paired target template frame and search image frame to the feature extraction network, and output feature maps T2, S2 through the feature extraction network through separation and convolution;
    得到所述特征图T2,S2,将所述特征图T2,S2输入到多路复用特征提取模块,所述多路复用特征提取模块输出特征图T3,S3;Obtain the feature map T2, S2, input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3;
    将所述特征图T3,S3分别输入所述RPN网络进行回归和分类;其中,所述多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;The feature maps T3 and S3 are respectively input into the RPN network for regression and classification; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;
    所述RPN网络对所述包围框进行回归,得到精确的位置估计。The RPN network regresses the bounding box to obtain an accurate position estimate.
  4. 根据权利要求3所述的实时目标跟踪方法,其特征在于,所述多路复用模块包括:The real-time target tracking method according to claim 3, wherein the multiplexing module comprises:
    通道多路复用模块,用于通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通,同时使用分组卷积,降低了参数量;The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
    空间多路复用模块,用于通过卷积将通道重整回来,并将剩下的通道直接进行复制。The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
  5. 根据权利要求1所述的实时目标跟踪方法,其特征在于,所述RPN网络 包括两部分,分别为分类分支及回归分支;The real-time target tracking method according to claim 1, wherein the RPN network comprises two parts, which are respectively a classification branch and a regression branch;
    所述分类分支,用于区分目标和背景;The classification branch is used to distinguish objects and backgrounds;
    所述回归分支,用于将候选区域进行微调。The regression branch is used for fine-tuning the candidate regions.
  6. 一种实时目标跟踪装置,其特征在于,包括:A real-time target tracking device is characterized in that it comprises:
    数据选取模块,选取训练所需的图像数据,对搜索图像帧进行增强,以来防止网络过拟合,所述搜索图像帧为标注有包围框的图像数据,所述包围框单独截取部分为目标模板帧;The data selection module selects the image data required for training, and enhances the search image frame to prevent network overfitting. The search image frame is image data marked with a bounding box, and the separately intercepted part of the bounding box is a target template frame;
    网络配置模块,配置特征提取网络及RPN网络,将成对的所述搜索图像及目标模板帧输入所述特征提取网络,所述特征提取网络输出两张特征图,将两张所述特征图输入所述RPN网络,所述RPN网络对所述包围框进行回归;The network configuration module configures the feature extraction network and the RPN network, and inputs the paired search image and the target template frame into the feature extraction network, and the feature extraction network outputs two feature maps, and inputs the two feature maps to the Described RPN network, described RPN network is carried out regression to described bounding box;
    网络训练模块,去掉所述RPN网络,在分类图像数据集上进行骨干网络的第一阶段训练,分类数据图像数据集收敛后,在所述骨干网络后加上所述RPN网络,在目标检测数据集上进行第二阶段训练;Network training module, remove described RPN network, carry out the first stage training of backbone network on classification image data set, after classification data image data set convergence, add described RPN network after described backbone network, in object detection data The second stage of training is performed on the set;
    目标跟踪模块,对未知的视频流,在视频的第一帧图像时,将需要跟踪的目标标出,所述骨干网络会将其作为目标模板帧进行一次训练,以得到一个特化网络;将所述视频流的每一帧作为搜索图像帧,输入到所述特化网络中,回归出所述包围框,完成目标跟踪。The target tracking module, for the unknown video stream, when the first frame image of the video, the target to be tracked will be marked, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; Each frame of the video stream is used as a search image frame, input into the specialized network, and the bounding box is regressed to complete target tracking.
  7. 根据权利要求6所述的实时目标跟踪装置,其特征在于,所述网络配置模块包括:The real-time target tracking device according to claim 6, wherein the network configuration module comprises:
    第一特征输出单元,有益输入成对的所述目标模板帧和搜索图像帧至所述特征提取网络,通过分离卷积所述特征提取网络输出特征图T2,S2;The first feature output unit usefully inputs the paired target template frame and search image frame to the feature extraction network, and outputs feature maps T2 and S2 through the feature extraction network through separation and convolution;
    第一特征输出单元,用于得到所述特征图T2,S2,将所述特征图T2,S2输入到多路复用特征提取模块,所述多路复用特征提取模块输出特征图T3,S3;The first feature output unit is used to obtain the feature map T2, S2, and input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3 ;
    特征输入单元,用于将所述特征图T3,S3分别输入所述RPN网络进行回归和分类;其中,所述多路复用特征提取模块由三个倒置残差模块和两个多路复用模块构成;The feature input unit is used to input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module consists of three inverted residual modules and two multiplexing Module composition;
    特征定位单元,所述RPN网络对所述包围框进行回归,得到精确的位置估计。The feature positioning unit, the RPN network performs regression on the bounding box to obtain accurate position estimation.
  8. 根据权利要求7所述的实时目标跟踪装置,其特征在于,所述多路复用模块包括:The real-time target tracking device according to claim 7, wherein the multiplexing module comprises:
    通道多路复用模块,用于通过扩张和缩减操作,将通道的信息共享到空间,将空间的信息又分散到通道,促进了信息的流通,同时使用分组卷积,降低了参数量;The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;
    空间多路复用模块,用于通过卷积将通道重整回来,并将剩下的通道直接进行复制。The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1-5任意一项所述的实时目标跟踪方法中的步骤。A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors, so as to realize the requirements of claim 1 Steps in the real-time target tracking method described in any one of -5.
PCT/CN2022/078255 2022-02-28 2022-02-28 Real-time target tracking method, device, and storage medium WO2023159558A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/078255 WO2023159558A1 (en) 2022-02-28 2022-02-28 Real-time target tracking method, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/078255 WO2023159558A1 (en) 2022-02-28 2022-02-28 Real-time target tracking method, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023159558A1 true WO2023159558A1 (en) 2023-08-31

Family

ID=87764401

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078255 WO2023159558A1 (en) 2022-02-28 2022-02-28 Real-time target tracking method, device, and storage medium

Country Status (1)

Country Link
WO (1) WO2023159558A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117292306A (en) * 2023-11-27 2023-12-26 四川迪晟新达类脑智能技术有限公司 Edge equipment-oriented vehicle target detection optimization method and device
CN117576164A (en) * 2023-12-14 2024-02-20 中国人民解放军海军航空大学 Remote sensing video sea-land movement target tracking method based on feature joint learning
CN117576164B (en) * 2023-12-14 2024-05-03 中国人民解放军海军航空大学 Remote sensing video sea-land movement target tracking method based on feature joint learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268256A1 (en) * 2017-03-16 2018-09-20 Aquifi, Inc. Systems and methods for keypoint detection with convolutional neural networks
CN110033478A (en) * 2019-04-12 2019-07-19 北京影谱科技股份有限公司 Visual target tracking method and device based on depth dual training
CN110570458A (en) * 2019-08-12 2019-12-13 武汉大学 Target tracking method based on internal cutting and multi-layer characteristic information fusion
CN111161316A (en) * 2019-12-18 2020-05-15 深圳云天励飞技术有限公司 Target object tracking method and device and terminal equipment
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268256A1 (en) * 2017-03-16 2018-09-20 Aquifi, Inc. Systems and methods for keypoint detection with convolutional neural networks
CN110033478A (en) * 2019-04-12 2019-07-19 北京影谱科技股份有限公司 Visual target tracking method and device based on depth dual training
CN110570458A (en) * 2019-08-12 2019-12-13 武汉大学 Target tracking method based on internal cutting and multi-layer characteristic information fusion
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure
CN111161316A (en) * 2019-12-18 2020-05-15 深圳云天励飞技术有限公司 Target object tracking method and device and terminal equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117197249B (en) * 2023-11-08 2024-01-30 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117292306A (en) * 2023-11-27 2023-12-26 四川迪晟新达类脑智能技术有限公司 Edge equipment-oriented vehicle target detection optimization method and device
CN117576164A (en) * 2023-12-14 2024-02-20 中国人民解放军海军航空大学 Remote sensing video sea-land movement target tracking method based on feature joint learning
CN117576164B (en) * 2023-12-14 2024-05-03 中国人民解放军海军航空大学 Remote sensing video sea-land movement target tracking method based on feature joint learning

Similar Documents

Publication Publication Date Title
Ren et al. Deep texture-aware features for camouflaged object detection
WO2023159558A1 (en) Real-time target tracking method, device, and storage medium
CN111428575B (en) Tracking method for fuzzy target based on twin network
CN111242288B (en) Multi-scale parallel deep neural network model construction method for lesion image segmentation
CN111968123B (en) Semi-supervised video target segmentation method
CN112712546A (en) Target tracking method based on twin neural network
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN112927209B (en) CNN-based significance detection system and method
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
CN114036969B (en) 3D human body action recognition algorithm under multi-view condition
Li et al. Transformer helps identify kiwifruit diseases in complex natural environments
US20220237938A1 (en) Methods of performing real-time object detection using object real-time detection model, performance optimization methods of object real-time detection model, electronic devices and computer readable storage media
Song et al. A joint siamese attention-aware network for vehicle object tracking in satellite videos
Dong et al. Compact interactive dual-branch network for real-time semantic segmentation
Wang et al. Qsfm: Model pruning based on quantified similarity between feature maps for ai on edge
Li et al. 2D amodal instance segmentation guided by 3D shape prior
Chen et al. Coupled Global–Local object detection for large VHR aerial images
Liu et al. Multi-scale residual pyramid attention network for monocular depth estimation
Yu et al. ROMA: cross-domain region similarity matching for unpaired nighttime infrared to daytime visible video translation
Wu et al. Deep texture exemplar extraction based on trimmed T-CNN
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN114663720A (en) Target tracking method, device, medium and equipment based on twin neural network
Ye et al. Tiny face detection based on deep learning
CN116310688A (en) Target detection model based on cascade fusion, and construction method, device and application thereof
CN115937249A (en) Twin network-based multi-prediction output aligned target tracking method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22927840

Country of ref document: EP

Kind code of ref document: A1