WO2023159558A1

WO2023159558A1 - Real-time target tracking method, device, and storage medium

Info

Publication number: WO2023159558A1
Application number: PCT/CN2022/078255
Authority: WO
Inventors: 胡金星; 李东昊; 王浩; 陈卫华; 罗亚林
Original assignee: 中国科学院深圳先进技术研究院; 中广核工程有限公司
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-08-31

Abstract

The present invention relates to the field of artificial intelligence, and specifically relates to a real-time target tracking method, a device, and a storage medium. The method comprises: selecting image data required for training, and enhancing a search image frame; configuring a feature extraction network and an RPN network, and performing regression on a bounding box by the RPN network; removing the RPN network, carrying out stage one training of a backbone network, adding an RPN network posterior to the backbone network, and carrying out stage two training; for an unknown video stream, marking a target to be tracked, and performing training to obtain a specialized network; taking each frame of the video stream as a search image frame, inputting the search image frames into the specialized network, performing regression to output bounding boxes, and completing target tracking. The present invention implements tracking of a target in a video on the basis of a siamese network; also, with respect to the problems of large amounts of memory being used and low inference speed that are pervasive in existing siamese networks, rapid target tracking is achieved without an excessive loss of precision by means of a network trained in the present invention.

Description

A real-time target tracking method, device and storage medium

technical field

The invention relates to the field of artificial intelligence, in particular to a real-time target tracking method, device and storage medium.

Background technique

Object tracking has broad application prospects in many fields such as human-machine interface, intelligent monitoring, virtual reality, and motion analysis, and has important research value in science and engineering. Due to the establishment of new benchmark object tracking datasets and the provision of standardized benchmarking platforms since 2013, object tracking has developed rapidly in the past decade, and many effective tracking algorithms have been proposed successively. Bolme et al pioneered the introduction of convolution theorem from the field of signal processing to visual tracking, and converted the object template matching problem into a correlation operation in the frequency domain. In this way, not only the running speed of the Correlation Filters (CF) tracker is improved, but also the accuracy can be improved by using appropriate features. Since then, the correlation filter has become a research hotspot in the field of tracking at that time, and many related target tracking research methods have been proposed, such as combining multi-resolution feature maps to reduce the influence of periodic boundaries, and improving tracking performance by optimizing losses.

With the rise of deep learning in computer vision, the field of tracking is currently adopting data-driven learning methods. Nine of the ten top-performing trackers on VOT17 rely on deep features and outperform previous state-of-the-art trackers. Among them, the Siamese neural network is tracked through a similarity comparison strategy, and its simple architecture can achieve very fast running speed. Bertinetto et al. adopted a fully convolutional architecture (Fully-Convolutional Siamese Networks, SiamFC) relative to the search image to estimate the feature similarity in the region between two frames, and achieved the best on multiple data test sets. performance.

The early twin tracker is represented by the SINT algorithm, which is based on the idea of similarity learning, divides the network into a query branch and a search branch, and uses a matching function to find a suitable candidate area, but its tracking speed is too slow, only 2fps. The GOTURN algorithm can achieve 100 frames per second on a single GPU through a deep regression network, but its robustness is poor. FCNT and CREST, like the above algorithms, focus on exploring the tracking ability of Siamese networks. Bertinetto et al. proposed SiamFC to construct a lightweight twin network structure, which is used to extract target features and search area features respectively and then perform related operations. The target bounding box is determined according to the maximum position of the response graph. The training uses the ILSVRC dataset to provide video for training, offline After training, the parameters are not updated during the network tracking process, and the accuracy and speed of the algorithm have achieved good results. Its subsequent correlation filter network CFNet embeds correlation filters into network branches, uses filters as neural network layers, and derives forward and backward propagation formulas to achieve end-to-end training, and the speed is still real-time on the GPU. Using this as the benchmark network, a large number of twin network algorithms have been proposed by researchers, such as SiamRPN, SiamRPN++, SiamMask, SiamAttn, etc. Bo Li proposed to add the RPN network in target detection to the tracking network, adopt the target positioning technology similar to the detection algorithm, and use coordinate regression to make the tracking structure more accurate, while avoiding the problem of multi-scale search. Later, Bo Li proposed SiamRPN++, which introduced the deep benchmark network into the tracking network, which greatly improved the detection accuracy. Many subsequent algorithms have been improved on SiamRPN++, including adding the mask branch to obtain the mask, by using the internal clipping residual unit and using a wider network, and playing a deeper network effect through multi-level fusion.

Although Siamese has made a lot of progress in tracking the network, Siamese still has a big problem with the memory constraints of real-world applications. For example, the memory usage of SiameseRPN++ reaches 206MB, and its memory usage makes it difficult to deploy on some mobile devices, such as drones. How to reduce its parameter amount without losing a lot of accuracy has become an urgent problem to be solved.

Contents of the invention

Embodiments of the present invention provide a real-time target tracking method, device, and storage medium, which implement real-time tracking of targets in videos based on twin networks, so that fast target tracking can be achieved without losing too much accuracy.

According to an embodiment of the present invention, a real-time target tracking method is provided, comprising the following steps:

Select the image data required for training and enhance the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;

Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;

Remove the RPN network, and perform the first-stage training of the backbone network on the classified image data set. After the classification data image data set converges, add the RPN network after the backbone network, and perform the second-stage training on the target detection data set;

For an unknown video stream, mark the target that needs to be tracked in the first frame of the video, and the backbone network will use it as a target template frame for training to obtain a specialized network; each frame of the video stream As a search image frame, it is input into the specialized network, and the bounding box is returned to complete the target tracking.

Further, the image data required for training is selected, and the search image frame is enhanced to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame. Specifically:

Two image data sets are selected, namely the micro-ImageNet data set and the COCO2017 data set; among them, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category, which is used for pre-training the backbone network ;The COCO2017 data set is a target detection data set. The target detection data set is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;

Perform data enhancement on the image data set, data enhancement includes random flipping, blurring, and shifting operations on the image;

Generate anchors for object localization of search image frames.

Further, configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box Specifically:

Input the paired target template frame and search image frame to the feature extraction network, and output the feature map T2, S2 through the separation convolution feature extraction network;

Get the feature map T2, S2, input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3;

Input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;

The RPN network regresses the bounding boxes to obtain accurate position estimates.

Further, the multiplexing module includes:

The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;

The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.

Further, the RPN network includes two parts, namely the classification branch and the regression branch;

Classification branches, used to distinguish objects from backgrounds;

The regression branch is used to fine-tune the candidate regions.

According to another embodiment of the present invention, a real-time target tracking device is provided, including:

The data selection module selects the image data required for training and enhances the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;

Network configuration module, configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network performs the bounding box return;

In the network training module, the RPN network is removed, and the first stage training of the backbone network is performed on the classified image dataset. After the classification data image dataset converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection dataset. train;

The target tracking module, for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame for training to obtain a specialized network; the video stream Each frame of is used as a search image frame, which is input into the specialized network, and the bounding box is regressed to complete the target tracking.

Further, the network configuration module includes:

The first feature output unit beneficially inputs paired target template frames and search image frames to the feature extraction network, and outputs feature maps T2 and S2 through the separate convolution feature extraction network;

The first feature output unit is used to obtain feature maps T2, S2, and input feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs feature maps T3, S3;

A feature input unit is used to input feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;

The feature localization unit, the RPN network regresses the bounding box to obtain an accurate position estimate.

Further, the multiplexing module includes:

The channel multiplexing module is used to share the information of the channel to the space through the expansion and reduction operation, and disperse the information of the space to the channel, which promotes the circulation of information, and uses group convolution at the same time to reduce the amount of parameters;

According to another embodiment of the present invention, a computer-readable medium is provided. The computer-readable storage medium stores one or more programs, and one or more programs can be executed by one or more processors to achieve the above-mentioned Steps in any one of the real-time object tracking methods.

The real-time target tracking method, device, and storage medium in the embodiment of the present invention, the method includes: selecting the image data required for training, and enhancing the search image frame to prevent network overfitting, and the search image frame is marked with a bounding box Image data, the bounding box is intercepted separately as the target template frame; configure the feature extraction network and RPN network, input the paired search image and target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and input the two feature maps RPN network, the RPN network returns the bounding box; remove the RPN network, and perform the first stage training of the backbone network on the classified image data set. After the classification data image data set converges, add the RPN network after the backbone network to detect The second stage of training is performed on the data set; for unknown video streams, mark the target to be tracked in the first frame of the video, and the backbone network will use it as the target template frame for a training to obtain a specialized Network; each frame of the video stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking. Through the present invention, the tracking of the target in the video is realized based on the twin network, and, in view of the problems of large memory usage and slow inference speed commonly existing in the current twin network, the network trained by the present invention can be used without losing too much precision Under the circumstances, achieve fast target tracking.

Description of drawings

Fig. 1 is the flowchart of real-time object tracking method of the present invention;

Fig. 2 is a network structure diagram of the present invention;

Fig. 3 is a spatial multiplexing module diagram of the present invention;

Fig. 4 is a channel multiplexing module diagram of the present invention;

Fig. 5 is the structural diagram of RPN network of the present invention;

Fig. 6 is a schematic diagram of the real-time target tracking device of the present invention.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

Referring to FIG. 1, according to an embodiment of the present invention, a real-time target tracking method is provided, comprising the following steps:

S100: Select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is image data marked with a bounding box, and the separately intercepted part of the bounding box is a target template frame;

S200: Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network returns the bounding box;

S300: remove the RPN network, perform the first-stage training of the backbone network on the classified image data set, after the classification data image data set converges, add the RPN network after the backbone network, and perform the second-stage training on the target detection data set;

S400: For an unknown video stream, mark the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; One frame is used as a search image frame, which is input into the specialized network, and the bounding box is returned to complete the target tracking.

The problem solved by the present invention is to realize the tracking of the target in the video based on the twin network, and, aiming at the problems of large memory usage and slow inference speed generally existing in the current twin network, a lightweight network is designed to make the Achieve fast object tracking without losing too much accuracy.

Specifically, the present invention trains and fine-tunes the backbone network from scratch, and finally applies it to the Siamese network framework to achieve fast target tracking. The inventive method comprises four stages:

Data preparation stage: The training is divided into backbone network pre-training and regression sub-network training stage. Two kinds of data sets are prepared, which are respectively used for the classification task training of the backbone network and the positioning task training of the regression sub-network. In positioning training, unlike the usual target detection tasks, it is necessary to use the target interception as the target template frame, and pair the search image into the network. During training, it is necessary to perform data enhancement on the search image, such as flipping, rotating, shifting, etc., to prevent the network from overfitting.

Network configuration stage: the network can be divided into two parts, feature extraction network and RPN network. The aforementioned multiplexing module is included in the feature extraction network. The input is pairs of target template frames and search image frames, sharing the same feature extraction network. Finally, two feature maps are output as the input of the RPN network. The RPN network is used to return the precise bounding box, and the feature map obtained by the target template frame and the search image frame is convolved once to obtain a thermal response map, where the place with a large response value is the place where the target is most likely to appear .

Network training phase: training uses the sum of regression loss and classification loss as the loss function, classification loss uses cross-entropy loss, and regression loss uses SmoothL1 loss. First remove the RPN network, train the backbone network on the classification data set, after convergence, add the RPN network after the backbone network, and conduct the second stage of training on the target detection data set together, using SGD as the gradient in both stages Descent algorithm.

Online tracking stage: online tracking adopts a one-time training method. For unknown videos, mark the target to be tracked in the first frame, and the network will use it as the target template frame for training, so that a specialized network can be obtained. Afterwards, each frame of the video stream is used as a search image frame, input into the network, and the bounding box is regressed, that is, the tracking is completed.

The invention improves the deduction speed of the target tracking network and reduces the memory occupation of the network while maintaining the precision by adopting the lightweight module and the backbone network formed by it. More details will be given in the following description.

In the embodiment, step S100: select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame Specifically:

S101: Select two image data sets, namely the micro-ImageNet data set and the COCO2017 data set; wherein, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category for pre-training Backbone network; the COCO2017 dataset is a target detection dataset. The target detection dataset is composed of several pictures with preset resolutions. Each picture is marked with multiple categories and the location of the bounding box;

S102: Perform data enhancement on the image data set, the data enhancement includes random flipping, blurring, and shifting operations on the image;

S103: Generate an Anchor to perform target positioning on the search image frame.

Step S100 is the data preparation stage, specifically including:

Step 1, data selection: the training is divided into two stages, namely backbone network pre-training and regression sub-network training. The specific training configuration is introduced in detail in step 3. Two datasets, Tiny ImageNet (20 classes) and COCO2017, were used in the two training phases. Tiny ImageNet is a sub-dataset of ImageNet. 10,000 pictures with a resolution of 224*224 are selected, and each picture is marked with a single category for pre-training the backbone network. COCO2017 is a target detection data set, and 5000 pictures are selected, and each picture is marked with multiple categories and the location of the bounding box. When training the Siamese network, the labeled bounding box needs to be intercepted separately as the target template frame, and the entire image is used as the search image frame, which is input into the network in pairs.

Step 2, data enhancement: Before entering the training, the data needs to be expanded, that is, data enhancement, which requires random flipping, blurring, and shifting operations on the image. It should be noted that in Siamese neural network, data augmentation only needs to be performed on the search image, not on the target template.

Step 3, Anchor generation: Anchor is needed to locate the target on the search image. The specific method of generating Anchor is: first obtain the benchmark Anchor, where the dimension size of Anchor is [5,4], 4 represents four scales, and each scale records A vector [x-coordinate of center point, y-coordinate of center point, width, height], 5 represents the expansion scale of Anchor, set to [0.5, 0.66, 1, 1.5, 2].

In the embodiment, step S200: configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, input the two feature maps into the RPN network, and the RPN network pairs The bounding box for regression is specifically:

S201: Input the paired target template frame and the search image frame to the feature extraction network, and output the feature map T2, S2 by separating the convolutional feature extraction network;

S202: Obtain the feature maps T2, S2, input the feature maps T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature maps T3, S3;

S203: Input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;

S204: The RPN network performs regression on the bounding box to obtain accurate position estimation.

Step S200 is the network model configuration stage, which specifically includes:

Step 1, feature extraction in the early stage: the overall structure of the network is shown in Figure 2. The input is a target template frame of 98*98*3 and a search image frame of 354*354*3. First, let it pass through a shared 3*3*28 The convolution kernel Conv_1 obtains the feature maps T1 and S1 of 96*96*28 and 352*352*28; then configures three-layer depth separable convolution, and the three-layer depth separable convolution is DwConv_1, DwConv_2 and DwConv_3; A depth-separable convolution layer is divided into channel-by-channel convolution and point-by-point convolution. The convolution kernel in channel-by-channel convolution is set to 3*3*28, and one convolution kernel is only responsible for one channel; the point-by-point convolution The convolution kernel is set to 1*1*28 for information fusion in the channel direction; the stride of DwConv_1 and DwConv_2 is set to 1, the padding is set to the same, the output feature map is the same size as the input T1 and S1, and the size of DwConv_3 The stride is set to 2, the padding is set to 1, and the output feature maps T2 and S2 are half the size of the feature maps T1 and S1, namely 48*48*28 and 176*176*28.

Step 2, multiplexing feature extraction: After obtaining the feature maps T2 and S2, input them to the multiplexing feature extraction module. The multiplexing feature extraction module is configured with three layers, respectively recorded as MPConv_1, MPConv_2, and MPConv_3; each layer of multiplexing feature extraction module is composed of three inverted residual modules (the three inverted residual modules are respectively recorded as InvResidual_1, InvResidual_2, InvResidual_3) and two multiplexing modules (the two multiplexing modules are respectively marked as Multiplexing Block1 and Multiplexing Block12). The inverted residual module consists of two point-by-point convolutions and one channel-by-channel convolution, respectively denoted as PwConv_1, PwConv_2, and CwConv_1, and their sizes correspond to 1*1, 1*1, 3*3, and the output feature map size unchanged, it is input into the multiplexing module, and the following two parts introduce the multiplexing module.

Multiplexing Block (multiplexing module) is made up of two parts, is called channel multiplexing module and space multiplexing module respectively; Space multiplexing module is shown in Fig. The feature map, split it into C1, C2, C3, where C=C1+C2+C3. Two operations are defined here, called reduction operation and expansion operation, and an operation factor r is defined. For the reduction operation, the channel of the feature map of C*H*W is ^doubled by the original r, and its width and height are changed to The original 1/r; for the expansion operation, in principle, it is a reverse operation of the reduction operation, which changes the channel of the feature map to 1/r ² times the original, and changes the width and height to the original r times; after that, each A group of new feature maps are obtained through their respective group convolutions, and then an inverse operation is performed to change them back to the original feature maps for connection; channel multiplexing uses expansion and reduction operations to convert channel information Shared to the space, the information in the space is distributed to the channel, and the circulation of information is promoted. At the same time, group convolution is used to reduce the amount of parameters.

For the channel multiplexing operation, as shown in Figure 4, for a feature map in the invention, L channels are selected each time to calculate, let it pass a 1*1 convolution to reshape the number of channels, and then use spatial multiplexing The used module is reorganized through 1*1 convolution to talk about the channel, and the remaining 1-L channels are directly copied; this can avoid the calculation of C-L channels; in order to allow each channel to be calculated Opportunities, and then the rearrangement of the channels, that is, channel shuffle (Channel Shuffle). In this embodiment, such an operation is repeated once, and the previous spatial multiplexing module is combined to use it as a new module, that is, the multiplexed convolution module.

Multiplexing Block1 is set so that the output size remains unchanged, and Multiplexing Block2 is set so that the output size becomes half of the original, and the number of channels is doubled; each time the feature map T2, S2 passes through an MPConv, the size becomes 24*24*56, 88* in turn 88*56; 12*12*112, 44*44*112; 6*6*224, 22*22*224, the final feature map is marked as T3, S3, which are respectively input into the RPN network for regression and classification.

The technical scheme of the invention proposes two lightweight modules, a space multiplexing module and a channel multiplexing module, and combines them into a new backbone network. The backbone network is trained and fine-tuned from scratch, and finally applied to the Siamese network framework to achieve fast target tracking.

Step 3, RPN network: The structure of the RPN network is shown in FIG. 5 . The role of the RPN network is to regress the bounding box to obtain an accurate position estimate. The RPN network consists of two parts, one is the classification branch, which is used to distinguish the target from the background, and the other is the regression branch, which fine-tunes the candidate area.

The target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the previous step 1.3, we can see that each Anchor has four scales, so the regression branch has increased by 4k times. At the same time, the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here. For the classification branch, the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated The response map of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, dh, these four values measure the deviation between the Anchor and the real bounding box. The formula for calculating the response graph is as follows:

in,

represents the target template feature T3,

represents the search image feature S3.

In an embodiment, step S300 is an offline training phase, which specifically includes:

Step 1. Selection of loss function: The loss function is divided into two parts, which are classification loss and regression loss, as follows:

Loss＝L _cls +λL _reg

The classification loss uses cross-entropy loss, as follows:

μ(y,v)=log(1+exp(-yv))

where v is a single response value output by the network, y is the label, D is the generated response map, and u is any value in the response map. The regression loss uses Smooth L1 loss, and first standardizes the coordinates of the Anchor:

Among them, x, y, w, and h represent the coordinates of the center of the matrix and the width and height of the matrix, and T and A represent the prediction frame and Anchor, respectively.

The Smooth L1 loss function looks like this:

Among them, λ is an adjustable hyperparameter to balance the two losses.

Step 2, training settings: first train the network backbone on the micro ImageNet (20 categories), at this time, you need to change the stride of the first convolutional layer Conv_1 to 2, and remove the RPN network. At this time, the input is an image of 224*224*3, and the output is a feature map of 7*7*224. Connect this feature map to a three-layer fully connected neural network, and the final output is a 20-dimensional vector, representing 20 categories. . Perform 80-100 rounds of training on 10,000 images to converge better. Using SGD as the gradient descent algorithm, the learning rate drops by 0.001 every 5 rounds. Next, change the stride of Conv_1 to 1, remove the fully connected neural network, and connect the regression branch and the classification branch, train on COCO2017, the input is 98*98*3 and 354*354*3, use the above mentioned 3.1 The obtained loss function trains the network. The network can converge well after about 50 rounds of training. Using SGD as the gradient descent algorithm, the learning rate is reduced from 0.01 to 0.00001 nonlinearly.

In an embodiment, step 400 is an online target tracking stage, which specifically includes:

One-shot tracking: Invention takes the tracking task as a one-shot training detection task. That is, a neural network is first learned, and after the learning is completed, the convolution kernel parameters of the convolution operation are obtained through the initial frame learning in the tracking phase, so as to obtain a specialized network, and then the subsequent frames are tracked.

In actual use, for an unknown video, first manually calibrate the first frame, and then this frame will be used as the target template for a specialized training of the network. After that, for each frame of the search image, the search image is passed through the network, and the similarity measure is calculated to return the position of the target.

Referring to FIG. 6, according to an embodiment of the present invention, a real-time target tracking device is provided, including:

The data selection module 100 selects the image data required for training, and enhances the search image frame to prevent network overfitting. The search image frame is the image data marked with a bounding box, and the separately intercepted part of the bounding box is the target template frame;

The network configuration module 200 configures the feature extraction network and the RPN network, inputs the paired search images and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and inputs the two feature maps into the RPN network, and the RPN network pairs the bounding box make regression;

The network training module 300 removes the RPN network, and performs the first stage training of the backbone network on the classified image data set. After the classification data image data set converges, the RPN network is added after the backbone network, and the second stage is performed on the target detection data set. stage training;

The target tracking module 400, for an unknown video stream, marks the target to be tracked in the first frame of the video, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; Each frame of the stream is used as a search image frame, input into the specialized network, and the bounding box is returned to complete the target tracking.

In an embodiment, the network configuration module 200 includes:

The network configuration module 200 processes the network model configuration stage, specifically including:

The target template and the search image are respectively obtained 6*6*224, 22*22*224 feature maps T3, S3 through the aforementioned feature extraction network; then the target template features are respectively generated by 3*3 convolution kernels 4*4*( 2k*224) and 4*4*(4k*224) features, here it is relatively simple to get the feature size of 4*4 from the size of 6*6 through the convolution kernel of 3*3, here we need to pay attention to the increase of the number of channels from 224 At 2k*224 and 4k*224, the number of channels has increased by 2k times because k Anchors are generated at each point of the feature map, and each Anchor can be classified into the foreground or background, so the classification branch has increased by 2k Times, from the aforementioned step S100 is the third step in the data preparation stage, each Anchor has four scales, so the regression branch has increased by 4k times. At the same time, the search image also obtains two features through a 3*3 convolution kernel, and the number of feature channels remains unchanged here. For the classification branch, the 4*4*224 features of the 2k template images Anchor are used as the convolution kernel and the 20*20*224 features of the search image are convolved to generate a classification branch response map; similarly for the regression branch, the generated The response graph of is 17*17*4k, where each point represents a vector with a size of 4k, denoted as dx, dy, dw, and dh. The above four values measure the deviation between the Anchor and the real bounding box. The formula for calculating the response graph is as follows:

in,

represents the target template feature T3,

represents the search image feature S3.

In an embodiment, the multiplexing module includes:

The beneficial effect of this application based on twin network target tracking method is:

1. The spatial multiplexing module is used to fuse the feature map through two operations of expansion and reduction, which reduces the parameters while ensuring the accuracy.

2. The channel multiplexing module is adopted, and the calculation efficiency is improved through the operation of channel shuffling and partial selection.

3. Combining the above two modules, a new backbone network is invented, which trains the network from scratch and converges. The inference efficiency is relatively high, and it can be applied to real-time single-target tracking tasks.

The present invention can realize fast single target tracking by introducing two self-developed modules, channel multiplexing and space multiplexing.

In the embodiment of the present invention, the results on the VOT2018 data set show that compared with the Siamese network based on ResNet50, the memory usage of the present invention is 43MB, which is one-fifth of the former, and the inference speed on the RTX 2080Ti graphics card is 83FPS , compared with the former's 25FPS, it has increased by 3.3 times, and the accuracy of the method of the present invention is only lost by 3% relative to the former, which can be ignored in actual use.

Based on the above real-time target tracking method, this embodiment provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and one or more programs can be executed by one or more processors to realize Such as the steps in the real-time target tracking method of the above-mentioned embodiment.

In addition, the specific process of loading and executing multiple instructions in the storage medium by the processor has been described in detail in the above method, and will not be described one by one here.

The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined in this invention may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to these embodiments shown in the present invention, but will conform to the widest scope consistent with the principles and novel features disclosed in the present invention.

Claims

A real-time target tracking method is characterized in that it comprises the following steps:

Select the image data required for training, and enhance the search image frame to prevent network overfitting, the search image frame is image data marked with a bounding box, and the separate intercepted part of the bounding box is the target template frame;

Configure the feature extraction network and the RPN network, input the paired search image and the target template frame into the feature extraction network, the feature extraction network outputs two feature maps, and input the two feature maps into the RPN network, The RPN network performs regression on the bounding box;

The RPN network is removed, and the first stage training of the backbone network is carried out on the classification image data set. After the classification data image data set converges, the RPN network is added after the backbone network, and the second stage is carried out on the target detection data set. two-stage training;

For an unknown video stream, when the first frame image of the video is used, the target to be tracked will be marked, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; the video stream Each frame of is used as a search image frame, which is input into the specialized network, and the bounding box is regressed to complete target tracking.
The real-time target tracking method according to claim 1, wherein the image data required for training is selected, and the search image frame is enhanced to prevent network overfitting, and the search image frame is marked with a bounding box For the image data, the separately intercepted part of the bounding box is the target template frame, specifically:

Two image data sets are selected, namely the micro-ImageNet data set and the COCO2017 data set; wherein, the micro-ImageNet data set is composed of several pictures with preset resolutions, and each picture is marked with a single category for pre-training The backbone network; the COCO2017 data set is the target detection data set, the target detection data set is composed of several pictures with preset resolutions, and each picture is marked with a plurality of categories and the position of the bounding box;

Carrying out data enhancement on the image data set, the data enhancement includes randomly flipping, blurring and shifting the image;

Anchor is generated to perform target positioning on the search image frame.
The real-time target tracking method according to claim 1, wherein the configuration feature extraction network and the RPN network input the paired search images and target template frames into the feature extraction network, and the feature extraction network outputs Two feature maps, the two feature maps are input into the RPN network, and the RPN network performs regression on the bounding box as follows:

Input the paired target template frame and search image frame to the feature extraction network, and output feature maps T2, S2 through the feature extraction network through separation and convolution;

Obtain the feature map T2, S2, input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3;

The feature maps T3 and S3 are respectively input into the RPN network for regression and classification; wherein, the multiplexing feature extraction module is composed of three inverted residual modules and two multiplexing modules;

The RPN network regresses the bounding box to obtain an accurate position estimate.
The real-time target tracking method according to claim 3, wherein the multiplexing module comprises:

The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;

The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
The real-time target tracking method according to claim 1, wherein the RPN network comprises two parts, which are respectively a classification branch and a regression branch;

The classification branch is used to distinguish objects and backgrounds;

The regression branch is used for fine-tuning the candidate regions.
A real-time target tracking device is characterized in that it comprises:

The data selection module selects the image data required for training, and enhances the search image frame to prevent network overfitting. The search image frame is image data marked with a bounding box, and the separately intercepted part of the bounding box is a target template frame;

The network configuration module configures the feature extraction network and the RPN network, and inputs the paired search image and the target template frame into the feature extraction network, and the feature extraction network outputs two feature maps, and inputs the two feature maps to the Described RPN network, described RPN network is carried out regression to described bounding box;

Network training module, remove described RPN network, carry out the first stage training of backbone network on classification image data set, after classification data image data set convergence, add described RPN network after described backbone network, in object detection data The second stage of training is performed on the set;

The target tracking module, for the unknown video stream, when the first frame image of the video, the target to be tracked will be marked, and the backbone network will use it as a target template frame to perform a training to obtain a specialized network; Each frame of the video stream is used as a search image frame, input into the specialized network, and the bounding box is regressed to complete target tracking.
The real-time target tracking device according to claim 6, wherein the network configuration module comprises:

The first feature output unit usefully inputs the paired target template frame and search image frame to the feature extraction network, and outputs feature maps T2 and S2 through the feature extraction network through separation and convolution;

The first feature output unit is used to obtain the feature map T2, S2, and input the feature map T2, S2 to the multiplexing feature extraction module, and the multiplexing feature extraction module outputs the feature map T3, S3 ;

The feature input unit is used to input the feature maps T3 and S3 into the RPN network for regression and classification respectively; wherein, the multiplexing feature extraction module consists of three inverted residual modules and two multiplexing Module composition;

The feature positioning unit, the RPN network performs regression on the bounding box to obtain accurate position estimation.
The real-time target tracking device according to claim 7, wherein the multiplexing module comprises:

The channel multiplexing module is used to share the information of the channel to the space through expansion and reduction operations, and disperse the information of the space to the channel, which promotes the flow of information, and uses group convolution to reduce the amount of parameters;

The spatial multiplexing module is used to reshape the channels back through convolution and directly copy the remaining channels.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors, so as to realize the requirements of claim 1 Steps in the real-time target tracking method described in any one of -5.