WO2023216572A1 - 一种跨视频目标跟踪方法、系统、电子设备以及存储介质 - Google Patents

一种跨视频目标跟踪方法、系统、电子设备以及存储介质 Download PDF

Info

Publication number
WO2023216572A1
WO2023216572A1 PCT/CN2022/137022 CN2022137022W WO2023216572A1 WO 2023216572 A1 WO2023216572 A1 WO 2023216572A1 CN 2022137022 W CN2022137022 W CN 2022137022W WO 2023216572 A1 WO2023216572 A1 WO 2023216572A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
tracking
video
network
image
Prior art date
Application number
PCT/CN2022/137022
Other languages
English (en)
French (fr)
Inventor
胡金星
李东昊
尚佩晗
贾亚伟
何兵
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2023216572A1 publication Critical patent/WO2023216572A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application belongs to the field of visual tracking target tracking technology, and particularly relates to a cross-video target tracking method, system, electronic device and storage medium.
  • Video security monitoring is gradually moving from digitization and grid to intelligence.
  • Intelligent surveillance systems are often used in social security places, large entertainment venues, and various road traffic places.
  • visual tracking target tracking is a basic requirement, and it is also the basic task of subsequent advanced visual processing such as posture recognition and behavior analysis. For example, important areas such as government agencies or banks can be monitored through automated monitoring, suspicious behavior can be identified, and warnings can be issued when abnormal behavior is detected; vehicles can also be tracked and analyzed in real time, and traffic data can be used to achieve intelligent traffic management.
  • the problem to be solved by visual tracking target tracking can be expressed as: in the video sequence, given the position and size of the tracking target in the first frame, it is necessary to predict the position and size of the tracking target in subsequent video frames.
  • visual tracking target tracking technology can be divided into multi-tracking target tracking and single-tracking target tracking.
  • Visual tracking target tracking has a wide range of applications in real life, such as human-computer interaction (Human-Computer Interaction) and autonomous driving (Autonomous Driving).
  • intelligent surveillance systems can automatically track targets through a visual tracking model from a single camera perspective without relying on humans.
  • most of the research on tracking methods of intelligent surveillance systems is focused on tracking methods for motion tracking targets that do not have overlapping fields of view within the multi-viewpoint collaborative video surveillance range, such as tracking based on tracking target re-identification algorithms.
  • scene changes caused by multi-video switching will cause problems such as scale changes, appearance changes, illumination changes, occlusions, and disappearance of tracking targets, resulting in unstable target tracking and difficulty in implementation.
  • Cross-video tracking Secondly, the tracking duration of the cross-video tracking target is longer, and the scale and appearance changes are more likely to occur than the short-term tracking target.
  • the tracking model will produce tracking drift due to the accumulation of tracking errors, leading to tracking failure.
  • cross-video tracking involves joint analysis of multiple videos, which requires high real-time inference speed of the tracking algorithm, and complex tracking algorithm models cannot be directly applied. Therefore, how to optimize and improve the target tracking model, solve various problems existing in cross-video tracking scenarios, and achieve real-time continuous tracking of the same tracking target across video ranges under overlapping fields of view is one of the problems that urgently need to be solved in intelligent surveillance systems. It has important theoretical and applied significance.
  • This application provides a cross-video target tracking method, system, electronic device, and storage medium, aiming to solve one of the above technical problems in the prior art at least to a certain extent.
  • a cross-video object tracking method including:
  • the video image is mapped into a unified geographic coordinate space to obtain the global geographic coordinates of the video image;
  • the deep twin network tracking model includes a skeleton network, a self-attention network, a target estimation network and a dynamic template network,
  • the input of the skeleton network is the target template image and the target search image, and the skeleton network uses a depth separable convolution module, a multiplexed convolution module and an inverted residual module to output the target template image and the target search image.
  • One-dimensional feature map ;
  • the self-attention network adopts an encoder-decoder architecture, its input is the output of the skeleton network, and the output is a two-dimensional feature map;
  • the target estimation network includes three network heads: a bias regression head, a scale prediction head and a target classification head, the input of which is the output of the self-attention network.
  • the outputs of the three network heads are respectively the bias regression map, Scale prediction map and target classification map, obtain the position of the tracking target in the target search image based on the bias regression map, scale prediction map and target classification map, and output the target prediction image;
  • the dynamic template network includes a three-layer feedforward neural network, the input of which is the output of the self-attention network, and the output is a Boolean value representing whether to update the target template image.
  • the skeleton network uses a depthwise separable convolution module, a multiplexed convolution module and an inverted residual module to output the one-dimensional feature map of the target template image and the target search image.
  • the skeleton network uses a depthwise separable convolution module, a multiplexed convolution module and an inverted residual module to output the one-dimensional feature map of the target template image and the target search image.
  • Each depth-separable convolution layer includes channel-by-channel convolution and point-by-point convolution;
  • the inputs of DwConv_1, DwConv_2 and DwConv_3 are feature maps T1, S1, DwConv_1 and
  • the feature map output by DwConv_2 has the same size as T1 and S1, and DwConv_3 outputs the final first feature map T2 and S2.
  • the sizes of T2 and S2 are half of T1 and S1 respectively;
  • the first feature map is input into a multiplexed convolution module.
  • the multiplexed convolution module includes three convolutional layers. Each convolutional layer includes three inverted residual modules and two multiplexed convolutional modules. Multiplexing module, the input of the multiplexing convolution module is the first feature map T2, S2, and its output is the second feature map T3, S3.
  • the self-attention network includes an encoder and a decoder
  • the encoder includes a first multi-head self-attention module, a first feedforward network, and a first residual normalization module
  • the input of the encoder is the feature map Z ⁇ R h ⁇ w ⁇ d of the target template image, where h and w are the width and height of the feature map Z respectively, d is the number of channels, and the spatial dimension of Z is Compress to one dimension and become a sequence Z 0 ⁇ R hw ⁇ d ;
  • the decoder includes a second feedforward network, a second residual normalization module, a second multi-head self-attention module and a multi-head cross-attention module.
  • the input of the decoder is the feature map X ⁇ of the target search image.
  • R H ⁇ W ⁇ d where H and W are the width and height of the feature map X respectively, and H>h, W>w, the decoder compresses the feature map X into a one-dimensional sequence X 0 ⁇ R HW ⁇ d .
  • the target estimation network includes a bias regression head, a scale prediction head and a target classification head
  • the bias regression head, scale prediction head and target classification head are respectively connected to the self-attention network
  • the output, and the offset regression head, scale prediction head and target classification head respectively include three 1x1 convolutional layers and a Sigmoid function.
  • the offset regression head and scale prediction head are used for target box regression and scale regression respectively.
  • the outputs are bias regression maps and scale prediction maps respectively.
  • the target classification head is used for target classification, and its output is a target classification map.
  • the value of the target classification map represents the occurrence of the tracking target under low-resolution discretization. Probability.
  • the dynamic template network includes a score network
  • the score network is a three-layer fully connected network, and is connected to a Sigmoid activation function
  • the input of the score network is self-attention
  • the output of the force network is flattened into one dimension.
  • the output of the score network is a score.
  • the geographical mapping model is a geographical mapping model based on a thin plate spline function
  • the specific steps of using the trained geographical mapping model to map all video images to a unified geographical coordinate space are:
  • Use ArcGIS to select a set number of geographical control points in the video image, correspond the geographical control points to the pixels in the video image one-to-one, find N matching points, and apply the thin plate spline function to combine the N
  • the matching points are deformed to corresponding positions, and N corresponding deformation functions are calculated at the same time.
  • Different video images are mapped into a unified geographical coordinate space through the deformation functions.
  • the cross-video target tracking using the tracking target handover algorithm is specifically:
  • the overlapping area between two video images is calculated through the polygon cropping algorithm, and the set of overlapping areas between all video images is recorded as A[n];
  • the point set Pi contains the predicted center point pixel coordinates of the tracking target obtained by running the target tracking algorithm in all video images on the video set z[m]; if the video If there is a center point in the image, the center point is added to the point set Pi . If the video image does not have a center point, the video image is removed from the video set z[m];
  • the pixel coordinates of the point set Pi are converted into the geographical coordinate point set Pg through the thin plate spline function, and the ray method is used to determine whether the tracking target enters the overlapping area, and the overlapping area where the tracking target exists is recorded as A k ;
  • the target tracking algorithm is performed on all video images to which A k belongs.
  • a cross-video target tracking system including:
  • Target prediction module used to determine the video image to be tracked and the initial target template image, input the video image and the initial target template image into the trained deep twin network tracking model, and output the tracking target through the deep twin network tracking model The target prediction image in the video image;
  • Geographical coordinate mapping module used to map the video image into a unified geographical coordinate space using a trained geographical mapping model to obtain the global geographical coordinates of the video image;
  • Cross-video target handover module used to calculate the overlapping area between two video images based on the mapped video image through the polygon cropping method, and determine whether there is a target prediction image in the overlapping area. If there is the target prediction image, The tracking target handover algorithm is used to perform cross-video target tracking in the video image corresponding to the overlapping area.
  • an electronic device the electronic device includes a processor and a memory coupled to the processor, wherein,
  • the memory stores program instructions for implementing the cross-video target tracking method
  • the processor is configured to execute the program instructions stored in the memory to control cross-video object tracking.
  • a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the cross-video target tracking method.
  • the beneficial effects produced by the embodiments of the present application are: the cross-video target tracking method, system, electronic device and storage medium of the embodiments of the present application build a deep twin based on the self-attention network and the dynamic template network.
  • the network tracking model can adapt to long-term tracking, and its tracking results are basically not affected by changes in the scale and appearance of the tracked target.
  • the tracking performance is stable in the scenario of multi-view target tracking across videos, and solves the problem of scale appearance changes in cross-video tracking. Problems such as the inability to effectively track targets.
  • this application uses a pixel-to-unified geographical coordinate mapping model based on thin-plate spline functions, combines the polygon clipping method to calculate the overlapping area between videos, and uses the ray method to determine whether the tracking target enters the overlapping area based on geographical coordinates. This determines whether to switch videos to achieve continuous tracking of the tracking target across videos.
  • the cross-video tracking technology of the embodiment of the present application has a wide field of view, can continuously and stably track the target for a long time, and identifies the movement trajectory of the tracking target in a wide range, effectively assisting subsequent decision-making tasks. It saves manpower and material resources, ensures real-time tracking, and has high tracking accuracy.
  • Figure 1 is a schematic diagram of the twin network target tracking model according to the embodiment of the present application.
  • Figure 2 is a flow chart of the cross-video target tracking method according to the embodiment of the present application.
  • Figure 3 is a schematic structural diagram of the self-attention network according to the embodiment of the present application.
  • Figure 4 is a schematic structural diagram of the self-attention module according to the embodiment of the present application.
  • Figure 5 is a schematic diagram of geographical coordinate mapping according to the embodiment of the present application, in which (a) is a schematic diagram of control point selection, (b) is a schematic diagram of unified geographical coordinate mapping;
  • Figure 6 is a flow chart of the cross-video target handover algorithm according to the embodiment of the present application.
  • Figure 7 is a schematic structural diagram of the cross-video target tracking system according to the embodiment of the present application.
  • Figure 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the cross-video target tracking method in the embodiment of the present application constructs a robust twin network target tracking model and constructs a cross-video target handover algorithm, thereby achieving cross-video range in overlapping areas.
  • the twin network target tracking model in the embodiment of this application includes four parts: skeleton network, self-attention network (Transformer), target estimation network and dynamic template network.
  • the skeleton network is a lightweight The network, whose input is the initial target template image and target search image, uses a depthwise separable convolution module, a multiplexed convolution module and an inverted residual module to output one-dimensional feature maps of the target template image and target search image.
  • the self-attention network uses an encoder-decoder architecture, its input is the output of the skeleton network, and the output is a two-dimensional feature map.
  • the target estimation network includes three network heads: bias regression head, scale prediction head and target classification head. Its input is the output of the self-attention network.
  • the outputs of the three network heads are three response maps respectively. According to the response map, tracking can be obtained The position of the target in the target search image, and the target prediction image is output.
  • the dynamic template network includes a three-layer feedforward neural network, whose input is the output of the self-attention network, and the output is a Boolean value representing whether to update the target template image.
  • the specific cross-video target handover algorithm is as follows: mapping pixel coordinates and unified geographical coordinates through a geographical mapping model, thereby mapping the pixel coordinates of multiple different videos to a unified geographical coordinate system to accurately locate the tracking target; and then
  • the polygon cropping method is used to determine the overlapping areas in different videos, and the ray method is used to determine whether the tracking target enters the overlapping area to determine the video set where the tracking target exists, and the tracking target tracking algorithm is started on the video set where the tracking target exists, and the tracking target is Perform cross-video online tracking.
  • twin network target tracking model in the embodiment of this application is divided into three training stages, namely: backbone network training stage, regression network training stage and dynamic template network training stage.
  • the mini-ImageNet data set (20 categories) is used, and 20,000 images of different types are selected as the training set.
  • the skeleton network is trained, the self-attention network, target estimation network and dynamic template network are removed, and the stride of the first convolutional layer is changed to 1.
  • the output feature map is 7*7*224.
  • the feature map is input into a three-layer fully connected neural network, and a 20-dimensional vector is finally output, representing 20 categories.
  • the stochastic gradient descent method is used as the gradient update algorithm, the learning rate is halved every 5 rounds starting from 0.0001, and the loss function uses the classification loss cross-entropy loss function.
  • subsets of COCO2017 and Youtube-BB are used as training sets, and 4500 images are selected in each round to generate 4500 input pairs of target template images and target search images.
  • the output is the response map of the three network heads of the target estimation network, which are: bias regression map, scale prediction map and target classification map.
  • the ADAM optimization algorithm is used, the weight attenuation rate is 0.0001, the starting learning rate is set to 0.0001, and it is halved every 20 rounds.
  • the loss function is a joint loss function for regression and classification.
  • subsets of COCO2017 and Youtube-BB are used as training sets, and 4500 images are selected in each round to generate 4500 input pairs of target template images and target search images.
  • Add the dynamic template network freeze the parameters of the regression network, and only train the three-layer feedforward neural network in the classification network and dynamic template network.
  • the stochastic gradient descent method was used as the optimization algorithm, the learning rate was set to 0.00001, and the learning rate was halved every 5 rounds.
  • FIG. 2 is a flow chart of the cross-video target tracking method according to the embodiment of the present application.
  • the cross-video target tracking method in the embodiment of this application includes the following steps:
  • S200 Input the video image and the initial target template image into the trained Siamese network target tracking model, and output the target prediction image of the tracking target in the video image through the Siamese network target tracking model;
  • the training process of the Siamese network target tracking model specifically includes:
  • S201 Obtain the target template image and the target search image, input the target template image and the target search image into the skeleton network, and output the first feature map of the target template image and the target search image through the skeleton network;
  • the skeleton network is the MPSiam skeleton network part in Figure 1, and its input is the 98*98*3 target template image and the 354*354*3 target search image.
  • the target template image and the target search image are passed through a shared 3*3*28 convolution kernel Conv_1, and the feature maps of 96*96*28 and 352*352*28 are obtained respectively, which are recorded as T1 and S1.
  • the feature maps of 96*96*28 and 352*352*28 are obtained respectively, which are recorded as T1 and S1.
  • Each depth-separable convolution layer includes channel-by-channel convolution and point-by-point convolution.
  • the convolution kernel in the channel-by-channel convolution is set to 3*3* 28.
  • One convolution kernel is only responsible for one channel; the convolution kernel in point-wise convolution is set to 1*1*28, which is used for information fusion in the channel direction.
  • the stride of DwConv_1 and DwConv_2 is set to 1, padding is set to same, the stride of DwConv_3 is set to 2, and padding is set to 1.
  • DwConv_1, DwConv_2 and DwConv_3 are feature maps T1, S1, and the feature maps output by DwConv_1 and DwConv_2
  • T1 and S1 The same size as T1 and S1, DwConv_3 outputs the final first feature map, recorded as T2 and S2, whose sizes are half of T1 and S1 respectively, that is, the sizes of the first feature maps T2 and S2 are 48*48*28 and 176*176*28.
  • S202 Input the first feature map into the multiplexed convolution module, and output the second feature map through the multiplexed convolution module;
  • the multiplexed convolution module includes three convolutional layers, namely MPConv_1, MPConv_2 and MPConv_3.
  • Each convolutional layer includes three inverted residual modules InvResidual_1, InvResidual_2, InvResidual_3 and two multiplexing modules MultiplexingBlock1 and MultiplexingBlock2.
  • Each inverted residual module consists of two point-wise convolutions and one channel-wise convolution, denoted as PwConv_1, CwConv_1 and PwConv_2.
  • the sizes of PwConv_1, CwConv_1 and PwConv_2 are 1*1, 3*3 and 1*1 respectively.
  • the inputs of the three inverted residual modules are the first feature maps T2 and S2 output by the skeleton network respectively, and the feature maps output by them are the same size as T2 and S2.
  • Multiplexing Block1 is set so that the output size remains unchanged
  • Multiplexing Block2 is set so that the output size is half the input size and doubles the number of channels.
  • S203 Input the second feature map into the self-attention network, and the self-attention network restores the second feature map to a two-dimensional feature map through the multi-head self-attention mechanism;
  • the self-attention network is the Transformer structure in Figure 1.
  • the self-attention network is a feature extraction architecture based on the encoder-decoder (Encoder-Decoder), which can fully extract timing information.
  • the self-attention network includes two parts: the encoder and the decoder.
  • the encoder includes the first multi-head self-attention module (Multi-head Self-Attention), the first feedforward network (FFN) and the first residual regression module. Normalization module (Add&Norm), where the number of first residual normalization modules is two.
  • the input of the encoder be the feature map Z ⁇ R h ⁇ w ⁇ d of the target template image, where h and w are the width and height of the feature map Z respectively, and d is the number of channels. Compress the spatial dimension of Z to one dimension, which becomes a sequence Z 0 ⁇ R hw ⁇ d .
  • Positional encoding uses sine and cosine transformation to model the affine transformation between two positions, so that the model can take advantage of the sequence. Sequence information.
  • the decoder includes a second feedforward network, a second residual normalization module, a second multi-head self-attention module and a multi-head cross-attention module (Multi-head Cross-Attention).
  • H and W are the width and height of the feature map X respectively, and H>h, W>w.
  • the feature map X is compressed into a one -dimensional sequence Subsequent regression and classification.
  • the core module of the self-attention network in the embodiment of the present application is the multi-head attention module (Multi-head Attention), and the multi-head attention module is composed of multiple self-attention modules (Self-Attention).
  • the self-attention module The structure is shown in Figure 4.
  • the self-attention module takes the above-mentioned X 0 and Z 0 as input, performs linear projection through three weighted matrices W q , W k and W v , and finally obtains the key matrix (K), value matrix (V) and query through the following formula Matrix(Q):
  • N q HW
  • N kv hw
  • X kv the output of the encoder
  • X q is the output of the first half of the decoder.
  • P q and P k represent the position codes of X q and X kv . Its unified calculation formula is as follows:
  • pos represents the corresponding position on the feature map
  • 2i represents the odd and even dimensions.
  • W the three weighting matrices W q , W k and W v .
  • the multi-head attention module is composed of M self-attention modules
  • the input is passed to M different self-attention modules, and M output matrices A' are obtained.
  • the channel dimensions of each self-attention module are Then connect the M output matrices together from the channel dimension, change the dimension of the connected feature map back to d dimension, and then perform a linear transformation through the matrix W′, and finally restore the obtained one-dimensional feature map to a two-dimensional feature map.
  • R H ⁇ W ⁇ d The specific calculation process is as follows:
  • MultiHeadAttn(X q ,X kv ) [Attn(X q ,X kv ,W 1 )...Attn(X q ,X kv ,W M )]W′ (4)
  • S204 Input the two-dimensional feature map into the target estimation network, and output the target prediction image through the target estimation network;
  • the present invention uses a target estimation network based on center point prediction. Its structure is shown in Figure 1. It includes three independent network heads, namely the offset regression head, the scale prediction head and the target classification head. The three networks The heads are connected to the output of the self-attention network respectively. Each network head contains three 1x1 convolutional layers and a Sigmoid function. The bias regression head and the scale prediction head are used for target box regression and scale regression respectively. The outputs are respectively bias Set the regression map and scale prediction map, the target classification head is used for target classification, and its output is the target classification map.
  • the target classification head outputs a target classification map Y ⁇ [0,1] Scale ⁇ 1 .
  • the formula of Scale is as follows:
  • H and W respectively represent the width and height of the target search image, and s represents the scaling coefficient.
  • s represents the scaling coefficient.
  • s 16.
  • Floor represents the floor function, which ensures that the output feature map size is 22.
  • the value of the target classification map Y represents the occurrence probability of the tracking target under low-resolution discretization.
  • O represents the two response maps of the x-coordinate offset and the y-coordinate offset.
  • Y′ represents the result of weighted cosine window processing on the target classification map Y, which is used to suppress large outliers in the target classification map Y.
  • the Argmax function can return the two-dimensional position of the Y peak corresponding to the target classification map. That is, the center point position of the tracking target in the target search image can be obtained by adding the peak value in Y plus the corresponding offset value, and then multiplying it by the scaling factor.
  • a scale regression feature map S ⁇ [0,1] Scale ⁇ 2 is also generated. After that, the size of the bounding box of the tracking target in the target search image is calculated through the following formula:
  • the * operation represents the Hadamard (Hadamard product) product.
  • corner values of the bounding box are then calculated using the following center values and width and height values:
  • S205 Input the target prediction image into the dynamic template network, and determine whether the target template image needs to be updated through the dynamic template network;
  • the present invention uses a dynamic template network to adapt to long-term tracking target tracking. Its structure is shown in the upper part of Figure 1.
  • the initial target template image be z
  • the initial target search image be x.
  • the predicted target is cropped from the final target prediction image, which is recorded as the dual template t.
  • Input t and z together into the skeleton network for operation, and mark the feature maps after passing through the skeleton network as F z and F t .
  • the feature map F′ t after the fusion of the target template image z and the dual template t is calculated through a simple linear interpolation algorithm:
  • w is a preset hyperparameter, preferably, w can be set to 0.7-0.8.
  • the target template image is updated through the dynamic template network.
  • the dynamic template network includes a score network (Scorehead), which is a three-layer fully connected network and is connected to a Sigmoid activation function.
  • the input of the fractional network is the output of the self-attention network and is flattened into one dimension; the output of the fractional network is a score.
  • the score is greater than the set threshold ⁇ and the number of interval frames reaches more than F u , it will Trigger the dynamic template network to update the target template.
  • the training is divided into two stages. In the first stage, the dynamic template network is removed and the entire backbone network is trained. In the second stage, the parameters of the backbone network and the two regression branches are frozen, only the parameters of the classification branch are retained, and the dynamic template network is added, and then the dynamic template network and the classification branch are jointly trained.
  • the dynamic template network uses cross-entropy loss for training:
  • yi represents the label of the true value
  • Pi represents the confidence on the classification branch response map
  • S300 Use the trained geographical mapping model to map all video images into a unified geographical coordinate space to obtain the global geographical coordinates of the video images;
  • the present invention adopts a geographical mapping model based on thin plate spline function (TPS).
  • the geographical mapping model maps pixel point coordinates and unified geographical coordinates, and can map the pixel coordinates of multiple different video images to a unified geographical coordinate system, so as to Achieve precise positioning of tracking targets.
  • the specific mapping method is: in the two video images, select 15-20 geographical control points through ArcGIS, map the geographical control points to the pixels in the video images one-to-one, find N matching points, and apply thin plate splines The function deforms N matching points to the corresponding positions, and at the same time gives the deformation function of the entire space.
  • N corresponding deformation functions can be calculated for N video images.
  • different video images can be mapped to a unified
  • the pixel coordinates can be converted into global geographical coordinates.
  • (a) is a schematic diagram of control point selection
  • (b) is a schematic diagram of unified geographical coordinate mapping.
  • the total loss function is as follows:
  • is the weight coefficient, which is used to control the degree of non-rigid body deformation.
  • N is the number of control points, Represents the origin passing through the deformation function After calculation, the distance between the tracking target point and Equation (14) represents the energy function of the degree of surface distortion []. After minimizing the loss function, a closed-form solution can be obtained:
  • U is the radial basis function, which represents the degree to which the deformation of a certain point on the surface is affected by other control points. It is defined as follows:
  • Equation (15) can be understood as using the two parameters M and m 0 to fit a plane, and using the radial basis function to fit the degree of curvature on the basis of the plane. There are a total of 3+N parameters. The more control points N are selected, the better the fitting effect will be.
  • S400 Calculate the overlapping area between two video images through the polygon cropping method, and use the ray method to determine whether there is a target prediction image in the overlapping area, and start the tracking target handover algorithm in the video image corresponding to the overlapping area where the target prediction image exists.
  • Cross-video tracking of tracking targets
  • each geographical coordinate mapped video image is regarded as a convex polygon, and the problem of solving the overlapping area is transformed into a graphics problem of solving the overlapping area between two convex polymorphs.
  • the polygon clipping algorithm (Sutherland-Hodgman) adopts the method of segmentation processing and edge-by-edge clipping. Its input is the vertex array of two convex polygons, and the output is the vertex array of the clipped convex polygon, as shown in the following formula:
  • PolyC represents the vertex array of the overlapping area between the two video images. Assuming that the overlapping areas of seven video images are calculated, the resulting overlapping area array is recorded as [A 1 , A 2 ,...A n ].
  • S402 Manually mark a target template that needs to be tracked on any video image, add the video image to the initial video set z[m], and add the target template image, video set z[m] and overlapping area set A of the tracking target [n] As a parameter, input into the target handover algorithm;
  • S403 Start the target handover algorithm, first obtain a point set Pi , which contains the predicted center point pixel coordinates of the tracking target obtained by running the target tracking algorithm in all video images on the video set z[m]; if If the tracking target can be tracked on a certain video image, it means that there is a center point in the video image, then the center point is added to the point set Pi . If there is no center point in the video image, the tracking target is considered to have left the video image. , remove the video image from the video set z[m].
  • the point set Pi being empty means that the tracking target cannot be tracked on any video image in the video set z[m]. At this time, the point set Pi is an empty point set.
  • S405 End the target handover algorithm, use the last appearing tracking target as the new target template image, and rerun the target tracking algorithm on all video images;
  • the embodiment of the present application sets the number of times to rerun the target tracking algorithm to five times. If the tracking target cannot be found after running the algorithm five times, it is considered that the tracking target has left the field of view, and the algorithm ends. If the tracking target is re-tracked in the five-time algorithm, the re-tracked tracking target is set as the new target template image, the video in which the tracking target last appeared is set as the initial video, and the target tracking algorithm is re-executed.
  • S406 Convert the pixel coordinates of the point set Pi into the geographical coordinate point set Pg through the thin plate spline function, and use the ray method to determine whether the tracking target enters the overlapping area, and record the overlapping area with the tracking target as A k ;
  • the geographical coordinates in P g should be the same, but in practice, there may be a slight error. Therefore, by averaging the geographical coordinates in P g or finding the center of gravity value, then Determine whether P g has entered an overlapping area in A[n] according to the geographical coordinates, that is, determine whether a point is within a polygon area.
  • the idea of the ray method is: (1) If the point is inside the polygon, the ray going out from the point must pass through the polygon for the first time; (2) If the point is outside the polygon, the ray going out from the point must go out of the polygon for the first time. Penetrate the polygon.
  • the embodiment of the present application transfers tracking targets between different video images through the judgment of overlapping areas, realizes cross-video target transfer, and ultimately achieves the purpose of cross-video target tracking.
  • the cross-video target tracking algorithm of the embodiment of the present invention can reach a running rate of 25 frames per second, has good tracking robustness, and has high accuracy.
  • the cross-video target tracking method of the embodiment of the present application can adapt to long-term tracking by constructing a deep twin network tracking model based on self-attention network and dynamic template network, and its tracking results are basically independent of the scale of the tracked target. and appearance changes, the tracking performance is stable in cross-video multi-view target tracking scenarios, and solves the problem of being unable to effectively track targets due to scale and appearance changes in cross-video tracking.
  • this application uses a pixel-to-unified geographical coordinate mapping model based on thin-plate spline functions, combines the polygon clipping method to calculate the overlapping area between videos, and uses the ray method to determine whether the tracking target enters the overlapping area based on geographical coordinates.
  • the cross-video tracking technology of the embodiment of the present application has a wide field of view, can continuously and stably track the target for a long time, and identifies the movement trajectory of the tracking target in a wide range, effectively assisting subsequent decision-making tasks. It saves manpower and material resources, ensures real-time tracking, and has high tracking accuracy.
  • FIG 7 is a schematic structural diagram of a cross-video target tracking system according to an embodiment of the present application.
  • the cross-video target tracking system 40 in this embodiment of the present application includes:
  • Target prediction module 41 used to determine the video image to be tracked and the initial target template image, input the video image and the initial target template image into the trained deep twin network tracking model, and output the tracking target in the video image through the deep twin network tracking model target prediction image;
  • Geographical coordinate mapping module 42 used to map the video image into a unified geographical coordinate space using the trained geographical mapping model to obtain the global geographical coordinates of the video image;
  • Cross-video target handover module 43 used to calculate the overlapping area between two video images based on the mapped video image through the polygon cropping method, and determine whether there is a target prediction image in the overlapping area. In the overlapping area where the target prediction image exists, The tracking target handover algorithm is used in the corresponding video image to perform cross-video target tracking.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the above cross-video object tracking method.
  • the processor 51 is used to execute program instructions stored in the memory 52 to control cross-video target tracking.
  • the processor 51 can also be called a CPU (Central Processing Unit).
  • the processor 51 may be an integrated circuit chip with signal processing capabilities.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • FIG. 9 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the storage medium of the embodiment of the present application stores a program file 61 that can implement all the above methods.
  • the program file 61 can be stored in the above storage medium in the form of a software product and includes a number of instructions to make a computer device (can It is a personal computer, server, or network device, etc.) or processor that executes all or part of the steps of the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. , or electronic equipment such as computers, servers, mobile phones, tablets, etc.

Abstract

本申请涉及一种跨视频目标跟踪方法、系统、电子设备以及存储介质。方法包括:确定待跟踪的视频图像以及初始目标模板图像;将视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型,通过深度孪生网络跟踪模型输出跟踪目标在视频图像中的目标预测图像;利用训练好的地理映射模型将视频图像映射到统一地理坐标空间中,得到视频图像的全局地理坐标;基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定重叠区域中是否存在目标预测图像,在存在目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。本申请可以长时间持续稳定的跟踪目标,保证了跟踪的实时性,且跟踪精度较高。

Description

一种跨视频目标跟踪方法、系统、电子设备以及存储介质 技术领域
本申请属于视觉跟踪目标跟踪技术领域,特别涉及一种跨视频目标跟踪方法、系统、电子设备以及存储介质。
背景技术
随着科技的发展,人们的生活品质不断提高,对安全的需求也越来越迫切。视频安防监控行业迅速发展,各类应用也提出了不同的监控技术需求,视频安防监控由数字化、网格化逐步走向智能化。智能监控系统常用于社会安防场所、大型娱乐场所以及各种道路交通场所,在智能监控系统中,视觉跟踪目标跟踪是一项基本需求,也是如姿态识别、行为分析等后续高级视觉处理的基础任务,例如,可以通过自动化监测对政府机构或银行等重要区域进行监控,对可疑行为进行识别,在检测到异常行为时进行警告;也可以对车辆进行实时跟踪与分析,使用交通数据实现智能化的交通管理。视觉跟踪目标跟踪要解决的问题可以表述为:在视频序列中,给出第一帧中跟踪目标的位置及大小,需要在后续视频帧中预测跟踪目标的位置及大小。根据跟踪目标的数量,视觉跟踪目标跟踪技术可以分为多跟踪目标跟踪和单跟踪目标跟踪。视觉跟踪目标跟踪在实际生活中有广泛的应用,例如人机交互(Human–Computer Interaction)和无人驾驶(Autonomous Driving)等。
目前,智能监控系统能够在不依赖人的情况下,在单摄像头视角下通过视觉跟踪模型实现自动跟踪目标。现阶段的智能监控系统跟踪方法研究中,大多数是针对多视角协同的视频监控范围内不存在重叠视域范围的运动跟踪目标跟踪方法进行研究,如基于跟踪目标重识别算法的跟踪。但是在多摄像头协同跨视频跟踪的场景中,由于多视频切换导致的场景变换,会产生如尺度变化、外观变化、光照变化、遮挡和跟踪目标消失等问题,导致跟踪目标跟踪不稳定,难以实现跨视频跟踪。其次,跨视频的跟踪目标跟踪持续时间较长,相对于短时间跟踪目标更易产生尺度和外观上的变化,跟踪模型会因为跟踪误差的积累而产生跟踪漂移,导致跟踪失败。另外,跨视频跟踪涉及到多视频的联合分析,对于跟踪算法的实时推理速度要求比较高,不能直接应用复杂的跟踪算法模型。因此,如何优化改进目标跟踪模型,解决跨视频跟踪场景下存在的多种问题,实现重叠视域下跨视频范围内的同一跟踪目标实时连续跟踪,是智能监控系统中亟待解决的难题之一,具有重要的理论意义和应用意义。
发明内容
本申请提供了一种跨视频目标跟踪方法、系统、电子设备以及存储介质,旨在至少在一定程度上解决现有技术中的上述技术问题之一。
为了解决上述问题,本申请提供了如下技术方案:
一种跨视频目标跟踪方法,包括:
确定待跟踪的视频图像以及初始目标模板图像;
将所述视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型,通过所述深度孪生网络跟踪模型输出跟踪目标在所述视频图像中的目标预测图像;
利用训练好的地理映射模型将所述视频图像映射到统一地理坐标空间中,得到所述视频图像的全局地理坐标;
基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定所述重叠区域中是否存在目标预测图像,在存在所述目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。
本申请实施例采取的技术方案还包括:所述深度孪生网络跟踪模型包括骨架网络、自注意力网络、目标估计网络和动态模板网络,
所述骨架网络的输入为目标模板图像和目标搜索图像,所述骨架网络利用深度可分离卷积模块、多路复用卷积模块和倒置残差模块输出所述目标模板图像和目标搜索图像的一维特征图;
所述自注意力网络采用编码器-解码器架构,其输入为所述骨架网络的输出,输出为一张二维特征图;
所述目标估计网络包括偏置回归头、尺度预测头和目标分类头三个网络头,其输入为所述自注意力网络的输出,所述三个网络头的输出分别为偏置回归图、尺度预测图和目标分类图,根据偏置回归图、尺度预测图和目标分类图得到跟踪目标在所述目标搜索图像中的位置,输出目标预测图像;
所述动态模板网络包括三层前馈神经网络,其输入为所述自注意力网络的输出,输出为一个布尔值,代表是否更新所述目标模板图像。
本申请实施例采取的技术方案还包括:所述骨架网络利用深度可分离卷积模块、多路复用卷积模块和倒置残差模块输出所述目标模板图像和目标搜索图像的一维特征图具体为:
令所述目标模板图像和目标搜索图像通过一个共享的卷积核Conv_1,分别得到目标模板图像和目标搜索图像的特征图T1,S1;
配置三个深度可分离卷积层DwConv_1、DwConv_2和DwConv_3,每一个深度可分离卷积层包括逐通道卷积和逐点卷积;DwConv_1、DwConv_2和DwConv_3的输入为特征图T1,S1,DwConv_1和DwConv_2输出的特征图与T1,S1尺寸相同,DwConv_3输出最终的第一特征图T2,S2,所述T2,S2的尺寸分别为T1,S1的一半;
将所述第一特征图输入多路复用卷积模块,所述多路复用卷积模块包括三层卷积层,每一层卷积层分别包括三个倒置残差模块以及两个多路复用模块,所述多路复用卷积模块的输入为第一特征图T2,S2,其输出为第二特征图T3,S3。
本申请实施例采取的技术方案还包括:所述自注意力网络包括编码器和解码器,所述编码器包括第一多头自注意力模块、第一前馈网络以及第一残差归一化模块,所述编码器的输入为目标模板图像的特征图Z∈R h×w×d,其中h和w分别为特征图Z的宽和高,d为通道数量,将Z的空间维数压缩至一维,变成一个序列Z 0∈R hw×d
所述解码器包括第二前馈网络、第二残差归一化模块、第二多头自注意力模块和多头交叉注意力模块,所述解码器的输入为目标搜索图像的特征图 X∈R H×W×d,其中H和W分别为特征图X的宽和高,且H>h,W>w,所述解码器将特征图X压缩成一维的序列X 0∈R HW×d
本申请实施例采取的技术方案还包括:所述目标估计网络包括偏置回归头、尺度预测头和目标分类头,所述偏置回归头、尺度预测头和目标分类头分别连接自注意力网络的输出,且所述偏置回归头、尺度预测头和目标分类头分别包含三个1x1卷积层和一个Sigmoid函数,所述偏置回归头和尺度预测头分别用于目标框回归和尺度回归,输出分别为偏置回归图和尺度预测图,所述目标分类头用于目标分类,其输出为目标分类图,所述目标分类图的值代表跟踪目标在低分辨率离散化情况下的出现概率。
本申请实施例采取的技术方案还包括:所述动态模板网络包括一个分数网络,所述分数网络为一个三层的全连接网络,并接一个Sigmoid激活函数,所述分数网络的输入为自注意力网络的输出,并将其展平成一维,所述分数网络的输出为一个分数,当该分数大于设定的阈值τ,且间隔帧数达到F u以上时,则触发动态模板网络更新目标模板图像。
本申请实施例采取的技术方案还包括:所述地理映射模型为采用基于薄板样条函数的地理映射模型,所述利用训练好的地理映射模型将所有视频图像映射到统一地理坐标空间具体为:
通过ArcGIS在所述视频图像中选取设定数量的地理控制点,将所述地理控制点与视频图像中的像素点一一对应,找出N个匹配点,并应用薄板样条函数将N个匹配点形变到对应位置,同时计算N个对应的形变函数,通过所述形变函数将不同的视频图像映射到统一地理坐标空间中。
本申请实施例采取的技术方案还包括:所述利用跟踪目标交接算法进行跨视频目标跟踪具体为:
通过多边形裁剪算法计算两两视频图像之间的重叠区域,将所有视频图像之间的重叠区域集合记为A[n];
将所述视频图像加入初始的视频集z[m],在任意一个视频图像上手动标记一个需要跟踪的目标模板,并将跟踪目标的目标模板图像、视频集z[m]和重叠区域集A[n]输入目标交接算法中;
启动目标交接算法,获得一个点集P i,所述点集P i中包含了跟踪目标在视频集z[m]上的所有视频图像中运行目标跟踪算法得到的预测中心点像素坐标;如果视频图像中存在中心点,则将该中心点加入点集P i,如果该视频图像不存在中心点,则将该视频图像从视频集z[m]中移除;
判断所述点集P i是否为空,如果点集P i为空,结束目标交接算法,以最后一次出现的跟踪目标作为新的目标模板图像,并对所有视频图像重新运行目标跟踪算法;如果点集P i不为空,
通过薄板样条函数将点集P i的像素坐标转换为地理坐标点集P g,并用射线法判定跟踪目标是否进入重叠区域,将存在跟踪目标的重叠区域记为A k
在A k所从属的所有视频图像上执行目标跟踪算法。
本申请实施例采取的另一技术方案为:一种跨视频目标跟踪系统,包括:
目标预测模块:用于确定待跟踪的视频图像以及初始目标模板图像,将所述视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型, 通过所述深度孪生网络跟踪模型输出跟踪目标在所述视频图像中的目标预测图像;
地理坐标映射模块:用于利用训练好的地理映射模型将所述视频图像映射到统一地理坐标空间中,得到所述视频图像的全局地理坐标;
跨视频目标交接模块:用于基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定所述重叠区域中是否存在目标预测图像,在存在所述目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。
本申请实施例采取的又一技术方案为:一种电子设备,所述电子设备包括处理器、与所述处理器耦接的存储器,其中,
所述存储器存储有用于实现所述跨视频目标跟踪方法的程序指令;
所述处理器用于执行所述存储器存储的所述程序指令以控制跨视频目标跟踪。
本申请实施例采取的又一技术方案为:一种存储介质,存储有处理器可运行的程序指令,所述程序指令用于执行所述跨视频目标跟踪方法。
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的跨视频目标跟踪方法、系统、电子设备以及存储介质通过构建一种基于自注意力网络、动态模板网络的深度孪生网络跟踪模型,能够适应长时间的跟踪,其跟踪结果基本不受跟踪目标的尺度和外观变化影响,在跨视频多视角目标跟踪的场景下跟踪性能稳定,解决了跨视频跟踪中由于尺度外观变化等问题无法有效跟踪目标的问题。其次,本申请使用基于薄板样条函数的像素点到统一地理坐标映射模型,结合多边形裁剪法计算视频与视频之间的重叠 区域,并使用射线法根据地理坐标判断跟踪目标是否进入重叠区域,以此判断是否切换视频,实现对跟踪目标的跨视频持续跟踪。相比于现有技术,本申请实施例的跨视频跟踪技术视域范围大,可以长时间持续稳定的跟踪目标,在大范围内标识出跟踪目标的行动轨迹,有效的辅助了后续决策任务,节省了人力物力,保证了跟踪的实时性,且跟踪精度较高。
附图说明
图1是本申请实施例的孪生网络目标跟踪模型示意图;
图2是本申请实施例的跨视频目标跟踪方法的流程图;
图3是本申请实施例的自注意力网络结构示意图;
图4是本申请实施例的自注意力模块结构示意图;
图5为本申请实施例的地理坐标映射示意图,其中,(a)为控制点选取示意图,(b)为统一地理坐标映射示意图;
图6为本申请实施例的跨视频目标交接算法流程图;
图7为本申请实施例的跨视频目标跟踪系统结构示意图;
图8为本申请实施例的电子设备结构示意图;
图9为本申请实施例的存储介质的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
为了解决现有技术的不足,本申请实施例的跨视频目标跟踪方法通过构建一种鲁棒性强的孪生网络目标跟踪模型,并构建了跨视频目标交接算法,实现了重叠区域下跨视频范围内同一跟踪目标的实时连续跟踪。如图1所示,本申请实施例的孪生网络目标跟踪模型包括骨架网络、自注意力网络(Transformer)、目标估计网络和动态模板网络四个部分,其中,骨架网络是一种轻量级的网络,其输入为初始的目标模板图像和目标搜索图像,利用深度可分离卷积模块、多路复用卷积模块和倒置残差模块输出目标模板图像和目标搜索图像的一维特征图。自注意力网络采用编码器-解码器架构,其输入为骨架网络的输出,输出为一张二维特征图。目标估计网络包括偏置回归头、尺度预测头和目标分类头三个网络头,其输入为自注意力网络的输出,三个网络头的输出分别为三种响应图,根据响应图可以得到跟踪目标在目标搜索图像中的位置,输出目标预测图像。动态模板网络包括三层前馈神经网络,其输入是自注意力网络的输出,输出为一个布尔值,代表是否更新目标模板图像。跨视频目标交接算法具体为:通过地理映射模型将像素点坐标和统一地理坐标进行映射,从而将多个不同视频的像素点坐标对应到统一地理坐标系下,以对跟踪目标进行精确定位;然后通过多边形裁剪法确定不同视频中的重叠区域,用射线法判定跟踪目标是否进入重叠区域,以确定存在跟踪目标的视频集合,并在存在跟踪目标的视频集合上启动跟踪目标跟踪算法,对跟踪目标进行跨视频在线跟踪。
进一步地,本申请实施例的孪生网络目标跟踪模型分为三个训练阶段,分别为:骨干网络训练阶段、回归网络训练阶段和动态模板网络训练阶段。
在骨干网络训练阶段,使用微型ImageNet数据集(20类),选取其中20000张不同种类的图片作为训练集。该阶段只训练骨架网络,去掉自注意力网络、目标估计网络和动态模板网络,并将第一个卷积层的stride(卷积步长)改为1。此时输出特征图为7*7*224,将特征图输入一个三层全连接神经网络,最终输出一个20维的向量,代表20个分类。在训练过程中,使用随机梯度下降法作为梯度更新算法,学习率从0.0001开始每5轮折半,损失函数使用分类损失交叉熵损失函数。
在回归网络训练阶段,使用COCO2017和Youtube-BB的子集作为训练集,每轮选取其中4500张图片,生成4500个目标模板图像和目标搜索图像的输入对。去掉全连接神经网络,加上自注意力网络和目标估计网络,输出为目标估计网络的三个网络头的响应图,分别为:偏置回归图、尺度预测图和目标分类图。在模型训练过程中,使用ADAM优化算法,权重衰减率为0.0001,起始学习率设置为0.0001,每20轮折半。损失函数为回归和分类的联合损失函数。
在动态模板网络训练阶段,使用COCO2017和Youtube-BB的子集作为训练集,每轮选取其中4500张图片,生成4500个目标模板图像和目标搜索图像的输入对。加入动态模板网络,冻结回归网络的参数,只训练分类网络和动态模板网络中的三层前馈神经网络。在网络训练过程中,使用随机梯度下降法作为优化算法,学习率设置为0.00001,每5轮学习率减半。
具体地,请参阅图2,是本申请实施例的跨视频目标跟踪方法的流程图。本申请实施例的跨视频目标跟踪方法包括以下步骤:
S100:获取待跟踪的视频图像以及初始目标模板图像;
S200:将视频图像以及初始目标模板图像输入训练好的孪生网络目标跟踪模型,通过孪生网络目标跟踪模型输出跟踪目标在视频图像中的目标预测图像;
本步骤中,孪生网络目标跟踪模型的训练过程具体包括:
S201:获取目标模板图像和目标搜索图像,将目标模板图像和目标搜索图像输入骨架网络,通过骨架网络输出目标模板图像和目标搜索图像的第一特征图;
本步骤中,骨架网络为图1中的MPSiam骨架网络部分,其输入是98*98*3的目标模板图像和354*354*3的目标搜索图像。首先令目标模板图像和目标搜索图像通过一个共享的3*3*28的卷积核Conv_1,分别得到96*96*28和352*352*28的特征图,记为T1,S1。然后配置三个深度可分离卷积层DwConv_1、DwConv_2和DwConv_3,每一个深度可分离卷积层包括逐通道卷积和逐点卷积,逐通道卷积中的卷积核设置为3*3*28,一个卷积核只负责一个通道;逐点卷积中的卷积核设置为1*1*28,用于在通道方向上进行信息融合。DwConv_1和DwConv_2的stride设置为1,padding(填充)设置为same,DwConv_3的stride设置为2,padding设置为1,DwConv_1、DwConv_2和DwConv_3的输入为特征图T1,S1,DwConv_1和DwConv_2输出的特征图与T1,S1尺寸相同,DwConv_3输出最终的第一特征图,记为T2,S2,其尺寸分别为T1,S1的一半,即第一特征图T2,S2的尺寸分别为48*48*28和176*176*28。
S202:将第一特征图输入多路复用卷积模块,通过多路复用卷积模块输出第二特征图;
本步骤中,如图1所示,多路复用卷积模块包括三层卷积层,分别为MPConv_1、MPConv_2和MPConv_3。每一层卷积层分别包括三个倒置残差模块InvResidual_1、InvResidual_2、InvResidual_3以及两个多路复用模块MultiplexingBlock1和MultiplexingBlock2。每个倒置残差模块分别由两个逐点卷积和一个逐通道卷积组成,记作PwConv_1,CwConv_1和PwConv_2,PwConv_1,CwConv_1和PwConv_2的尺寸分别是1*1、3*3和1*1,三个倒置残差模块的输入分别为骨架网络输出的第一特征图T2,S2,其输出的特征图与T2,S2的尺寸相同。
Multiplexing Block1设置为输出尺寸不变,Multiplexing Block2设置为输出尺寸为输入的一半,并将通道数翻倍。第一特征图T2,S2依次通过三层卷积层MPConv_1、MPConv_2和MPConv_3后,尺寸依次变成24*24*56,88*88*56、12*12*112,44*44*112和6*6*224,22*22*224,最后输出的第二特征图记为T3,S3,其尺寸分别为6*6*224和22*22*224。
S203:将第二特征图输入自注意力网络,自注意力网络通过多头自注意力机制将第二特征图恢复为二维特征图;
本步骤中,如图1所示,自注意力网络即为图1中的Transformer结构,自注意力网络是一种基于编码器-解码器(Encoder-Decoder)的特征提取架构,可以充分提取时序信息。
请一并参阅图3,是本申请实施例的自注意力网络结构示意图。自注意力网络包括编码器和解码器两个部分,其中,编码器包括第一多头自注意力模块(Multi-head Self-Attention)、第一前馈网络(FFN)以及第一残差归一化模块(Add&Norm),其中,第一残差归一化模块的数量为两个。设编码器 的输入为目标模板图像的特征图Z∈R h×w×d,其中h和w分别为特征图Z的宽和高,d为通道数量。将Z的空间维数压缩至一维,即变成一个序列Z 0∈R hw×d。为了满足置换不变性(Permutation-invariant),自注意力网络中需要加入位置编码(Positional Encoding),位置编码使用正弦余弦变换来建模两个位置之间的仿射变换,使模型可以利用序列的顺序信息。
解码器包括第二前馈网络、第二残差归一化模块、第二多头自注意力模块和多头交叉注意力模块(Multi-head Cross-Attention)。设解码器的输入为目标搜索图像的特征图X∈R H×W×d,其中H和W分别为特征图X的宽和高,且H>h,W>w。与编码器类似,将特征图X压缩成一维的序列X 0∈R HW×d,解码器最后的输出和解码器的输入尺寸相同,并被恢复为R H×W×d的大小,用于后续的回归与分类。
基于上述,本申请实施例的自注意力网络的核心模块为多头注意力模块(Multi-head Attention),而多头注意力模块由多个自注意力模块(Self-Attention)组成,自注意力模块结构如图4所示。自注意力模块以上述的X 0和Z 0作为输入,通过W q、W k和W v三个加权矩阵进行线性投影,最后通过下式获得键矩阵(K)、值矩阵(V)和查询矩阵(Q):
Figure PCTCN2022137022-appb-000001
其中,
Figure PCTCN2022137022-appb-000002
Figure PCTCN2022137022-appb-000003
在编码器的第一多头自注意力模块中,N q=N kv=hw,X q=X kv,表示其输入为目标模板图像的特征图Z 0。在解码器的第二多头自注意力模块中,N q=N kv=HW,X q=X kv表示其输入为目标搜索图像 的特征图X 0。在解码器的多头交叉注意力模块中,N q=HW,N kv=hw,X kv是编码器的输出,X q是解码器前半部分的输出。P q和P k代表X q和X kv的位置编码。其统一的计算公式如下:
Figure PCTCN2022137022-appb-000004
其中pos代表特征图上对应的位置,2i代表奇偶维度。
如图4所示,首先计算Q和K的内积,即QK T。为了防止内积过大,需要除以
Figure PCTCN2022137022-appb-000005
的平方根。之后,以矩阵行为单位,使用Softmax函数计算每一行对其他行的系数,得到输出的二维特征图记为
Figure PCTCN2022137022-appb-000006
再用A与V计算内积,即可得到最终的注意力矩阵:
Figure PCTCN2022137022-appb-000007
其中
Figure PCTCN2022137022-appb-000008
W代表W q、W k和W v三个加权矩阵。
由于多头注意力模块是由M个自注意力模块组合而成,将输入传递到M个不同的自注意力模块中,得到M个输出矩阵A′,每个自注意力模块的通道维度
Figure PCTCN2022137022-appb-000009
再将M个输出矩阵从通道维度上连接到一起,连接后的特征图维度变回d维,再通矩阵W′做一次线性变换,最后将得到的一维特征图恢复为二维特征图,记为R H×W×d。具体计算过程如下式所示:
MultiHeadAttn(X q,X kv)=[Attn(X q,X kv,W 1)…Attn(X q,X kv,W M)]W′  (4)
S204:将二维特征图输入目标估计网络,通过目标估计网络输出目标预测图像;
本步骤中,本发明使用基于中心点预测的目标估计网络,其结构如图1所示,包括三个独立的网络头,分别为偏置回归头、尺度预测头和目标分类 头,三个网络头分别连接到自注意力网络的输出,每个网络头分别包含三个1x1卷积层和一个Sigmoid函数,偏置回归头和尺度预测头分别用于目标框回归和尺度回归,输出分别为偏置回归图和尺度预测图,目标分类头用于目标分类,其输出为目标分类图。
具体的,目标分类头输出一张目标分类图Y∈[0,1] Scale×1,Scale的公式如下:
Figure PCTCN2022137022-appb-000010
其中,H,W分别代表目标搜索图像的宽和高,s代表缩放系数,优选地,本申请实施例中,s=16。Floor代表取地板函数,可以保证输出的特征图尺寸为22。目标分类图Y的值代表了跟踪目标在低分辨率离散化情况下的出现概率。
由于最终需要在原目标搜索图像上定位跟踪目标,而离散化会导致位置还原到原目标搜索图像时产生偏移误差。因此,为了恢复由于离散化导致的误差,通过预测一个局部偏移特征图O∈[0,1] Scale×2,O代表了x坐标偏移和y坐标偏移的两幅响应图。基于此,在目标搜索图像中预测跟踪目标的中心点位置可以由下式来表达:
(x c,y c)=s(Argmax(Y′)+O(Argmax(Y′)))   (6)
其中,Y′代表对目标分类图Y进行加权余弦窗处理的结果,用于抑制目标分类图Y中的大离群值。Argmax函数可以返回对应目标分类图Y峰值的二维位置。即跟踪目标在目标搜索图像中的中心点位置可以用Y中的峰值加上对应的偏移值,再乘以缩放系数得到。
对于尺度预测头,同样生成一个尺度回归特征图S∈[0,1] Scale×2,之后,通过如下公式计算跟踪目标在目标搜索图像中边界框的大小:
(w bb,h bb)=(W,H)*S(Argmax(Y′))  (7)
其中,*运算代表Hadamard(哈达玛积)乘积。
然后通过以下中心值和宽高值计算出边界框的角点值:
Figure PCTCN2022137022-appb-000011
Figure PCTCN2022137022-appb-000012
S205:将目标预测图像输入动态模板网络,通过动态模板网络判断是否需要更新目标模板图像;
本步骤中,本发明使用动态模板网络来适应长时间的跟踪目标跟踪。其结构如图1上半部分标识所示。记初始的目标模板图像为z,初始的目标搜索图像为x。在模型训练和推理期间,从最后的目标预测图像中裁剪出预测目标,将其记为对偶模板t。将t和z一同输入到骨架网络中进行运算,将经过骨架网络后的特征图记为F z和F t。之后,通过简单线性插值算法计算目标模板图像z和对偶模板t融合后的特征图F′ t
F′ t=(1-w)F t+wF z   (10)
其中w是一个预先设置的超参数,优选地,w可以设置为0.7-0.8。
在目标跟踪过程中,有些情况下并不应该更新目标模板图像,例如,当跟踪目标被遮挡或移出视线以及跟踪模型发生漂移时。在本申请实施例中,如果目标搜索图像中包含跟踪目标,则通过动态模板网络更新目标模板图像。如图1所示,动态模板网络包括一个分数网络(Scorehead),该分数网络是一个三层的全连接网络,并接一个Sigmoid激活函数。分数网络的输入为 自注意力网络的输出,并将其展平成一维;分数网络的输出为一个分数,当该分数大于设定的阈值τ,且间隔帧数达到F u以上时,才会触发动态模板网络更新目标模板。优选地,通常设τ=0.5,F u=200。
在训练动态模板网络时,如果将其和主干网络同时进行训练,可能会导致收敛到次优解。因此,将训练分为两个阶段,在第一阶段中,去掉动态模板网络,训练整个主干网络。在第二阶段中,冻结主干网络以及两个回归分支的参数,只保留分类分支的参数,并加入动态模板网络,再进行动态模板网络和分类分支的联合训练。本申请实施例中,动态模板网络使用交叉熵损失进行训练:
L t=y ilog(P i)+(1-y i)log(1-P i)   (11)
其中,y i代表真值的标签,P i代表分类分支响应图上的置信度。
S300:利用训练好的地理映射模型将所有视频图像映射到统一地理坐标空间中,得到视频图像的全局地理坐标;
本步骤中,为了在不同视频视角中实现目标交接,需要将所有视频映射到一个统一地理坐标空间中。本发明采用基于薄板样条函数(TPS)的地理映射模型,该地理映射模型将像素点坐标和统一地理坐标进行映射,可以将多个不同视频图像的像素坐标对应到统一地理坐标系下,以实现对跟踪目标的精确定位。具体映射方式为:在两幅视频图像中,通过ArcGIS选取15-20个地理控制点,将地理控制点与视频图像中的像素点一一对应,找出N个匹配点,并应用薄板样条函数将N个匹配点形变到对应位置,同时给出整个空间的形变函数,N个视频图像可以计算出N个对应的形变函数,经过形变函数的计算,即可将不同的视频图像映射到统一地理坐标空间中,视频图像具有地 理属性后,可以进行像素坐标到全局地理坐标的换算。具体如图5所示,(a)为控制点选取示意图,(b)为统一地理坐标映射示意图。
具体地,令p代表一个视频图像上原始的点,q代表经过移动之后的对应点,若干控制点产生了这种移动之后,整个图像平面必然会发生扭曲,薄板样条函数的目的就是拟合一个函数,从而得到曲面上每个点的变化。为了描述该插值过程,需要定义两个变量,一个是拟合项
Figure PCTCN2022137022-appb-000013
用于测量将原点形变后离跟踪目标点的距离。第二个是扭曲项ε d,用于度量曲面的扭曲程度,因此,总的损失函数如下:
Figure PCTCN2022137022-appb-000014
其中α是权值系数,用于控制非刚体形变发生的程度。公式中两项展开后分别如下式:
Figure PCTCN2022137022-appb-000015
Figure PCTCN2022137022-appb-000016
其中,N为控制点的数量,
Figure PCTCN2022137022-appb-000017
代表了原点经过形变函数
Figure PCTCN2022137022-appb-000018
计算之后和跟踪目标点之间的距离,式(14)代表了曲面扭曲程度的能量函数【】,最小化损失函数后,可以得到一个闭式解:
Figure PCTCN2022137022-appb-000019
其中p代表曲面上的任意一个点,p i代表选定的控制点,M=(m 1,m 2)。U为径向基函数,代表曲面上的某一个点的变形受其他控制点的影响程度,定义如下:
U(x)=x 2logx     (16)
ω i代表对不同径向基函数的加权。式(15)可以理解为使用M和m 0两个参数去拟合一个平面,并用径向基函数在平面的基础上拟合弯曲程度。总共有3+N个参数,控制点数N选取越多,拟合的效果就越好。
在拥有闭式解的情况下,可以求解下式(17),相当于求解一个具有N+3个方程的线性方程组,之后可以得到参数组(ω i,m 0,m 1,m 2)。拥有参数组后,就可以通过函数
Figure PCTCN2022137022-appb-000020
将任一视频上的任意点映射到统一地理坐标空间中:
Figure PCTCN2022137022-appb-000021
S400:通过多边形裁剪法计算两两视频图像之间的重叠区域,并采用射线法判定重叠区域中是否存在目标预测图像,在存在目标预测图像的重叠区域对应的视频图像中启动跟踪目标交接算法,对跟踪目标进行跨视频跟踪;
本步骤中,如图6所示,为本申请实施例的跨视频目标交接算法流程图,其具体包括以下步骤:
S401:通过多边形裁剪算法计算两两视频图像之间的重叠区域,将所有待预测的视频图像记为[Z 1,Z 2,…Z 7],将所有视频图像之间的重叠区域集合记为A[n];
其中,两两视频图像之间的重叠区域如图7所示。重叠区域计算方式为:将每个地理坐标映射后的视频图像看作一个凸多边形,将求解重叠区域的问题转化成求解两个凸多变形之间重叠区域的图形学问题。多边形裁剪算法(Suther land-Hodgman)采用了分割处理、逐边裁剪的方法,其输入为两个凸多边形的顶点数组,输出为裁剪后的凸多边形的顶点数组,如下式所示:
Sutherland-Hodgman(PolyA[…],PolyB[…])=PolyC[…]   (18)
其中PolyC代表两个视频图像之间的重叠区域的顶点数组。假设对七幅视频图像进行两两重叠区域的计算,则得到的重叠区域数组记为[A 1,A 2,…A n]。
S402:在任意一个视频图像上手动标记一个需要跟踪的目标模板,将视频图像加入初始的视频集z[m],并将跟踪目标的目标模板图像、视频集z[m]和重叠区域集A[n]作为参数,输入到目标交接算法中;
S403:启动目标交接算法,首先获得一个点集P i,点集P i中包含了跟踪目标在视频集z[m]上的所有视频图像中运行目标跟踪算法得到的预测中心点像素坐标;如果跟踪目标在某个视频图像上可以被跟踪到,表示该视频图像存在中心点,则将该中心点加入点集P i,如果该视频图像不存在中心点,则认为跟踪目标已经离开该视频图像,将该视频图像从视频集z[m]中移除。
S404:判断点集P i是否为空,如果点集P i为空,执行S405;否则,执行S406;
其中,点集P i为空表示在视频集z[m]上的任一视频图像上都跟踪不到跟踪目标,此时,点集P i为一个空点集。
S405:结束目标交接算法,以最后一次出现的跟踪目标作为新的目标模板图像,并对所有视频图像重新运行目标跟踪算法;
其中,优选地,本申请实施例设定重新运行目标跟踪算法的次数为五次,如果运行五次算法都找不到跟踪目标,则认为跟踪目标已经离开视野,算法结束。如果在五次算法中重新跟踪到跟踪目标,则将重新跟踪到的跟踪目标设置为新的目标模板图像,将跟踪目标最后一次出现的视频设置为初始视频,并重新执行目标跟踪算法。
S406:通过薄板样条函数将点集P i的像素坐标转换为地理坐标点集P g,并用射线法判定跟踪目标是否进入重叠区域,将存在跟踪目标的重叠区域记为A k
其中,由于是转换到统一地理空间,P g中的地理坐标都应该相同,而在实际中,可能会有些许的误差,因此,通过对P g中的地理坐标取均值或求重心值,然后根据地理坐标判断P g是否进入了A[n]中的某个重叠区域,即判断一个点是否在一个多边形区域内。
具体的,本申请实施例通过图形学中的射线法进行判断。射线法的思路为:(1)如果点在多边形内部,则从点出去的射线第一次一定是穿出多边形;(2)如果点在多边形外部,则从点出去的射线第一次一定是穿入多边形。
S407:在A k所从属的所有视频图像上执行目标跟踪算法;
具体的,首先找到A k所从属的视频图像中不存在于视频集z[m]中的那一个,记为Z k,然后将z k加入到视频集z[m]中。如此,在下一轮跨视频目标交接算法运行的第一步中,就可以遍历到Z k并开启目标跟踪算法。本申请实施例通过重叠区域的判断,在不同的视频图像之间交接跟踪目标,实现了跨视频的目标交接,最终达到了跨视频目标跟踪的目的。本发明实施例的跨视频目标跟踪算法可以达到25帧每秒的运行速率,跟踪鲁棒性好,且精度较高。
基于上述,本申请实施例的跨视频目标跟踪方法通过构建一种基于自注意力网络、动态模板网络的深度孪生网络跟踪模型,能够适应长时间的跟踪,其跟踪结果基本不受跟踪目标的尺度和外观变化影响,在跨视频多视角目标跟踪的场景下跟踪性能稳定,解决了跨视频跟踪中由于尺度外观变化等 问题无法有效跟踪目标的问题。其次,本申请使用基于薄板样条函数的像素点到统一地理坐标映射模型,结合多边形裁剪法计算视频与视频之间的重叠区域,并使用射线法根据地理坐标判断跟踪目标是否进入重叠区域,以此判断是否切换视频,实现对跟踪目标的跨视频持续跟踪。相比于现有技术,本申请实施例的跨视频跟踪技术视域范围大,可以长时间持续稳定的跟踪目标,在大范围内标识出跟踪目标的行动轨迹,有效的辅助了后续决策任务,节省了人力物力,保证了跟踪的实时性,且跟踪精度较高。
请参阅图7,为本申请实施例的跨视频目标跟踪系统结构示意图。本申请实施例的跨视频目标跟踪系统40包括:
目标预测模块41:用于确定待跟踪的视频图像以及初始目标模板图像,将视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型,通过深度孪生网络跟踪模型输出跟踪目标在视频图像中的目标预测图像;
地理坐标映射模块42:用于利用训练好的地理映射模型将视频图像映射到统一地理坐标空间中,得到视频图像的全局地理坐标;
跨视频目标交接模块43:用于基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定重叠区域中是否存在目标预测图像,在存在目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。
请参阅图8,为本申请实施例的电子设备结构示意图。该电子设备50包括处理器51、与处理器51耦接的存储器52。
存储器52存储有用于实现上述跨视频目标跟踪方法的程序指令。
处理器51用于执行存储器52存储的程序指令以控制跨视频目标跟踪。
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
请参阅图9,为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等电子设备设备。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本申请所示的这些实施例,而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种跨视频目标跟踪方法,其特征在于,包括:
    确定待跟踪的视频图像以及初始目标模板图像;
    将所述视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型,通过所述深度孪生网络跟踪模型输出跟踪目标在所述视频图像中的目标预测图像;
    利用训练好的地理映射模型将所述视频图像映射到统一地理坐标空间中,得到所述视频图像的全局地理坐标;
    基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定所述重叠区域中是否存在目标预测图像,在存在所述目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。
  2. 根据权利要求1所述的跨视频目标跟踪方法,其特征在于,所述深度孪生网络跟踪模型包括骨架网络、自注意力网络、目标估计网络和动态模板网络,
    所述骨架网络的输入为目标模板图像和目标搜索图像,所述骨架网络利用深度可分离卷积模块、多路复用卷积模块和倒置残差模块输出所述目标模板图像和目标搜索图像的一维特征图;
    所述自注意力网络采用编码器-解码器架构,其输入为所述骨架网络的输出,输出为一张二维特征图;
    所述目标估计网络包括偏置回归头、尺度预测头和目标分类头三个网络头,其输入为所述自注意力网络的输出,所述三个网络头的输出分别为偏置 回归图、尺度预测图和目标分类图,根据偏置回归图、尺度预测图和目标分类图得到跟踪目标在所述目标搜索图像中的位置,输出目标预测图像;
    所述动态模板网络包括三层前馈神经网络,其输入为所述自注意力网络的输出,输出为一个布尔值,代表是否更新所述目标模板图像。
  3. 根据权利要求2所述的跨视频目标跟踪方法,其特征在于,所述骨架网络利用深度可分离卷积模块、多路复用卷积模块和倒置残差模块输出所述目标模板图像和目标搜索图像的一维特征图具体为:
    令所述目标模板图像和目标搜索图像通过一个共享的卷积核Conv_1,分别得到目标模板图像和目标搜索图像的特征图T1,S1;
    配置三个深度可分离卷积层DwConv_1、DwConv_2和DwConv_3,每一个深度可分离卷积层包括逐通道卷积和逐点卷积;DwConv_1、DwConv_2和DwConv_3的输入为特征图T1,S1,DwConv_1和DwConv_2输出的特征图与T1,S1尺寸相同,DwConv_3输出最终的第一特征图T2,S2,所述T2,S2的尺寸分别为T1,S1的一半;
    将所述第一特征图输入多路复用卷积模块,所述多路复用卷积模块包括三层卷积层,每一层卷积层分别包括三个倒置残差模块以及两个多路复用模块,所述多路复用卷积模块的输入为第一特征图T2,S2,其输出为第二特征图T3,S3。
  4. 根据权利要求3所述的跨视频目标跟踪方法,其特征在于,所述自注意力网络包括编码器和解码器,所述编码器包括第一多头自注意力模块、第一前馈网络以及第一残差归一化模块,所述编码器的输入为目标模板图像的特征图Z∈R h×w×d,其中h和w分别为特征图Z的宽和高,d为通道数量,将Z的空间维数压缩至一维,变成一个序列Z 0∈R hw×d
    所述解码器包括第二前馈网络、第二残差归一化模块、第二多头自注意力模块和多头交叉注意力模块,所述解码器的输入为目标搜索图像的特征图X∈R H×W×d,其中H和W分别为特征图X的宽和高,且H>h,W>w,所述解码器将特征图X压缩成一维的序列X 0∈R HW×d
  5. 根据权利要求4所述的跨视频目标跟踪方法,其特征在于,所述目标估计网络包括偏置回归头、尺度预测头和目标分类头,所述偏置回归头、尺度预测头和目标分类头分别连接自注意力网络的输出,且所述偏置回归头、尺度预测头和目标分类头分别包含三个1x1卷积层和一个Sigmoid函数,所述偏置回归头和尺度预测头分别用于目标框回归和尺度回归,输出分别为偏置回归图和尺度预测图,所述目标分类头用于目标分类,其输出为目标分类图,所述目标分类图的值代表跟踪目标在低分辨率离散化情况下的出现概率。
  6. 根据权利要求5所述的跨视频目标跟踪方法,其特征在于,所述动态模板网络包括一个分数网络,所述分数网络为一个三层的全连接网络,并接一个Sigmoid激活函数,所述分数网络的输入为自注意力网络的输出,并将其展平成一维,所述分数网络的输出为一个分数,当该分数大于设定的阈值τ,且间隔帧数达到F u以上时,则触发动态模板网络更新所述目标模板图像。
  7. 根据权利要求1至6任一项所述的跨视频目标跟踪方法,其特征在于,所述地理映射模型为采用基于薄板样条函数的地理映射模型,所述利用训练好的地理映射模型将所有视频图像映射到统一地理坐标空间具体为:
    通过ArcGIS在所述视频图像中选取设定数量的地理控制点,将所述地理控制点与视频图像中的像素点一一对应,找出N个匹配点,并应用薄板样条 函数将N个匹配点形变到对应位置,同时计算N个对应的形变函数,通过所述形变函数将不同的视频图像映射到统一地理坐标空间中。
  8. 根据权利要求7所述的跨视频目标跟踪方法,其特征在于,所述利用跟踪目标交接算法进行跨视频目标跟踪具体为:
    通过多边形裁剪算法计算两两视频图像之间的重叠区域,将所有视频图像之间的重叠区域集合记为A[n];
    将所述视频图像加入初始的视频集z[m],在任意一个视频图像上手动标记一个需要跟踪的目标模板,并将跟踪目标的目标模板图像、视频集z[m]和重叠区域集A[n]输入目标交接算法中;
    启动目标交接算法,获得一个点集P i,所述点集P i中包含了跟踪目标在视频集z[m]上的所有视频图像中运行目标跟踪算法得到的预测中心点像素坐标;如果视频图像中存在中心点,则将该中心点加入点集P i,如果该视频图像不存在中心点,则将该视频图像从视频集z[m]中移除;
    判断所述点集P i是否为空,如果点集P i为空,结束目标交接算法,以最后一次出现的跟踪目标作为新的目标模板图像,并对所有视频图像重新运行目标跟踪算法;如果点集P i不为空,
    采用射线法判定跟踪目标是否进入重叠区域,将存在跟踪目标的重叠区域记为A k
    在A k所从属的所有视频图像上执行目标跟踪算法。
  9. 一种跨视频目标跟踪系统,其特征在于,包括:
    目标预测模块:用于确定待跟踪的视频图像以及初始目标模板图像,将所述视频图像以及初始目标模板图像输入训练好的深度孪生网络跟踪模型, 通过所述深度孪生网络跟踪模型输出跟踪目标在所述视频图像中的目标预测图像;
    地理坐标映射模块:用于利用训练好的地理映射模型将所述视频图像映射到统一地理坐标空间中,得到所述视频图像的全局地理坐标;
    跨视频目标交接模块:用于基于映射后的视频图像,通过多边形裁剪法计算两两视频图像之间的重叠区域,并判定所述重叠区域中是否存在目标预测图像,在存在所述目标预测图像的重叠区域对应的视频图像中利用跟踪目标交接算法进行跨视频目标跟踪。
  10. 一种电子设备,其特征在于,所述电子设备包括处理器、与所述处理器耦接的存储器,其中,
    所述存储器存储有用于实现权利要求1-8任一项所述的跨视频目标跟踪方法的程序指令;
    所述处理器用于执行所述存储器存储的所述程序指令以控制跨视频目标跟踪。
PCT/CN2022/137022 2022-05-07 2022-12-06 一种跨视频目标跟踪方法、系统、电子设备以及存储介质 WO2023216572A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210490914.5 2022-05-07
CN202210490914.5A CN114842028A (zh) 2022-05-07 2022-05-07 一种跨视频目标跟踪方法、系统、电子设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2023216572A1 true WO2023216572A1 (zh) 2023-11-16

Family

ID=82567470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137022 WO2023216572A1 (zh) 2022-05-07 2022-12-06 一种跨视频目标跟踪方法、系统、电子设备以及存储介质

Country Status (2)

Country Link
CN (1) CN114842028A (zh)
WO (1) WO2023216572A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315552A (zh) * 2023-11-30 2023-12-29 山东森普信息技术有限公司 一种大规模农作物巡检方法、装置及存储介质
CN117333515A (zh) * 2023-12-01 2024-01-02 南昌工程学院 基于区域感知注意力的目标跟踪方法与系统
CN117333514A (zh) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 一种单目标视频跟踪方法、装置、存储介质及设备
CN117576164A (zh) * 2023-12-14 2024-02-20 中国人民解放军海军航空大学 基于特征联合学习的遥感视频海陆运动目标跟踪方法
CN117612046A (zh) * 2024-01-23 2024-02-27 青岛云世纪信息科技有限公司 一种基于ai和gis交互实现目标区域地物识别的方法和系统
CN117710663A (zh) * 2024-02-05 2024-03-15 南昌工程学院 基于特征增强与级联融合注意力的目标跟踪方法与系统
CN117786507A (zh) * 2024-02-27 2024-03-29 中国海洋大学 全局和局部特征耦合指导的滚动轴承未知故障检测方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842028A (zh) * 2022-05-07 2022-08-02 深圳先进技术研究院 一种跨视频目标跟踪方法、系统、电子设备以及存储介质
CN115272419B (zh) * 2022-09-27 2022-12-09 南昌工程学院 基于混合卷积与自注意力的聚合网络目标跟踪方法与系统
CN116486203B (zh) * 2023-04-24 2024-02-02 燕山大学 一种基于孪生网络和在线模板更新的单目标跟踪方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161311A (zh) * 2019-12-09 2020-05-15 中车工业研究院有限公司 一种基于深度学习的视觉多目标跟踪方法及装置
US20200327679A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deeply and densely connected neural network
CN113963032A (zh) * 2021-12-01 2022-01-21 浙江工业大学 一种融合目标重识别的孪生网络结构目标跟踪方法
CN114842028A (zh) * 2022-05-07 2022-08-02 深圳先进技术研究院 一种跨视频目标跟踪方法、系统、电子设备以及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200327679A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deeply and densely connected neural network
CN111161311A (zh) * 2019-12-09 2020-05-15 中车工业研究院有限公司 一种基于深度学习的视觉多目标跟踪方法及装置
CN113963032A (zh) * 2021-12-01 2022-01-21 浙江工业大学 一种融合目标重识别的孪生网络结构目标跟踪方法
CN114842028A (zh) * 2022-05-07 2022-08-02 深圳先进技术研究院 一种跨视频目标跟踪方法、系统、电子设备以及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHI YA-LING; LIU ZHENG-XI; XIONG YUN-YU: "Multi-Camera Cross-Regional Target Tracking", XIANDAI JISUANJI (ZHUANYE BAN)/ MODERN COMPUTER (PROFESSIONAL EDITION), XIANDAI JISUANJI ZAZHISHE, CHINA, no. 2, 31 January 2018 (2018-01-31), China , pages 72 - 75, XP009550441, ISSN: 1007-1423 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315552B (zh) * 2023-11-30 2024-01-26 山东森普信息技术有限公司 一种大规模农作物巡检方法、装置及存储介质
CN117315552A (zh) * 2023-11-30 2023-12-29 山东森普信息技术有限公司 一种大规模农作物巡检方法、装置及存储介质
CN117333514B (zh) * 2023-12-01 2024-04-16 科大讯飞股份有限公司 一种单目标视频跟踪方法、装置、存储介质及设备
CN117333514A (zh) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 一种单目标视频跟踪方法、装置、存储介质及设备
CN117333515B (zh) * 2023-12-01 2024-02-09 南昌工程学院 基于区域感知注意力的目标跟踪方法与系统
CN117333515A (zh) * 2023-12-01 2024-01-02 南昌工程学院 基于区域感知注意力的目标跟踪方法与系统
CN117576164A (zh) * 2023-12-14 2024-02-20 中国人民解放军海军航空大学 基于特征联合学习的遥感视频海陆运动目标跟踪方法
CN117576164B (zh) * 2023-12-14 2024-05-03 中国人民解放军海军航空大学 基于特征联合学习的遥感视频海陆运动目标跟踪方法
CN117612046A (zh) * 2024-01-23 2024-02-27 青岛云世纪信息科技有限公司 一种基于ai和gis交互实现目标区域地物识别的方法和系统
CN117612046B (zh) * 2024-01-23 2024-04-26 青岛云世纪信息科技有限公司 一种基于ai和gis交互实现目标区域地物识别的方法和系统
CN117710663A (zh) * 2024-02-05 2024-03-15 南昌工程学院 基于特征增强与级联融合注意力的目标跟踪方法与系统
CN117710663B (zh) * 2024-02-05 2024-04-26 南昌工程学院 基于特征增强与级联融合注意力的目标跟踪方法与系统
CN117786507A (zh) * 2024-02-27 2024-03-29 中国海洋大学 全局和局部特征耦合指导的滚动轴承未知故障检测方法
CN117786507B (zh) * 2024-02-27 2024-04-30 中国海洋大学 全局和局部特征耦合指导的滚动轴承未知故障检测方法

Also Published As

Publication number Publication date
CN114842028A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2023216572A1 (zh) 一种跨视频目标跟踪方法、系统、电子设备以及存储介质
WO2021196294A1 (zh) 一种跨视频人员定位追踪方法、系统及设备
WO2022000838A1 (zh) 基于马尔可夫随机场的远程塔台视频目标挂标牌方法
CN113673425B (zh) 一种基于Transformer的多视角目标检测方法及系统
Wang et al. Pointloc: Deep pose regressor for lidar point cloud localization
US20130343600A1 (en) Self learning face recognition using depth based tracking for database generation and update
AU2016266968A1 (en) Modelling a three-dimensional space
JP7439153B2 (ja) 全方位場所認識のためのリフトされたセマンティックグラフ埋め込み
JP7306766B2 (ja) ターゲット動き情報検出方法、装置、機器及び媒体
Wang et al. A framework for moving target detection, recognition and tracking in UAV videos
Zhao et al. Dynamic object tracking for self-driving cars using monocular camera and lidar
Saif et al. Crowd density estimation from autonomous drones using deep learning: challenges and applications
Mo et al. Eventtube: An artificial intelligent edge computing based event aware system to collaborate with individual devices in logistics systems
Liu et al. CenterTube: Tracking multiple 3D objects with 4D tubelets in dynamic point clouds
Wang et al. Interactive multi-scale fusion of 2D and 3D features for multi-object tracking
Callemein et al. Anyone here? Smart embedded low-resolution omnidirectional video sensor to measure room occupancy
US11315257B2 (en) Method for real time surface tracking in unstructured environments
Delibasoglu et al. Motion detection in moving camera videos using background modeling and FlowNet
CN116824641A (zh) 姿态分类方法、装置、设备和计算机存储介质
Zhu et al. InterpolationSLAM: An effective visual SLAM system based on interpolation network
CN115131407B (zh) 面向数字仿真环境的机器人目标跟踪方法、装置和设备
Yin et al. V2VFormer $++ $: Multi-Modal Vehicle-to-Vehicle Cooperative Perception via Global-Local Transformer
Delforouzi et al. Deep learning for object tracking in 360 degree videos
Im et al. Distributed Spatial Transformer for Object Tracking in Multi-Camera
Liu et al. Online multi-object tracking under moving unmanned aerial vehicle platform based on object detection and feature extraction network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941507

Country of ref document: EP

Kind code of ref document: A1