CN117649427A

CN117649427A - Video single-target tracking method and system based on space-time knowledge fusion and model dynamic integration

Info

Publication number: CN117649427A
Application number: CN202311595400.7A
Authority: CN
Inventors: 冯平; 陈旭; 向丽; 刘敏; 蒋合领
Original assignee: Guizhou University of Finance and Economics
Current assignee: Guizhou University of Finance and Economics
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-05

Abstract

A video single-target tracking method and system based on space-time knowledge fusion and model dynamic integration relate to the field of computer vision, in particular to the field of single-target tracking. The method solves the problems of neglecting the sequential change or large complex calculation amount of processing in the existing single-target tracking technology. The construction method comprises the following steps: step S1, generating a sample frame based on an initial sample marked with target position and scale information in an initial video, wherein the target frame is a rectangular frame used for marking the initial target position and the scale information; step S2, generating a category label for the initial sample; s3, generating space-time knowledge information for the initial sample; step S4, constructing a convolutional neural network model for classification, wherein the convolutional neural network model comprises the following components: a first partial network and a second partial network; and S5, training a convolutional neural network model. The method has application value in the field of single-target tracking.

Description

Video single-target tracking method and system based on space-time knowledge fusion and model dynamic integration

Technical Field

The invention relates to the field of computer vision, in particular to the field of single-target tracking.

Background

In the process of tracking the target in a complex scene, different interference factors such as illumination, scale, gesture and shielding can have a great influence on the appearance characteristics of the target, so that it becomes difficult to continuously and accurately finish the target tracking. The video is composed of a sequence of frame images, the possible positions and area of the target on each frame image can be estimated according to a proper strategy, so that a certain number of target candidate areas are generated, and an algorithm is utilized to find the optimal candidate area under certain conditions as a target tracking result.

The prior art mainly uses convolutional neural network to extract the characteristics of the local area of the video frame image, and performs target tracking in a similarity matching or classifying mode. The above prior art uses convolutional neural networks to directly extract features on local image blocks, and even if the spatial distribution of the features is considered, the features are limited to the local image blocks, and for the time sequence change of the features, the features are either ignored so as not to deal with deformation, occlusion and disappearance of the target, or the models such as rnn, lstm and transformer are used, and the training and application of the models are complex. rnn and lstm models are step-wise processing sequences, requiring sequential calculations for each time step, resulting in less computationally efficient. Especially when processing long sequences, the computation time increases significantly. This is a challenge for real-time applications or large-scale data sets. The transducer model models dependencies in a sequence through a self-attention mechanism, but it is based on global attention, i.e., each location interacts with all other locations in the sequence. This can result in excessive consumption of computational and memory resources when processing long sequences, limiting the scalability of the transducer model.

Disclosure of Invention

In order to solve the problems of neglecting the sequential change or large processing complex calculation amount in the existing single-target tracking technology, the invention provides the following scheme:

a method for constructing a video single-target tracking model based on space-time knowledge fusion and model dynamic integration comprises the following steps:

step S1, generating a sample frame based on an initial sample marked with target position and scale information in an initial video, wherein the target frame is a rectangular frame used for marking the initial target position and the scale information;

step S2, generating a category label for the initial sample;

s3, generating space-time knowledge information for the initial sample;

step S4, constructing a convolutional neural network model for classification, wherein the convolutional neural network model comprises the following components: a first partial network and a second partial network;

and S5, training a convolutional neural network model, and obtaining a video single-target tracking model based on space-time knowledge fusion and model dynamic integration after training.

Further, the step S2 specifically includes:

the step S2 specifically includes:

step S201, obtaining the area overlapping ratio IoU between the sample frame and the target frame of each initial sample, respectively:

in the formula, as represents the area of a sample frame, at represents the area of a target frame, a numerator part represents the area of an overlapping part of the sample frame and the target frame, a denominator part represents the space area of the sample frame and the target frame which are assembled and operated, and an image block corresponding to the sample frame is called a sample;

in step S202, all area overlapping ratios IoU are sequentially compared with a preset threshold, and if IoU of the sample frame and the target frame is greater than 0.7, the sample flag is set to 1, which indicates that the sample is a foreground type positive sample, and if IoU is less than 0.5, the sample flag is set to 0, which indicates that the sample is a background type negative sample.

Further, the step S3 specifically includes:

step S301, generating sample space information (x, y, w, h, BIoU1, BIoU 2), wherein (x, y) is the coordinate position of the sample frame, (w, h) is the width and height of the sample frame, BIoU1 and BIoU2 are the space relative position relation information of the sample frame of the current frame and the target frame of the previous frame,

wherein A is _r Representing a minimum circumscribed rectangular box area containing a sample box and a target box;

step S302, generating sample time information (ia, p) _t1 ,p _t2 ,p _t3 ) Dividing the initial video into video segments in units of P1 frames, P _t1 Representing the p-th video segment where the sample is located _t1 A frame;

dividing the initial video into video segments, P, in units of P2 frames _t2 Representing the p-th video segment where the sample is located _t2 A frame;

dividing the initial video into video segments, P, in units of P3 frames _t3 Representing the p-th video segment where the sample is located _t3 A frame;

ia represents the sequence number of the video segment where the sample is located when the sample divides the initial video into video segments in units of P1 frames;

step S303, generating sample space-time knowledge information, supplementing a p1 frame by copying the first frame before the first frame of the initial video, and defining partial positive samples of the first frame as targets of the supplemented frame, wherein ia values in the time information are all set to 0; starting from the original first frame, splicing the target space and time information of each frame p1-1 before the frame where the sample is positioned with the space and time information of the sample itself to generate the space-time knowledge information of the sample,

[(x,y,w,h,BIoU1,BIoU2,ia,p _t1 ,p _t2 ,p _t3 ) _n1 ,...,(x,y,w,h,BIoU1,BIoU2,ia,p _t1 ,p _t2 ,p _t3 ) _n2 ]。

further, the step S4 specifically includes:

step S401, constructing a first partial network for extracting features, wherein the partial network is shared by all video sequences and is used for extracting features for the whole image of each video frame in all video sequences, and the first partial network intercepts local area features of a frame generated by a particle filter in the whole output feature map through a RoIAlign layer;

step S402, constructing a second partial network for classifying samples, the partial network being separately constructed for each video sequence, the second partial network comprising: the three full-connection layers and the two dropout layers flatten the output of the last layer of the first partial network into one dimension, splice and fuse the output with the corresponding space-time knowledge information as the input of the first full-connection layer, and the last full-connection layer contains probability output of two categories and uses a binary classification function BCELoss or an improved function as a loss function.

A video single-target tracking method based on space-time knowledge fusion and model dynamic integration comprises the following steps:

step S6, adapting the network parameters of the video single-target tracking model based on space-time knowledge fusion and model dynamic integration according to the claim 1 based on detection samples in tracking video obtained in real time, and setting the initial value of a variable nf to be 1, wherein nf represents the sequence number of a current frame, and the detection samples refer to video frames used for testing the model in the video single-target tracking based on space-time knowledge fusion and model dynamic integration;

step S7, using the adapted model for new video target tracking to obtain a classification prediction probability of a current frame image in a tracked video, wherein the classification prediction probability value is divided into a foreground probability value and a background probability value, and the average value of the positions and the width heights of 5 candidate target areas with the largest foreground probability value is used as a target prediction result;

step S8, judging whether the current frame is the last frame, if yes, ending tracking, and if not, continuing to execute the subsequent steps to track the target;

step S9, collecting detection samples, extracting and storing features, generating positive and negative type sample frame areas by using a particle filter according to a target prediction result in a current frame, inputting the whole frame of image into a first part of network, and if the maximum foreground type probability value of the classification prediction probability is greater than 0.6, separating out the positive and negative type sample area features by using a RoIAlign layer for storing; masking positive and negative region features obtained by RoIAlign layer mapping by using a mask to form global features with the mask and storing the global features, and generating and storing space-time knowledge information of a sample;

step S10, updating a network model, if the maximum foreground probability value of the classification prediction probability is not more than 0.6, creating an auxiliary decision network model with the same structure as the convolutional neural network model used for classification in the step S4, if so, reinitializing parameters of the auxiliary decision network model, respectively selecting a part of global positive and negative sample characteristics with a mask stored in the tracking process from the current frame to 5 frames, and training the auxiliary decision network model by utilizing the global positive and negative sample characteristics and corresponding space-time knowledge information;

and S11, completing new target prediction, judging whether the maximum foreground probability value of the classification prediction probability is larger than 0.6 under the condition that an auxiliary decision network model is not used, if so, completing classification by using the adapted model, taking the mean value of the positions and the width heights of 5 candidate target areas with the maximum foreground probability value as a target prediction result, solving the maximum value of the foreground probability value, adding 1 to the value of the current frame sequence number nf, continuing to execute the step S8, if so, completing classification by using the adapted model, further using the auxiliary decision model in S10, weighting the probability output by the two models, taking the mean value of the positions and the width heights of 5 candidate target areas with the maximum foreground probability value as a target prediction result, solving the maximum value of the foreground probability, adding 1 to the value nf of the current frame of the variable, and continuing to execute the step S8.

Further, a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing a method for video single-object tracking model construction based on spatio-temporal knowledge fusion and model dynamic integration according to any one of claims 1-4 or a method for video single-object tracking based on spatio-temporal knowledge fusion and model dynamic integration according to claim 5 when executing said computer program.

Further, a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video single-object tracking model construction method based on spatiotemporal knowledge fusion and model dynamic integration as claimed in any one of claims 1 to 4 or a video single-object tracking method based on spatiotemporal knowledge fusion and model dynamic integration as claimed in claim 5.

The technical scheme provided by the invention has the advantages that:

1. the video single-target tracking method based on space-time knowledge fusion and model dynamic integration provided by the invention fully utilizes space information and time information: according to the invention, the space information and the time information of the target are fused in the convolutional neural network, and the space-time change of the characteristics is better reflected and utilized by carrying out space-time knowledge fusion on the position, the scale and the appearance characteristics of the target. Compared with the prior art, the method provided by the invention is simpler and more effective by using local features or by using a complex model to carry out time sequence modeling.

2. The video single-target tracking method based on space-time knowledge fusion and model dynamic integration provided by the invention utilizes global features to carry out auxiliary classification: the method utilizes the mask to form the global features of the shielding target area and trains the auxiliary decision model for classification. Compared with the prior art, the method only uses local features for classification, and can more accurately distinguish the target from the background.

3. The video single-target tracking method based on space-time knowledge fusion and model dynamic integration provided by the invention is simple and effective in integrated learning: according to the invention, the integrated learning is performed by constructing the auxiliary decision model, and the global features of the video frame image with the target mask are utilized for training, so that more robust target tracking is realized. Compared with the complex integrated learning method in the prior art, the method is simpler and easier to realize.

Drawings

Fig. 1 is a flowchart of a video single-target tracking method based on spatiotemporal knowledge fusion and model dynamic integration in the fifth embodiment.

Detailed Description

In an embodiment, a method for constructing a video single-target tracking model based on space-time knowledge fusion and model dynamic integration comprises the following steps:

step S2, generating a category label for the initial sample;

s3, generating space-time knowledge information for the initial sample;

Specifically, in step S1, a certain number of frame images are randomly selected for each video marked with the position and scale information of the target, and a particle filter algorithm is used to generate a plurality of sample frames with different positions and scales near the target and in a larger range;

in step S5, the sample data of each video sequence is input into the constructed network model in batches in turn, gradually converging by continuously iterating the parameters of the training network.

Embodiment two: the present embodiment is further defined on the video single-target tracking method based on space-time knowledge fusion and model dynamic integration in the first embodiment, where the step S2 specifically includes:

Specifically, the area overlap ratio IoU is a commonly used index for measuring the degree of overlap of two frames. By calculating the ratio of the area of the overlapping portion of the sample frame and the target frame to the area of the union of the two, a value between 0 and 1 can be obtained, representing the degree of overlap of the two frames. In step S201, the area overlapping ratio IoU between the sample frame and the target frame is calculated, and by comparing the calculated IoU value with a preset threshold value, it can be determined whether the overlapping degree of the sample frame and the target frame meets a certain requirement. In step S202, if IoU calculated by the sample frame and the target frame is greater than 0.7, the sample tag is set to 1, which indicates that the sample is a foreground positive sample, that is, overlaps more with the target frame; if the value of IoU is less than 0.5, the sample tag is set to 0, indicating that the sample is a background-like negative sample, i.e., has less overlap with the target box. In this way, corresponding class labels can be generated for the samples according to the overlapping degree of the sample frames and the target frames, so that classification information is provided for subsequent target tracking.

Embodiment III: the present embodiment is further defined on the video single-target tracking method based on space-time knowledge fusion and model dynamic integration in the first embodiment, where the step S3 specifically includes:

step S302, generating sample time information (ia, p) _t1 ,p _t2 ,p _t3 ) Dividing the initial video into video segments, P, in units of P1 frames _t1 Representing the p-th video segment where the sample is located _t1 A frame;

specifically, in S302, pt1, pt2, pt3 are obtained by taking the remainder of the current frame number for different p, and if the sample is currently in nf frame, pt1=nf% p1, ia is obtained by dividing the current frame number nf by p1 quotient plus 1.

For example, assuming samples are at frame 15, p1 is 10 frames, p2 is 20 frames, and p3 is 25 frames when the initial video is 100 frames;

dividing the video into 10 video segments with p1 as a period, wherein the samples are positioned in the second video segment, namely ia=2, and pt1 is 15/10 of the remainder 5 frames;

dividing the video into 5 video segments with p2 as a period, wherein a sample is positioned in a first video segment, and pt2 is 15/20 of a remainder 15 frames;

dividing the video into 4 video segments with p3 as a period, wherein a sample is positioned in a first video segment, and pt3 is 15/25 of a remainder 15 frames;

further, the 15 th frame sample time information is (2,5,15,15).

By generating the space-time knowledge information of the sample, more comprehensive historical information can be provided for target tracking, so that the accuracy and the robustness of target tracking are improved.

Embodiment four: the present embodiment is further defined on the video single-target tracking method based on space-time knowledge fusion and model dynamic integration in the first embodiment, where the step S4 specifically includes:

Specifically, by constructing such a convolutional neural network model, feature extraction and classification of video frames can be achieved. The first part of network is responsible for extracting features, the second part of network is responsible for fusing the extracted features with the space-time knowledge information and outputting the classification probability of the target. Such a design may take full advantage of spatiotemporal knowledge information to assist in classification tasks.

In a fifth embodiment, referring to fig. 1, a method for tracking a video single target based on space-time knowledge fusion and dynamic model integration, the method includes the following steps:

the method comprises the following steps:

After the training of the network model in step S6, in order to perform the target tracking on the new video, it is necessary to fix the network parameters of the first part of the extracted features and construct a second part of the classification network which is completely new and has the same structure for the new video. Samples are acquired on the initial frames using the steps described above, frames are supplemented forward and spatio-temporal knowledge information is generated for the samples, and parameters of the second partial network are iteratively trained using the sample data. Setting a variable nf to 1, representing frame 1;

in step S7, after the parameters of the network model are debugged, starting from the next frame of the initial frame, generating a region of the predicted frame representing the candidate target on the new video frame according to the initial target position and size by using the particle filter, inputting the whole image of the video frame into the first part of network, extracting the characteristics of the candidate target region by using the RoIAlign layer, generating the space-time knowledge information of the regions, inputting the space-time knowledge information of the regions into the second part of network, and obtaining the classification prediction probabilities of all the candidate target regions. And calculating the average value of the positions and the sizes of the 5 candidate target areas with the largest foreground probability values, then taking the calculated average value as a target prediction result, solving the maximum value of the 5 foreground probability values, and using Ptd to represent the probability. Adding 1 to the value of the variable nf, and then circularly executing the subsequent steps;

in order to achieve the foregoing embodiments, the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements a method for constructing a video single-target tracking model based on spatio-temporal knowledge fusion and model dynamic integration or a method for video single-target tracking based on spatio-temporal knowledge fusion and model dynamic integration according to the foregoing embodiments when executing the computer program.

In order to achieve the above embodiments, the present invention further proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for constructing a video single-target tracking model based on spatio-temporal knowledge fusion and model dynamic integration or a method for video single-target tracking based on spatio-temporal knowledge fusion and model dynamic integration as described in the foregoing embodiments.

Claims

1. A method for constructing a video single-target tracking model based on space-time knowledge fusion and model dynamic integration is characterized by comprising the following steps:

step S2, generating a category label for the initial sample;

s3, generating space-time knowledge information for the initial sample;

2. The method for constructing the video single-target tracking model based on the space-time knowledge fusion and the dynamic model integration according to claim 1, wherein the step S2 specifically comprises:

3. The method for constructing the video single-target tracking model based on the space-time knowledge fusion and the dynamic model integration according to claim 1, wherein the step S3 specifically comprises:

4. the method for constructing the video single-target tracking model based on the space-time knowledge fusion and the dynamic model integration according to claim 1, wherein the step S4 specifically comprises:

5. A video single-target tracking method based on space-time knowledge fusion and model dynamic integration is characterized by comprising the following steps:

6. A computer device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for constructing a video single-target tracking model based on spatiotemporal knowledge fusion and model dynamic integration according to any one of claims 1 to 4 or a method for video single-target tracking based on spatiotemporal knowledge fusion and model dynamic integration according to claim 5 when executing the computer program.

7. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a method for video single-object tracking model construction based on spatiotemporal knowledge fusion and model dynamic integration according to any one of claims 1 to 4 or a method for video single-object tracking based on spatiotemporal knowledge fusion and model dynamic integration according to claim 5.