CN117392180B - Interactive video character tracking method and system based on self-supervision optical flow learning - Google Patents

Interactive video character tracking method and system based on self-supervision optical flow learning Download PDF

Info

Publication number
CN117392180B
CN117392180B CN202311694258.1A CN202311694258A CN117392180B CN 117392180 B CN117392180 B CN 117392180B CN 202311694258 A CN202311694258 A CN 202311694258A CN 117392180 B CN117392180 B CN 117392180B
Authority
CN
China
Prior art keywords
optical flow
self
image
frame image
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311694258.1A
Other languages
Chinese (zh)
Other versions
CN117392180A (en
Inventor
王少华
秦者云
刘兴波
庞瑞英
聂秀山
尹义龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Guozi Software Co ltd
Shandong Jianzhu University
Original Assignee
Shandong Guozi Software Co ltd
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Guozi Software Co ltd, Shandong Jianzhu University filed Critical Shandong Guozi Software Co ltd
Priority to CN202311694258.1A priority Critical patent/CN117392180B/en
Publication of CN117392180A publication Critical patent/CN117392180A/en
Application granted granted Critical
Publication of CN117392180B publication Critical patent/CN117392180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Abstract

The invention discloses an interactive video character tracking method and system based on self-supervision optical flow learning, and relates to the technical field of video character tracking, wherein the method comprises the following steps: acquiring video data comprising a continuous frame image sequence, determining the initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user, and further generating an initial mask of the target person; inputting the initial mask of the target person and the video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, and predicting the mask and the position of the target person in the next frame image according to the predicted optical flow vectors and the initial mask of the target person; and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, and carrying out continuous iterative tracking prediction until the video is finished, and outputting the moving position and track of the target character in the whole video in real time to realize more accurate target character tracking.

Description

Interactive video character tracking method and system based on self-supervision optical flow learning
Technical Field
The invention relates to the technical field of video character tracking, in particular to an interactive video character tracking method and system based on self-supervision optical flow learning.
Background
The statements in this section merely provide background of the present disclosure and may not necessarily constitute prior art.
With the continuous advancement of technology, video person tracking technology has found wide application in many fields, including security monitoring, social media, human-computer interaction, entertainment games, and health sports. When the video person tracking technology is applied to the security monitoring field or system, specific target persons can be tracked and monitored in real time, so that the accuracy and efficiency of the monitoring system can be improved, and the security is enhanced. Therefore, developing more accurate video person tracking technology is one of the current focus research directions.
Existing person tracking techniques can be categorized into online processing and offline processing. Although the offline processing method can continuously correct the error of the previous frame in the processing process, the fault tolerance is higher, the processing speed is very low in the mode, and the tracking result cannot be obtained in real time; although the online processing method can obtain real-time tracking results, the processing capability is poor in solving the problems of shielding and crossing among people and long-time missing of people, when the judgment of a certain frame of image is wrong, the image behind the frame of image cannot be corrected, and the existing online people tracking technology needs complex calculation in each prediction tracking, has low efficiency and occupies more calculation resources.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides an interactive video character tracking method and system based on self-supervision optical flow learning, which are used for providing accurate real-time tracking prediction of a target character by learning the motion information of the target character in a video through a self-supervision optical flow learning model, combining with interactive target character selection and correction to realize more accurate real-time tracking of the target character, avoiding the situation that the follow-up tracking prediction is wrong due to the judgment error of the previous frame of image, having simple calculation process, improving the tracking efficiency and saving time and calculation resources.
In a first aspect, the present invention provides an interactive video person tracking method based on self-supervised optical flow learning.
An interactive video character tracking method based on self-supervision optical flow learning, comprising:
acquiring video data comprising a continuous frame image sequence, and determining the initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user;
generating an initial target character mask based on the initial target character position, inputting the initial target character mask and video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, predicting the target character mask in the next frame image according to the predicted optical flow vectors and the initial target character mask, and outputting the target character position in the next frame image;
and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished.
In a second aspect, the present invention provides an interactive video person tracking system based on self-supervised optical flow learning.
An interactive video character tracking system based on self-supervised optical flow learning, comprising:
the data acquisition module is used for acquiring video data comprising a continuous frame image sequence;
the target person initial position determining module is used for determining the initial position of the target person in the first frame image of the continuous frame image sequence through the click operation of the user based on the video data;
the target person tracking module is used for generating a target person initial mask based on the target person initial position, inputting the target person initial mask and video data into the self-supervision optical flow learning model which is pre-trained, predicting optical flow vectors between adjacent frame images in the video data, predicting the target person mask in the next frame image according to the predicted optical flow vectors and the target person initial mask, and outputting the target person position in the next frame image; and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished.
The one or more of the above technical solutions have the following beneficial effects:
1. the invention provides an interactive video character tracking method and system based on self-supervision optical flow learning, which are used for providing accurate real-time tracking prediction of a target character by learning motion information of the target character in a video through a self-supervision optical flow learning model, combining with interactive target character selection and correction, realizing more accurate real-time tracking of the target character and avoiding the situation of error follow-up tracking prediction caused by error judgment of a previous frame of image.
2. The invention can realize user-friendly interaction experience, a user can select and correct the target character mask through simple click operation without complex labeling process, and the interactive selection and correction mode is more visual and natural, so that the user can more easily designate the interested characters.
3. The invention continuously utilizes the self-supervision optical flow learning model to update the position information of the target person in an iterative updating mode so as to adapt to the change in the continuous frame image sequence of the video, and the self-adaptive tracking capability can still keep higher real-time tracking accuracy when facing complex scenes and long-time video.
4. The self-supervision optical flow learning-based method can realize the tracking of the target person in one-time reasoning prediction without multiple iterations or complex calculation process, improves the tracking efficiency and saves time and calculation resources.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flow chart of an interactive video person tracking method based on self-supervised optical flow learning according to an embodiment of the present invention;
FIG. 2 is a flow chart of self-supervised optical flow learning model training in an embodiment of the present invention;
FIG. 3 is a flowchart of a self-supervised optical flow learning model pre-training in an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
The embodiment provides an interactive video character tracking method based on self-supervision optical flow learning, as shown in fig. 1, which comprises the following steps:
s1, acquiring video data comprising a continuous frame image sequence, and determining an initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user;
s2, generating an initial target character mask based on the initial target character position, inputting the initial target character mask and video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, predicting the initial target character mask in the next frame image according to the predicted optical flow vectors and the initial target character mask, and outputting the target character position in the next frame image;
and S3, inputting the target character mask in the predicted next frame of image into a self-supervision optical flow learning model, and carrying out continuous iterative tracking prediction until the video is finished, and outputting the moving position and track of the target character in the whole video in real time.
Further, in predicting and outputting the target person position in the next frame image using the self-supervising optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user click operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask.
The interactive video person tracking method proposed by the present embodiment will be described in more detail below.
In step S1, video data including a sequence of continuous frame images is acquired, and an initial position of a target person in a first frame image of the sequence of continuous frame images is determined by a user click operation. Specifically, for the acquired video data, a person of interest is selected as a target person through a user click operation, and a frame image of the target person is determined as a first frame image through the user click operation, and at this time, the initial position of the target person in the first frame image of the video is determined.
In the step S2, a self-supervision optical flow learning model is constructed and pre-trained; based on the initial position of the target person determined in the step S1, a target person initial mask is generated, the target person initial mask and video data are input into a self-supervision optical flow learning model which is already pre-trained, the target person mask in the first frame image is transferred into the next frame image, the target person mask in the next frame image is predicted, and then the target person position in the next frame image is predicted and output.
Based on the constructed and trained self-supervision optical flow learning model, the motion information in the continuous frame image sequence is learned, and accurate target person tracking prediction can be provided. The pre-training process of the self-supervision optical flow learning model, as shown in fig. 2, includes:
and S2.1, constructing a training data set by using video data which comprises a continuous frame image sequence and is marked with a target person frame in each frame image. In this embodiment, a training dataset is constructed from video data comprising a sequence of successive frames of images, and for each frame of images in the training dataset, a labeling frame of the target person is provided or labeled, which can be labeled manually by the user or automatically by other target detection models.
The training data set-based pre-training self-supervision optical flow learning model lays a foundation for learning optical flow information and character motions among a plurality of continuous frame images by using the model after pre-training.
And step S2.2, pre-training a self-supervision optical flow learning model based on the training data set.
In this embodiment, the self-supervised optical flow learning model employs an encoder-decoder architecture. Video frame image sequence input by the modelComprises->Frame images, each frame image having a resolution of +.>,/>And->The resolution height and width are respectively represented, and each frame of image is an RGB 3-dimensional image. Dividing each frame of image into +.>A plurality of non-overlapping image blocks of fixed size, each imageThe block size is +.>,/>And->Representing the height and width of the image block, respectively. In this way each frame of image is converted into a sequence, denoted +.>Feature extraction is performed on non-overlapping image blocks of each frame image by using an encoder, and features of each image block are extracted +.>I.e. +.>Representing the extracted firstiFeatures of the image blocks. The encoder is composed ofLEach attention layer comprises a multi-head self-attention layer MSA and a multi-head cross-attention layer MCA. The multi-head self-attention layer captures the global pixel dependency relationship in the same frame image, the multi-head cross-attention layer is used for information transfer between different frame images, and modeling of spatial globalization and time relevance is realized through the alternate action of the two attention layers. The output of the last attention layer is the feature map modeled by spatial globally and temporally relevance +.>. The feature map->Re-splicing according to the inverse process of image block division to obtain the complete frame-level feature map +.>. By encoding video frame images in the form of dividing each frame image into image blocks, the reduction in the size of the video frame image can be achievedCalculated amount. Based on the image dividing and splicing and feature extraction modes, obtaining a feature image sequence of the whole video continuous frame image sequence>Wherein->Representing the dimension of the feature vector (i.e., feature map).
The decoder is similar to the encoder in structure, and is mainly used for the pre-training process and also comprisesLAnd each attention layer consists of a plurality of self-attention layers and cross-attention layers, and is used for decoding each feature map and reconstructing the features.
And finally, optimizing the pre-trained self-supervision optical flow learning model based on the loss function by using the cross entropy loss as the loss function based on the original image and the reconstructed image in the input video frame image sequence, and finally obtaining the pre-trained self-supervision optical flow learning model. In fact, the pre-training refers to training of a feature extraction module in the self-supervision optical flow learning model, and the feature extraction of the input image by the subsequent self-supervision optical flow learning model can be achieved through the pre-training.
Step S2.3, inputting each frame image in the continuous frame image sequence into a self-supervision optical flow learning model, converting a first frame image in the continuous frame image sequence into a binary mask image, and generating a target character initial mask according to the initial position of a target character, wherein a target character frame area marked in the first frame image is a foreground, and a non-target character frame area is a background.
Because the first frame image has been marked with the target person frame (determined by the clicking operation of the user), a binary mask is added to the first frame image, the pixel value of the target person frame region of the mask is 1, the pixel values of the rest regions are 0, and the mask is multiplied by the corresponding pixel positions of the first frame image, so that only the marked target person frame region in the first frame image is visible (i.e., foreground), and the pixel values of the rest backgrounds are 0 (i.e., background). By the data processing method, the self-supervision optical flow learning model only focuses on the characteristics of the target person region.
And S2.4, extracting a feature map of each frame of image based on the pre-trained self-supervision optical flow learning model, carrying out feature alignment on the feature maps of the adjacent frame of images to generate an optical flow map, and extracting optical flow vectors between each pair of adjacent frame of images.
In this embodiment, the feature map of each frame image is extracted by using the optical flow estimation network model (i.e., the self-supervised optical flow learning model) obtained after the pre-training is completed, and feature alignment is performed on the feature maps of the adjacent frame images to generate an optical flow map, so as to implement prediction of optical flow vectors between the adjacent frame images.
The feature alignment generates a light flow graph comprising:
firstly, calculating a similarity matrix between feature images of two adjacent frames, and comparing feature similarity between the two adjacent frames according to the similarity matrix so as to find out pixel points of matching features. The similarity matrix is calculated by multiplying the feature values of the pixel point positions corresponding to the feature images of two adjacent frames, and the higher the product value is, the stronger the similarity is indicated.
Then, carrying out normalization processing on the similarity matrix to obtain a correlation score between each pixel point of the two adjacent frame image feature imagesWherein score->Is a scalar between 0 and 1, representing the pixel position of the pixel in the current frame image +.>And pixel position of the next frame image +.>Degree of similarity between the two.
Second, for each pixel pointFinding out the best matched pixel point position in the adjacent frame image according to the correlation score>I.e. the position with the highest relevance score +.>Is marked as->
Further, forward and backward consistency checks are performed using the relevance scores to reduce mismatching. If it isAnd (3) withIf the Euclidean distance between the positions is smaller than the set threshold value, the two positions are considered to be consistent; on the other hand, if the euclidean distance is greater than the set threshold, the two positions are considered to be inconsistent, a larger value is generated in the loss function, and the inconsistency is reduced by the back propagation optimization model. Through the forward and backward consistency check mode, consistency of feature matching can be ensured.
And finally, based on the matched pixel point pairs in the determined adjacent frame images, generating a final light flow graph through pixel level displacement estimation. Assume that pixel points in current frame imageAnd the position in the next frame image +.>The Euclidean distance between them is->The corresponding displacement vector can be expressed as: />=/>
By calculating the displacement vector of each pixel point in the adjacent frame images, the light flow graph of the whole image of the current frame image can be obtained, and the light flow vector between each pair of adjacent frame images is extracted in the mode.
Further, the self-supervision training is performed on the self-supervision optical flow learning model. Based on the extracted optical flow vector between each pair of adjacent frame images, a forward-backward consistency loss function and an optical flow loss function are constructed, and training of the model is completed through self-supervision training until the loss function is minimum.
In order to ensure consistency of feature matching in the training process, a forward-backward consistency loss function is set as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing displacement vectors obtained by forward optical flow estimation, < >>Representing the displacement vector obtained by the reverse optical flow estimation.
In addition to the forward-backward consistency loss function, the self-supervised training process also uses the squared difference of the optical flow error as the optical flow loss function. Specifically, on the basis of the learned optical flow vectors between each pair of adjacent frame images, an optical flow loss function is constructed by combining the actual optical flow vectors between the adjacent frame images, as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>And representing optical flow information extracted by using the Horn-Schunck method, and constructing an optical flow loss function by taking the optical flow information as a self-supervision signal. The method solves the dense light obtained by the Horn-Schunck optical flow methodThe flow, which requires calculation of the optical flow value for each pixel, is large in calculation amount, and by the above-described manner of the present embodiment, it is possible to realize acquisition of optical flow information with smaller calculation amount.
Step S2.5, predicting the target person mask in the next frame image according to the predicted optical flow vector and the target person presentation mask, and outputting the target person position in the next frame image.
In step S3, the target character mask in the predicted next frame image is input into the self-supervision optical flow learning model, continuous iterative tracking prediction is performed until the video is finished, and the moving position and track of the target character in the whole video are output in real time.
As another embodiment, to better train the self-supervised optical flow estimation network, the self-supervised pre-training of the self-supervised optical flow learning model is further included before training the self-supervised optical flow learning model. As shown in fig. 3, the self-supervised pre-training process includes:
firstly, constructing a pre-training data set by video data comprising a continuous frame image sequence;
secondly, inputting each frame of image in the pre-training data set into a self-supervision optical flow learning model for training, extracting a characteristic image of each frame of image through encoding of an encoder, selecting a pixel characteristic region with a score exceeding a threshold value according to similarity of each pixel point in the characteristic images of adjacent frames, and shielding the pixel characteristic region, namely setting all characteristic values of the region to zero, so that a network can pay more attention to important regions, and guiding the network to learn and reconstruct the shielded region. Preferably, sorting is performed according to the similarity values, and a pixel region corresponding to more than the first 50% of the similarity values is selected, wherein the sorting can be approximately realized by using a soft sorting (softport) method;
then, decoding and reconstructing are carried out based on the shielded pixel characteristic area, original image characteristics are restored as far as possible, and the reconstructed object is to minimize pixel differences in the covered area;
and finally, taking Euclidean distance loss or other reconstruction loss between the extracted features and the recovered features as a loss function, and performing continuous iterative training until the loss function is minimum, thereby completing the pre-training of the self-supervision optical flow learning model. Network parameters are optimized by minimizing reconstruction losses to enable the network to recover the original image from the corrupted input. Through self-supervision pre-training, the network model can learn useful representation of the image and has good generalization capability for reconstruction tasks.
And obtaining the self-supervision optical flow learning model after training through the pre-training and training. Inputting the initial mask of the target person and the video data into a self-supervision optical flow learning model after training, extracting a feature map of an input frame image through an encoder, predicting optical flow vectors between adjacent frame images in the video data based on the feature map, transferring the target person mask in a first frame image into a next frame image according to the predicted optical flow vectors, predicting the target person mask in the next frame image, reconstructing the image through a decoder, and outputting the predicted target person position in the next frame image.
Further, the present embodiment also proposes an interactive tracking update method, that is, in the process of predicting and outputting the target person position in the next frame image by using the self-supervision optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user clicking operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask. By the interactive tracking and updating mode, updated target person position and mask information are obtained, and accuracy of target tracking is guaranteed.
According to the embodiment, through the interactive updating of the optical flow prediction and the character mask, the target character and the position thereof can be continuously adjusted in the continuous frame images, so that high-precision interactive video character tracking is realized, the output result is the position and mask sequence of the final target character, and the position and track information of the target character in the whole video are provided.
The interactive video character tracking method based on self-supervision optical flow learning provided by the embodiment greatly improves the accuracy of target character tracking. Simulation comparison experiment results are obtained through simulation of the target person tracking model, and are shown in the following table 1.
Table 1 comparison of simulation results of the scheme described in this example with other algorithms
By evaluating the indexes of the accuracy, the overlapping rate and the average overlapping rate error for each model, it is obvious that the scheme of the embodiment can realize higher accuracy, overlapping rate and lower average overlapping rate error, and is more superior. The accuracy is the proportion of the model to correctly identify the characters, and the higher the accuracy is, the better the model is shown in the aspect of identifying the characters; the overlapping rate is the overlapping degree of the boundary frame predicted by the model and the real boundary frame, and the higher the overlapping rate is, the more the prediction of the model is close to the real boundary frame; the average overlap ratio error is the average error between the overlap ratio of the model prediction and the true overlap ratio, and a lower error value indicates that the prediction of the model is closer to the true value.
Example two
The embodiment provides an interactive video character tracking system based on self-supervision optical flow learning, which comprises the following components:
the data acquisition module is used for acquiring video data comprising a continuous frame image sequence;
the target person initial position determining module is used for determining the initial position of the target person in the first frame image of the continuous frame image sequence through the click operation of the user based on the video data;
the target person tracking module is used for generating a target person initial mask based on the target person initial position, inputting the target person initial mask and video data into the self-supervision optical flow learning model which is pre-trained, predicting optical flow vectors between adjacent frame images in the video data, predicting the target person mask in the next frame image according to the predicted optical flow vectors and the target person initial mask, and outputting the target person position in the next frame image; and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished.
The system according to this embodiment further includes:
and the target person tracking interaction module is used for correcting and updating the mask of the target person in the current frame image through a user clicking operation in the process of predicting and outputting the position of the target person in the next frame image by utilizing the self-supervision optical flow learning model, and predicting and outputting the accurate position of the target person in the next frame image based on the corrected and updated target person mask.
The steps involved in the second embodiment correspond to those of the first embodiment of the method, and the detailed description of the second embodiment can be found in the related description section of the first embodiment.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (6)

1. An interactive video character tracking method based on self-supervision optical flow learning is characterized by comprising the following steps:
acquiring video data comprising a continuous frame image sequence, and determining the initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user;
generating an initial target character mask based on the initial target character position, inputting the initial target character mask and video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, predicting the target character mask in the next frame image according to the predicted optical flow vectors and the initial target character mask, and outputting the target character position in the next frame image;
inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished;
the self-supervision optical flow learning model adopts an encoder-decoder structure; the encoder comprises a plurality of attention layers, wherein each attention layer comprises a multi-head self-attention layer and a multi-head cross-attention layer and is used for extracting image features and generating a feature map; the decoder also includes a plurality of attention layers, each including a multi-headed self-attention layer and a multi-headed cross-attention layer for decoding and feature reconstruction of the feature map;
the training process of the self-supervision optical flow learning model comprises the following steps:
constructing a training data set by using video data which comprises a continuous frame image sequence and each frame image is marked with a target character frame;
pre-training a self-supervision optical flow learning model based on the training data set;
inputting each frame image in the continuous frame image sequence into a self-supervision optical flow learning model, converting a first frame image in the continuous frame image sequence into a binary mask image, and generating a target character initial mask according to the initial position of the target character;
extracting a feature map of each frame of image based on the self-supervision optical flow learning model which is already pre-trained, carrying out feature alignment on the feature maps of the adjacent frame of images, generating an optical flow graph, and extracting optical flow vectors between each pair of adjacent frame of images; the extracting the feature map of each frame of image comprises the following steps:
dividing each frame of image into a plurality of non-overlapping image blocks with fixed sizes, extracting features of the non-overlapping image blocks of each frame of image by using an encoder, and extracting feature images of each image block subjected to space globalization and time relevance modeling; re-splicing the feature images of each image block according to the inverse process of image block division to obtain a complete frame-level feature image of each frame image;
based on the extracted optical flow vector between each pair of adjacent frame images, a forward-backward consistency loss function and an optical flow loss function are constructed, and training of the model is completed through self-supervision training until the loss function is minimum.
2. The interactive video person tracking method based on self-monitoring optical flow learning according to claim 1, wherein in predicting and outputting the target person position in the next frame image using the self-monitoring optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user click operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask.
3. The interactive video character tracking method based on self-supervised optical flow learning of claim 1, further comprising a self-supervised pre-training process for the self-supervised optical flow learning model prior to training the self-supervised optical flow learning model, comprising:
constructing a pre-training data set with video data comprising a sequence of successive frame images;
inputting each frame of image in the pre-training data set into a self-supervision optical flow learning model for training, extracting a characteristic image of each frame of image by encoding, selecting a pixel characteristic region with a score exceeding a threshold value according to the similarity of pixels in the characteristic images of adjacent frames, shielding the pixel characteristic region, and then carrying out decoding reconstruction based on the shielded pixel characteristic region to recover the original image characteristics;
and taking Euclidean distance loss between the extracted features and the recovered features as a loss function, and performing continuous iterative training until the loss function is minimum, thereby completing the pre-training of the self-supervision optical flow learning model.
4. The interactive video person tracking method based on self-supervised optical flow learning of claim 1, wherein feature alignment of feature maps of adjacent frame images to generate an optical flow map comprises:
calculating a similarity matrix between two adjacent frame image feature images;
normalizing the similarity matrix to obtain a correlation score between each pixel point of the two adjacent frame image feature images;
for each pixel point, searching the position of the pixel point which is most matched with the pixel point in the adjacent frame image according to the correlation score, and carrying out forward and backward consistency check by utilizing the correlation score;
and generating a final light flow graph through displacement estimation at a pixel level based on the matched pixel point pairs in the determined adjacent frame images.
5. An interactive video character tracking system based on self-supervised optical flow learning, comprising:
the data acquisition module is used for acquiring video data comprising a continuous frame image sequence;
the target person initial position determining module is used for determining the initial position of the target person in the first frame image of the continuous frame image sequence through the click operation of the user based on the video data;
the target person tracking module is used for generating a target person initial mask based on the target person initial position, inputting the target person initial mask and video data into the self-supervision optical flow learning model which is pre-trained, predicting optical flow vectors between adjacent frame images in the video data, predicting the target person mask in the next frame image according to the predicted optical flow vectors and the target person initial mask, and outputting the target person position in the next frame image; inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished;
the self-supervision optical flow learning model adopts an encoder-decoder structure; the encoder comprises a plurality of attention layers, wherein each attention layer comprises a multi-head self-attention layer and a multi-head cross-attention layer and is used for extracting image features and generating a feature map; the decoder also includes a plurality of attention layers, each including a multi-headed self-attention layer and a multi-headed cross-attention layer for decoding and feature reconstruction of the feature map;
the training process of the self-supervision optical flow learning model comprises the following steps:
constructing a training data set by using video data which comprises a continuous frame image sequence and each frame image is marked with a target character frame;
pre-training a self-supervision optical flow learning model based on the training data set;
inputting each frame image in the continuous frame image sequence into a self-supervision optical flow learning model, converting a first frame image in the continuous frame image sequence into a binary mask image, and generating a target character initial mask according to the initial position of the target character;
extracting a feature map of each frame of image based on the self-supervision optical flow learning model which is already pre-trained, carrying out feature alignment on the feature maps of the adjacent frame of images, generating an optical flow graph, and extracting optical flow vectors between each pair of adjacent frame of images; the extracting the feature map of each frame of image comprises the following steps:
dividing each frame of image into a plurality of non-overlapping image blocks with fixed sizes, extracting features of the non-overlapping image blocks of each frame of image by using an encoder, and extracting feature images of each image block subjected to space globalization and time relevance modeling; re-splicing the feature images of each image block according to the inverse process of image block division to obtain a complete frame-level feature image of each frame image;
based on the extracted optical flow vector between each pair of adjacent frame images, a forward-backward consistency loss function and an optical flow loss function are constructed, and training of the model is completed through self-supervision training until the loss function is minimum.
6. The self-supervised optical flow learning based interactive video character tracking system as recited in claim 5, further comprising:
and the target person tracking interaction module is used for correcting and updating the mask of the target person in the current frame image through a user clicking operation in the process of predicting and outputting the position of the target person in the next frame image by utilizing the self-supervision optical flow learning model, and predicting and outputting the accurate position of the target person in the next frame image based on the corrected and updated target person mask.
CN202311694258.1A 2023-12-12 2023-12-12 Interactive video character tracking method and system based on self-supervision optical flow learning Active CN117392180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311694258.1A CN117392180B (en) 2023-12-12 2023-12-12 Interactive video character tracking method and system based on self-supervision optical flow learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311694258.1A CN117392180B (en) 2023-12-12 2023-12-12 Interactive video character tracking method and system based on self-supervision optical flow learning

Publications (2)

Publication Number Publication Date
CN117392180A CN117392180A (en) 2024-01-12
CN117392180B true CN117392180B (en) 2024-03-26

Family

ID=89467012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311694258.1A Active CN117392180B (en) 2023-12-12 2023-12-12 Interactive video character tracking method and system based on self-supervision optical flow learning

Country Status (1)

Country Link
CN (1) CN117392180B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
CN115147457A (en) * 2022-07-08 2022-10-04 河南大学 Memory enhanced self-supervision tracking method and device based on space-time perception
CN115375732A (en) * 2022-08-18 2022-11-22 南京邮电大学 Unsupervised target tracking method and system based on module migration
CN115393396A (en) * 2022-08-18 2022-11-25 西安电子科技大学 Unmanned aerial vehicle target tracking method based on mask pre-training
WO2023284341A1 (en) * 2021-07-15 2023-01-19 北京小蝇科技有限责任公司 Deep learning-based context-sensitive detection method for urine formed element
CN116310971A (en) * 2023-03-03 2023-06-23 长春理工大学 Unsupervised target tracking method based on sparse attention updating template features
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739078B (en) * 2020-06-15 2022-11-18 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
WO2023284341A1 (en) * 2021-07-15 2023-01-19 北京小蝇科技有限责任公司 Deep learning-based context-sensitive detection method for urine formed element
CN115147457A (en) * 2022-07-08 2022-10-04 河南大学 Memory enhanced self-supervision tracking method and device based on space-time perception
CN115375732A (en) * 2022-08-18 2022-11-22 南京邮电大学 Unsupervised target tracking method and system based on module migration
CN115393396A (en) * 2022-08-18 2022-11-25 西安电子科技大学 Unmanned aerial vehicle target tracking method based on mask pre-training
CN116310971A (en) * 2023-03-03 2023-06-23 长春理工大学 Unsupervised target tracking method based on sparse attention updating template features
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于深度置信网络的目标跟踪算法;李克靖;孙凤梅;;电子设计工程;20180605(第11期);全文 *
李克靖 ; 孙凤梅 ; .一种基于深度置信网络的目标跟踪算法.电子设计工程.2018,(第11期),全文. *

Also Published As

Publication number Publication date
CN117392180A (en) 2024-01-12

Similar Documents

Publication Publication Date Title
Wang et al. Multi-view stereo in the deep learning era: A comprehensive review
CN112767554B (en) Point cloud completion method, device, equipment and storage medium
Kim et al. Recurrent temporal aggregation framework for deep video inpainting
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN113592913B (en) Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
Oh et al. Space-time memory networks for video object segmentation with user guidance
Xue et al. ECANet: Explicit cyclic attention-based network for video saliency prediction
Zhang et al. ReX-Net: A reflectance-guided underwater image enhancement network for extreme scenarios
CN113066034A (en) Face image restoration method and device, restoration model, medium and equipment
CN113808047A (en) Human motion capture data denoising method
Wang et al. Thermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection
Zhou et al. Transformer-based multi-scale feature integration network for video saliency prediction
WO2023015414A1 (en) Method for eliminating uncertainty in self-supervised three-dimensional reconstruction
CN114240811A (en) Method for generating new image based on multiple images
Zhou et al. A superior image inpainting scheme using Transformer-based self-supervised attention GAN model
CN111738092B (en) Method for recovering occluded human body posture sequence based on deep learning
Zhang et al. Multi-scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction
CN117392180B (en) Interactive video character tracking method and system based on self-supervision optical flow learning
Su et al. Physical model and image translation fused network for single-image dehazing
Wang et al. Temporal consistent portrait video segmentation
CN116978057A (en) Human body posture migration method and device in image, computer equipment and storage medium
CN113920170A (en) Pedestrian trajectory prediction method and system combining scene context and pedestrian social relationship and storage medium
Li et al. PFONet: A Progressive Feedback Optimization Network for Lightweight Single Image Dehazing
Wang et al. Camera Parameters Aware Motion Segmentation Network with Compensated Optical Flow
Cao et al. Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant