CN117392180B

CN117392180B - Interactive video character tracking method and system based on self-supervision optical flow learning

Info

Publication number: CN117392180B
Application number: CN202311694258.1A
Authority: CN
Inventors: 王少华; 秦者云; 刘兴波; 庞瑞英; 聂秀山; 尹义龙
Original assignee: Shandong Guozi Software Co ltd; Shandong Jianzhu University
Current assignee: Shandong Guozi Software Co ltd; Shandong Jianzhu University
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-26
Anticipated expiration: 2043-12-12
Also published as: CN117392180A

Abstract

The invention discloses an interactive video character tracking method and system based on self-supervision optical flow learning, and relates to the technical field of video character tracking, wherein the method comprises the following steps: acquiring video data comprising a continuous frame image sequence, determining the initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user, and further generating an initial mask of the target person; inputting the initial mask of the target person and the video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, and predicting the mask and the position of the target person in the next frame image according to the predicted optical flow vectors and the initial mask of the target person; and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, and carrying out continuous iterative tracking prediction until the video is finished, and outputting the moving position and track of the target character in the whole video in real time to realize more accurate target character tracking.

Description

Interactive video character tracking method and system based on self-supervision optical flow learning

Technical Field

The invention relates to the technical field of video character tracking, in particular to an interactive video character tracking method and system based on self-supervision optical flow learning.

Background

The statements in this section merely provide background of the present disclosure and may not necessarily constitute prior art.

With the continuous advancement of technology, video person tracking technology has found wide application in many fields, including security monitoring, social media, human-computer interaction, entertainment games, and health sports. When the video person tracking technology is applied to the security monitoring field or system, specific target persons can be tracked and monitored in real time, so that the accuracy and efficiency of the monitoring system can be improved, and the security is enhanced. Therefore, developing more accurate video person tracking technology is one of the current focus research directions.

Existing person tracking techniques can be categorized into online processing and offline processing. Although the offline processing method can continuously correct the error of the previous frame in the processing process, the fault tolerance is higher, the processing speed is very low in the mode, and the tracking result cannot be obtained in real time; although the online processing method can obtain real-time tracking results, the processing capability is poor in solving the problems of shielding and crossing among people and long-time missing of people, when the judgment of a certain frame of image is wrong, the image behind the frame of image cannot be corrected, and the existing online people tracking technology needs complex calculation in each prediction tracking, has low efficiency and occupies more calculation resources.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides an interactive video character tracking method and system based on self-supervision optical flow learning, which are used for providing accurate real-time tracking prediction of a target character by learning the motion information of the target character in a video through a self-supervision optical flow learning model, combining with interactive target character selection and correction to realize more accurate real-time tracking of the target character, avoiding the situation that the follow-up tracking prediction is wrong due to the judgment error of the previous frame of image, having simple calculation process, improving the tracking efficiency and saving time and calculation resources.

In a first aspect, the present invention provides an interactive video person tracking method based on self-supervised optical flow learning.

An interactive video character tracking method based on self-supervision optical flow learning, comprising:

acquiring video data comprising a continuous frame image sequence, and determining the initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user;

generating an initial target character mask based on the initial target character position, inputting the initial target character mask and video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, predicting the target character mask in the next frame image according to the predicted optical flow vectors and the initial target character mask, and outputting the target character position in the next frame image;

and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished.

In a second aspect, the present invention provides an interactive video person tracking system based on self-supervised optical flow learning.

An interactive video character tracking system based on self-supervised optical flow learning, comprising:

the data acquisition module is used for acquiring video data comprising a continuous frame image sequence;

the target person initial position determining module is used for determining the initial position of the target person in the first frame image of the continuous frame image sequence through the click operation of the user based on the video data;

the target person tracking module is used for generating a target person initial mask based on the target person initial position, inputting the target person initial mask and video data into the self-supervision optical flow learning model which is pre-trained, predicting optical flow vectors between adjacent frame images in the video data, predicting the target person mask in the next frame image according to the predicted optical flow vectors and the target person initial mask, and outputting the target person position in the next frame image; and inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished.

The one or more of the above technical solutions have the following beneficial effects:

1. the invention provides an interactive video character tracking method and system based on self-supervision optical flow learning, which are used for providing accurate real-time tracking prediction of a target character by learning motion information of the target character in a video through a self-supervision optical flow learning model, combining with interactive target character selection and correction, realizing more accurate real-time tracking of the target character and avoiding the situation of error follow-up tracking prediction caused by error judgment of a previous frame of image.

2. The invention can realize user-friendly interaction experience, a user can select and correct the target character mask through simple click operation without complex labeling process, and the interactive selection and correction mode is more visual and natural, so that the user can more easily designate the interested characters.

3. The invention continuously utilizes the self-supervision optical flow learning model to update the position information of the target person in an iterative updating mode so as to adapt to the change in the continuous frame image sequence of the video, and the self-adaptive tracking capability can still keep higher real-time tracking accuracy when facing complex scenes and long-time video.

4. The self-supervision optical flow learning-based method can realize the tracking of the target person in one-time reasoning prediction without multiple iterations or complex calculation process, improves the tracking efficiency and saves time and calculation resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of an interactive video person tracking method based on self-supervised optical flow learning according to an embodiment of the present invention;

FIG. 2 is a flow chart of self-supervised optical flow learning model training in an embodiment of the present invention;

FIG. 3 is a flowchart of a self-supervised optical flow learning model pre-training in an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

The embodiment provides an interactive video character tracking method based on self-supervision optical flow learning, as shown in fig. 1, which comprises the following steps:

s1, acquiring video data comprising a continuous frame image sequence, and determining an initial position of a target person in a first frame image of the continuous frame image sequence through clicking operation of a user;

s2, generating an initial target character mask based on the initial target character position, inputting the initial target character mask and video data into a pre-trained self-supervision optical flow learning model, predicting optical flow vectors between adjacent frame images in the video data, predicting the initial target character mask in the next frame image according to the predicted optical flow vectors and the initial target character mask, and outputting the target character position in the next frame image;

and S3, inputting the target character mask in the predicted next frame of image into a self-supervision optical flow learning model, and carrying out continuous iterative tracking prediction until the video is finished, and outputting the moving position and track of the target character in the whole video in real time.

Further, in predicting and outputting the target person position in the next frame image using the self-supervising optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user click operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask.

The interactive video person tracking method proposed by the present embodiment will be described in more detail below.

In step S1, video data including a sequence of continuous frame images is acquired, and an initial position of a target person in a first frame image of the sequence of continuous frame images is determined by a user click operation. Specifically, for the acquired video data, a person of interest is selected as a target person through a user click operation, and a frame image of the target person is determined as a first frame image through the user click operation, and at this time, the initial position of the target person in the first frame image of the video is determined.

In the step S2, a self-supervision optical flow learning model is constructed and pre-trained; based on the initial position of the target person determined in the step S1, a target person initial mask is generated, the target person initial mask and video data are input into a self-supervision optical flow learning model which is already pre-trained, the target person mask in the first frame image is transferred into the next frame image, the target person mask in the next frame image is predicted, and then the target person position in the next frame image is predicted and output.

Based on the constructed and trained self-supervision optical flow learning model, the motion information in the continuous frame image sequence is learned, and accurate target person tracking prediction can be provided. The pre-training process of the self-supervision optical flow learning model, as shown in fig. 2, includes:

and S2.1, constructing a training data set by using video data which comprises a continuous frame image sequence and is marked with a target person frame in each frame image. In this embodiment, a training dataset is constructed from video data comprising a sequence of successive frames of images, and for each frame of images in the training dataset, a labeling frame of the target person is provided or labeled, which can be labeled manually by the user or automatically by other target detection models.

The training data set-based pre-training self-supervision optical flow learning model lays a foundation for learning optical flow information and character motions among a plurality of continuous frame images by using the model after pre-training.

And step S2.2, pre-training a self-supervision optical flow learning model based on the training data set.

In this embodiment, the self-supervised optical flow learning model employs an encoder-decoder architecture. Video frame image sequence input by the modelComprises->Frame images, each frame image having a resolution of +.>，/>And->The resolution height and width are respectively represented, and each frame of image is an RGB 3-dimensional image. Dividing each frame of image into +.>A plurality of non-overlapping image blocks of fixed size, each imageThe block size is +.>，/>And->Representing the height and width of the image block, respectively. In this way each frame of image is converted into a sequence, denoted +.>Feature extraction is performed on non-overlapping image blocks of each frame image by using an encoder, and features of each image block are extracted +.>I.e. +.>Representing the extracted firstiFeatures of the image blocks. The encoder is composed ofLEach attention layer comprises a multi-head self-attention layer MSA and a multi-head cross-attention layer MCA. The multi-head self-attention layer captures the global pixel dependency relationship in the same frame image, the multi-head cross-attention layer is used for information transfer between different frame images, and modeling of spatial globalization and time relevance is realized through the alternate action of the two attention layers. The output of the last attention layer is the feature map modeled by spatial globally and temporally relevance +.>. The feature map->Re-splicing according to the inverse process of image block division to obtain the complete frame-level feature map +.>. By encoding video frame images in the form of dividing each frame image into image blocks, the reduction in the size of the video frame image can be achievedCalculated amount. Based on the image dividing and splicing and feature extraction modes, obtaining a feature image sequence of the whole video continuous frame image sequence>Wherein->Representing the dimension of the feature vector (i.e., feature map).

The decoder is similar to the encoder in structure, and is mainly used for the pre-training process and also comprisesLAnd each attention layer consists of a plurality of self-attention layers and cross-attention layers, and is used for decoding each feature map and reconstructing the features.

And finally, optimizing the pre-trained self-supervision optical flow learning model based on the loss function by using the cross entropy loss as the loss function based on the original image and the reconstructed image in the input video frame image sequence, and finally obtaining the pre-trained self-supervision optical flow learning model. In fact, the pre-training refers to training of a feature extraction module in the self-supervision optical flow learning model, and the feature extraction of the input image by the subsequent self-supervision optical flow learning model can be achieved through the pre-training.

Step S2.3, inputting each frame image in the continuous frame image sequence into a self-supervision optical flow learning model, converting a first frame image in the continuous frame image sequence into a binary mask image, and generating a target character initial mask according to the initial position of a target character, wherein a target character frame area marked in the first frame image is a foreground, and a non-target character frame area is a background.

Because the first frame image has been marked with the target person frame (determined by the clicking operation of the user), a binary mask is added to the first frame image, the pixel value of the target person frame region of the mask is 1, the pixel values of the rest regions are 0, and the mask is multiplied by the corresponding pixel positions of the first frame image, so that only the marked target person frame region in the first frame image is visible (i.e., foreground), and the pixel values of the rest backgrounds are 0 (i.e., background). By the data processing method, the self-supervision optical flow learning model only focuses on the characteristics of the target person region.

And S2.4, extracting a feature map of each frame of image based on the pre-trained self-supervision optical flow learning model, carrying out feature alignment on the feature maps of the adjacent frame of images to generate an optical flow map, and extracting optical flow vectors between each pair of adjacent frame of images.

In this embodiment, the feature map of each frame image is extracted by using the optical flow estimation network model (i.e., the self-supervised optical flow learning model) obtained after the pre-training is completed, and feature alignment is performed on the feature maps of the adjacent frame images to generate an optical flow map, so as to implement prediction of optical flow vectors between the adjacent frame images.

The feature alignment generates a light flow graph comprising:

firstly, calculating a similarity matrix between feature images of two adjacent frames, and comparing feature similarity between the two adjacent frames according to the similarity matrix so as to find out pixel points of matching features. The similarity matrix is calculated by multiplying the feature values of the pixel point positions corresponding to the feature images of two adjacent frames, and the higher the product value is, the stronger the similarity is indicated.

Then, carrying out normalization processing on the similarity matrix to obtain a correlation score between each pixel point of the two adjacent frame image feature imagesWherein score->Is a scalar between 0 and 1, representing the pixel position of the pixel in the current frame image +.>And pixel position of the next frame image +.>Degree of similarity between the two.

Second, for each pixel pointFinding out the best matched pixel point position in the adjacent frame image according to the correlation score>I.e. the position with the highest relevance score +.>Is marked as->。

Further, forward and backward consistency checks are performed using the relevance scores to reduce mismatching. If it isAnd (3) withIf the Euclidean distance between the positions is smaller than the set threshold value, the two positions are considered to be consistent; on the other hand, if the euclidean distance is greater than the set threshold, the two positions are considered to be inconsistent, a larger value is generated in the loss function, and the inconsistency is reduced by the back propagation optimization model. Through the forward and backward consistency check mode, consistency of feature matching can be ensured.

And finally, based on the matched pixel point pairs in the determined adjacent frame images, generating a final light flow graph through pixel level displacement estimation. Assume that pixel points in current frame imageAnd the position in the next frame image +.>The Euclidean distance between them is->The corresponding displacement vector can be expressed as: />=/>。

By calculating the displacement vector of each pixel point in the adjacent frame images, the light flow graph of the whole image of the current frame image can be obtained, and the light flow vector between each pair of adjacent frame images is extracted in the mode.

Further, the self-supervision training is performed on the self-supervision optical flow learning model. Based on the extracted optical flow vector between each pair of adjacent frame images, a forward-backward consistency loss function and an optical flow loss function are constructed, and training of the model is completed through self-supervision training until the loss function is minimum.

In order to ensure consistency of feature matching in the training process, a forward-backward consistency loss function is set as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing displacement vectors obtained by forward optical flow estimation, < >>Representing the displacement vector obtained by the reverse optical flow estimation.

In addition to the forward-backward consistency loss function, the self-supervised training process also uses the squared difference of the optical flow error as the optical flow loss function. Specifically, on the basis of the learned optical flow vectors between each pair of adjacent frame images, an optical flow loss function is constructed by combining the actual optical flow vectors between the adjacent frame images, as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>And representing optical flow information extracted by using the Horn-Schunck method, and constructing an optical flow loss function by taking the optical flow information as a self-supervision signal. The method solves the dense light obtained by the Horn-Schunck optical flow methodThe flow, which requires calculation of the optical flow value for each pixel, is large in calculation amount, and by the above-described manner of the present embodiment, it is possible to realize acquisition of optical flow information with smaller calculation amount.

Step S2.5, predicting the target person mask in the next frame image according to the predicted optical flow vector and the target person presentation mask, and outputting the target person position in the next frame image.

In step S3, the target character mask in the predicted next frame image is input into the self-supervision optical flow learning model, continuous iterative tracking prediction is performed until the video is finished, and the moving position and track of the target character in the whole video are output in real time.

As another embodiment, to better train the self-supervised optical flow estimation network, the self-supervised pre-training of the self-supervised optical flow learning model is further included before training the self-supervised optical flow learning model. As shown in fig. 3, the self-supervised pre-training process includes:

firstly, constructing a pre-training data set by video data comprising a continuous frame image sequence;

secondly, inputting each frame of image in the pre-training data set into a self-supervision optical flow learning model for training, extracting a characteristic image of each frame of image through encoding of an encoder, selecting a pixel characteristic region with a score exceeding a threshold value according to similarity of each pixel point in the characteristic images of adjacent frames, and shielding the pixel characteristic region, namely setting all characteristic values of the region to zero, so that a network can pay more attention to important regions, and guiding the network to learn and reconstruct the shielded region. Preferably, sorting is performed according to the similarity values, and a pixel region corresponding to more than the first 50% of the similarity values is selected, wherein the sorting can be approximately realized by using a soft sorting (softport) method;

then, decoding and reconstructing are carried out based on the shielded pixel characteristic area, original image characteristics are restored as far as possible, and the reconstructed object is to minimize pixel differences in the covered area;

and finally, taking Euclidean distance loss or other reconstruction loss between the extracted features and the recovered features as a loss function, and performing continuous iterative training until the loss function is minimum, thereby completing the pre-training of the self-supervision optical flow learning model. Network parameters are optimized by minimizing reconstruction losses to enable the network to recover the original image from the corrupted input. Through self-supervision pre-training, the network model can learn useful representation of the image and has good generalization capability for reconstruction tasks.

And obtaining the self-supervision optical flow learning model after training through the pre-training and training. Inputting the initial mask of the target person and the video data into a self-supervision optical flow learning model after training, extracting a feature map of an input frame image through an encoder, predicting optical flow vectors between adjacent frame images in the video data based on the feature map, transferring the target person mask in a first frame image into a next frame image according to the predicted optical flow vectors, predicting the target person mask in the next frame image, reconstructing the image through a decoder, and outputting the predicted target person position in the next frame image.

Further, the present embodiment also proposes an interactive tracking update method, that is, in the process of predicting and outputting the target person position in the next frame image by using the self-supervision optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user clicking operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask. By the interactive tracking and updating mode, updated target person position and mask information are obtained, and accuracy of target tracking is guaranteed.

According to the embodiment, through the interactive updating of the optical flow prediction and the character mask, the target character and the position thereof can be continuously adjusted in the continuous frame images, so that high-precision interactive video character tracking is realized, the output result is the position and mask sequence of the final target character, and the position and track information of the target character in the whole video are provided.

The interactive video character tracking method based on self-supervision optical flow learning provided by the embodiment greatly improves the accuracy of target character tracking. Simulation comparison experiment results are obtained through simulation of the target person tracking model, and are shown in the following table 1.

Table 1 comparison of simulation results of the scheme described in this example with other algorithms

By evaluating the indexes of the accuracy, the overlapping rate and the average overlapping rate error for each model, it is obvious that the scheme of the embodiment can realize higher accuracy, overlapping rate and lower average overlapping rate error, and is more superior. The accuracy is the proportion of the model to correctly identify the characters, and the higher the accuracy is, the better the model is shown in the aspect of identifying the characters; the overlapping rate is the overlapping degree of the boundary frame predicted by the model and the real boundary frame, and the higher the overlapping rate is, the more the prediction of the model is close to the real boundary frame; the average overlap ratio error is the average error between the overlap ratio of the model prediction and the true overlap ratio, and a lower error value indicates that the prediction of the model is closer to the true value.

Example two

The embodiment provides an interactive video character tracking system based on self-supervision optical flow learning, which comprises the following components:

The system according to this embodiment further includes:

and the target person tracking interaction module is used for correcting and updating the mask of the target person in the current frame image through a user clicking operation in the process of predicting and outputting the position of the target person in the next frame image by utilizing the self-supervision optical flow learning model, and predicting and outputting the accurate position of the target person in the next frame image based on the corrected and updated target person mask.

The steps involved in the second embodiment correspond to those of the first embodiment of the method, and the detailed description of the second embodiment can be found in the related description section of the first embodiment.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. An interactive video character tracking method based on self-supervision optical flow learning is characterized by comprising the following steps:

inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished;

the self-supervision optical flow learning model adopts an encoder-decoder structure; the encoder comprises a plurality of attention layers, wherein each attention layer comprises a multi-head self-attention layer and a multi-head cross-attention layer and is used for extracting image features and generating a feature map; the decoder also includes a plurality of attention layers, each including a multi-headed self-attention layer and a multi-headed cross-attention layer for decoding and feature reconstruction of the feature map;

the training process of the self-supervision optical flow learning model comprises the following steps:

constructing a training data set by using video data which comprises a continuous frame image sequence and each frame image is marked with a target character frame;

pre-training a self-supervision optical flow learning model based on the training data set;

inputting each frame image in the continuous frame image sequence into a self-supervision optical flow learning model, converting a first frame image in the continuous frame image sequence into a binary mask image, and generating a target character initial mask according to the initial position of the target character;

extracting a feature map of each frame of image based on the self-supervision optical flow learning model which is already pre-trained, carrying out feature alignment on the feature maps of the adjacent frame of images, generating an optical flow graph, and extracting optical flow vectors between each pair of adjacent frame of images; the extracting the feature map of each frame of image comprises the following steps:

dividing each frame of image into a plurality of non-overlapping image blocks with fixed sizes, extracting features of the non-overlapping image blocks of each frame of image by using an encoder, and extracting feature images of each image block subjected to space globalization and time relevance modeling; re-splicing the feature images of each image block according to the inverse process of image block division to obtain a complete frame-level feature image of each frame image;

based on the extracted optical flow vector between each pair of adjacent frame images, a forward-backward consistency loss function and an optical flow loss function are constructed, and training of the model is completed through self-supervision training until the loss function is minimum.

2. The interactive video person tracking method based on self-monitoring optical flow learning according to claim 1, wherein in predicting and outputting the target person position in the next frame image using the self-monitoring optical flow learning model, the mask of the target person in the current frame image is corrected and updated by the user click operation, and the accurate target person position in the next frame image is predicted and output based on the corrected and updated target person mask.

3. The interactive video character tracking method based on self-supervised optical flow learning of claim 1, further comprising a self-supervised pre-training process for the self-supervised optical flow learning model prior to training the self-supervised optical flow learning model, comprising:

constructing a pre-training data set with video data comprising a sequence of successive frame images;

inputting each frame of image in the pre-training data set into a self-supervision optical flow learning model for training, extracting a characteristic image of each frame of image by encoding, selecting a pixel characteristic region with a score exceeding a threshold value according to the similarity of pixels in the characteristic images of adjacent frames, shielding the pixel characteristic region, and then carrying out decoding reconstruction based on the shielded pixel characteristic region to recover the original image characteristics;

and taking Euclidean distance loss between the extracted features and the recovered features as a loss function, and performing continuous iterative training until the loss function is minimum, thereby completing the pre-training of the self-supervision optical flow learning model.

4. The interactive video person tracking method based on self-supervised optical flow learning of claim 1, wherein feature alignment of feature maps of adjacent frame images to generate an optical flow map comprises:

calculating a similarity matrix between two adjacent frame image feature images;

normalizing the similarity matrix to obtain a correlation score between each pixel point of the two adjacent frame image feature images;

for each pixel point, searching the position of the pixel point which is most matched with the pixel point in the adjacent frame image according to the correlation score, and carrying out forward and backward consistency check by utilizing the correlation score;

and generating a final light flow graph through displacement estimation at a pixel level based on the matched pixel point pairs in the determined adjacent frame images.

5. An interactive video character tracking system based on self-supervised optical flow learning, comprising:

the target person tracking module is used for generating a target person initial mask based on the target person initial position, inputting the target person initial mask and video data into the self-supervision optical flow learning model which is pre-trained, predicting optical flow vectors between adjacent frame images in the video data, predicting the target person mask in the next frame image according to the predicted optical flow vectors and the target person initial mask, and outputting the target person position in the next frame image; inputting the target character mask in the predicted next frame image into a self-supervision optical flow learning model, carrying out continuous iterative tracking prediction, and outputting the moving position and track of the target character in the whole video in real time until the video is finished;

6. The self-supervised optical flow learning based interactive video character tracking system as recited in claim 5, further comprising: