CN114596312B

CN114596312B - Video processing method and device

Info

Publication number: CN114596312B
Application number: CN202210491668.5A
Authority: CN
Inventors: 乔宇; 何军军; 宋迪平; 邹静; 周蔚; 李英
Original assignee: Shenzhen Institute of Advanced Technology of CAS; Union Shenzhen Hospital of Huazhong University of Science and Technology
Current assignee: Shenzhen Institute of Advanced Technology of CAS; Union Shenzhen Hospital of Huazhong University of Science and Technology
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-08-02
Anticipated expiration: 2042-05-07
Also published as: CN114596312A

Abstract

The invention discloses a video processing method and a video processing device. The method comprises the following steps: constructing a label-free data set by using three types of images including medical images, endoscopic surgery videos and natural images; pre-training a transfer learning model by taking the set loss function minimization as a target, wherein the transfer learning model comprises an encoder and a decoder, the encoder takes an image subjected to serialization transformation aiming at the unlabeled data set as an input image, the common knowledge representation of the three types of images is learned, and the decoder obtains a reconstructed image by utilizing the output characteristics of the encoder; and migrating the pre-trained encoder to a video understanding model to detect and segment the object in the target mirror cavity operation video. The invention can process various complex tasks such as video fidelity blind enhancement, video understanding and the like, and can be migrated and applied to various scenes.

Description

Video processing method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video processing method and apparatus.

Background

The video enhancement technology can be applied to various fields such as security monitoring, traffic and medical image processing. For example, in the clinical needs of endoscopic surgery, the precise identification of key objects in the surgical field, such as surgical instruments, diseased regions and tissues, is critical to the judgment and operation of the physician. However, in endoscopic surgery, due to limited access and visibility of the target tissue and the complex structure of the abdominal cavity, partially hidden structures are often difficult to predict and difficult to find in time, such as thermal damage of the retroperitoneal ureters. At present, the judgment is usually carried out by depending on the personal experience of a doctor, and once the doctor finds a complex condition which cannot be treated in the operation, the doctor cannot obtain corresponding guidance support in the operation in time.

Medical imaging has been widely used in clinical diagnosis and surgery, for example, fundus color photography is commonly used for screening fundus diseases, three-dimensional imaging CT (computed tomography) and MRI (magnetic resonance imaging) are commonly used for observing the position, texture and morphology of a lesion, accurate delineation of a lesion, tumor, organ and tissue, quantitative analysis of a lesion, accurate surgical navigation, radiotherapy planning, and the like. Conventional medical image detection and segmentation studies have focused on still images such as CT, MRI, and X-ray. With the development of video analysis technology, dynamic surgical video analysis attracts more and more attention. For example, the three-dimensional structure is reconstructed using moving gastroscopic image information.

In the prior art, the video analysis and understanding of endoscopic surgery is less relevant research due to the lack of high quality data sets and calibration. In addition, due to the particularity and complexity of endoscopic surgery and the requirement of real-time analysis, the existing video processing and understanding method has difficulty in obtaining satisfactory effect in the surgical scene.

In conclusion, in the endoscopic surgery, the complex dynamic change of the key objects brings great challenges to the fine understanding of vision, on one hand, the traditional technology has large dependence on the scale of training data and high requirements on labeling, while in the endoscopic surgery scene, the sample amount is small, and the pixel-level calibration of the key objects in the endoscopic surgery, such as surgical instruments, lesion areas, key organs and the like, is time-consuming. In addition, due to the difference of patients and instruments, the operation field scene is complex and changeable, and the difficulty of video understanding is further increased.

Disclosure of Invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a video processing method and apparatus.

According to a first aspect of the invention, a video processing method is provided. The method comprises the following steps:

constructing a label-free data set by using three types of images including medical images, endoscopic surgery videos and natural images;

pre-training a transfer learning model by taking the set loss function minimization as a target, wherein the transfer learning model comprises an encoder and a decoder, the encoder takes an image subjected to serialization transformation aiming at the unlabeled data set as an input image, the common knowledge representation of the three types of images is learned, and the decoder obtains a reconstructed image by utilizing the output characteristics of the encoder;

and migrating the pre-trained encoder to a video understanding model to detect and segment the object in the target endoscopic surgery video.

According to a second aspect of the present invention, there is provided a video processing apparatus. The device includes:

a data acquisition unit: the system is used for constructing a tag-free data set by utilizing three types of images including medical images, endoscopic surgery videos and natural images;

a pre-training unit: the system comprises a pre-training transfer learning model, a pre-training transfer learning model and a pre-training transfer learning model, wherein the transfer learning model comprises an encoder and a decoder, the encoder takes an image subjected to serialization transformation aiming at the label-free data set as an input image, the common knowledge representation of the three types of images is learned, and the decoder obtains a reconstructed image by utilizing the output characteristics of the encoder;

a transfer learning unit: the method is used for migrating the pre-trained encoder to a video understanding model so as to detect and segment the object in the target endoscopic surgery video.

Compared with the prior art, the invention has the advantages that the efficient video understanding and processing method is provided, video enhancement can be performed in multiple fields such as endoscopic surgery navigation, and the like, and the identification and understanding of video contents and events are realized, so that key objects and processes in the surgical process are monitored and reminded.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a video processing method according to one embodiment of the invention;

FIG. 2 is a schematic diagram of an auto-supervised migration learning model framework, according to one embodiment of the present invention;

fig. 3 is a schematic diagram of a video detection segmentation network according to one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the following, the enhancement of the endoscopic surgery video is described as an example, and the purpose is to achieve accurate detection and segmentation of a surgical key object under a small sample and weak calibration, so as to more reliably and comprehensively identify and understand the video. Referring to fig. 1, the video processing method provided includes the following steps.

And step S110, constructing a non-tag database by using the multi-source data.

The deep learning model with good performance and strong generalization usually depends on a large number of training samples with labels, and the labeling of the current endoscopic surgery needs experienced doctors, so that the labeling cost is high, and a large amount of high-quality labeling data is difficult to obtain.

In one embodiment of the invention, in order to realize subsequent self-supervision training, a non-label database is constructed by adopting multi-source data. The multi-source data comprises natural images, medical images, endoscopic surgery videos and the like, because the endoscopic surgery videos are closer to the natural images in mode, and the carried information is more relevant to the medical images, and in consideration of the fact that a large number of natural images can be collected on the internet, a large number of public medical image data can be obtained, and therefore the cost for collecting samples is saved. In this way, millions of multi-source non-annotation data can be collected.

And step S120, pre-training a self-supervision transfer learning model based on the label-free data to obtain a trained encoder.

Because the natural image and the endoscopic surgery video have modal similarity, and the medical image and the endoscopic surgery video have structural similarity, in the pre-training stage, the self-supervision learning based on the natural image can learn the prior of the image, and the self-supervision learning based on the medical image can learn the prior knowledge of the medical science and the human anatomy structure. Further, in the pre-training stage, various transformation agent tasks (or simply, various agent tasks) are designed to learn the intrinsic characteristics of the images and the human anatomy relation in a self-supervision mode and the like.

Specifically, referring to fig. 2, the provided self-supervised migration learning model framework based on resilient learning includes: random sampling and clipping processes, self-supervised transform proxy tasks, and image restoration encoders and decoders.

Firstly, in the process of random sampling and cutting, samples in a constructed label-free database are randomly sampled and cut to obtain an original input image which is marked as X.

Then, the self-supervision transformation agent task carries out various transformations on the original input image X to obtain a transformed image. For example, X is transformed with a certain probability into three transformations, including distribution-based transformations (e.g., non-linear enhancement, local scrambling), paint-based transformations (e.g., inside paint, outside paint), and mask-based transformations (MAE). Transformed image

Can be described as:

（1）

wherein p is randomly derived

The floating point number between them, threshold is the probability threshold of random data transformation, and can be set according to training precision and efficiency. In the self-supervision agent task, three transformation agent tasks are randomly combined with a certain probability, a challenging self-supervision learning agent task can be constructed for an encoder-decoder network through the combination of different image transformations, an original input image is reconstructed by a recovery network based on the transformed image, and the network is guided to learn effective characteristics, a learning image and the prior of content.

Next, the transformed image

Reconstructing the image by an image restoration encoder-decoder, and outputting the reconstructed image

. For example, a transform-based image restoration encoder-decoder or other type of encoder-decoder may be employed. For clarity, the following description is given taking a transform-based image restoration encoder-decoder as an example.

In the pre-training process of the self-supervised migration learning model, the loss function reflecting the reconstruction effect is, for example, an MSE (Mean Squared Error) loss function, which is expressed as:

（2）

where n represents the number of samples.

In the embodiment of fig. 2, the self-supervised learning based on the recovery learning is adopted, and preferably, the self-supervised learning based on the comparison learning paradigm can also be combined at the same time. The core of the comparison learning lies in the construction of positive sample pairs, and based on the endoscope operation identification task and the collected label-free multi-source data set, the positive sample pairs in three forms can be constructed, which are respectively as follows: taking the homomorphism as a positive sample pair; the positive sample pairs (content, modality) with the same characteristics; and taking samples obtained by carrying out different data enhancement modes (such as turning, noise disturbance, stretching and the like) on the picture as a positive sample pair.

Through a pre-training process, an encoder with generic characterization capabilities can be obtained and a priori knowledge of the image, medical and human anatomy learned.

Step S130, migrating the pre-trained encoder to the video understanding model to detect and segment the key objects.

In this step, the pre-trained encoder parameters are migrated to the downstream tasks of video understanding of endoscopic surgery, including detection, segmentation of surgical key objects and identification of surgical procedures and events. Based on massive non-labeled multi-source data and multiple complex self-supervision agent tasks, the pre-training encoder can learn general knowledge representation, and for specific downstream tasks, a specific structure is designed by combining the pre-training encoder, for example, more image priors and multi-scale features are introduced, and the local attention replaces the global attention to improve the calculation efficiency. By migrating the general knowledge, the accuracy and robustness of the model on the small sample data are improved.

Aiming at the characteristics of various complex key targets in the background in endoscopic surgery, and the like, in one embodiment, a video understanding model (or referred to as a detection segmentation model) considering time multi-scale and space multi-scale features is designed, and as shown in fig. 3, the model mainly comprises a transform-based encoder, a multi-scale feature adapter, a space-time multi-scale attention module, a pixel decoder, a cross-scale attention decoder, a multi-layer perceptron and the like.

The transform-based encoder is an encoder migrated from step S120, and can effectively extract spatial multi-scale information of the target object and effectively merge features of the current time and the historical time.

Specifically, for the current time T, the history information of the previous m times is sampled at certain intervals to form an image sequence containing m +1 video frames. In order to better capture space multi-scale information, an encoder takes an image sequence as input, extracts features in different encoding stages and inputs the features into a multi-scale feature adapter to obtain a feature pyramid of a plurality of moments, then flattens a plurality of features of the feature pyramid with different resolutions, and splices the features to obtain space multi-scale features of each moment.

In order to strengthen the interaction among the space-time characteristics, the multi-scale image characteristics of a plurality of video frames are fused in a splicing mode and used as the input of a space-time multi-scale attention module, important space-time information is mined through a global and local self-adaptive space-time attention mechanism, the motion change of a target object is modeled, and the space-time fusion characteristics at the moment T are obtained. The spatio-temporal fusion features are then input to a pixel decoder, which decodes the feature pyramid of fused spatio-temporal information.

The cross-scale attention decoder takes a space-time feature pyramid and learnable global embedding as input, predicts N object features (N can be 1 or more), inputs the N object features into a multilayer perceptron, predicts N mask embedding and N example surrounding frames and categories, performs convolution operation on the features with the highest resolution in the feature pyramid and the mask embedding, and further obtains a detection segmentation result at the time T. Since the dimensions of the lesion region and the key organ change with the movement of the endoscope, the dimension information such as the diameter and the area of the lesion region is predicted based on the detection segmentation result using the surgical instrument as the dimension reference object.

In conclusion, the invention utilizes the self-supervision learning to construct the basic model (namely the encoder) of the general knowledge representation by collecting million-level multi-source non-labeled data; and the general knowledge characteristics are migrated to the endoscope operation for fine understanding by using a basic model of self-supervision pre-training in a migration learning mode, so that the rapid construction of a small-sample and high-precision detection segmentation model is realized.

Accordingly, the present invention also provides a video processing apparatus for implementing one or more aspects of the above method. For example, the apparatus includes: the data acquisition unit is used for constructing a tag-free data set by utilizing three types of images, namely medical images, endoscopic surgery videos and natural images; the pre-training unit is used for pre-training a migration learning model with a set loss function minimization as a target, wherein the migration learning model comprises an encoder and a decoder, the encoder takes an image subjected to serialization transformation on the label-free data set as an input, learns the common knowledge representation of the three types of images, and the decoder obtains a reconstructed image by utilizing the output characteristics of the encoder; and the transfer learning unit is used for transferring the pre-trained encoder to the video understanding model so as to detect and segment the object in the target endoscopic surgery video. The units involved in the device can be realized by a processor, an FPGA or other special hardware.

The model training process related to the invention can be carried out in a server or a cloud offline mode, the trained model (such as an encoder) is embedded into the electronic equipment, and the model is combined into a video understanding model designed for an actual task, so that the enhanced video can be displayed in real time. The electronic device can be a terminal device or a server, and the terminal device comprises any terminal device such as a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point-of-sale (POS), a smart wearable device (a smart watch, virtual reality glasses, a virtual reality helmet, etc.). The server includes but is not limited to an application server or a Web server, and may be a stand-alone server, a cluster server, a cloud server, or the like. For example, in actual model application, a video acquisition device can be used for shooting a target video, the target video is transmitted to an electronic device, and then the video understanding model is used for displaying the video enhanced relative to the acquired video in real time so as to assist a doctor to smoothly complete an operation process.

In summary, compared with the prior art, the invention has the following advantages:

1) the detection segmentation network facing the complex endoscopic surgery scene is provided, the multi-scale network is utilized to fuse the spatio-temporal context information, a self-supervision detection segmentation model based on contrast learning and recovery learning is constructed, and uncertainty is detected and analyzed, so that the key objects in the surgery can be more reliably and comprehensively identified and understood. The network-learned enhancement capability of the image is embodied in the quality and the authenticity of the reconstructed image and the consistency with the original input image, and the learned general knowledge characterization capability is embodied in the efficient and rapid construction of a downstream small sample task.

2) The agent tasks with more challenges can be constructed through random combination of a plurality of transformation agent tasks, so that the recovery network can learn the characteristics with stronger representation capability and the knowledge with better universality; the encoder of the recovery network is migrated to the model of the cavity mirror operation with few samples, and the downstream task is migrated in a learning stage, so that the rapid construction of the cavity mirror operation fine understanding model can be completed only by a small amount of precisely labeled cavity mirror operation videos.

3) The invention has huge clinical number of endoscopic surgeries, and the invention can be used for carrying out intelligent analysis on massive clinical endoscopic surgery data, realizing faster and more accurate surgical operation under the assistance of artificial intelligence technology, and realizing the sharing of experience and knowledge among different doctors, thereby reducing surgical risks and complications and benefiting patients.

4) The method aims at the actual requirements of gynecological endoscopic surgery, and compared with the traditional single model which can only solve the problem of single task research, the method is used for solving the bottleneck problems of systematic solution of small samples, generalization, efficiency and the like, can process various complex tasks such as video fidelity blind enhancement, surgery video understanding and the like, and can be migrated and applied to various scenes such as other endoscopic surgeries.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A video processing method, comprising the steps of:

the pre-trained encoder is migrated to the video understanding model to detect and segment objects in the target laparoscopic surgery video.

2. The video processing method of claim 1, wherein the transfer learning model is pre-trained according to the following steps:

constructing a non-tag data set containing a multi-modal image sample by using the medical image, the endoscopic surgery video and the natural image;

randomly sampling and cutting image samples in the label-free data set to obtain an input image;

designing a self-supervision agent task, and obtaining images after serialization transformation by carrying out various transformations on input images;

the decoder outputs a reconstructed image with the serialized transformed image as an input to the encoder.

3. The video processing method of claim 2, wherein the input to the encoder is obtained according to the following steps:

randomly sampling and cutting the image sample in the label-free data set to obtain an input image X;

carrying out multiple transformations on X by using a set probability threshold value to obtain a transformed image

The transformation process is represented as:

wherein p is randomly derived

Floating point number in between, threshold is the set probability threshold, transform represents the transform;

converting the image

Input to the encoder and output from the decoder of a reconstructed image

。

4. The video processing method of claim 3, wherein the plurality of transforms includes a distribution-based transform, a paint-based transform, and a mask-based transform.

5. The video processing method of claim 1, wherein the video understanding model comprises a migrated encoder, a multi-scale feature adapter, a spatiotemporal multi-scale attention module, a pixel decoder, a cross-scale attention decoder, and a multi-layer perceptron, and performs the following processes:

for the current time T, sampling historical information of previous m times at certain intervals to form an image sequence containing m +1 video frames, taking the image sequence as input by the encoder, extracting features of different encoding stages, inputting the features into the multi-scale feature adapter to obtain a feature pyramid of a plurality of times, flattening a plurality of features of the feature pyramid with different resolutions, and splicing to obtain spatial multi-scale features of each time;

fusing multi-scale image characteristics of a plurality of video frames in a splicing mode, taking the fused multi-scale image characteristics as input of the space-time multi-scale attention module, mining space-time information through a global and local self-adaptive space-time attention mechanism, modeling motion change of a target object, and obtaining space-time fusion characteristics at the time T;

inputting the space-time fusion feature into the pixel decoder, and decoding a feature pyramid fusing space-time information;

and the cross-scale attention decoder takes the feature pyramid fused with the spatio-temporal information and the learnable global embedding as input, predicts N object features, inputs the N object features into the multilayer perceptron, predicts corresponding mask embedding and example surrounding frames and categories, performs convolution operation on the feature with the highest resolution in the feature pyramid fused with the spatio-temporal information and the mask embedding, and further obtains a detection segmentation result at the moment T.

6. The video processing method according to claim 1, wherein the pre-training process of the transfer learning model further comprises an auto-supervised learning based on a contrast learning paradigm, and the positive sample pairs corresponding to the contrast learning paradigm comprise: taking the homomorphism as a positive sample pair; taking the samples with the same characteristics as a positive sample pair; samples obtained by performing different data enhancement on the same picture are taken as positive sample pairs.

7. The video processing method of claim 1, wherein the loss function is a mean square error loss function reflecting a loss between the input image and the reconstructed image.

8. The video processing method according to claim 1, wherein the transfer learning model is constructed based on a transformer.

9. A video processing apparatus comprising:

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program realizes the steps of the method according to any one of claims 1 to 8 when executed by a processor.