WO2023179385A1

WO2023179385A1 - Video super resolution method, apparatus, device, and storage medium

Info

Publication number: WO2023179385A1
Application number: PCT/CN2023/080945
Authority: WO
Inventors: 谢良彬; 董超
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-03-22
Filing date: 2023-03-10
Publication date: 2023-09-28
Also published as: CN116862762A

Abstract

Disclosed in the present application is a video super resolution method. The method comprises: acquiring an original video frame sequence containing at least two original video frames; and performing video super resolution operation on the original video frame sequence by means of a pre-trained target multi-scale video super resolution model, and outputting a target video frame sequence containing a target video frame, the target multi-scale video super resolution model comprising a multi-scale feature interaction network and an image reconstruction network.

Description

A video super-resolution method, device, equipment and storage medium

This application requires the priority of the Chinese patent application submitted to the China Patent Office on March 22, 2022, with the application number 202210286954.8 and the invention name "A video super-resolution method, device, equipment and storage medium". All its contents are approved by This reference is incorporated into this application.

Technical field

The present application relates to the technical field of video image processing, and in particular to a video super-resolution method, device, equipment and storage medium.

Background technique

Compared with image super-resolution tasks, video super-resolution (VSR) not only requires the use of inherent characteristics on a single image frame for image super-resolution, but also involves aggregating multiple highly related but unaligned frames from the video sequence. Extracted information. Due to the movement of objects in the video and the movement of the video shooting lens, there is an obvious displacement between the information of different frames in the video. In order to better utilize the information of different frames, existing methods are specially designed to align the information of different frames. alignment module, and there are dedicated comparative experiments to illustrate the necessity of the proposed alignment module.

There are currently some representative methods. For example, in the RBPN method, multiple projection modules are used to sequentially aggregate features from multiple frames; in the BasicVSR method, the common VSR framework is summarized into four parts, namely information flow ( Propagation), alignment (Alignment), aggregation (Aggregation) and upsampling (Upsampling), two-way propagation is used to extract information from the entire input video for reconstruction, and optical flow is used for feature distortion; the recently proposed BasicVSR++ is based on BasicVSR On the other hand, a more complex alignment module is used to further better align the features of different frames; Swin-Transformer combines the advantages of CNN and Transformer and shows great prospects in the computer field. SwinIR, built based on the basic modules proposed in Swin-Transformer, has achieved better performance than CNN in many low-level visual tasks with the same number of parameters.

technical problem

This application provides a video super-resolution invention name, which can solve the task of realizing video super-resolution in the existing technology. In order to better perform information fusion between different frames, there are various particularly complex alignment modules. The implementation process It is a problem that is complex, computationally intensive, and requires high processing equipment.

Technical solutions

According to one aspect of the present application, a video super-resolution method is provided, which method includes:

Obtain an original video frame sequence containing at least two original video frames;

Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.

According to another aspect of the present application, a video super-resolution device is provided, which device includes:

A data acquisition module, used to acquire an original video frame sequence containing at least two original video frames;

The video super-resolution module is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale The scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the features. The feature map output by the interactive network is used for image reconstruction.

According to another aspect of the present application, an electronic device is provided, the electronic device including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Video super-resolution method.

According to another aspect of the present application, a computer-readable storage medium is provided. The computer-readable storage medium stores computer instructions, and the computer instructions are used to implement any of the embodiments of the present application when executed by a processor. Video super-resolution method.

beneficial effects

The technical solution of the embodiment of the present application obtains an original video frame sequence containing at least two original video frames, uses a pre-trained target multi-scale video super-resolution model, performs a video super-resolution operation on the original video frame sequence, and outputs a sequence containing the target Target video frame sequence of video frames; among them, the target multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame. Image The reconstruction network performs image reconstruction based on the feature map output by the feature interaction network, which solves the problem that the existing video super-resolution method uses the video frame alignment method to perform information fusion between different frames, resulting in a complicated calculation process and a large amount of calculation. , The embodiment of this application builds a multi-scale video super-resolution model, forms video frames into feature maps of different sizes for information interaction, and can pay attention to different fine-grained feature information, achieving the effect of reducing the amount of calculation and improving model performance.

It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily understood from the following description.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

Figure 1 is a flow chart of a video super-resolution method provided according to Embodiment 1 of the present application;

Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application;

Figure 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application;

Figure 4 is a schematic structural diagram of a video super-resolution device provided according to Embodiment 2 of the present application;

Figure 5 is a schematic structural diagram of an electronic device that implements the video super-resolution method according to an embodiment of the present application.

Embodiments of the invention

In order to enable those in the technical field to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.

It should be noted that the terms "first", "second", "original", "target", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to distinguish them. Describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

Example

Figure 1 is a flow chart of a video super-resolution method provided in Embodiment 1 of the present application. This embodiment can be applied to the case of super-resolution processing of video images. The method can be executed by a video super-resolution device. The video The super-resolution device can be implemented in the form of hardware and/or software, and the video super-resolution device can be configured in computer equipment. As shown in Figure 1, the method includes:

S110. Obtain an original video frame sequence containing at least two original video frames.

In practical applications, when it is necessary to perform a super-resolution operation on a video image, a long video can be cropped to form a video clip containing a certain number of video frames. In this embodiment, each video before the super-resolution operation is performed The collection of video frames contained in a segment can be called an original video frame sequence. Multiple consecutive video frames within an original video frame sequence can be called original video frames.

S120. Use the pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output the target video frame sequence including the target video frame; where the target multi-scale video super-resolution model includes multi-scale features. Interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame. The image reconstruction network performs image reconstruction based on the feature map output by the feature interaction network.

In this embodiment, a multi-scale video super-resolution model can be built in advance, and after training with a large amount of training data, the target multi-scale video super-resolution model can be obtained. When performing super-resolution processing on the original video frame sequence, the original video frame sequence can be used as input data and the target multi-scale video super-resolution model is input. The target multi-scale video super-resolution model processes each original video frame in the original video frame sequence. After performing information interaction and feature fusion at different scales, and reconstructing the feature map of each original video frame, the corresponding target video frame can be obtained. The target video frames are arranged in order to form the target video frame sequence after super-resolution processing.

Optionally, the multi-scale feature interaction network can be composed of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series; the initial feature interaction module can be used to perform feature interaction on each original video frame; multiple The scale feature interaction module can be used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module; the terminal feature interaction module can be used to perform multi-scale feature interaction and feature fusion on the last multi-scale feature interaction module All output feature maps undergo feature interaction.

In practical applications, the number of multi-scale feature interaction modules can be specifically set according to the actual application scenario, and can be 2, 3 or 4. When there are two multi-scale feature interaction modules, the original video frame sequence is input to the target multi-scale video super-resolution model, and the initial feature interaction module performs feature interaction on all original video frames in the original video frame sequence, and outputs each original video frame with Feature maps of other original video frames after feature interaction; after the first multi-scale feature interaction module obtains each feature map output by the initial feature interaction module, it performs multi-scale feature interaction and feature fusion on each feature map, and then inputs the second A multi-scale feature interaction module; the second multi-scale feature interaction module again performs multi-scale feature interaction and feature fusion on each feature map output by the first multi-scale feature interaction module, and then inputs the terminal feature interaction module; the terminal feature interaction module Then perform feature interaction on all feature maps output by the second multi-scale feature interaction module.

Further, the multi-scale feature interaction module can include at least two RVSTB units and a feature fusion unit; the feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, and the down-sampled feature map is processed using the RVSTB unit. Feature interaction, and upsampling according to the preset sampling frequency after feature interaction; the feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.

In this embodiment, the RVSTB unit may include at least two VSTL layers and one convolutional layer. The VSTL layer can use the shift window mechanism and the attention mechanism to achieve feature interaction.

In practical applications, the RVSTB (Residual Video Swin Transformer Block) unit can construct the correlation between different video frames and complete the continuous interaction of information between multiple video frames. The number of RVSTB units in the multi-scale feature interaction module can be specifically set according to the actual application scenario. The number of RVSTB units can be consistent with the number of preset sampling frequencies. That is to say, there are 3 preset sampling frequencies of feature maps. The number of RVSTB units in the multi-scale feature interaction module is 3. The preset sampling frequencies of feature maps There are 4 types, and the number of RVSTB units in the multi-scale feature interaction module is 4.

In the same way, the number of VSTL layers in the RVSTB unit can also be specifically set according to the actual application scenario. The number of VSTL layers can reflect the degree of feature interaction. It is understandable that within a certain range, the more VSTL layers there are, the deeper the degree of information interaction between feature maps.

Specifically, the VSTL layer can be composed of LayerNorm, MSA, MLP and residual connections. The network structure can be improved based on the standard multi-head self-attention used by the original Transformer layer. The main difference from the original Transformer network is the local attention mechanism and shift window mechanism. Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application. As shown in Figure 2, the size of the given input is T x H x W x C, where T can represent the input The number of frames, H can represent the height of each input picture, W can represent the width of each input picture, C can represent the number of channels of each input picture, the number of channels can be 3 by default; Video Swin Transformer can first use a Conv3d layer to divide the input into non-overlapping small windows with dimensions of N x M x M. For each small window, the result of self-attention can be calculated independently. For the characteristics of a small window ,

N x M² x C can represent the dimension composed of all point vectors in the small window, where the dimension of each point is C. Since there are N x M x M pixels in the small window, it is N x M² x C; and three learnable mapping matrices By multiplying them separately, you can get the corresponding query(Q), key(K) and value(V) values. Its corresponding attention matrix can then be calculated:

,

in, The similarity between the query and the key can be calculated, that is, each point in the query and each point in the key are a similarity result; then multiplied by V, the new frame after fusing other frames and its own frame is obtained feature.

The above attention operation will be performed h times in parallel, and then the h results will be concated.

The MLP layer can contain two FC layers and a GELU activation function. The LayerNorm layer is used between the MSA and MLP layers. After the MSA and MLP operations, a residual connection will be connected. The whole process can be expressed as follows:

X=MSA(LN(X))+X，

X=MLP(LN(X))+X，

The above-mentioned operations are all performed within locally divided small windows. In order to interact with information between different small windows, a shift window mechanism can be used.

Optionally, the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module; the feature image reconstruction module is used to reconstruct features of the feature map output by the feature interaction network to form a reconstructed image; the interpolation image construction module It is used to perform image interpolation on the original video frame to form an interpolated image; the image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.

Specifically, the feature map output by the feature interaction network is reconstructed to form a reconstructed image. At the same time, the interpolation amplification method is used to perform image interpolation on the original video frame to form an interpolated image. The reconstructed image corresponding to the same original video frame and Interpolate the image to get the super-resolved target video frame.

For example, FIG. 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application. As shown in Figure 3, the obtained original video frame sequence contains 5 original video frames. The multi-scale feature interaction network in the target multi-scale video super-resolution model consists of 1 initial feature interaction module, 3 multi-scale feature interaction modules and It consists of one terminal feature interaction module in series, in which the initial feature interaction module includes one RVSTB unit, each multi-scale feature interaction module includes three RVSTB units and one feature fusion unit, and the terminal feature interaction module includes one RVSTB unit; pre- Suppose the sampling frequency is 1 times, 2 times and 4 times; one RVSTB unit includes 6 VSTL layers and 1 convolution layer, and 1 feature fusion unit includes 1 convolution layer. The input 5 consecutive frames of original video images will first go through an RVSTB module, and the output feature map will go through 3 multi-scale feature interaction modules; each multi-scale feature interaction module uses the RVSTB module to process 3 feature maps of different sizes. Yes, for the three-sized feature maps that have been processed, you can first change the two small feature maps to the same size as the main feature map, then concat the three feature maps, and then use a Conv3D for information Aggregation; after passing through three multi-scale feature interaction modules, it passes through an RVSTB module to further fuse the information of different feature maps; and then passes through a feature reconstruction module to combine the output of the feature reconstruction with the input image amplified by interpolation. Added together, the final multi-frame super-resolution result is obtained.

Based on the above solution, the training process of the target multi-scale video super-resolution model in this embodiment may include:

A1. Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence.

Specifically, when training the model, a large amount of training data needs to be used. Each set of training data can include a low-scoring video frame sequence and the corresponding standard high-scoring video frame sequence. The low-resolution video frame sequence contains a certain number of low-resolution video frames, and the standard high-resolution video frame sequence contains a certain number of high-resolution video frames. The low-resolution video frames can be obtained by downsampling the high-resolution video frames.

A2. Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence.

In this embodiment, the multi-scale video super-resolution model to be trained can be built in advance. The multi-scale video super-resolution model to be trained can be composed of a multi-scale feature interaction network to be trained and an image reconstruction network to be trained; the multi-scale feature interaction network to be trained It can be composed of an initial feature interaction module to be trained, at least one multi-scale feature interaction module to be trained, and a terminal feature interaction module to be trained in series; the multi-scale feature interaction module to be trained can include at least two RVSTB units to be trained and one to be trained. Feature fusion unit; each RVSTB unit to be trained can include at least two VSTL layers and one convolutional layer.

Specifically, by using the low-scoring video frame sequence as input data and inputting the built multi-scale video super-resolution model to be trained, the output actual high-scoring video frame sequence can be obtained.

A3. Obtain the fitting loss function based on the standard high-score video frame sequence and the actual high-score video frame sequence.

Specifically, since the standard high-resolution video frame sequence is an actual high-resolution video frame sequence, and the actual high-resolution video frame sequence is calculated and output by a model that has not completed training, the standard high-resolution video frame sequence and the actual high-resolution video frame are There must be a certain error in the sequence, and a fitting loss function can be formed based on this error to adjust the training parameters of the multi-scale video super-resolution model to be trained.

A4. Perform backpropagation on the multi-scale video super-resolution model to be trained by fitting the loss function to obtain the target multi-scale video super-resolution model.

Specifically, after obtaining the fitting loss function, the multi-scale video super-resolution model to be trained can be back-propagated through the fitting loss function, and the behavior recognition model can be continuously adjusted to finally obtain the target multi-scale video super-resolution model.

Example

Figure 4 is a schematic structural diagram of a video super-resolution device provided in Embodiment 2 of the present application. As shown in Figure 4, the device includes:

The data acquisition module 210 is used to acquire an original video frame sequence including at least two original video frames.

The video super-resolution module 220 is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein, the target The multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the The feature map output by the feature interaction network is used for image reconstruction.

Optionally, the multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series;

The initial feature interaction module is used to perform feature interaction on each of the original video frames;

The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;

The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.

Optionally, the multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;

The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;

The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.

Optionally, the RVSTB unit includes at least two VSTL layers and one convolutional layer.

Optionally, the VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.

Optionally, the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;

The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;

The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;

The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.

Optionally, the training steps of the target multi-scale video super-resolution model include:

Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;

Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;

Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;

The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.

The video super-resolution device provided by the embodiments of this application can execute the video super-resolution method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.

Example

FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present application. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.

As shown in Figure 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program executable by at least one processor. The processor 11 can perform the operation according to a computer program stored in a read-only memory (ROM) 12 or loaded from a storage unit 18 into a random access memory (RAM) 13 . Perform various appropriate actions and processing. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14. An input/output (I/O) interface 15 is also connected to bus 14 .

Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs various methods and processes described above, such as video super-resolution methods.

In some embodiments, the video super-resolution method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19 . When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the video super-resolution method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the video super-resolution method in any other suitable manner (eg, by means of firmware).

Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media would include one or more wires based electrical connection, laptop disk, hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including acoustic input, speech input, or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.

Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in this application can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution of this application can be achieved, there is no limitation here.

The above-mentioned specific embodiments do not constitute a limitation on the scope of protection of the present application. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims

A video super-resolution method, including:

Obtain an original video frame sequence containing at least two original video frames;

Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
The method of claim 1, wherein,

The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;

The initial feature interaction module is used to perform feature interaction on each of the original video frames;

The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;

The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
The method of claim 2, wherein

The multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;

The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;

The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
The method of claim 3, wherein,

The RVSTB unit includes at least two VSTL layers and one convolutional layer.
The method of claim 4, wherein,

The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
The method of claim 1, wherein,

The image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;

The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;

The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;

The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
The method according to claim 1, wherein the training step of the target multi-scale video super-resolution model includes:

Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;

Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;

Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;

The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
A video super-resolution device, including:

A data acquisition module, used to acquire an original video frame sequence containing at least two original video frames;

The video super-resolution module is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale The scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the features. The feature map output by the interactive network is used for image reconstruction.
An electronic device, the electronic device includes:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to perform the following steps:

Obtain an original video frame sequence containing at least two original video frames;

Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
The electronic device according to claim 9, wherein

The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;

The initial feature interaction module is used to perform feature interaction on each of the original video frames;

The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;

The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
The electronic device according to claim 10, wherein

The multi-scale feature interaction module includes at least two RVSTB units and a feature fusion unit;

The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;

The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
The electronic device according to claim 11, wherein

The RVSTB unit includes at least two VSTL layers and one convolutional layer.
The electronic device according to claim 12, wherein

The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
The electronic device according to claim 9, wherein

The image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;

The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;

The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;

The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
The electronic device according to claim 9, wherein the training step of the target multi-scale video super-resolution model includes:

Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;

Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;

Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;

The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
A computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement the following steps when executed:

Obtain an original video frame sequence containing at least two original video frames;

Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
The computer-readable storage medium of claim 16, wherein:

The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;

The initial feature interaction module is used to perform feature interaction on each of the original video frames;

The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;

The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
The computer-readable storage medium of claim 17, wherein:

The multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;

The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;

The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
The computer-readable storage medium of claim 18, wherein:

The RVSTB unit includes at least two VSTL layers and one convolutional layer.
The computer-readable storage medium of claim 19, wherein

The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.