WO2023179385A1 - Video super resolution method, apparatus, device, and storage medium - Google Patents

Video super resolution method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023179385A1
WO2023179385A1 PCT/CN2023/080945 CN2023080945W WO2023179385A1 WO 2023179385 A1 WO2023179385 A1 WO 2023179385A1 CN 2023080945 W CN2023080945 W CN 2023080945W WO 2023179385 A1 WO2023179385 A1 WO 2023179385A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
scale
feature interaction
video frame
module
Prior art date
Application number
PCT/CN2023/080945
Other languages
French (fr)
Chinese (zh)
Inventor
谢良彬
董超
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023179385A1 publication Critical patent/WO2023179385A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present application relates to the technical field of video image processing, and in particular to a video super-resolution method, device, equipment and storage medium.
  • VSR video super-resolution
  • This application provides a video super-resolution invention name, which can solve the task of realizing video super-resolution in the existing technology.
  • video super-resolution invention name In order to better perform information fusion between different frames, there are various particularly complex alignment modules.
  • the implementation process It is a problem that is complex, computationally intensive, and requires high processing equipment.
  • a video super-resolution method which method includes:
  • target multi-scale video super-resolution model uses a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network.
  • the multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames.
  • the image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
  • a video super-resolution device which device includes:
  • a data acquisition module used to acquire an original video frame sequence containing at least two original video frames
  • the video super-resolution module is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale
  • the scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network.
  • the multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames.
  • the image reconstruction network is based on the features.
  • the feature map output by the interactive network is used for image reconstruction.
  • an electronic device including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application.
  • Video super-resolution method Video super-resolution method.
  • a computer-readable storage medium stores computer instructions, and the computer instructions are used to implement any of the embodiments of the present application when executed by a processor.
  • Video super-resolution method Video super-resolution method.
  • the technical solution of the embodiment of the present application obtains an original video frame sequence containing at least two original video frames, uses a pre-trained target multi-scale video super-resolution model, performs a video super-resolution operation on the original video frame sequence, and outputs a sequence containing the target Target video frame sequence of video frames; among them, the target multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network.
  • the multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame.
  • the reconstruction network performs image reconstruction based on the feature map output by the feature interaction network, which solves the problem that the existing video super-resolution method uses the video frame alignment method to perform information fusion between different frames, resulting in a complicated calculation process and a large amount of calculation.
  • the embodiment of this application builds a multi-scale video super-resolution model, forms video frames into feature maps of different sizes for information interaction, and can pay attention to different fine-grained feature information, achieving the effect of reducing the amount of calculation and improving model performance.
  • Figure 1 is a flow chart of a video super-resolution method provided according to Embodiment 1 of the present application;
  • Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application;
  • Figure 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application;
  • Figure 4 is a schematic structural diagram of a video super-resolution device provided according to Embodiment 2 of the present application.
  • Figure 5 is a schematic structural diagram of an electronic device that implements the video super-resolution method according to an embodiment of the present application.
  • Figure 1 is a flow chart of a video super-resolution method provided in Embodiment 1 of the present application. This embodiment can be applied to the case of super-resolution processing of video images.
  • the method can be executed by a video super-resolution device.
  • the video can be implemented in the form of hardware and/or software, and the video super-resolution device can be configured in computer equipment. As shown in Figure 1, the method includes:
  • a long video can be cropped to form a video clip containing a certain number of video frames.
  • each video before the super-resolution operation is performed
  • the collection of video frames contained in a segment can be called an original video frame sequence.
  • Multiple consecutive video frames within an original video frame sequence can be called original video frames.
  • S120 Use the pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output the target video frame sequence including the target video frame; where the target multi-scale video super-resolution model includes multi-scale features.
  • Interaction network and image reconstruction network The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame.
  • the image reconstruction network performs image reconstruction based on the feature map output by the feature interaction network.
  • a multi-scale video super-resolution model can be built in advance, and after training with a large amount of training data, the target multi-scale video super-resolution model can be obtained.
  • the original video frame sequence can be used as input data and the target multi-scale video super-resolution model is input.
  • the target multi-scale video super-resolution model processes each original video frame in the original video frame sequence. After performing information interaction and feature fusion at different scales, and reconstructing the feature map of each original video frame, the corresponding target video frame can be obtained.
  • the target video frames are arranged in order to form the target video frame sequence after super-resolution processing.
  • the multi-scale feature interaction network can be composed of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series;
  • the initial feature interaction module can be used to perform feature interaction on each original video frame;
  • multiple The scale feature interaction module can be used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
  • the terminal feature interaction module can be used to perform multi-scale feature interaction and feature fusion on the last multi-scale feature interaction module All output feature maps undergo feature interaction.
  • the number of multi-scale feature interaction modules can be specifically set according to the actual application scenario, and can be 2, 3 or 4.
  • the original video frame sequence is input to the target multi-scale video super-resolution model, and the initial feature interaction module performs feature interaction on all original video frames in the original video frame sequence, and outputs each original video frame with Feature maps of other original video frames after feature interaction; after the first multi-scale feature interaction module obtains each feature map output by the initial feature interaction module, it performs multi-scale feature interaction and feature fusion on each feature map, and then inputs the second A multi-scale feature interaction module; the second multi-scale feature interaction module again performs multi-scale feature interaction and feature fusion on each feature map output by the first multi-scale feature interaction module, and then inputs the terminal feature interaction module; the terminal feature interaction module Then perform feature interaction on all feature maps output by the second multi-scale feature interaction module.
  • the multi-scale feature interaction module can include at least two RVSTB units and a feature fusion unit; the feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, and the down-sampled feature map is processed using the RVSTB unit. Feature interaction, and upsampling according to the preset sampling frequency after feature interaction; the feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
  • the RVSTB unit may include at least two VSTL layers and one convolutional layer.
  • the VSTL layer can use the shift window mechanism and the attention mechanism to achieve feature interaction.
  • the RVSTB (Residual Video Swin Transformer Block) unit can construct the correlation between different video frames and complete the continuous interaction of information between multiple video frames.
  • the number of RVSTB units in the multi-scale feature interaction module can be specifically set according to the actual application scenario.
  • the number of RVSTB units can be consistent with the number of preset sampling frequencies. That is to say, there are 3 preset sampling frequencies of feature maps.
  • the number of RVSTB units in the multi-scale feature interaction module is 3.
  • the preset sampling frequencies of feature maps There are 4 types, and the number of RVSTB units in the multi-scale feature interaction module is 4.
  • the number of VSTL layers in the RVSTB unit can also be specifically set according to the actual application scenario.
  • the number of VSTL layers can reflect the degree of feature interaction. It is understandable that within a certain range, the more VSTL layers there are, the deeper the degree of information interaction between feature maps.
  • the VSTL layer can be composed of LayerNorm, MSA, MLP and residual connections.
  • the network structure can be improved based on the standard multi-head self-attention used by the original Transformer layer.
  • the main difference from the original Transformer network is the local attention mechanism and shift window mechanism.
  • Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application.
  • the size of the given input is T x H x W x C, where T can represent the input
  • T can represent the input
  • the number of frames, H can represent the height of each input picture, W can represent the width of each input picture, C can represent the number of channels of each input picture, the number of channels can be 3 by default;
  • Video Swin Transformer can first use a Conv3d layer to divide the input into non-overlapping small windows with dimensions of N x M x M. For each small window, the result of self-attention can be calculated independently. For the characteristics of a small window ,
  • N x M2 x C can represent the dimension composed of all point vectors in the small window, where the dimension of each point is C. Since there are N x M x M pixels in the small window, it is N x M2 x C; and three learnable mapping matrices By multiplying them separately, you can get the corresponding query(Q), key(K) and value(V) values. Its corresponding attention matrix can then be calculated:
  • the similarity between the query and the key can be calculated, that is, each point in the query and each point in the key are a similarity result; then multiplied by V, the new frame after fusing other frames and its own frame is obtained feature.
  • the MLP layer can contain two FC layers and a GELU activation function.
  • the LayerNorm layer is used between the MSA and MLP layers. After the MSA and MLP operations, a residual connection will be connected. The whole process can be expressed as follows:
  • the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module; the feature image reconstruction module is used to reconstruct features of the feature map output by the feature interaction network to form a reconstructed image; the interpolation image construction module It is used to perform image interpolation on the original video frame to form an interpolated image; the image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
  • the feature map output by the feature interaction network is reconstructed to form a reconstructed image.
  • the interpolation amplification method is used to perform image interpolation on the original video frame to form an interpolated image.
  • FIG. 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application.
  • the obtained original video frame sequence contains 5 original video frames.
  • the multi-scale feature interaction network in the target multi-scale video super-resolution model consists of 1 initial feature interaction module, 3 multi-scale feature interaction modules and It consists of one terminal feature interaction module in series, in which the initial feature interaction module includes one RVSTB unit, each multi-scale feature interaction module includes three RVSTB units and one feature fusion unit, and the terminal feature interaction module includes one RVSTB unit; pre- Suppose the sampling frequency is 1 times, 2 times and 4 times; one RVSTB unit includes 6 VSTL layers and 1 convolution layer, and 1 feature fusion unit includes 1 convolution layer.
  • the input 5 consecutive frames of original video images will first go through an RVSTB module, and the output feature map will go through 3 multi-scale feature interaction modules; each multi-scale feature interaction module uses the RVSTB module to process 3 feature maps of different sizes.
  • the technical solution of the embodiment of the present application obtains an original video frame sequence containing at least two original video frames, uses a pre-trained target multi-scale video super-resolution model, performs a video super-resolution operation on the original video frame sequence, and outputs a sequence containing the target Target video frame sequence of video frames; among them, the target multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network.
  • the multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame.
  • the reconstruction network performs image reconstruction based on the feature map output by the feature interaction network, which solves the problem that the existing video super-resolution method uses the video frame alignment method to perform information fusion between different frames, resulting in a complicated calculation process and a large amount of calculation.
  • the embodiment of this application builds a multi-scale video super-resolution model, forms video frames into feature maps of different sizes for information interaction, and can pay attention to different fine-grained feature information, achieving the effect of reducing the amount of calculation and improving model performance.
  • the training process of the target multi-scale video super-resolution model in this embodiment may include:
  • A1. Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence.
  • Each set of training data can include a low-scoring video frame sequence and the corresponding standard high-scoring video frame sequence.
  • the low-resolution video frame sequence contains a certain number of low-resolution video frames
  • the standard high-resolution video frame sequence contains a certain number of high-resolution video frames.
  • the low-resolution video frames can be obtained by downsampling the high-resolution video frames.
  • A2 Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence.
  • the multi-scale video super-resolution model to be trained can be built in advance.
  • the multi-scale video super-resolution model to be trained can be composed of a multi-scale feature interaction network to be trained and an image reconstruction network to be trained;
  • the multi-scale feature interaction network to be trained It can be composed of an initial feature interaction module to be trained, at least one multi-scale feature interaction module to be trained, and a terminal feature interaction module to be trained in series;
  • the multi-scale feature interaction module to be trained can include at least two RVSTB units to be trained and one to be trained.
  • Feature fusion unit; each RVSTB unit to be trained can include at least two VSTL layers and one convolutional layer.
  • the output actual high-scoring video frame sequence can be obtained.
  • the standard high-resolution video frame sequence is an actual high-resolution video frame sequence
  • the actual high-resolution video frame sequence is calculated and output by a model that has not completed training
  • the standard high-resolution video frame sequence and the actual high-resolution video frame are There must be a certain error in the sequence, and a fitting loss function can be formed based on this error to adjust the training parameters of the multi-scale video super-resolution model to be trained.
  • A4. Perform backpropagation on the multi-scale video super-resolution model to be trained by fitting the loss function to obtain the target multi-scale video super-resolution model.
  • the multi-scale video super-resolution model to be trained can be back-propagated through the fitting loss function, and the behavior recognition model can be continuously adjusted to finally obtain the target multi-scale video super-resolution model.
  • FIG 4 is a schematic structural diagram of a video super-resolution device provided in Embodiment 2 of the present application. As shown in Figure 4, the device includes:
  • the data acquisition module 210 is used to acquire an original video frame sequence including at least two original video frames.
  • the video super-resolution module 220 is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein, the target
  • the multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network.
  • the multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames.
  • the image reconstruction network is based on the The feature map output by the feature interaction network is used for image reconstruction.
  • the multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series;
  • the initial feature interaction module is used to perform feature interaction on each of the original video frames
  • the multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
  • the terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
  • the multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;
  • the feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency
  • the RVSTB unit is used to perform feature interaction on the down-sampled feature map
  • the feature map is up-sampled according to the preset sampling frequency. sampling;
  • the feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
  • the RVSTB unit includes at least two VSTL layers and one convolutional layer.
  • the VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
  • the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;
  • the feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image
  • the interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image
  • the image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
  • the training steps of the target multi-scale video super-resolution model include:
  • the multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
  • the video super-resolution device provided by the embodiments of this application can execute the video super-resolution method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present application.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program executable by at least one processor.
  • the processor 11 can perform the operation according to a computer program stored in a read-only memory (ROM) 12 or loaded from a storage unit 18 into a random access memory (RAM) 13 . Perform various appropriate actions and processing.
  • RAM 13 various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs various methods and processes described above, such as video super-resolution methods.
  • the video super-resolution method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18.
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19 .
  • the processor 11 may be configured to perform the video super-resolution method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or a combination thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • a computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media would include one or more wires based electrical connection, laptop disk, hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on an electronic device having: a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display)
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including acoustic input, speech input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in the present application is a video super resolution method. The method comprises: acquiring an original video frame sequence containing at least two original video frames; and performing video super resolution operation on the original video frame sequence by means of a pre-trained target multi-scale video super resolution model, and outputting a target video frame sequence containing a target video frame, the target multi-scale video super resolution model comprising a multi-scale feature interaction network and an image reconstruction network.

Description

一种视频超分方法、装置、设备及存储介质A video super-resolution method, device, equipment and storage medium
本申请要求于2022年03月22日提交中国专利局、申请号为202210286954.8、发明名称为“一种视频超分方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on March 22, 2022, with the application number 202210286954.8 and the invention name "A video super-resolution method, device, equipment and storage medium". All its contents are approved by This reference is incorporated into this application.
技术领域Technical field
本申请涉及视频图像处理技术领域,尤其涉及一种视频超分方法、装置、设备及存储介质。The present application relates to the technical field of video image processing, and in particular to a video super-resolution method, device, equipment and storage medium.
背景技术Background technique
相比于图片超分任务,视频超分辨率(VSR)不仅需要利用单个图像帧上的固有特性进行图像超分辨率,还涉及到聚合从视频序列中的多个高度相关但未对齐的帧中提取的信息。由于视频中物体的运动以及拍摄视频镜头的移动,视频中不同帧的信息之间存在着明显的位移,为了更好的利用不同帧的信息,现有的方法都会设计专门用来对齐不同帧信息的对齐模块,并有专门的对比实验来说明所提出的对齐模块的必要性。Compared with image super-resolution tasks, video super-resolution (VSR) not only requires the use of inherent characteristics on a single image frame for image super-resolution, but also involves aggregating multiple highly related but unaligned frames from the video sequence. Extracted information. Due to the movement of objects in the video and the movement of the video shooting lens, there is an obvious displacement between the information of different frames in the video. In order to better utilize the information of different frames, existing methods are specially designed to align the information of different frames. alignment module, and there are dedicated comparative experiments to illustrate the necessity of the proposed alignment module.
目前存在一些代表性的方法,例如在RBPN方法中,多个投影模块被用于顺序聚合来自多个帧的特征;在BasicVSR方法中,将常见的VSR框架归纳为四个部分,即信息流通(Propagation)、对齐(Alignment)、聚合(Aggregation)和上采样(Upsampling),双向传播被用于从整个输入视频中提取信息进行重建,采用光流进行特征扭曲;最近提出的BasicVSR++,在BasicVSR的基础上,使用了更加复杂的对齐模块来进一步的更好的对齐不同帧的特征;Swin-Transformer融合了CNN和Transformer的优点,在计算机领域显示出巨大的前景。基于Swin-Transformer里面提出的基础模块构建的SwinIR,在同等参数量的情况下,也在众多底层视觉任务中都达到了比CNN更好的性能。There are currently some representative methods. For example, in the RBPN method, multiple projection modules are used to sequentially aggregate features from multiple frames; in the BasicVSR method, the common VSR framework is summarized into four parts, namely information flow ( Propagation), alignment (Alignment), aggregation (Aggregation) and upsampling (Upsampling), two-way propagation is used to extract information from the entire input video for reconstruction, and optical flow is used for feature distortion; the recently proposed BasicVSR++ is based on BasicVSR On the other hand, a more complex alignment module is used to further better align the features of different frames; Swin-Transformer combines the advantages of CNN and Transformer and shows great prospects in the computer field. SwinIR, built based on the basic modules proposed in Swin-Transformer, has achieved better performance than CNN in many low-level visual tasks with the same number of parameters.
技术问题technical problem
本申请提供了一种视频超分发明名称,可以解决现有技术中实现视频超分任务为了更好的进行不同帧之间的信息融合,都有着各种各样特别复杂的对齐模块,实现过程复杂,计算量大,对处理设备的要求高的问题。This application provides a video super-resolution invention name, which can solve the task of realizing video super-resolution in the existing technology. In order to better perform information fusion between different frames, there are various particularly complex alignment modules. The implementation process It is a problem that is complex, computationally intensive, and requires high processing equipment.
技术解决方案Technical solutions
根据本申请的一方面,提供了一种视频超分方法,该方法包括:According to one aspect of the present application, a video super-resolution method is provided, which method includes:
获取包含至少两个原始视频帧的原始视频帧序列;Obtain an original video frame sequence containing at least two original video frames;
采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
根据本申请的另一方面,提供了一种视频超分装置,该装置包括:According to another aspect of the present application, a video super-resolution device is provided, which device includes:
数据获取模块,用于获取包含至少两个原始视频帧的原始视频帧序列;A data acquisition module, used to acquire an original video frame sequence containing at least two original video frames;
视频超分模块,用于采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。The video super-resolution module is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale The scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the features. The feature map output by the interactive network is used for image reconstruction.
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:According to another aspect of the present application, an electronic device is provided, the electronic device including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的视频超分方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Video super-resolution method.
根据本申请的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本申请任一实施例所述的视频超分方法。According to another aspect of the present application, a computer-readable storage medium is provided. The computer-readable storage medium stores computer instructions, and the computer instructions are used to implement any of the embodiments of the present application when executed by a processor. Video super-resolution method.
有益效果beneficial effects
本申请实施例的技术方案,通过获取包含至少两个原始视频帧的原始视频帧序列,采用预先训练好的目标多尺度视频超分模型,对原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,多尺度特征交互网络用于对各原始视频帧进行多尺度特征交互和特征融合,图像重建网络基于特征交互网络输出的特征图进行图像重构,解决了现有技术的视频超分方法为了进行不同帧之间的信息融合,使用视频帧对齐方法导致计算过程复杂,计算量大的问题,本申请实施例构建多尺度视频超分模型,将视频帧形成不同尺寸的特征图进行信息交互,可以关注不同细粒度的特征信息,实现了减少计算量,提升模型性能的效果。The technical solution of the embodiment of the present application obtains an original video frame sequence containing at least two original video frames, uses a pre-trained target multi-scale video super-resolution model, performs a video super-resolution operation on the original video frame sequence, and outputs a sequence containing the target Target video frame sequence of video frames; among them, the target multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame. Image The reconstruction network performs image reconstruction based on the feature map output by the feature interaction network, which solves the problem that the existing video super-resolution method uses the video frame alignment method to perform information fusion between different frames, resulting in a complicated calculation process and a large amount of calculation. , The embodiment of this application builds a multi-scale video super-resolution model, forms video frames into feature maps of different sizes for information interaction, and can pay attention to different fine-grained feature information, achieving the effect of reducing the amount of calculation and improving model performance.
应当理解,本部分所描述的内容并非旨在标识本申请的实施例的关键或重要特征,也不用于限制本申请的范围。本申请的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily understood from the following description.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.
图1是根据本申请实施例一提供的一种视频超分方法的流程图;Figure 1 is a flow chart of a video super-resolution method provided according to Embodiment 1 of the present application;
图2是根据本申请实施例一提供的一种视频超分方法中VSTL层的结构示意图;Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application;
图3是根据本申请实施例一提供的一种视频超分方法的原理示意图;Figure 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application;
图4是根据本申请实施例二提供的一种视频超分装置的结构示意图;Figure 4 is a schematic structural diagram of a video super-resolution device provided according to Embodiment 2 of the present application;
图5是实现本申请实施例的视频超分方法的电子设备的结构示意图。Figure 5 is a schematic structural diagram of an electronic device that implements the video super-resolution method according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those in the technical field to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“原始”、“目标”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", "original", "target", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to distinguish them. Describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
实施例Example
图1为本申请实施例一提供了一种视频超分方法的流程图,本实施例可适用于对视频图像进行超分辨率处理的情况,该方法可以由视频超分装置来执行,该视频超分装置可以采用硬件和/或软件的形式实现,该视频超分装置可配置于计算机设备中。如图1所示,该方法包括:Figure 1 is a flow chart of a video super-resolution method provided in Embodiment 1 of the present application. This embodiment can be applied to the case of super-resolution processing of video images. The method can be executed by a video super-resolution device. The video The super-resolution device can be implemented in the form of hardware and/or software, and the video super-resolution device can be configured in computer equipment. As shown in Figure 1, the method includes:
S110、获取包含至少两个原始视频帧的原始视频帧序列。S110. Obtain an original video frame sequence containing at least two original video frames.
在实际应用中,需要对视频图像进行超分辨率操作时,可以将一段长视频进行裁剪,形成包含一定数量视频帧的视频片段,在本实施例中,每个进行超分辨率操作之前的视频片段包含的视频帧集合可以称作一个原始视频帧序列。一个原始视频帧序列内多个连续的视频帧可以称为原始视频帧。In practical applications, when it is necessary to perform a super-resolution operation on a video image, a long video can be cropped to form a video clip containing a certain number of video frames. In this embodiment, each video before the super-resolution operation is performed The collection of video frames contained in a segment can be called an original video frame sequence. Multiple consecutive video frames within an original video frame sequence can be called original video frames.
S120、采用预先训练好的目标多尺度视频超分模型,对原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,多尺度特征交互网络用于对各原始视频帧进行多尺度特征交互和特征融合,图像重建网络基于特征交互网络输出的特征图进行图像重构。S120. Use the pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output the target video frame sequence including the target video frame; where the target multi-scale video super-resolution model includes multi-scale features. Interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame. The image reconstruction network performs image reconstruction based on the feature map output by the feature interaction network.
在本实施例中,可以预先搭建好多尺度视频超分模型,通过大量的训练数据进行训练后,得到目标多尺度视频超分模型。在对原始视频帧序列进行超分处理时,可以将原始视频帧序列作为输入数据,输入目标多尺度视频超分模型,目标多尺度视频超分模型通过对原始视频帧序列中的各个原始视频帧进行不同尺度的信息交互和特征融合,对每个原始视频帧的特征图进行重建后,可以得到对应的目标视频帧,目标视频帧按序排列就形成了超分处理后的目标视频帧序列。In this embodiment, a multi-scale video super-resolution model can be built in advance, and after training with a large amount of training data, the target multi-scale video super-resolution model can be obtained. When performing super-resolution processing on the original video frame sequence, the original video frame sequence can be used as input data and the target multi-scale video super-resolution model is input. The target multi-scale video super-resolution model processes each original video frame in the original video frame sequence. After performing information interaction and feature fusion at different scales, and reconstructing the feature map of each original video frame, the corresponding target video frame can be obtained. The target video frames are arranged in order to form the target video frame sequence after super-resolution processing.
可选的,多尺度特征交互网络可以由一个初始特征交互模块、至少一个多尺度特征交互模块和一个末端特征交互模块串联组成;初始特征交互模块可以用于对各原始视频帧进行特征交互;多尺度特征交互模块可以用于对初始特征交互子网或前一个多尺度特征交互模块输出的所有特征图进行多尺度特征交互和特征融合;末端特征交互模块可以用于对最后一个多尺度特征交互模块输出的所有特征图进行特征交互。Optionally, the multi-scale feature interaction network can be composed of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series; the initial feature interaction module can be used to perform feature interaction on each original video frame; multiple The scale feature interaction module can be used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module; the terminal feature interaction module can be used to perform multi-scale feature interaction and feature fusion on the last multi-scale feature interaction module All output feature maps undergo feature interaction.
在实际应用中,多尺度特征交互模块的数量可以根据实际应用场景具体设定,可以是2个、3个或者4个。当多尺度特征交互模块有2个时,原始视频帧序列输入目标多尺度视频超分模型,由初始特征交互模块对原始视频帧序列中的所有原始视频帧进行特征交互,输出各个原始视频帧与其他原始视频帧进行特征交互后的特征图;第一个多尺度特征交互模块获取到初始特征交互模块输出的各个特征图后,对各个特征图进行多尺度特征交互和特征融合,然后输入第二个多尺度特征交互模块;第二个多尺度特征交互模块再次对第一个多尺度特征交互模块输出的各个特征图进行多尺度特征交互和特征融合后,输入末端特征交互模块;末端特征交互模块再对第二个多尺度特征交互模块输出的所有特征图进行特征交互。In practical applications, the number of multi-scale feature interaction modules can be specifically set according to the actual application scenario, and can be 2, 3 or 4. When there are two multi-scale feature interaction modules, the original video frame sequence is input to the target multi-scale video super-resolution model, and the initial feature interaction module performs feature interaction on all original video frames in the original video frame sequence, and outputs each original video frame with Feature maps of other original video frames after feature interaction; after the first multi-scale feature interaction module obtains each feature map output by the initial feature interaction module, it performs multi-scale feature interaction and feature fusion on each feature map, and then inputs the second A multi-scale feature interaction module; the second multi-scale feature interaction module again performs multi-scale feature interaction and feature fusion on each feature map output by the first multi-scale feature interaction module, and then inputs the terminal feature interaction module; the terminal feature interaction module Then perform feature interaction on all feature maps output by the second multi-scale feature interaction module.
进一步的,多尺度特征交互模块可以包括至少两个RVSTB单元和一个特征融合单元;将输入多尺度特征交互模块的特征图按照预设采样频率进行下采样,采用RVSTB单元对下采样的特征图进行特征交互,并在特征交互后按预设采样频率进行上采样;特征融合单元用于对各RVSTB单元针对同一输入的特征图对应的所有输出特征图进行特征融合。Further, the multi-scale feature interaction module can include at least two RVSTB units and a feature fusion unit; the feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, and the down-sampled feature map is processed using the RVSTB unit. Feature interaction, and upsampling according to the preset sampling frequency after feature interaction; the feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
在本实施例中,RVSTB单元可以包括至少两个VSTL层和一个卷积层。VSTL层可以采用移位窗口机制和注意力机制实现特征交互。In this embodiment, the RVSTB unit may include at least two VSTL layers and one convolutional layer. The VSTL layer can use the shift window mechanism and the attention mechanism to achieve feature interaction.
在实际应用中,RVSTB(Residual Video Swin Transformer Block)单元可以构建不同视频帧之间的相关性,完成多视频帧之间信息的不断交互。多尺度特征交互模块中RVSTB单元的数量可以根据实际应用场景具体设定。RVSTB单元的数量可以与预设采样频率的数量一致,也就是说,特征图的预设采样频率有3种,多尺度特征交互模块中RVSTB单元的数量即为3,特征图的预设采样频率有4种,多尺度特征交互模块中RVSTB单元的数量即为4。In practical applications, the RVSTB (Residual Video Swin Transformer Block) unit can construct the correlation between different video frames and complete the continuous interaction of information between multiple video frames. The number of RVSTB units in the multi-scale feature interaction module can be specifically set according to the actual application scenario. The number of RVSTB units can be consistent with the number of preset sampling frequencies. That is to say, there are 3 preset sampling frequencies of feature maps. The number of RVSTB units in the multi-scale feature interaction module is 3. The preset sampling frequencies of feature maps There are 4 types, and the number of RVSTB units in the multi-scale feature interaction module is 4.
同理,RVSTB单元中VSTL层的数量也可以根据实际应用场景具体设定,VSTL层的数量可以体现特征交互的程度。可以理解的是,在一定范围内,VSTL层越多,特征图之间的信息交互程度越深。In the same way, the number of VSTL layers in the RVSTB unit can also be specifically set according to the actual application scenario. The number of VSTL layers can reflect the degree of feature interaction. It is understandable that within a certain range, the more VSTL layers there are, the deeper the degree of information interaction between feature maps.
具体的,VSTL层可以由LayerNorm,MSA,MLP和残差连接组成。该网络结构可以是基于原始Transformer layer使用的标准多头自注意力进行改进的,与原始的Transformer网络的主要区别在于局部注意力机制和移位窗口机制。图2是根据本申请实施例一提供的一种视频超分方法中VSTL层的结构示意图,如图2所示,给定输入的尺寸为T x H x W x C,其中,T可以表示输入的帧数,H可以表示输入的每一张图片的高度,W可以表示输入的每一张图片的宽度,C可以表示输入的每一张图片的通道数,通道数可以默认是3;Video Swin Transformer可以先使用1个Conv3d层将输入划分成非重叠的维度为N x M x M的小窗,针对每一个小窗,可以独立计算其自注意力的结果,对于一个小窗的特征 Specifically, the VSTL layer can be composed of LayerNorm, MSA, MLP and residual connections. The network structure can be improved based on the standard multi-head self-attention used by the original Transformer layer. The main difference from the original Transformer network is the local attention mechanism and shift window mechanism. Figure 2 is a schematic structural diagram of the VSTL layer in a video super-resolution method provided according to Embodiment 1 of the present application. As shown in Figure 2, the size of the given input is T x H x W x C, where T can represent the input The number of frames, H can represent the height of each input picture, W can represent the width of each input picture, C can represent the number of channels of each input picture, the number of channels can be 3 by default; Video Swin Transformer can first use a Conv3d layer to divide the input into non-overlapping small windows with dimensions of N x M x M. For each small window, the result of self-attention can be calculated independently. For the characteristics of a small window ,
N x M² x C可以表示小窗内所有点向量组成的维度,其中每一个点的维度是C,由于小窗里面有N x M x M个像素点,所以是N x M² x C;将其和三个可学习的映射矩阵 分别进行相乘,就能得到相应的query(Q),key(K)和value(V)值。随后可以计算其相应的注意力矩阵: N x M² x C can represent the dimension composed of all point vectors in the small window, where the dimension of each point is C. Since there are N x M x M pixels in the small window, it is N x M² x C; and three learnable mapping matrices By multiplying them separately, you can get the corresponding query(Q), key(K) and value(V) values. Its corresponding attention matrix can then be calculated:
,
其中, 可以计算query和key之间的相似性,即query中的每一个点和key中的每一个点都是一个相似性结果;之后与V相乘,就得到了融合其它帧以及自己帧后新的特征。 in, The similarity between the query and the key can be calculated, that is, each point in the query and each point in the key are a similarity result; then multiplied by V, the new frame after fusing other frames and its own frame is obtained feature.
上述的注意力操作会并行进行h次,之后这h个结果会被concat起来。The above attention operation will be performed h times in parallel, and then the h results will be concated.
MLP层可以包含两个FC层和一个GELU激活函数。LayerNorm层在MSA和MLP层之间使用,在MSA和MLP操作之后,都会接入一个残差连接。整个过程可以用公式表示如下:The MLP layer can contain two FC layers and a GELU activation function. The LayerNorm layer is used between the MSA and MLP layers. After the MSA and MLP operations, a residual connection will be connected. The whole process can be expressed as follows:
X=MSA(LN(X))+X,X=MSA(LN(X))+X,
X=MLP(LN(X))+X,X=MLP(LN(X))+X,
上述所说的操作都是在局部划分的小窗内部进行的,为了进行不同小窗之间信息的交互,可以使用移位窗口机制进行。The above-mentioned operations are all performed within locally divided small windows. In order to interact with information between different small windows, a shift window mechanism can be used.
可选的,图像重建网络包括特征图像重建模块、插值图像构建模块和图像融合模块;特征图像重建模块用于对特征交互网络输出的特征图进行特征重构,形成重构图像;插值图像构建模块用于对原始视频帧进行图像插值,形成插值图像;图像融合模块用于对重构图像和插值图像进行融合,形成目标视频帧。Optionally, the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module; the feature image reconstruction module is used to reconstruct features of the feature map output by the feature interaction network to form a reconstructed image; the interpolation image construction module It is used to perform image interpolation on the original video frame to form an interpolated image; the image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
具体的,对特征交互网络输出的特征图进行特征重构,形成重构图像,同时使用插值放大的方法对原始视频帧进行图像插值,形成插值图像,将同一原始视频帧对应的重构图像和插值图像,就得到超分后的目标视频帧。Specifically, the feature map output by the feature interaction network is reconstructed to form a reconstructed image. At the same time, the interpolation amplification method is used to perform image interpolation on the original video frame to form an interpolated image. The reconstructed image corresponding to the same original video frame and Interpolate the image to get the super-resolved target video frame.
示例性的,图3是根据本申请实施例一提供的一种视频超分方法的原理示意图。如图3所示,获取的原始视频帧序列包含5个原始视频帧,该目标多尺度视频超分模型中的多尺度特征交互网络由1个初始特征交互模块、3个多尺度特征交互模块和1个末端特征交互模块串联组成,其中,初始特征交互模块包括1个RVSTB单元,每个多尺度特征交互模块包括3个RVSTB单元和一个特征融合单元,末端特征交互模块包括1个RVSTB单元;预设采样频率为1倍、2倍和4倍;1个RVSTB单元包括6个VSTL层和1个卷积层,1个特征融合单元包括1个卷积层。输入的连续5帧原始视频图像首先会经过1个RVSTB模块,输出的特征图经过3个多尺度特征交互模块;每一个多尺度特征交互模块是针对3个不同尺寸的特征图使用RVSTB模块进行处理的,针对处理完的3个尺寸的特征图,可以先将2个小的特征图变化到和主特征图同样的尺寸,之后对这3个特征图进行concat,然后使用1个Conv3D进行信息的聚合;经过3个多尺度特征交互模块后再经过1个RVSTB模块,用于进一步的融合不同特征图的信息;之后再经过一个特征重建模块,将特征重建的输出和使用插值放大的输入图片进行相加,就得到最终的多帧超分结果。For example, FIG. 3 is a schematic diagram of the principle of a video super-resolution method provided according to Embodiment 1 of the present application. As shown in Figure 3, the obtained original video frame sequence contains 5 original video frames. The multi-scale feature interaction network in the target multi-scale video super-resolution model consists of 1 initial feature interaction module, 3 multi-scale feature interaction modules and It consists of one terminal feature interaction module in series, in which the initial feature interaction module includes one RVSTB unit, each multi-scale feature interaction module includes three RVSTB units and one feature fusion unit, and the terminal feature interaction module includes one RVSTB unit; pre- Suppose the sampling frequency is 1 times, 2 times and 4 times; one RVSTB unit includes 6 VSTL layers and 1 convolution layer, and 1 feature fusion unit includes 1 convolution layer. The input 5 consecutive frames of original video images will first go through an RVSTB module, and the output feature map will go through 3 multi-scale feature interaction modules; each multi-scale feature interaction module uses the RVSTB module to process 3 feature maps of different sizes. Yes, for the three-sized feature maps that have been processed, you can first change the two small feature maps to the same size as the main feature map, then concat the three feature maps, and then use a Conv3D for information Aggregation; after passing through three multi-scale feature interaction modules, it passes through an RVSTB module to further fuse the information of different feature maps; and then passes through a feature reconstruction module to combine the output of the feature reconstruction with the input image amplified by interpolation. Added together, the final multi-frame super-resolution result is obtained.
本申请实施例的技术方案,通过获取包含至少两个原始视频帧的原始视频帧序列,采用预先训练好的目标多尺度视频超分模型,对原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,多尺度特征交互网络用于对各原始视频帧进行多尺度特征交互和特征融合,图像重建网络基于特征交互网络输出的特征图进行图像重构,解决了现有技术的视频超分方法为了进行不同帧之间的信息融合,使用视频帧对齐方法导致计算过程复杂,计算量大的问题,本申请实施例构建多尺度视频超分模型,将视频帧形成不同尺寸的特征图进行信息交互,可以关注不同细粒度的特征信息,实现了减少计算量,提升模型性能的效果。The technical solution of the embodiment of the present application obtains an original video frame sequence containing at least two original video frames, uses a pre-trained target multi-scale video super-resolution model, performs a video super-resolution operation on the original video frame sequence, and outputs a sequence containing the target Target video frame sequence of video frames; among them, the target multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each original video frame. Image The reconstruction network performs image reconstruction based on the feature map output by the feature interaction network, which solves the problem that the existing video super-resolution method uses the video frame alignment method to perform information fusion between different frames, resulting in a complicated calculation process and a large amount of calculation. , The embodiment of this application builds a multi-scale video super-resolution model, forms video frames into feature maps of different sizes for information interaction, and can pay attention to different fine-grained feature information, achieving the effect of reducing the amount of calculation and improving model performance.
在上述方案的基础上,本实施例中目标多尺度视频超分模型的训练过程可以包括:Based on the above solution, the training process of the target multi-scale video super-resolution model in this embodiment may include:
A1、获取包含低分视频帧序列和对应标准高分视频帧序列的训练数据集。A1. Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence.
具体的,在训练模型时,需要使用大量的训练数据,每组训练数据都可以包括一个低分视频帧序列和对应的标准高分视频帧序列。低分视频帧序列包含一定数量的低分辨率视频帧,标准高分视频帧序列包含一定数量的高分辨率视频帧,低分辨率视频帧可以由高分辨率视频帧降采样得到。Specifically, when training the model, a large amount of training data needs to be used. Each set of training data can include a low-scoring video frame sequence and the corresponding standard high-scoring video frame sequence. The low-resolution video frame sequence contains a certain number of low-resolution video frames, and the standard high-resolution video frame sequence contains a certain number of high-resolution video frames. The low-resolution video frames can be obtained by downsampling the high-resolution video frames.
A2、将低分视频帧序列输入待训练多尺度视频超分模型,获得输出的实际高分视频帧序列。A2. Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence.
在本实施例中,待训练多尺度视频超分模型可以预先搭建,待训练多尺度视频超分模型可以由待训练多尺度特征交互网络和待训练图像重建网络构成;待训练多尺度特征交互网络可以由一个待训练初始特征交互模块、至少一个待训练多尺度特征交互模块和一个待训练末端特征交互模块串联组成;待训练多尺度特征交互模块可以包括至少两个待训练RVSTB单元和一个待训练特征融合单元;每个待训练RVSTB单元可以包括至少两个VSTL层和一个卷积层。In this embodiment, the multi-scale video super-resolution model to be trained can be built in advance. The multi-scale video super-resolution model to be trained can be composed of a multi-scale feature interaction network to be trained and an image reconstruction network to be trained; the multi-scale feature interaction network to be trained It can be composed of an initial feature interaction module to be trained, at least one multi-scale feature interaction module to be trained, and a terminal feature interaction module to be trained in series; the multi-scale feature interaction module to be trained can include at least two RVSTB units to be trained and one to be trained. Feature fusion unit; each RVSTB unit to be trained can include at least two VSTL layers and one convolutional layer.
具体的,将低分视频帧序列作为输入数据,输入搭建好的待训练多尺度视频超分模型,可以获得输出的实际高分视频帧序列。Specifically, by using the low-scoring video frame sequence as input data and inputting the built multi-scale video super-resolution model to be trained, the output actual high-scoring video frame sequence can be obtained.
A3、根据标准高分视频帧序列和实际高分视频帧序列,获得拟合损失函数。A3. Obtain the fitting loss function based on the standard high-score video frame sequence and the actual high-score video frame sequence.
具体的,由于标准高分视频帧序列是实际存在高分辨率视频帧序列,实际高分视频帧序列是由未完成训练的模型计算输出的,因此标准高分视频帧序列和实际高分视频帧序列必然存在一定的误差,可以根据该误差,形成拟合损失函数,以实现对待训练多尺度视频超分模型的训练调参。Specifically, since the standard high-resolution video frame sequence is an actual high-resolution video frame sequence, and the actual high-resolution video frame sequence is calculated and output by a model that has not completed training, the standard high-resolution video frame sequence and the actual high-resolution video frame are There must be a certain error in the sequence, and a fitting loss function can be formed based on this error to adjust the training parameters of the multi-scale video super-resolution model to be trained.
A4、通过拟合损失函数对待训练多尺度视频超分模型进行反向传播,得到目标多尺度视频超分模型。A4. Perform backpropagation on the multi-scale video super-resolution model to be trained by fitting the loss function to obtain the target multi-scale video super-resolution model.
具体的,在得到拟合损失函数后,可以通过拟合损失函数对待训练多尺度视频超分模型进行反向传播,不断调整行为识别模型,最终得到目标多尺度视频超分模型。Specifically, after obtaining the fitting loss function, the multi-scale video super-resolution model to be trained can be back-propagated through the fitting loss function, and the behavior recognition model can be continuously adjusted to finally obtain the target multi-scale video super-resolution model.
实施例Example
图4为本申请实施例二提供的一种视频超分装置的结构示意图。如图4所示,该装置包括:Figure 4 is a schematic structural diagram of a video super-resolution device provided in Embodiment 2 of the present application. As shown in Figure 4, the device includes:
数据获取模块210,用于获取包含至少两个原始视频帧的原始视频帧序列。The data acquisition module 210 is used to acquire an original video frame sequence including at least two original video frames.
视频超分模块220,用于采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。The video super-resolution module 220 is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein, the target The multi-scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the The feature map output by the feature interaction network is used for image reconstruction.
可选的,所述多尺度特征交互网络由一个初始特征交互模块、至少一个多尺度特征交互模块和一个末端特征交互模块串联组成;Optionally, the multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module in series;
所述初始特征交互模块用于对各所述原始视频帧进行特征交互;The initial feature interaction module is used to perform feature interaction on each of the original video frames;
所述多尺度特征交互模块用于对所述初始特征交互子网或前一个多尺度特征交互模块输出的所有特征图进行多尺度特征交互和特征融合;The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
所述末端特征交互模块用于对最后一个多尺度特征交互模块输出的所有特征图进行特征交互。The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
可选的,所述多尺度特征交互模块包括至少两个RVSTB单元和一个特征融合单元;Optionally, the multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;
将输入所述多尺度特征交互模块的特征图按照预设采样频率进行下采样,采用所述RVSTB单元对下采样的特征图进行特征交互,并在特征交互后按所述预设采样频率进行上采样;The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;
所述特征融合单元用于对各所述RVSTB单元针对同一输入的特征图对应的所有输出特征图进行特征融合。The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
可选的,所述RVSTB单元包括至少两个VSTL层和一个卷积层。Optionally, the RVSTB unit includes at least two VSTL layers and one convolutional layer.
可选的,所述VSTL层采用移位窗口机制和注意力机制实现特征交互。Optionally, the VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
可选的,所述图像重建网络包括特征图像重建模块、插值图像构建模块和图像融合模块;Optionally, the image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;
所述特征图像重建模块用于对所述特征交互网络输出的特征图进行特征重构,形成重构图像;The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;
所述插值图像构建模块用于对所述原始视频帧进行图像插值,形成插值图像;The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;
所述图像融合模块用于对所述重构图像和所述插值图像进行融合,形成所述目标视频帧。The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
可选的,所述目标多尺度视频超分模型的训练步骤包括:Optionally, the training steps of the target multi-scale video super-resolution model include:
获取包含低分视频帧序列和对应标准高分视频帧序列的训练数据集;Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;
将所述低分视频帧序列输入待训练多尺度视频超分模型,获得输出的实际高分视频帧序列;Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;
根据所述标准高分视频帧序列和所述实际高分视频帧序列,获得拟合损失函数;Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;
通过所述拟合损失函数对所述待训练多尺度视频超分模型进行反向传播,得到所述目标多尺度视频超分模型。The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
本申请实施例所提供的视频超分装置可执行本申请任意实施例所提供的视频超分方法,具备执行方法相应的功能模块和有益效果。The video super-resolution device provided by the embodiments of this application can execute the video super-resolution method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.
实施例Example
图5示出了可以用来实施本申请的实施例的电子设备10的结构示意图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present application. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.
如图5所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(ROM)12、随机访问存储器(RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序,来执行各种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(I/O)接口15也连接至总线14。As shown in Figure 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program executable by at least one processor. The processor 11 can perform the operation according to a computer program stored in a read-only memory (ROM) 12 or loaded from a storage unit 18 into a random access memory (RAM) 13 . Perform various appropriate actions and processing. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14. An input/output (I/O) interface 15 is also connected to bus 14 .
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如视频超分方法。Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs various methods and processes described above, such as video super-resolution methods.
在一些实施例中,视频超分方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的视频超分方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行视频超分方法。In some embodiments, the video super-resolution method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19 . When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the video super-resolution method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the video super-resolution method in any other suitable manner (eg, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
用于实施本申请的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本申请的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this application, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media would include one or more wires based electrical connection, laptop disk, hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including acoustic input, speech input, or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in this application can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution of this application can be achieved, there is no limitation here.
上述具体实施方式,并不构成对本申请保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等,均应包含在本申请保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of protection of the present application. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims (20)

  1. 一种视频超分方法,包括:A video super-resolution method, including:
    获取包含至少两个原始视频帧的原始视频帧序列;Obtain an original video frame sequence containing at least two original video frames;
    采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
  2. 根据权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述多尺度特征交互网络由一个初始特征交互模块、至少一个多尺度特征交互模块和一个末端特征交互模块串联组成;The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;
    所述初始特征交互模块用于对各所述原始视频帧进行特征交互;The initial feature interaction module is used to perform feature interaction on each of the original video frames;
    所述多尺度特征交互模块用于对所述初始特征交互子网或前一个多尺度特征交互模块输出的所有特征图进行多尺度特征交互和特征融合;The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
    所述末端特征交互模块用于对最后一个多尺度特征交互模块输出的所有特征图进行特征交互。The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
  3. 根据权利要求2所述的方法,其中,The method of claim 2, wherein
    所述多尺度特征交互模块包括至少两个RVSTB单元和一个特征融合单元;The multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;
    将输入所述多尺度特征交互模块的特征图按照预设采样频率进行下采样,采用所述RVSTB单元对下采样的特征图进行特征交互,并在特征交互后按所述预设采样频率进行上采样;The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;
    所述特征融合单元用于对各所述RVSTB单元针对同一输入的特征图对应的所有输出特征图进行特征融合。The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
  4. 根据权利要求3所述的方法,其中,The method of claim 3, wherein,
    所述RVSTB单元包括至少两个VSTL层和一个卷积层。The RVSTB unit includes at least two VSTL layers and one convolutional layer.
  5. 根据权利要求4所述的方法,其中,The method of claim 4, wherein,
    所述VSTL层采用移位窗口机制和注意力机制实现特征交互。The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
  6. 根据权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述图像重建网络包括特征图像重建模块、插值图像构建模块和图像融合模块;The image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;
    所述特征图像重建模块用于对所述特征交互网络输出的特征图进行特征重构,形成重构图像;The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;
    所述插值图像构建模块用于对所述原始视频帧进行图像插值,形成插值图像;The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;
    所述图像融合模块用于对所述重构图像和所述插值图像进行融合,形成所述目标视频帧。The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
  7. 根据权利要求1所述的方法,其中,所述目标多尺度视频超分模型的训练步骤包括:The method according to claim 1, wherein the training step of the target multi-scale video super-resolution model includes:
    获取包含低分视频帧序列和对应标准高分视频帧序列的训练数据集;Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;
    将所述低分视频帧序列输入待训练多尺度视频超分模型,获得输出的实际高分视频帧序列;Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;
    根据所述标准高分视频帧序列和所述实际高分视频帧序列,获得拟合损失函数;Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;
    通过所述拟合损失函数对所述待训练多尺度视频超分模型进行反向传播,得到所述目标多尺度视频超分模型。The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
  8. 一种视频超分装置,包括:A video super-resolution device, including:
    数据获取模块,用于获取包含至少两个原始视频帧的原始视频帧序列;A data acquisition module, used to acquire an original video frame sequence containing at least two original video frames;
    视频超分模块,用于采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。The video super-resolution module is used to use a pre-trained target multi-scale video super-resolution model to perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale The scale video super-resolution model includes a multi-scale feature interaction network and an image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the features. The feature map output by the interactive network is used for image reconstruction.
  9. 一种电子设备,所述电子设备包括:An electronic device, the electronic device includes:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行以下步骤:The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to perform the following steps:
    获取包含至少两个原始视频帧的原始视频帧序列;Obtain an original video frame sequence containing at least two original video frames;
    采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
  10. 根据权利要求9所述的电子设备,其中,The electronic device according to claim 9, wherein
    所述多尺度特征交互网络由一个初始特征交互模块、至少一个多尺度特征交互模块和一个末端特征交互模块串联组成;The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;
    所述初始特征交互模块用于对各所述原始视频帧进行特征交互;The initial feature interaction module is used to perform feature interaction on each of the original video frames;
    所述多尺度特征交互模块用于对所述初始特征交互子网或前一个多尺度特征交互模块输出的所有特征图进行多尺度特征交互和特征融合;The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
    所述末端特征交互模块用于对最后一个多尺度特征交互模块输出的所有特征图进行特征交互。The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
  11. 根据权利要求10所述的电子设备,其中,The electronic device according to claim 10, wherein
    所述多尺度特征交互模块包括至少两个RVSTB单元和一个特征融合单元;The multi-scale feature interaction module includes at least two RVSTB units and a feature fusion unit;
    将输入所述多尺度特征交互模块的特征图按照预设采样频率进行下采样,采用所述RVSTB单元对下采样的特征图进行特征交互,并在特征交互后按所述预设采样频率进行上采样;The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;
    所述特征融合单元用于对各所述RVSTB单元针对同一输入的特征图对应的所有输出特征图进行特征融合。The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
  12. 根据权利要求11所述的电子设备,其中,The electronic device according to claim 11, wherein
    所述RVSTB单元包括至少两个VSTL层和一个卷积层。The RVSTB unit includes at least two VSTL layers and one convolutional layer.
  13. 根据权利要求12所述的电子设备,其中,The electronic device according to claim 12, wherein
    所述VSTL层采用移位窗口机制和注意力机制实现特征交互。The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
  14. 根据权利要求9所述的电子设备,其中,The electronic device according to claim 9, wherein
    所述图像重建网络包括特征图像重建模块、插值图像构建模块和图像融合模块;The image reconstruction network includes a feature image reconstruction module, an interpolation image construction module and an image fusion module;
    所述特征图像重建模块用于对所述特征交互网络输出的特征图进行特征重构,形成重构图像;The feature image reconstruction module is used to perform feature reconstruction on the feature map output by the feature interaction network to form a reconstructed image;
    所述插值图像构建模块用于对所述原始视频帧进行图像插值,形成插值图像;The interpolation image building module is used to perform image interpolation on the original video frame to form an interpolation image;
    所述图像融合模块用于对所述重构图像和所述插值图像进行融合,形成所述目标视频帧。The image fusion module is used to fuse the reconstructed image and the interpolated image to form the target video frame.
  15. 根据权利要求9所述的电子设备,其中,所述目标多尺度视频超分模型的训练步骤包括:The electronic device according to claim 9, wherein the training step of the target multi-scale video super-resolution model includes:
    获取包含低分视频帧序列和对应标准高分视频帧序列的训练数据集;Obtain a training data set containing a low-scoring video frame sequence and a corresponding standard high-scoring video frame sequence;
    将所述低分视频帧序列输入待训练多尺度视频超分模型,获得输出的实际高分视频帧序列;Input the low-scoring video frame sequence into the multi-scale video super-resolution model to be trained, and obtain the output actual high-scoring video frame sequence;
    根据所述标准高分视频帧序列和所述实际高分视频帧序列,获得拟合损失函数;Obtain a fitting loss function according to the standard high-scoring video frame sequence and the actual high-scoring video frame sequence;
    通过所述拟合损失函数对所述待训练多尺度视频超分模型进行反向传播,得到所述目标多尺度视频超分模型。The multi-scale video super-resolution model to be trained is back-propagated through the fitting loss function to obtain the target multi-scale video super-resolution model.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现以下步骤:A computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement the following steps when executed:
    获取包含至少两个原始视频帧的原始视频帧序列;Obtain an original video frame sequence containing at least two original video frames;
    采用预先训练好的目标多尺度视频超分模型,对所述原始视频帧序列进行视频超分操作,输出包含目标视频帧的目标视频帧序列;其中,所述目标多尺度视频超分模型包含多尺度特征交互网络和图像重建网络,所述多尺度特征交互网络用于对各所述原始视频帧进行多尺度特征交互和特征融合,所述图像重建网络基于所述特征交互网络输出的特征图进行图像重构。Using a pre-trained target multi-scale video super-resolution model, perform a video super-resolution operation on the original video frame sequence, and output a target video frame sequence including the target video frame; wherein the target multi-scale video super-resolution model includes multiple Scale feature interaction network and image reconstruction network. The multi-scale feature interaction network is used to perform multi-scale feature interaction and feature fusion on each of the original video frames. The image reconstruction network is based on the feature map output by the feature interaction network. Image reconstruction.
  17. 根据权利要求16所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 16, wherein:
    所述多尺度特征交互网络由一个初始特征交互模块、至少一个多尺度特征交互模块和一个末端特征交互模块串联组成;The multi-scale feature interaction network consists of an initial feature interaction module, at least one multi-scale feature interaction module and a terminal feature interaction module connected in series;
    所述初始特征交互模块用于对各所述原始视频帧进行特征交互;The initial feature interaction module is used to perform feature interaction on each of the original video frames;
    所述多尺度特征交互模块用于对所述初始特征交互子网或前一个多尺度特征交互模块输出的所有特征图进行多尺度特征交互和特征融合;The multi-scale feature interaction module is used to perform multi-scale feature interaction and feature fusion on all feature maps output by the initial feature interaction subnet or the previous multi-scale feature interaction module;
    所述末端特征交互模块用于对最后一个多尺度特征交互模块输出的所有特征图进行特征交互。The terminal feature interaction module is used to perform feature interaction on all feature maps output by the last multi-scale feature interaction module.
  18. 根据权利要求17所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 17, wherein:
    所述多尺度特征交互模块包括至少两个RVSTB单元和一个特征融合单元;The multi-scale feature interaction module includes at least two RVSTB units and one feature fusion unit;
    将输入所述多尺度特征交互模块的特征图按照预设采样频率进行下采样,采用所述RVSTB单元对下采样的特征图进行特征交互,并在特征交互后按所述预设采样频率进行上采样;The feature map input to the multi-scale feature interaction module is down-sampled according to the preset sampling frequency, the RVSTB unit is used to perform feature interaction on the down-sampled feature map, and after feature interaction, the feature map is up-sampled according to the preset sampling frequency. sampling;
    所述特征融合单元用于对各所述RVSTB单元针对同一输入的特征图对应的所有输出特征图进行特征融合。The feature fusion unit is used to perform feature fusion on all output feature maps corresponding to the same input feature map of each RVSTB unit.
  19. 根据权利要求18所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 18, wherein:
    所述RVSTB单元包括至少两个VSTL层和一个卷积层。The RVSTB unit includes at least two VSTL layers and one convolutional layer.
  20. 根据权利要求19所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 19, wherein
    所述VSTL层采用移位窗口机制和注意力机制实现特征交互。The VSTL layer uses a shift window mechanism and an attention mechanism to implement feature interaction.
PCT/CN2023/080945 2022-03-22 2023-03-10 Video super resolution method, apparatus, device, and storage medium WO2023179385A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210286954.8A CN116862762A (en) 2022-03-22 2022-03-22 Video superdivision method, device, equipment and storage medium
CN202210286954.8 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023179385A1 true WO2023179385A1 (en) 2023-09-28

Family

ID=88099817

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080945 WO2023179385A1 (en) 2022-03-22 2023-03-10 Video super resolution method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN116862762A (en)
WO (1) WO2023179385A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094999A (en) * 2023-10-19 2023-11-21 南京航空航天大学 Cross-scale defect detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070667A (en) * 2020-08-14 2020-12-11 西安理工大学 Multi-scale feature fusion video super-resolution reconstruction method
CN112419152A (en) * 2020-11-23 2021-02-26 中国科学院深圳先进技术研究院 Image super-resolution method and device, terminal equipment and storage medium
US20210097648A1 (en) * 2019-09-30 2021-04-01 Tsinghua University Multi-image-based image enhancement method and device
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097648A1 (en) * 2019-09-30 2021-04-01 Tsinghua University Multi-image-based image enhancement method and device
CN112070667A (en) * 2020-08-14 2020-12-11 西安理工大学 Multi-scale feature fusion video super-resolution reconstruction method
CN112419152A (en) * 2020-11-23 2021-02-26 中国科学院深圳先进技术研究院 Image super-resolution method and device, terminal equipment and storage medium
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094999A (en) * 2023-10-19 2023-11-21 南京航空航天大学 Cross-scale defect detection method
CN117094999B (en) * 2023-10-19 2023-12-22 南京航空航天大学 Cross-scale defect detection method

Also Published As

Publication number Publication date
CN116862762A (en) 2023-10-10

Similar Documents

Publication Publication Date Title
JP7331171B2 (en) Methods and apparatus for training image recognition models, methods and apparatus for recognizing images, electronic devices, storage media, and computer programs
TWI756378B (en) System and method for deep learning image super resolution
Sun et al. Shufflemixer: An efficient convnet for image super-resolution
WO2022227886A1 (en) Method for generating super-resolution repair network model, and method and apparatus for image super-resolution repair
US20220207299A1 (en) Method and apparatus for building image enhancement model and for image enhancement
WO2022068451A1 (en) Style image generation method and apparatus, model training method and apparatus, device, and medium
WO2022042124A1 (en) Super-resolution image reconstruction method and apparatus, computer device, and storage medium
WO2023179385A1 (en) Video super resolution method, apparatus, device, and storage medium
CN117237197B (en) Image super-resolution method and device based on cross attention mechanism
WO2023143222A1 (en) Image processing method and apparatus, device, and storage medium
WO2022057868A1 (en) Image super-resolution method and electronic device
CN114494022B (en) Model training method, super-resolution reconstruction method, device, equipment and medium
CN114519667A (en) Image super-resolution reconstruction method and system
CN115209064A (en) Video synthesis method, device, equipment and storage medium
CN112418249A (en) Mask image generation method and device, electronic equipment and computer readable medium
WO2023197805A1 (en) Image processing method and apparatus, and storage medium and electronic device
WO2023125550A1 (en) Video frame repair method and apparatus, and device, storage medium and program product
WO2023179360A1 (en) Video processing method and apparatus, and electronic device and storage medium
WO2022213716A1 (en) Image format conversion method and apparatus, device, storage medium, and program product
CN116485654A (en) Lightweight single-image super-resolution reconstruction method combining convolutional neural network and transducer
WO2020000878A1 (en) Method and apparatus for generating image
WO2021213340A1 (en) Video resolution enhancement method and apparatus, storage medium, and electronic device
CN113240780B (en) Method and device for generating animation
WO2021218414A1 (en) Video enhancement method and apparatus, and electronic device and storage medium
CN117196959B (en) Self-attention-based infrared image super-resolution method, device and readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773633

Country of ref document: EP

Kind code of ref document: A1