CN116757970A

CN116757970A - Training method of video reconstruction model, video reconstruction method, device and equipment

Info

Publication number: CN116757970A
Application number: CN202311046169.6A
Authority: CN
Inventors: 蔡德
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-09-15
Anticipated expiration: 2043-08-18
Also published as: CN116757970B

Abstract

The application discloses a training method of a video reconstruction model, a video reconstruction method, a device and equipment, and belongs to the technical field of computer vision. The method comprises the following steps: acquiring a first video and a second video with the same content, wherein the resolution of the second video is higher than that of the first video; noise adding processing is carried out on any frame of first image in the first video through the neural network model to obtain prediction noise information, denoising processing is carried out on the prediction noise information to obtain a reconstructed image corresponding to any frame of first image, and the resolution of the reconstructed image corresponding to the first image is the same as that of the second image corresponding to the first image in the second video; and training the neural network model based on the second image and the reconstructed image corresponding to each frame of the first image to obtain a video reconstruction model. Through denoising processing, detailed information in the image can be reserved, and the definition of the reconstructed image is improved, so that the accuracy of the model is improved.

Description

Training method of video reconstruction model, video reconstruction method, device and equipment

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a training method of a video reconstruction model, a video reconstruction method, a device and equipment.

Background

With the development of computer vision technology, the requirements of fields such as medical imaging, film and television production and the like on video resolution are increasing. In general, due to limitations of video capturing devices, video production devices, and the like, the resolution of video is low, resulting in an insufficient definition and reality of pictures. In this case, a video with a lower resolution can be reconstructed to obtain a video with a higher resolution, and how to reconstruct the video becomes a problem to be solved.

Disclosure of Invention

The application provides a training method, a video reconstruction method, a device and equipment for a video reconstruction model, which can train to obtain the video reconstruction model with higher accuracy and improve the quality of reconstructed video.

In a first aspect, a method for training a video reconstruction model is provided, the method comprising: acquiring a first video and a second video with the same content, wherein the resolution of the second video is higher than that of the first video; for any frame of first image in the first video, performing noise adding processing on the any frame of first image through a neural network model to obtain prediction noise information, and performing noise removing processing on the prediction noise information to obtain a reconstructed image corresponding to the any frame of first image, wherein the reconstructed image corresponding to the any frame of first image is identical to a second image corresponding to the any frame of first image in the second video in resolution; training the neural network model based on a second image and a reconstruction image corresponding to each frame of the first image to obtain a video reconstruction model, wherein the video reconstruction model is used for reconstructing a video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

In a second aspect, a video reconstruction method is provided, the method comprising: acquiring a video to be reconstructed; carrying out noise adding processing on the video to be reconstructed through a video reconstruction model to obtain reference noise information, and carrying out denoising processing on the reference noise information to obtain a target video; the video reconstruction model is trained according to the method of the first aspect, the content of the target video is the same as that of the video to be reconstructed, and the resolution of the target video is higher than that of the video to be reconstructed.

In a third aspect, a training apparatus for a video reconstruction model is provided, the apparatus comprising: the acquisition module is used for acquiring a first video and a second video with the same content, and the resolution of the second video is higher than that of the first video; the noise adding and denoising module is used for adding noise to any frame of first image in the first video through a neural network model to obtain prediction noise information, and denoising the prediction noise information to obtain a reconstructed image corresponding to any frame of first image, wherein the reconstructed image corresponding to any frame of first image is identical to the second image corresponding to any frame of first image in the second video in resolution; the training module is used for training the neural network model based on the second image and the reconstructed image corresponding to each frame of the first image to obtain a video reconstruction model, the video reconstruction model is used for reconstructing a video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

In a possible implementation manner, the noise adding and removing module is used for determining a reference image of the first image of any frame from the first video through a neural network model; determining image characteristics of the first image of any frame according to the reference image and the first image of any frame through the neural network model; and carrying out noise adding processing on the image characteristics of the first image of any frame through the neural network model to obtain prediction noise information.

In one possible implementation manner, the noise adding and removing module is configured to determine, through the neural network model, an image change feature based on the reference image and the first image of any frame, where the image change feature is used to characterize a change made by changing the reference image to the first image of any frame; extracting features of the reference image through the neural network model to obtain image features of the reference image; and determining the image characteristics of the first image of any frame based on the image characteristics and the image change characteristics of the reference image through the neural network model.

In a possible implementation manner, the noise adding and denoising module is configured to perform feature extraction on the first image of any frame through the neural network model to obtain a first feature of the first image of any frame; performing change processing on the image features of the reference image based on the image change features through the neural network model to obtain second features of the first image of any frame; and fusing the first features and the second features of the first image of any frame through the neural network model to obtain the image features of the first image of any frame.

In one possible implementation, the number of times of the noise adding process is multiple times; the noise adding and removing module is used for determining the image characteristics of the first image of any frame through a neural network model; performing first noise adding processing on the image features of the first image of any frame through the neural network model to obtain the features of the first image of any frame obtained after the first noise adding processing; and for any one of the noise adding processes except the first noise adding process, performing the any one of the noise adding processes on the characteristics of the first image of any one frame obtained after the last noise adding process of the any one of the noise adding processes through the neural network model to obtain the characteristics of the first image of any one frame obtained after the any one of the noise adding processes, wherein the characteristics of the first image of any one frame obtained after the last noise adding process are the prediction noise information.

In one possible implementation, the number of times of the denoising process is a plurality of times; the noise adding and denoising module is used for carrying out first denoising processing on the predicted noise information through the neural network model to obtain the characteristics of the first image of any frame obtained after the first denoising processing; for any denoising process except the first denoising process, performing the any denoising process on the characteristics of the first image of any frame obtained after the last denoising process of the any denoising process through the neural network model to obtain the characteristics of the first image of any frame obtained after the any denoising process; and determining a reconstructed image corresponding to the first image of any frame based on the characteristics obtained after the last denoising treatment of the first image of any frame through the neural network model.

In a possible implementation manner, the noise adding and removing module is configured to obtain description information of the first image of the any frame; and denoising the prediction noise information based on the description information of the first image of any frame to obtain a reconstructed image corresponding to the first image of any frame.

In a possible implementation manner, the training module is configured to determine, for a first image of any frame, an image loss corresponding to the first image of any frame based on an error between a second image corresponding to the first image of any frame and a reconstructed image; and training the neural network model based on the image loss corresponding to the first image of each frame to obtain a video reconstruction model.

In a possible implementation manner, the training module is configured to obtain labeling and noise adding data of each frame of the first image, where the labeling and noise adding data of any frame of the first image characterizes noise added in a process of adding noise of any frame of the first image into labeling noise information; acquiring labeling denoising data of each frame of second image, wherein the labeling denoising data of any frame of second image represents noise removed in the process of denoising the labeling noise information into any frame of second image; for any frame of first image, obtaining prediction noise adding data added in the process of adding noise to any frame of first image through a neural network model to obtain prediction noise information, and obtaining prediction noise removing data removed in the process of removing noise to the prediction noise information through the neural network model to obtain a reconstructed image; determining a first loss based on the labeling denoising data of the second images of each frame, the labeling denoising data of the first images of each frame, the prediction denoising data and the prediction denoising data; and training the neural network model based on the first loss, the second image corresponding to the first image of each frame and the reconstructed image to obtain a video reconstruction model.

In a possible implementation manner, the training module is configured to determine a denoising data loss based on the labeled denoising data of the second image of each frame and the predicted denoising data corresponding to the first image of each frame; determining noise adding data loss based on the labeling noise adding data of the first images of each frame and the prediction noise adding data corresponding to the first images of each frame; the first loss is determined based on the denoising data loss and the denoising data loss.

In a fourth aspect, there is provided a video reconstruction apparatus, the apparatus comprising: the acquisition module is used for acquiring the video to be reconstructed; the noise adding and denoising module is used for adding noise to the video to be reconstructed through a video reconstruction model to obtain reference noise information, and denoising the reference noise information to obtain a target video; the video reconstruction model is trained according to the method of the first aspect, the content of the target video is the same as that of the video to be reconstructed, and the resolution of the target video is higher than that of the video to be reconstructed.

In a possible implementation manner, the noise adding and denoising module is configured to determine, for any frame of to-be-reconstructed image in the to-be-reconstructed video, a reference image of the any frame of to-be-reconstructed image from the to-be-reconstructed video through a video reconstruction model; determining image characteristics of any frame of image to be reconstructed according to the reference image and the image to be reconstructed of any frame through the video reconstruction model; and carrying out noise adding processing on the image characteristics of the image to be reconstructed of any frame through the video reconstruction model to obtain reference noise information.

In a possible implementation manner, the noise adding and denoising module is configured to obtain, for any frame of to-be-reconstructed image in the to-be-reconstructed video, description information of the any frame of to-be-reconstructed image; and denoising the reference noise information based on the description information of the image to be reconstructed of any frame to obtain a target image corresponding to the image to be reconstructed of any frame.

In a fifth aspect, there is provided an electronic device, including a processor and a memory, where at least one computer program is stored in the memory, where the at least one computer program is loaded and executed by the processor, so that the electronic device implements the training method of the video reconstruction model described in the first aspect or implements the video reconstruction method described in the second aspect.

In a sixth aspect, there is further provided a computer readable storage medium, in which at least one computer program is stored, where the at least one computer program is loaded and executed by a processor, to cause an electronic device to implement the training method of the video reconstruction model described in the first aspect or implement the video reconstruction method described in the second aspect.

In a seventh aspect, there is further provided a computer program, where the computer program is at least one, and at least one computer program is loaded and executed by a processor, so that the electronic device implements the training method of the video reconstruction model described in the first aspect or implements the video reconstruction method described in the second aspect.

In an eighth aspect, there is also provided a computer program product, in which at least one computer program is stored, which is loaded and executed by a processor, to cause an electronic device to implement the training method of the video reconstruction model described in the first aspect or to implement the video reconstruction method described in the second aspect.

The technical scheme provided by the application has at least the following beneficial effects.

According to the technical scheme provided by the application, the first images of each frame in the first video are subjected to noise adding processing through the neural network model to obtain the prediction noise information, and the prediction noise information is subjected to noise removing processing to obtain the reconstructed images corresponding to the first images of each frame, so that the reconstruction of the first video with lower resolution is realized to obtain the video with higher resolution. Because the reconstructed image is obtained by denoising the predicted noise information, the reconstructed image is not influenced by the type, the size, the resolution and the like of the first image, and detail information in the image is reserved by denoising, so that the definition of the reconstructed image is higher. On the basis, the neural network model is trained through the second video and the reconstructed images of each frame, so that the neural network model can be optimized towards the direction of enabling the reconstructed video to approach to the second video, the accuracy, the universality and the stability of the neural network model are improved, the video reconstruction model obtained through training can reconstruct the video with higher resolution, and the definition and the quality of the video are higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an implementation environment of a training method of a video reconstruction model or a video reconstruction method according to an embodiment of the present application.

Fig. 2 is a flowchart of a training method of a video reconstruction model according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a U network structure according to an embodiment of the present application.

Fig. 4 is a schematic diagram of processing an image feature according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a noise adding process and a noise removing process according to an embodiment of the present application.

Fig. 6 is a schematic diagram of adding and removing noise to an image feature according to an embodiment of the present application.

Fig. 7 is a flowchart of a video reconstruction method according to an embodiment of the present application.

Fig. 8 is a schematic diagram of an image reconstruction process according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a training device for a video reconstruction model according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a video reconstruction device according to an embodiment of the present application.

Fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a training method of a video reconstruction model or a video reconstruction method according to an embodiment of the present application, where, as shown in fig. 1, the implementation environment includes a terminal device 101 and a server 102. The training method of the video reconstruction model or the video reconstruction method in the embodiment of the present application may be performed by the terminal device 101, or may be performed by the server 102, or may be performed by the terminal device 101 and the server 102 together.

The terminal device 101 may be a smart phone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart television, a smart car device, a smart voice interaction device, a smart home appliance, etc. The server 102 may be a server, or a server cluster formed by a plurality of servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 102 may be in communication connection with the terminal device 101 via a wired network or a wireless network. The server 102 may have functions of data processing, data storage, data transceiving, etc., which are not limited in the embodiment of the present application. The number of terminal devices 101 and servers 102 is not limited, and may be one or more.

Alternative embodiments of the present application are applicable to the field of artificial intelligence (Artificial Intelligence, AI) technology. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a training method of a video reconstruction model, which can be applied to the implementation environment, can train to obtain the video reconstruction model with higher accuracy, and improves the quality of reconstructed video. Taking the flowchart of the training method of the video reconstruction model provided by the embodiment of the present application shown in fig. 2 as an example, for convenience of description, the terminal device 101 or the server 102 that performs the training method of the video reconstruction model in the embodiment of the present application is referred to as an electronic device, and the method may be performed by the electronic device. As shown in fig. 2, the method includes the following steps.

In step 201, a first video and a second video with the same content are acquired, and the resolution of the second video is higher than that of the first video.

In the embodiment of the application, the electronic device can acquire the first video and the second video, wherein the content of the first video is the same as the content of the second video, but the resolution of the first video is smaller than that of the second video. For convenience of description, the resolution of the first video is referred to as a first resolution, and the resolution of the second video is referred to as a second resolution.

The embodiment of the application does not limit the acquisition modes of the first video and the second video. For example, a first video may be obtained by photographing a subject with a video capturing apparatus of a first resolution, and a second video may be obtained by photographing the subject with a video capturing apparatus of a second resolution instead of the video capturing apparatus of the first resolution. Since the same subject is photographed, the content of the first video is the same as the content of the second video, the resolution of the first video is the resolution of the video capturing apparatus that photographs the first video, that is, the first resolution, and the resolution of the second video is the resolution of the video capturing apparatus that photographs the second video, that is, the second resolution. Or shooting the shot object by adopting video acquisition equipment with the second resolution ratio to obtain a second video, and compressing the second video to reduce the resolution ratio of the second video to obtain the first video. Since the first video is obtained by compressing the second video, the content of the first video is the same as the content of the second video, and the resolution of the first video is lower than the resolution of the second video.

The first video includes a plurality of frames of first images and the second video includes a plurality of frames of second images. Since the content of the first video is the same as the content of the second video, for any frame of the first image in the first video, there is a second image in the second video, which is the same as the content of the first image and has a higher resolution than the first image, and the second image is referred to as a second image corresponding to the first image.

It will be appreciated that any two frames of the first image may be the same or different. For example, the first video includes 10 frames of first images, the 1 st frame of first images and the 10 th frame of first images are the same, any two frames of first images of 2 nd to 9 th frames are different, and the 1 st frame of first images and any one frame of first images of 2 nd to 9 th frames are different. The resolution of the first image of any frame in the first video is the first resolution, and the resolution of the second image of any frame in the second video is the second resolution. The resolution of an image may indicate the number of pixels in the length of the image and the number of pixels in the width of the image, e.g., 1600 x 1200 = 1920000 ≡ 200 ten thousand of the image, characterizing that the image has 1600 pixels in length and 1200 pixels in width, approximately 200 ten thousand pixels of the image.

Step 202, for any frame of first image in the first video, performing noise adding processing on any frame of first image through the neural network model to obtain prediction noise information, and performing noise removing processing on the prediction noise information to obtain a reconstructed image corresponding to any frame of first image, wherein the resolution of the reconstructed image corresponding to any frame of first image is the same as that of the second image corresponding to any frame of first image in the second video.

In the embodiment of the application, the electronic equipment can acquire the neural network model. The embodiment of the application does not limit the structure, the size, the parameters and the like of the neural network model, and the neural network model comprises a noise adding network and a noise removing network, wherein the noise removing network is connected in series behind the noise adding network. Alternatively, the neural network model includes an encoder, a noise adding network, a noise removing network and a decoder, where the number of encoders is at least one, and it is understood that the number of encoders is different, and the structure of the neural network model is different. Alternatively, when the encoder is one, the encoder may be connected in series before the noise adding network; when there are two encoders, one encoder may be connected in series before the noise adding network, and the other encoder may be connected in series before the noise removing network. Furthermore, a decoder is connected in series after the denoising network.

It should be noted that, the structures, functions, and the like of the above-mentioned encoder, noise adding network, noise removing network, decoder, and other modules are correspondingly described below, and are not described herein again.

In the embodiment of the application, the noise adding network is used for carrying out noise adding processing on the first image of any frame to obtain the predicted noise information. Denoising the predicted noise information through a denoising network to obtain a reconstructed image corresponding to the first image of any frame. The procedure of the noise adding process and the procedure of the noise removing process are described in this order.

First, a procedure of noise-adding processing of a first image of any frame through a noise-adding network will be described.

The embodiment of the application does not limit the structure, the size, the parameters and the like of the noise adding network. Illustratively, the noise plus network comprises a plurality of first network blocks in series, one first network block comprising at least one of a convolutional layer, a deconvolution layer, an attention layer, a pooling layer, a normalization layer, an activation layer, and the like.

Optionally, one first network block is a U-network (U-Net) structure. Referring to fig. 3, fig. 3 is a schematic diagram of a U-network structure according to an embodiment of the present application, where the U-network structure includes a downsampling portion and an upsampling portion connected in series after the downsampling portion.

The downsampling portion includes a plurality of attention layers in series. The input to the U-network structure is the input to the downsampling section and also the input to the first attention layer in the downsampling section. The input of any one of the attention layers in the downsampling portion other than the first attention layer includes the output of the last attention layer of the attention layer in the downsampling portion. Briefly, the downsampling portion includes M (M is a positive integer) attention layers in series, and the input of the M (M is any positive integer greater than 1 and less than or equal to M) attention layer in the downsampling portion includes the output of the M-1 (th) attention layer in the downsampling portion. The output of the downsampling portion includes the output of each of the attention layers in the downsampling portion. Any attention layer in the downsampling part is used for downsampling the input of the attention layer based on an attention mechanism, the input of the attention layer is a feature, and the feature dimension is reduced through the downsampling process, so that the feature concentrates on expressing effective information, and the expression capability of the feature is improved.

The upsampling section includes a plurality of attention layers connected in series, and the number of attention layers included in the upsampling section is the same as the number of attention layers included in the downsampling section. The input of the upsampling section comprises the output of the downsampling section. The input of the first attention layer in the upsampling section includes the output of the last attention layer in the downsampling section. The input of any one of the attention layers except the first one of the attention layers in the upsampling section includes an output of a previous one of the attention layers in the upsampling section and an output of a corresponding one of the attention layers in the downsampling section. Briefly, the upsampling section and the downsampling section each include M (M is a positive integer) attention layers in series, and the input of the M (M is any positive integer greater than 1 and less than or equal to M) attention layer in the upsampling section includes the output of the M-1 th attention layer in the upsampling section and the output of the M+1-M th attention layer in the downsampling section. The output of the upsampling section is the output of the U-network structure, including the output of the last attention layer in the upsampling section. Any attention layer in the up-sampling part is used for up-sampling processing on the input of the attention layer based on an attention mechanism, the input of the attention layer is a feature, and the feature dimension is increased and the effective information of the feature is amplified through the up-sampling processing, so that the expressive capacity of the feature is improved.

It can be appreciated that the above-mentioned U network structure is only illustrative, and can be flexibly adjusted according to application scenarios. For example, the attention layer of the downsampled portion may be replaced with a convolution layer and the attention layer in the upsampled portion may be replaced with a deconvolution layer. Alternatively, the attention layer in the downsampling portion and the attention layer in the upsampling portion may be replaced with a self-attention layer, a multi-headed attention layer, a hole convolution layer, or the like.

In the embodiment of the application, the noise adding network can be used for carrying out noise adding processing on the first image of any frame to obtain the predicted noise information. It will be appreciated that the structure of the noise adding network is different and the manner in which the noise is added varies.

For example, the noise plus network includes a feature map network and a first network block in series after the feature map network. The feature mapping network comprises at least one network layer such as a pooling layer, a convolution layer, an activation layer and a full connection layer, and features of any frame of first image can be extracted through the feature mapping network to obtain the image features of the frame of first image. The first network block can determine the noise adding data, and the noise adding processing is carried out on the first image by carrying out convolution processing on the noise adding data and the image characteristics of the first image, so that the prediction noise information is obtained.

As another example, the noise adding network may be preceded by an encoder, by which the image characteristics of the first image of any frame may be determined in accordance with the implementation of steps A1 to A2 mentioned below. Then, according to the implementation manner of the steps B2 to B3 mentioned below, the image features of the first image are subjected to noise adding processing through a noise adding network, so as to obtain prediction noise information. The implementation manner of the steps A1 to A2 and the implementation manner of the steps B2 to B3 are correspondingly described below, and are not described in detail herein.

In a possible implementation a, the "noise adding processing is performed on the first image of any frame by using the neural network model to obtain the prediction noise information" in step 202 includes steps A1 to A3 (not shown in the figure). For convenience of description, the implementation of each step will be described below by taking an example that any frame first image is an i-th frame first image.

And step A1, determining a reference image of a first image of any frame from the first video through a neural network model.

In the embodiment of the application, at least one frame of first image except the ith frame of first image can be determined from the first video through the neural network model, and each determined frame of first image is used as each frame of reference image of the ith frame of first image.

The embodiment of the application does not limit any frame of reference image of the first image of the ith frame. For example, a specified frame first image (e.g., a first frame first image, a last frame first image, etc.) in the first video may be used as a reference image for an i-th frame first image. Alternatively, at least one frame of the first image (e.g., the i-3 to i-1 frame of the first image, and/or the i+1 to i+3 frame of the first image) adjacent to the i frame of the first image in the first video may be used as a reference image for the i frame of the first image.

And step A2, determining the image characteristics of the first image of any frame according to the reference image and the first image of any frame through a neural network model.

In an embodiment of the application, the neural network model includes an encoder in series before the noise adding network. The structure, size, parameters, etc. of the Encoder are not limited in the embodiments of the present application, and the Encoder may be an Auto-Encoder (AE) or a variable Auto-Encoder (VAE), for example. The image characteristics of the first image of the ith frame can be obtained by encoding the first image of the ith frame by an encoder according to each reference image of the first image of the ith frame.

It will be appreciated that the encoder has different structures, and the manner in which the encoder encodes the first image also has differences, which is not limited by the embodiment of the present application.

Illustratively, the encoder includes a feature mapping network. In one aspect, the image features of each reference image are obtained by mapping each reference image to corresponding features through a feature mapping network, and the process can be described in the following related step a22, which is not repeated herein. On the other hand, the first image of the ith frame is mapped to the corresponding feature through the feature mapping network to obtain the first feature of the first image of the ith frame, and the process can be described in the following related step a23, which is not repeated here. And then, carrying out weighted calculation on the image characteristics of each reference image and the first characteristics of the first image of the ith frame to obtain the image characteristics of the first image of the ith frame. Optionally, the weight of the first feature of the i-th frame first image is greater than the sum of the weights of the image features of the respective reference images, in such a way that it is ensured that the image features of the i-th frame first image are focused on describing the i-th frame first image.

Alternatively, the image features of each reference image are extracted by the encoder based on the attention mechanism, and the extracted features are focused on information common to each reference image by the feature extraction, and the information common to each reference image can reflect the subject content of the first video. And then, fusing the extracted features with the first features of the first image of the ith frame through an encoder to obtain the image features of the first image of the ith frame, so that the image features of the first image of the ith frame can represent the subject content of the first video and the image content of the first image of the ith frame, and the feature representation capability is improved.

Alternatively, the image features of the first image of the ith frame may be determined according to steps a21 to a23 shown below, which are not described herein.

In general, there is an association relationship between the first images of each frame in the first video. For example, when the first video is obtained by photographing the object, the content of the first image of each frame in the first video is the object. As another example, the first video is a dance video, limb movement may be reflected by successive multiple frames of the first image. According to each reference image of the ith frame of first image, the ith frame of first image is encoded, so that the image characteristics of the ith frame of first image can reflect the image content of the ith frame of first image, the image characteristics of the ith frame of first image can reflect the association relationship between each reference image and the ith frame of first image, the representation capability of the image characteristics of the ith frame of first image is improved, and when the high-resolution image corresponding to the ith frame of first image is obtained based on the subsequent reconstruction of the image characteristics of the ith frame of first image, the high-resolution image can be clearer and more real, and the quality of the high-resolution image is higher.

Optionally, step A2 includes steps a21 to a23 (not shown in the figure).

Step A21, determining image change characteristics based on the reference image and the first image of any frame through a neural network model, wherein the image change characteristics are used for representing changes made by changing the reference image into the first image of any frame.

In the embodiment of the application, the encoder comprises a first encoding network, and the first encoding network can comprise a convolution layer, a feedforward layer, a normalization layer and other network layers. For any frame of first image and any reference image of the first image, the first image can be mapped into the feature of the first image through a first coding network, the reference image is mapped into the feature corresponding to the reference image, and then optical flow calculation is carried out on the feature of the first image and the feature of the reference image, so as to obtain an optical flow field between the first image and the reference image. Optical flow is the instantaneous velocity of a pixel's motion of a moving object in three-dimensional space on an imaging plane, used to describe the movement of the pixel in video. Optical flow may provide a pixel displacement vector between two frames, quantitatively describing the motion and change of objects in the video, while optical flow fields are used to describe the change information of optical flow.

In short, there is a pixel point corresponding to any point of the object in the reference image, and there is a pixel point corresponding to this point of the object in the first image, and since these two pixel points correspond to the same point of the object, these two pixel points correspond. The optical flow field between the first image and the reference image is used to describe the information about the change in the instantaneous speed of the movement of each pixel point on the reference image to the corresponding pixel point on the first image, that is to say, the optical flow field may describe the changes that need to be made to change the reference image to the first image. The optical flow field between the first image and the reference image is an image change feature.

The embodiment of the application does not limit the calculation mode of the light flow. For example, the first encoding network may determine a stacking feature based on the feature of the first image and the feature of the reference image, characterize a three-dimensional image formed by stacking the first image over the reference image by the stacking feature, and derive an optical flow field between the first image and the reference image by convolving the stacking feature. Alternatively, the first encoder may determine the characteristics of each pixel in the first image based on the characteristics of the first image, and determine the characteristics of each pixel in the reference image based on the characteristics of the reference image. And calculating the similarity between the characteristics of any pixel point in the first image and the characteristics of any pixel point in the reference image, and determining the optical flow field between the two pixel points based on the characteristics of the two pixel points if the similarity is larger than a threshold value. In this way, the optical flow field between the plurality of pixels in the first image and the plurality of pixels in the reference image is determined, resulting in an optical flow field between the first image and the reference image.

And step A22, extracting features of the reference image through the neural network model to obtain image features of the reference image.

In an embodiment of the application, the encoder further comprises a second encoding network. The embodiment of the application does not limit the structure, the size, the parameters and the like of the second coding network, and the second coding network can be a feature mapping network for mapping any one of the reference images into the image features of the reference image by using the feature mapping network. Alternatively, the second coding network may be a self-encoder or a variable self-encoder, and the self-encoder or the variable self-encoder may include a convolution layer, and the second coding network performs convolution processing on the reference image to obtain the image characteristic of the reference image. The image characteristics of the reference image may describe texture, color, content, style, and other information of the reference image.

Step A23, determining the image characteristics of the first image of any frame based on the image characteristics and the image change characteristics of the reference image through the neural network model.

In the embodiment of the application, the image characteristics of the reference image are used for describing the reference image, and the image change characteristics are used for describing the change of the reference image into the first image of any frame. Therefore, the image change feature and the image feature of the reference image are processed by the encoder, so that the image feature of the reference image can be changed according to the image change feature, the changed feature is obtained, and the image feature of the first image is determined based on the changed feature.

Alternatively, the feature after the change processing may be taken as the image feature of the first image. Alternatively, step a23 includes: extracting features of the first images of any frame through a neural network model to obtain first features of the first images of any frame; and carrying out change processing on the image characteristics of the reference image based on the image change characteristics through the neural network model to obtain second characteristics of the first image of any frame. And fusing the first characteristic of the first image of any frame and the second characteristic of the first image of any frame through the neural network model to obtain the image characteristic of the first image of any frame.

In the embodiment of the present application, on the one hand, feature extraction may be performed on the first image of any frame according to the implementation principle of step a22 to obtain the first feature of the first image of any frame, which is not described herein again. On the other hand, the encoder is used for processing the image change characteristics and the image characteristics of the reference image, so that the image characteristics of the reference image are changed according to the image change characteristics, and the characteristics after the change processing are obtained. Then, any one of a cross multiplication calculation, a weighted summation calculation, a weighted averaging calculation and the like is performed on the first feature of the first image and the second feature of the first image, and the obtained calculation result is used as the image feature of the first image. In this way, the characterizability of the image features is improved.

It is understood that the first image corresponds to at least one frame of reference image. For any frame of reference image, the image characteristics of the frame of reference image can be subjected to change processing according to the image change characteristics between the first image and the frame of reference image, so as to obtain the characteristics after the change processing. And fusing the first characteristics of the first image and the characteristics of the changed and processed corresponding to each frame of reference image to obtain the image characteristics of the first image.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating processing of an image feature according to an embodiment of the application. In the embodiment of the application, on one hand, the first image is subjected to feature extraction through the first coding network to obtain the first feature of the first image. In another aspect, an image change characteristic is determined over a first encoding network based on the first image and a reference image. In yet another aspect, feature extraction is performed on the reference image through a second encoding network to obtain image features of the reference image. Then, the image features of the reference image are subjected to Warp (Warp) change based on the image change features, resulting in second features of the first image. And then, performing cross multiplication calculation on the first characteristic of the first image and the second characteristic of the first image to obtain the image characteristic of the first image.

The image characteristics of the first image are used for describing information such as texture, color, content, style and the like of the first image. The image features of the first image may be decoded by a decoding network to recover the first image.

By calculating the optical flow field between the first image and the reference image, the change required by changing the reference image into the first image is determined, so that the motion compensation of the video segment taking the reference image as a start frame and the first image as an end frame is realized, which is equivalent to the analysis and editing of the reference image and the first image, the video segment is generated, the stability of the video segment is improved, and the jitter phenomenon of the video segment is reduced. The image characteristics of the first image are determined based on the optical flow field, so that the image characteristics of the first image can describe information of the video segment, and the characterization capability of the image characteristics of the first image is improved. When the high-resolution image corresponding to the first image is determined based on the image characteristics of the first image, the high-resolution image can contain the information of the video segment, so that the stability of the high-resolution video is enhanced, the jitter phenomenon of the high-resolution video is reduced, and the high-resolution video can be played more smoothly. Namely, the quality of the high-resolution video is high, and the dynamic playing effect is good.

And step A3, carrying out noise adding processing on the image characteristics of the first image of any frame through the neural network model to obtain prediction noise information.

In the embodiment of the present application, the implementation manner of step A3 may be described in the following description of steps B2 to B3, which is not repeated herein. And determining the image characteristics of the first image of any frame through an encoder, and carrying out noise adding processing on the image characteristics of the first image of the frame for a plurality of times through a noise adding network to obtain prediction noise information. The mode of extracting the image features and then carrying out the noise adding processing can realize that images with different sizes are encoded into the image features with the same dimension through the encoder, so that the neural network model is not limited by the image size, and the application scene is enlarged.

In a possible implementation B, the number of times of the noise addition processing is a plurality of times. In step 202, "noise-adding the first image of any frame by the neural network model to obtain the prediction noise information", includes steps B1 to B3 (not shown in the figure). For convenience of description, the implementation of each step will be described below by taking any frame of first image as an ith frame of first image as an example.

And B1, determining the image characteristics of the first image of any frame through a neural network model.

In the embodiment of the present application, the implementation manner of the step B1 may be seen from the description of the step A1 to the step A2, which is not repeated here.

And B2, performing first noise adding processing on the image features of the first image of any frame through the neural network model to obtain the features of the first image of any frame obtained after the first noise adding processing.

In an embodiment of the present application, the noise adding network includes a plurality of first network blocks, and the input of the noise adding network includes an image feature of the first image of the i-th frame. The image characteristics of the first image of the ith frame can be input into a first network block, and the first noise adding processing is carried out on the image characteristics of the first image of the ith frame through the first network block, so that the characteristics of the first image of the ith frame obtained after the first noise adding processing are obtained.

And B3, for any noise adding process except the first noise adding process, performing any noise adding process on the characteristics obtained after the last noise adding process of the first image of any frame through a neural network model to obtain the characteristics obtained after the last noise adding process of the first image of any frame, wherein the characteristics obtained after the last noise adding process of the first image of any frame are prediction noise information.

In the embodiment of the application, the characteristic obtained after the first noise adding processing is performed on the first image of the ith frame through the second first network block, so that the characteristic obtained after the second noise adding processing is obtained on the first image of the ith frame. And then, carrying out third noise adding processing on the characteristics obtained after the second noise adding processing on the first image of the ith frame through a third first network block to obtain the characteristics obtained after the third noise adding processing on the first image of the ith frame. And the like, until the characteristics of the first image of the ith frame obtained after the last noise adding process are obtained, the characteristics of the first image of the ith frame obtained after the last noise adding process are prediction noise information.

The procedure of the noise addition processing shown in steps B2 to B3 is shown in fig. 5. The image features of the first image may be noted as X ₀ By the method of X ₀ Sequentially performing T times of noise adding treatment to obtain a characteristic X ₁ To X _T . Wherein X is _t And characterizing the characteristics of the first image obtained after the t-th noise adding treatment. X is X _T Characterizing prediction noise information, which is a characteristic of Gaussian noise or pepperNoise characteristics of arbitrary noise such as salt noise or poisson noise.

Optionally, the noise adding network includes M (M is a positive integer) first network blocks connected in series, and the image feature of the first image of the ith (i is a positive integer) frame may be referred to as a feature of the first image of the ith frame obtained after the 0 th noise adding process. And for the mth (M is a positive integer greater than or equal to 1 and less than or equal to M) first network block, inputting the mth first network block into the features obtained after the mth-1 times of the noise adding processing of the first image of the ith frame, and carrying out the mth times of the noise adding processing on the features obtained after the mth-1 times of the noise adding processing of the first image of the ith frame through the mth first network block to obtain the features obtained after the mth times of the noise adding processing of the first image of the ith frame. The feature obtained after the Mth noise adding processing of the first image of the ith frame is prediction noise information.

Alternatively, a process of performing noise addition processing on the image feature of the first image a plurality of times through the noise addition network may be expressed as the following formula (1).

Equation (1).

Wherein, the liquid crystal display device comprises a liquid crystal display device,characterizing features of the first image obtained after the 0 th noise-adding process, i.eThe image features of the first image are characterized,characterizing features of the first image obtained after the t-th noise addition process,characterizing features of the first image obtained after the t-1 th noise adding process,and characterizing the characteristics of the first image obtained after the 1 st to the T th noise adding processes.The function symbols representing the noise adding processing function, x being a variable.The cumulative sign is characterized.

And (3) after the image characteristics of the first image are subjected to T times of noise adding treatment, the characteristics of the first image obtained after the 1 st to T times of noise adding treatment are sequentially obtained.And characterizing the feature obtained after the t-1 th noise adding treatment is carried out on the first image, and obtaining the feature obtained after the t-1 th noise adding treatment.

Alternatively, the feature obtained after the t-th noise addition processing of the first image satisfies the following formula (2).

Equation (2).

Wherein, the liquid crystal display device comprises a liquid crystal display device,the function symbols characterizing the normal distribution function. In general, the normal distribution function is （0，) I is a parameter of a normal distribution function.Is a fixed variance parameter that is used to determine the variance of the signal,is the t-th variance parameter. Optionally, the t-th variance parameter satisfies:. In the embodiment of the application, the formula (2) characterizes that the characteristic obtained after the t-th noise adding treatment of the first image accords with a normal distribution function。

It can be appreciated that the variance parameters corresponding to different times of the noise adding process are different, and the amplitude of the noise added to the feature in the process of the noise adding process can be controlled through the variance parameters.

In an exemplary embodiment, the number of times of the noise adding process corresponding to the different first network blocks is different. Therefore, the number of times of any one noise adding process and the characteristics obtained after the last noise adding process of any one frame of the first image are spliced to obtain splicing information, and the first network block corresponding to any one noise adding process is used for carrying out any one noise adding process on the splicing information to obtain the characteristics obtained after any one noise adding process of any one frame of the first image.

That is, the feature obtained by the m-1 th noise adding processing of the number m of noise adding processing and the first image of the i frame is spliced to obtain splicing information, and the m-1 th noise adding processing is performed on the splicing information through the m-1 th first network block to obtain the feature obtained by the m-1 th noise adding processing of the first image of the i frame.

By means of steps B1 to B3, the prediction noise information can be determined. And denoising the predicted noise information through a denoising network to obtain a reconstructed image corresponding to the first image.

Next, the contents of denoising the predicted noise information by the denoising network will be described.

The embodiment of the application does not limit the structure, the size, the parameters and the like of the denoising network, and the denoising network comprises a plurality of second network blocks connected in series, wherein one second network block comprises at least one network layer of a convolution layer, a deconvolution layer, an attention layer, a pooling layer, a normalization layer, an activation layer and the like. Optionally, one second network block is a U network structure, where the U network structure is shown in fig. 3 and is not described herein.

In the embodiment of the application, the denoising processing can be performed on the predicted noise information through the denoising network, so as to obtain the reconstructed image corresponding to the first image of any frame. It will be appreciated that the structure of the denoising network is different, and the manner of denoising is different.

For example, the denoising network includes a second network block and a feature mapping network connected in series after the second network block. The second network block can determine denoising data, and denoising the prediction noise information is achieved by performing convolution processing on the denoising data and the prediction noise information, so that image characteristics of the reconstructed image are obtained. Image features of the reconstructed image are mapped to the reconstructed image by a feature mapping network.

As another example, an encoder is connected in series before the denoising network, and a decoder is connected in series after the denoising network, and the description characteristic of the first image of any frame can be determined by the encoder according to the implementation manner of step C1 to step C2. Next, according to the implementation manner of steps D1 to D2 mentioned below, the prediction noise information is subjected to denoising processing through a denoising network, so as to obtain image characteristics of the reconstructed image. Thereafter, the image features of the reconstructed image are decoded into the reconstructed image by a decoder according to the implementation of step D3 mentioned below. The implementation manner of step C1 to step C2 and the implementation manner of step D1 to step D3 are correspondingly described below, and are not described in detail herein.

In a possible implementation C, the "denoising the prediction noise information to obtain a reconstructed image corresponding to the first image of any frame" in step 202 includes steps C1 to C2 (not shown in the figure).

And step C1, acquiring the description information of the first image of any frame.

In the embodiment of the application, the description information of the first image comprises at least one item of text in the first image (namely image text), semantic text for describing the semantic represented by the first image, category text for describing the image category of the first image, content text for describing the image content of the first image, style text for describing the image style of the first image and the like. The image text may include bullet screen text, line text, text contained in the subject itself, and the like. It may be appreciated that, since the first image may describe itself, the description information of the first image may include the first image or an image obtained by performing image processing such as cropping, compression, or the like on the first image.

The embodiment of the application does not limit the acquisition mode of the description information of the first image. For example, the electronic device may acquire the description information of the input first image, or the electronic device may call a tool, a program, software, a model, or the like, and analyze the first image to obtain the description information of the first image.

And step C2, denoising the prediction noise information based on the description information to obtain a reconstructed image corresponding to the first image of any frame.

In the embodiment of the application, the description information of the first image can be encoded by an encoder to obtain the description characteristics of the first image. The embodiment of the application does not limit the structure, the size, the parameters and the like of the encoder, and the encoder can be a feature mapping network for mapping the description information into the description features by the feature mapping network. Alternatively, the second encoding network may be a self-encoder or a variant self-encoder, etc., which may include a convolution layer, and the description information is convolved by the encoder to obtain the description characteristic.

And then, denoising the prediction noise information for a plurality of times based on the description characteristics of the first image to obtain a reconstructed image corresponding to the first image of any frame. The multiple denoising process may be described in the implementation D, which is not described herein. The denoising processing is guided through the description features of the first image, so that the features obtained after the denoising processing can represent the content of the first image, and the reconstruction of the first image is realized.

In a possible implementation D, the number of denoising processes is multiple times. In step 202, "denoising the prediction noise information to obtain a reconstructed image corresponding to the first image of any frame", steps D1 to D3 (not shown in the figure) are included. For convenience of description, the implementation of each step will be described below by taking any frame of first image as an ith frame of first image as an example.

And D1, carrying out first denoising processing on the predicted noise information through a neural network model to obtain the characteristics of the first image of any frame obtained after the first denoising processing.

In an embodiment of the present application, the denoising network includes a plurality of second network blocks connected in series. The prediction noise information can be input into a first network block and a second network block, and the first denoising processing is performed on the prediction noise information through the first network block and the second network block, so that the characteristics of the first image of the ith frame obtained after the first denoising processing are obtained.

And D2, for any denoising process except the first denoising process, performing any denoising process on the features obtained after the last denoising process of the first image of any frame by using a neural network model to obtain the features obtained after the any denoising process of the first image of any frame.

In the embodiment of the application, the characteristic obtained after the first denoising treatment of the first image of the ith frame is subjected to the second denoising treatment through the second network block, so that the characteristic obtained after the second denoising treatment of the first image of the ith frame is obtained. And then, carrying out third denoising processing on the features obtained after the second denoising processing on the first image of the ith frame through a third second network block to obtain the features obtained after the third denoising processing on the first image of the ith frame. And the like, until the characteristics of the first image of the ith frame obtained after the last denoising treatment are obtained. The characteristics obtained after the last denoising treatment of the first image of the ith frame are the image characteristics of the reconstructed image corresponding to the first image of the ith frame.

The procedure of the denoising process shown in steps D1 to D2 is shown in fig. 5. The prediction noise information may be noted as X _T By the method of X _T Sequentially carrying out denoising treatment for T times to obtain a characteristic X _T-1 To X ₀ . Wherein X is _t And characterizing the characteristics of the first image obtained after the T-T times of denoising treatment.

Alternatively, the denoising network includes M (M is a positive integer) second network blocks connected in series, and the prediction noise information may be referred to as a feature obtained after the 0 th denoising process of the i-th frame first image. And for an mth (M is a positive integer which is greater than or equal to 1 and less than or equal to M) second network block, inputting the mth second network block into a feature obtained after the mth-1 denoising processing of the first image of the ith frame, and carrying out the mth denoising processing on the feature obtained after the mth-1 denoising processing of the first image of the ith frame through the mth second network block to obtain the feature obtained after the mth denoising processing of the first image of the ith frame. The features obtained after the M-th denoising treatment of the first image of the ith frame are the image features of the reconstructed image corresponding to the first image of the ith frame.

Alternatively, a process of performing denoising processing on the prediction noise information a plurality of times through the denoising network may be expressed as formula (3) shown below.

Equation (3).

Wherein, the liquid crystal display device comprises a liquid crystal display device,characterizing features of the first image obtained after the T-th denoising process, as well as image features of a reconstructed image corresponding to the first image,characterizing features of the first image obtained after the T-th denoising process,characterizing features of the first image obtained after the T- (T-1) th denoising process,and characterizing the characteristics of the first image obtained after the T-th to 0-th denoising treatment.The function symbols characterizing the denoising process function,is a variable.The cumulative sign is characterized.The prediction noise information is characterized.

Characterizing pair prediction noise informationAfter T times of denoising treatment, sequentially obtaining the characteristics of the first image obtained after 1 st to T times of denoising treatmentTo the point of。Characterizing features obtained after T-T denoising of a first imagePerforming T-t+1st denoising treatment to obtain the characteristics of the first image obtained after the T-t+1st denoising treatment。Characterizing pair prediction noise informationAnd (5) denoising.

Alternatively, the feature obtained after the T- (T-1) th denoising process of the first image satisfies the following formula (4).

Equation (4).

Wherein, the liquid crystal display device comprises a liquid crystal display device,the function symbols characterizing the normal distribution function. In general, the normal distribution function is（0，) I is a parameter of a normal distribution function.Is thatThe average value of the distribution is met,is thatThe variance value according with the distribution can be any set data. In an embodiment of the present application, equation (4) characterizesConform to a normal distribution function。

In the embodiment of the application, for any denoising process, the denoising network can determine the denoising data corresponding to the denoising process, and denoising the features obtained after the last denoising process of the first image of the ith frame based on the denoising data. It will be appreciated that different numbers of denoising processes correspond to different denoised data.

Alternatively, the process may be carried out in a single-stage,. Wherein, the liquid crystal display device comprises a liquid crystal display device,is the denoising data corresponding to the T-t+1st denoising process determined by the denoising network. When t>At 1, z satisfies a normal distribution function, i.e（0，) When t.ltoreq.1, z is equal to 0, i.e. z=0.Satisfy normal distribution function, i.e（0，). t=t, …,1. The meaning of the remaining parameters is correspondingly described herein and will not be described in detail herein.

Illustratively, the code of the denoising process is as follows.

1：

2：for t=T，…，1 do

3：（0，） if t>1，else z=0

4：

5：end for

6：return

In an exemplary embodiment, the number of denoising processes corresponding to different second network blocks is different. Therefore, the number of times of any denoising process and the characteristics obtained after the last denoising process of any frame of the first image in any denoising process can be spliced to obtain splicing information, and the second network block corresponding to any denoising process is used for carrying out any denoising process on the splicing information to obtain the characteristics obtained after any denoising process of any frame of the first image.

That is, the feature obtained after the m-1 th denoising of the first image of the ith frame is spliced with the number m of denoising, the spliced information is obtained, the m-th denoising is carried out on the spliced information through the m-th first network block, and the feature obtained after the m-th denoising of the first image of the ith frame is obtained.

As mentioned above, the prediction noise information may be subjected to a plurality of denoising processes based on the descriptive characteristics of the first image. In the embodiment of the application, optionally, the prediction noise information and the description characteristic of the first image are spliced to obtain the spliced characteristic. Firstly, denoising the spliced features for the first time through a neural network model to obtain features of any frame of first image obtained after the first denoising process. And then, for any denoising process except the first denoising process, performing any denoising process on the characteristics obtained after the last denoising process of the any frame of first image in any denoising process through a neural network model to obtain the characteristics obtained after the any denoising process of the any frame of first image.

Or splicing the predicted noise information and the description characteristic of the first image to obtain a first splicing characteristic. Firstly, denoising the first spliced characteristic for the first time through a neural network model to obtain the characteristic of the first image of any frame obtained after the first denoising process. And then, for any denoising process except the first denoising process, splicing the description characteristic of the first image and the characteristic obtained after the last denoising process of any frame of the first image in any denoising process to obtain a second spliced characteristic, and performing any denoising process on the second spliced characteristic through a neural network model to obtain the characteristic obtained after any denoising process of any frame of the first image.

And D3, determining a reconstructed image corresponding to the first image of any frame based on the characteristics of the first image of any frame obtained after the last denoising treatment through the neural network model.

In the embodiment of the application, the characteristics obtained after the last denoising treatment of the first image of the ith frame are the image characteristics of the reconstructed image corresponding to the first image of the ith frame, and the content, the color, the texture, the style and other information of the reconstructed image are described through the image characteristics of the reconstructed image. Therefore, the image characteristics of the reconstructed image corresponding to the first image of the ith frame can be decoded by the decoder, so that the reconstructed image corresponding to the first image of the ith frame is obtained. The embodiment of the application does not limit the structure, the size, the parameters and the like of the decoder, and different decoders correspond to different decoding modes and are not described herein.

In general, an embodiment of the present application is shown in FIG. 6. And carrying out noise adding processing on the image characteristics of the low-resolution image through a noise adding network to obtain noise information, and carrying out noise removing processing on the noise information through a noise removing network to obtain the image characteristics of the high-resolution image. Wherein the image features of the low resolution image correspond to the image features of the first image mentioned above, the noise information corresponds to the prediction noise information mentioned above, and the image features of the high resolution image correspond to the image features of the reconstructed image corresponding to the first image mentioned above.

Step 203, training the neural network model based on the second image and the reconstructed image corresponding to the first image of each frame to obtain a video reconstruction model, wherein the video reconstruction model is used for reconstructing the video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

In the embodiment of the application, the loss of the neural network model can be determined based on the second image corresponding to the first image of each frame and the reconstructed image corresponding to the first image of each frame. And training the neural network model through the loss of the neural network model to obtain the trained neural network model. And determining a video reconstruction model based on the trained neural network model.

Optionally, if the trained neural network model meets the training end condition, using the trained neural network model as a video reconstruction model. If the trained neural network model does not meet the training ending condition, the trained neural network model is used as a neural network model for the next training, the neural network model is trained for the next time according to the modes from the step 202 to the step 203 until the trained neural network model meets the training ending condition, and the trained neural network model is used as a video reconstruction model.

The embodiment of the application does not limit that the trained neural network model meets the training ending condition. Illustratively, the trained neural network model satisfies the training end condition including, but not limited to, at least one of: the training times corresponding to the trained neural network model reach a time threshold; model parameters of the trained neural network model are in a set range; the difference or the ratio between the model parameters of the neural network model after training and the model parameters of the neural network model before training, or the logarithm of the index or the ratio of the difference, etc. are within a set range.

In one possible implementation, step 203 includes steps 2031 to 2032 (not shown in the figures).

Step 2031, for any frame of the first image, determining an image loss corresponding to any frame of the first image based on an error between the second image corresponding to any frame of the first image and the reconstructed image.

In the embodiment of the application, the reconstructed image corresponding to the first image of any frame is obtained by reconstructing the first image, so that the reconstructed image corresponding to the first image of any frame has the same content as the first image. Since the second image corresponding to the first image of any frame is identical to the first image in content, the reconstructed image corresponding to the first image of any frame is identical to the second image corresponding to the first image of any frame in content. Based on this, an error between the second image corresponding to the first image of any frame and the reconstructed image corresponding to the first image can be calculated, and the accuracy of the neural network model can be measured by the error. The embodiment of the application does not limit the determination mode of the error between the second image and the reconstructed image.

Optionally, the resolution of the reconstructed image corresponding to any one frame of the first image is the same as the resolution of the reconstructed image corresponding to any one frame of the second image, so that a pixel point corresponding to any one pixel point in the reconstructed image exists in the reconstructed image. In brief, the pixel point of the ith row and the jth column in the reconstructed image corresponds to the pixel point of the ith row and the jth column in the second image. In the embodiment of the application, the difference between the pixel values or the ratio of the pixel values between any pixel point in the second image and the pixel point in the reconstructed image corresponding to the pixel point can be calculated to obtain the contrast information of the pixel point. And obtaining an error between the second image and the reconstructed image by calculating the sum value or the average value or the variance of the contrast information of each pixel point.

Alternatively, image features of the reconstructed image and image features of the second image may be acquired. The method for determining the image features of the reconstructed image has been described above, but the image features of the second image may be determined according to the method for determining the image features of the reference image or the image features of the first image, which will not be described herein. Then, according to Euclidean distance or cosine distance or Manhattan distance equidistant calculation formula, calculating the feature distance between the image feature of the reconstructed image and the image feature of the second image, and taking the feature distance as the error between the second image and the reconstructed image.

Next, an error between the second image corresponding to the first image of any frame and the reconstructed image is taken as an image loss corresponding to the first image of the frame. Or, calculating the square, the logarithm, the index or the like of the error between the second image corresponding to the first image and the reconstructed image, and obtaining the image loss corresponding to the first image of the frame.

Step 2032, training the neural network model based on the image loss corresponding to the first image of each frame to obtain a video reconstruction model.

In the embodiment of the present application, according to the method of step 2031, the image loss corresponding to the first image of each frame may be calculated. Then, a second loss is determined based on the image loss corresponding to the first image of each frame, and optionally, the sum, the average, or the like of the image losses corresponding to the first image of each frame is used as the second loss. The second loss is taken as the loss of the neural network model, or the loss of the neural network model is determined according to the second loss and the first loss mentioned below. And then training the neural network model through the loss of the neural network model to obtain a video reconstruction model, wherein the training of the neural network model through the loss of the neural network model is described above, and the description is omitted here.

In another possible implementation, step 203 includes steps 2033 to 2037 (not shown in the figures).

Step 2033, obtaining the labeling and noise adding data of the first images of each frame, where the labeling and noise adding data of the first image of any frame characterizes the noise added in the process of adding the first image of any frame into the labeling noise information.

In the embodiment of the application, the image characteristics of the first image of any frame can be subjected to noise adding processing to obtain the marking noise information, and the marking noise information is the noise characteristics. Wherein the noise adding process is repeated. The following describes the noise adding process taking any frame of first image as the i frame of first image and any noise adding process as the m noise adding process.

The image feature of the first image of the i frame can be used as the feature obtained after the 0 th noise adding process of the first image of the i frame. And for the mth noise adding process, marking noise adding data corresponding to the mth noise adding process is obtained, and the mth noise adding process is carried out on the features obtained after the mth-1 th noise adding process on the first image of the ith frame based on the marking noise adding data, so that the features obtained after the mth noise adding process on the first image of the ith frame are obtained.

The marked noise adding data corresponding to the mth noise adding process is obtained by sampling based on the statistical distribution function, and based on the marked noise adding data, the marked noise adding data meets the statistical distribution corresponding to the statistical distribution function. The statistical distribution function corresponding to any two times of noise adding processing can be the same or different. Alternatively, the statistical distribution function is a normal distribution N (μ, σ) ² ) Mu is mean value, sigma ² Is the variance. The electronic device may obtain a mean and a variance (e.g., mean 1 and variance 0) corresponding to the mth noise to determine a statistical distribution function corresponding to the mth noiseAnd randomly or equidistantly sampling based on a statistical distribution function corresponding to the mth noise adding process to obtain labeling noise adding data corresponding to the mth noise adding process.

It can be understood that, because the image features of the first image of any frame are subjected to multiple times of noise adding processing, the labeling noise adding data of the first image of any frame includes labeling noise adding data corresponding to each time of noise adding processing of the first image of the frame.

Step 2034, obtaining labeling denoising data of the second images of each frame, where the labeling denoising data of the second image of any frame represents noise removed in the process of denoising the labeling noise information into the second image of any frame.

In the embodiment of the application, the labeling noise information can be subjected to denoising processing to obtain the image characteristics of the second image of any frame. Wherein the denoising process is repeated. The denoising process is described below by taking any frame of second image as an ith frame of second image and any denoising process as an mth denoising process as an example.

The labeling noise information can be used as the feature of the second image of the ith frame obtained after the 0 th denoising process. And for the mth denoising processing, marking denoising data corresponding to the mth denoising processing is obtained, and the mth denoising processing is carried out on the features obtained after the mth-1 th denoising processing of the second image of the ith frame based on the marking denoising data, so that the features obtained after the mth denoising processing of the second image of the ith frame are obtained.

The marked denoising data corresponding to the mth denoising process is obtained by sampling based on the statistical distribution function, and based on the marked denoising data, the marked denoising data meets the statistical distribution corresponding to the statistical distribution function. The statistical distribution functions corresponding to any two denoising processes can be the same or different. Alternatively, the statistical distribution function is a normal distribution N (μ, σ) ² ). The electronic equipment can acquire the mean value and the variance corresponding to the mth denoising process to determine a statistical distribution function corresponding to the mth denoising process, and randomly or equidistantly sample the statistical distribution function corresponding to the mth denoising process to obtain labeling denoising data corresponding to the mth denoising process.

It can be understood that, because the labeling noise information is subjected to multiple denoising processes, labeling noise removal data of any frame of second image includes labeling noise removal data corresponding to each denoising process of the frame of second image.

Step 2035, for any frame of the first image, obtaining prediction noise-added data added in the process of obtaining prediction noise information by performing noise-adding processing on any frame of the first image through the neural network model, and obtaining prediction noise-removed data removed in the process of obtaining a reconstructed image by performing noise-removing processing on the prediction noise information through the neural network model.

As already mentioned above, the noise adding network of the neural network model may determine the predicted noise adding data of the mth noise adding process based on the feature of the first image of the ith frame obtained after the mth-1 th noise adding process and the number m of noise adding processes. Based on the prediction noise adding data, corresponding to each noise adding process, of the first image of any frame can be obtained by the electronic equipment.

Based on the same principle, the denoising network of the neural network model can determine prediction denoising data of the mth denoising process based on the features obtained after the mth-1 th denoising process of the first image of the ith frame and the times m of denoising processes. Based on the prediction denoising data, corresponding to each denoising process, of the first image of any frame can be obtained by the electronic equipment.

In step 2036, a first penalty is determined based on the labeled denoising data for the second image for each frame, the labeled denoising data for the first image for each frame, the predicted denoising data, and the predicted denoising data.

In the embodiment of the application, for any frame of first image, the noise loss of the frame of first image can be determined based on the marking and noise adding data of the frame of first image, the prediction and noise removing data of the frame of first image and the marking and noise removing data of the second image corresponding to the frame of first image. The sum, average, or the like of noise losses of the first images of the frames is taken as a first loss.

Optionally, step 2036 includes: determining noise adding data loss based on the labeling noise adding data of the first images of each frame and the prediction noise adding data corresponding to the first images of each frame; determining denoising data loss based on labeling denoising data of the second image of each frame and prediction denoising data corresponding to the first image of each frame; the first loss is determined based on the de-noised data loss and the noisy data loss.

First, for any frame of first image, a noise penalty for the frame of first image may be determined based on the annotation noise data for the frame of first image and the prediction noise data for the frame of first image. Optionally, the labeling and noise adding data of the first image includes labeling and noise adding data corresponding to each noise adding process of the first image, and the prediction and noise adding data of the first image includes prediction and noise adding data corresponding to each noise adding process of the first image. The difference value can be obtained by subtracting the predicted noise adding data of the first image of the ith frame in the t-th noise adding process from the marked noise adding data of the first image of the ith frame in the t-th noise adding process according to the following formula (5), and taking the square of the norm of the difference value as the noise adding loss of the first image of the ith frame in the t-th noise adding process.

Equation (5).

Wherein, the liquid crystal display device comprises a liquid crystal display device,and characterizing the noise adding loss of the first image of the ith frame in the t-th noise adding process.And marking the marked noise adding data of the first image of the ith frame in the t-th noise adding process.The predicted noise-added data of the first image of the ith frame in the t-th noise adding process is characterized,model parameters characterizing the neural network model.Characterization variableIs the square of the norm of (c).Characterizing the image features of the first image of the i-th frame,the determination of (2) is described above and will not be described in detail herein.

After the noise adding loss of the first image of any frame in any noise adding process is calculated, the sum or average value of the noise adding losses of the first image of any frame in each noise adding process can be used as the noise adding loss of the first image of the frame.

Then, for any frame of first image, the predicted denoising data corresponding to the frame of first image and the labeling denoising data of the second image corresponding to the frame of first image can be used for determining the denoising loss of the frame of first image. Optionally, the predicted denoising data corresponding to the first image includes predicted denoising data corresponding to each denoising process of the first image, the labeling denoising data of the second image includes labeling denoising data corresponding to each denoising process of the second image, and denoising loss of the first image in any denoising process can be determined according to the calculation principle of the formula (5) shown above. Then, the sum, average, or the like of the denoising losses of the first image of any frame in each denoising process is used as the denoising loss of the first image of the frame.

And then, for any frame of first image, carrying out weighted average calculation or weighted summation calculation on the noise adding loss of the frame of first image and the noise removing loss of the frame of first image to obtain the noise loss of the frame of first image. Then, the sum, average, or the like of noise losses of the first images of the frames is taken as a first loss.

Step 2037, training the neural network model based on the first loss, the second image corresponding to the first image of each frame, and the reconstructed image to obtain a video reconstructed model.

In the embodiment of the present application, the second loss may be determined according to the manner from step 2031 to step 2032 based on the second image and the reconstructed image corresponding to the first image of each frame. Then, the first loss and the second loss are subjected to weighted summation calculation or weighted averaging calculation, and the obtained calculation result is used as the loss of the neural network model. And then training the neural network model through the loss of the neural network model to obtain a video reconstruction model, wherein the training of the neural network model through the loss of the neural network model is described above, and the description is omitted here.

It will be appreciated that training the neural network model with the first and second losses amounts to gradient back transmission (Gradient Backpropagation) of the loss between the noise added and removed from the first image, the second image, and the reconstructed image for each frame over the neural network model, optimizing model parameters of the neural network model. The following code mainly describes optimizing model parameters by denoising the model parameters in a similar manner to optimizing model parameters by denoising the model parameters by loss between the second image and the reconstructed image, and will not be described in detail herein.

1：repeat

2：（） //Considered as the characteristics of the first image obtained after the 0 th noise adding process

3：t({ 1, …, T })// T takes values 1 to T

4：The// labeling noisy data conforms to a normal distribution function

5：Take gradient descent step on

Loss of// noiseGradient feedback is carried out on the neural network model, and model parameters are optimized

6: unitl converged// until model convergence

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant region. For example, the first video, the second video, and the like referred to in the present application are all acquired with sufficient authorization.

In the method, the first images of each frame in the first video are subjected to noise adding processing through the neural network model to obtain the prediction noise information, and the prediction noise information is subjected to noise removing processing to obtain the reconstructed images corresponding to the first images of each frame, so that the reconstruction of the first video with lower resolution is realized to obtain the video with higher resolution. Because the reconstructed image is obtained by denoising the predicted noise information, the reconstructed image is not influenced by the type, the size, the resolution and the like of the first image, and detail information in the image is reserved by denoising, so that the definition of the reconstructed image is higher. On the basis, the neural network model is trained through the second video and the reconstructed images of each frame, so that the neural network model can be optimized towards the direction of enabling the reconstructed video to approach to the second video, the accuracy, the universality and the stability of the neural network model are improved, the video reconstruction model obtained through training can reconstruct the video with higher resolution, and the definition and the quality of the video are higher.

The embodiment of the application provides a video reconstruction method which can be applied to the implementation environment and can obtain high-resolution video with higher quality. Taking the flowchart of the video reconstruction method provided by the embodiment of the present application shown in fig. 7 as an example, for convenience of description, the terminal device 101 or the server 102 that performs the video reconstruction method in the embodiment of the present application is referred to as an electronic device, and the method may be performed by the electronic device. As shown in fig. 7, the method includes the following steps.

Step 701, obtaining a video to be reconstructed.

The embodiment of the application does not limit the acquisition mode of the video to be rebuilt, and the electronic equipment can acquire the input video to be rebuilt, or the electronic equipment can read the video to be rebuilt from the storage equipment, or the electronic equipment can search the network to obtain the video to be rebuilt, or the electronic equipment has the function of a video acquisition device, and the video to be rebuilt is obtained by shooting a shot object.

The video to be reconstructed includes a plurality of frames of images to be reconstructed, the resolution of any frame of images to be reconstructed may be lower than or equal to or higher than the first resolution, and the resolution of the images to be reconstructed is lower than the second resolution.

It will be appreciated that the implementation of step 701 is similar to that of step 201, and that a description of step 201 may be found and will not be repeated here.

Step 702, performing noise adding processing on the video to be reconstructed through the video reconstruction model to obtain reference noise information, and performing denoising processing on the reference noise information to obtain a target video.

The video reconstruction model is obtained by training according to a training method of the video reconstruction model related to fig. 2, the content of the target video is the same as that of the video to be reconstructed, and the resolution of the target video is higher than that of the video to be reconstructed. That is, the target video includes target images corresponding to the images to be reconstructed of the frames, and the target image corresponding to any one of the images to be reconstructed is the same as the content of any one of the images to be reconstructed of the frames and has higher resolution than any one of the images to be reconstructed of the frames.

In the embodiment of the application, the video reconstruction model is obtained by training the neural network model, so that the structure, the function and the like of the video reconstruction model are similar to those of the neural network model, but the parameters of the video reconstruction model are different from those of the neural network model. The content of the video reconstruction model can be found in the description of the neural network model, and is not repeated here.

Similar to the structure of the neural network model, the video reconstruction model also includes a noise adding network and a noise removing network. In the embodiment of the application, the noise adding network is used for carrying out the noise adding processing on any frame of image to be reconstructed to obtain the reference noise information, and the noise removing network is used for carrying out the noise removing processing on the reference noise information to obtain the target image corresponding to the frame of image to be reconstructed. The description of step 202 can be seen in the implementation manner of the noise adding process through the noise adding network and the implementation manner of the noise removing process through the noise removing network, and the implementation principles of the two are similar, so that the description is omitted here.

In one possible implementation manner, in step 702, "noise adding the image to be reconstructed of each frame through the video reconstruction model to obtain the reference noise information", includes: for any frame of image to be reconstructed in the video to be reconstructed, determining a reference image of the image to be reconstructed of any frame from the video to be reconstructed through a video reconstruction model; determining the image characteristics of any frame of image to be reconstructed according to the reference image and any frame of image to be reconstructed through a video reconstruction model; and carrying out noise adding processing on the image characteristics of any frame of image to be reconstructed through the video reconstruction model to obtain reference noise information.

In the embodiment of the application, at least one frame of image to be reconstructed except the ith frame of image to be reconstructed can be determined from the video to be reconstructed through the video reconstruction model, and each determined frame of image to be reconstructed is used as each frame reference image of the ith frame of image to be reconstructed. The implementation manner of this part of the content can be seen from the description of the step A1, and the implementation principles of the two are similar, and are not repeated here.

Similar to the structure of the neural network model, the video reconstruction model also includes an encoder, which is connected in series before the noise adding network. And the image characteristics of the image to be reconstructed of the ith frame can be obtained by encoding the image to be reconstructed of the ith frame through an encoder according to each reference image of the image to be reconstructed of the ith frame. The implementation manner of this part of the content can be seen from the description of the step A2, and the implementation principles of the two are similar, and are not repeated here.

And then, carrying out multiple times of noise adding processing on the image characteristics of the image to be reconstructed of the ith frame through a noise adding network to obtain reference noise information. The reference noise information may be determined by a related description of the prediction noise information, and the implementation principles of the reference noise information and the related description are similar, which is not described herein.

In one possible implementation, the "denoising the reference noise information to obtain the target video" in step 702 includes: for any frame of image to be reconstructed in the video to be reconstructed, acquiring description information of any frame of image to be reconstructed, wherein the description information of any frame of image to be reconstructed is used for representing at least one of text in any frame of image to be reconstructed, description information of any frame of image to be reconstructed, image content of any frame of image to be reconstructed, semantics of any frame of image to be reconstructed expression and the like; and denoising the reference noise information based on the description information of the image to be reconstructed of any frame to obtain a target image corresponding to the image to be reconstructed of any frame.

In the embodiment of the application, the electronic equipment can acquire the description information of the image to be reconstructed of the ith frame, and the denoising network performs denoising processing on the reference noise information for a plurality of times according to the description information of the image to be reconstructed of the ith frame to obtain the target image corresponding to the image to be reconstructed of the ith frame. The method for obtaining the description information of the image to be reconstructed is similar to the method for obtaining the description information of the first image, the method for determining the target image is similar to the method for determining the reconstructed image, and descriptions of steps C1 to C2 may be omitted herein.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant region. For example, the video to be reconstructed, the reference image, and the like, which are referred to in the present application, are acquired under the condition of sufficient authorization.

In the method, the reference noise information is obtained by carrying out noise adding processing on each frame of to-be-reconstructed image in the to-be-reconstructed video through the video reconstruction model, and the target image corresponding to each frame of to-be-reconstructed image is obtained by carrying out noise removing processing on the reference noise information, so that the target video with higher resolution is obtained by reconstructing the to-be-reconstructed video with lower resolution. The target image is obtained by denoising the reference noise information, so that the target image is not influenced by the type, the size, the resolution and the like of the image to be reconstructed, the detail information in the image is reserved by denoising, the definition of the target image is higher, and the quality of the target video is improved.

The foregoing describes the training method and the video reconstruction method of the video reconstruction model according to the embodiments of the present application from the perspective of method steps, and is described in detail below with reference to the scene. The embodiment of the application can be suitable for any scene which can acquire video, such as an automatic driving scene, a medical scene, a remote sensing satellite scene, a virtual reality scene and the like, and is also suitable for on-line and off-line video playing, live broadcasting and on-demand broadcasting, old film repairing, real-time audio and video (RTC) and the like. In some cases, the resolution of the acquired video is low under the limitation of hardware of video acquisition equipment and the like, a video reconstruction model can be trained according to the method provided by the embodiment of the application, and the video with low resolution is reconstructed through the video reconstruction model to obtain the video with high resolution. In the process of video reconstruction, any frame image in a video with lower resolution can be reconstructed through a video reconstruction model, so that a reconstructed image with the same content as the frame image and higher resolution than the frame image is obtained.

For convenience of description, any frame image in a video having a lower resolution will be referred to as a low resolution image, and any frame image in a video having a higher resolution obtained by reconstruction will be referred to as a high resolution image. It is understood that the low resolution image corresponds to the first image mentioned above and the image to be reconstructed or the like, and the high resolution image corresponds to the second image mentioned above, the reconstructed image, the target image or the like.

Referring to fig. 8, fig. 8 is a schematic diagram of an image reconstruction process according to an embodiment of the application. In the embodiment of the application, a video reconstruction model can be obtained through training according to the content related to fig. 2, and a low-resolution image is reconstructed through the video reconstruction model to obtain a high-resolution image. The video reconstruction model includes an Encoder (E), a decoder (Dncoder, E), a noise adding network, and a noise removing network (not shown in the figure), where the noise removing network includes a plurality of U network structures. It can be appreciated that fig. 8 shows two U network structures, and the number of the U network structures can be flexibly set according to the application scenario in actual application.

Firstly, inputting a low-resolution image into an encoder, and encoding the low-resolution image by the encoder to obtain low-resolution image features, wherein the low-resolution image features can be determined according to the above-mentioned determination mode of the image features of the first image, and the implementation principle of the low-resolution image features is similar and will not be repeated here.

And then, carrying out noise adding processing on the low-resolution image characteristics for a plurality of times through a noise adding network to obtain noise information. The noise information may be determined according to the above-mentioned determination method of the prediction noise information, and the implementation principles of the two are similar, which is not described herein again.

In the embodiment of the application, the input of the denoising network comprises noise information. In addition, the input to the denoising network also includes descriptive information for the low resolution image. Alternatively, the description information includes text in the low resolution image (i.e., image text), text for describing the semantics characterized by the low resolution image (i.e., semantic text), and text for describing the image content of the low resolution image (i.e., content text). In addition, since the low resolution image is description information of itself, the description information may further include the low resolution image.

Optionally, the description information is input into an encoder, the description information is encoded by the encoder to obtain the description feature, and the input of the denoising network includes the description feature. The encoder for encoding the description information and the encoder for encoding the low resolution image may be the same encoder or may be different encoders, and are not limited herein.

The denoising network comprises a plurality of U network structures, functions, structures, compositions and the like of any two U network structures are the same, and parameters of any two U network structures can be the same or different. Any one of the U-network structures is shown in fig. 3 as comprising a downsampling portion comprising a plurality of attention layers and an upsampling portion comprising the same number of attention layers as the downsampling portion. The following description will take an example in which the denoising network includes two U-network structures, and the downsampling portion and upsampling portion of one U-network structure each include two attention layers.

In the embodiment of the application, the noise information and the description characteristic can be input into the first U network structure of the denoising network, and the denoising characteristic is output through the first U network structure. It will be appreciated that the denoising features correspond to the features of the first image as mentioned above that result after any denoising process. The denoising feature may be determined in either mode 1 or mode 2 as shown below. The switch shown in fig. 8 is a module that visually represents the selection of mode 1 or mode 2, which may or may not be present in the actual model.

Mode 1: and splicing the noise information and the description characteristic to obtain a spliced characteristic. Firstly, performing a first downsampling process on the spliced features according to an attention mechanism through a first attention layer in a downsampling part of a first U network structure to obtain features obtained after the first downsampling process. And then, performing a second downsampling process on the features obtained after the first downsampling process according to an attention mechanism through a second attention layer in the downsampling part of the first U network structure to obtain features obtained after the second downsampling process. And then, performing a first upsampling process on the features obtained after the second upsampling process according to an attention mechanism through a first attention layer in the upsampling part of the first U network structure to obtain the features obtained after the first upsampling process. And then, performing a second upsampling process on the features obtained after the first upsampling process and the features obtained after the first downsampling process by a second attention layer in the upsampling part of the first U network structure according to an attention mechanism to obtain features obtained after the second upsampling process, wherein the features obtained after the second upsampling process are denoising features.

Mode 2: firstly, performing a first downsampling process on the characteristics spliced by the noise information and the description characteristics according to an attention mechanism through a first attention layer in a downsampling part of a first U network structure to obtain the characteristics obtained after the first downsampling process. And then, performing a second downsampling process on the characteristics obtained after the first downsampling process and the characteristics spliced by the description characteristics according to an attention mechanism through a second attention layer in the downsampling part of the first U network structure to obtain the characteristics obtained after the second downsampling process. And then, performing a first upsampling process on the features obtained after the second upsampling process and the features spliced by the description features according to an attention mechanism through a first attention layer in an upsampling part of the first U network structure to obtain the features obtained after the first upsampling process. And then, performing a second upsampling process on the features obtained after the first upsampling process, the description features and the features obtained after the first downsampling process by a second attention layer in the upsampling part of the first U network structure according to an attention mechanism to obtain features obtained after the second upsampling process, wherein the features obtained after the second upsampling process are denoising features.

Then, the denoising feature and the description feature are input into a second U network structure of the denoising network, and the high-resolution image feature is output through the second U network structure. It will be appreciated that the high resolution image features correspond to the features of the first image referred to above that were obtained after the last denoising process. The manner in which the second U-network structure determines the features of the high resolution image is similar to the manner in which the first U-network structure determines the noise removal features, and will not be described in detail herein.

And then, decoding the high-resolution image features by a decoder to obtain the high-resolution image. The high resolution image may be determined according to the above-mentioned determination method of the reconstructed image, and the implementation principle of the two is similar, and will not be described herein.

The above process of reconstructing the low resolution image to obtain the high resolution image is equivalent to a potential Diffusion (Diffusion) model. The potential diffusion model is a generation model based on the diffusion principle, and can reconstruct high-dimensional data such as text, images, audio and video. Unlike conventional generative models, the potential diffusion model does not require calculation of probability density functions that generate data, but rather generates data by multi-step diffusion of potential variables. That is, the image space is encoded into the feature space (also called potential space) by the encoder, multi-step diffusion (i.e., multi-noise adding process and multi-noise removing process) is performed in the feature space, and then the feature space is restored into the image space by the decoder, so that a high resolution image is obtained. The low resolution image and the high resolution image may be a gray scale image, a color image, a Red-Green-Blue (RGB) image, or the like.

In the embodiment of the application, on one hand, the image acquisition equipment is required to acquire the low-resolution image, or the video acquisition equipment is required to acquire the low-resolution video. On the other hand, an electronic device having a central processing unit (Central Processing Unit, CPU) or a graphic processor (Graphics Processing Unit) is required, and a high-resolution image is obtained by reconstructing a low-resolution image based on the method according to the embodiment of the present application, so as to obtain a high-resolution video. Optionally, the electronic device is provided with a display supporting high resolution, through which high resolution video and low resolution video are displayed. In yet another aspect, a storage resource is also needed for storing the low resolution video, the high resolution video, and the data obtained during the reconstruction process. The configuration of the hardware equipment can be common configuration, special customization is not needed, and hardware cost is reduced.

In practical application, in order to improve the real-time performance, the video can be reconstructed based on a CPU parallel computing mode or by using a lightweight model based on the method of the embodiment of the application. In order to improve the video reconstruction effect, the method of the embodiment of the application can be combined with a super-resolution reconstruction technology, a multi-frame super-resolution reconstruction technology and the like to reconstruct the video. In order to reduce complexity, the method of the embodiment of the application can be combined with low-rank decomposition, compressed sensing and other technologies to reconstruct video. And will not be described in detail herein.

Fig. 9 is a schematic structural diagram of a training device for a video reconstruction model according to an embodiment of the present application, and as shown in fig. 9, the device includes an acquisition module 901, a noise adding and denoising module 902 and a training module 903.

The acquiring module 901 is configured to acquire a first video and a second video with the same content, where the resolution of the second video is higher than that of the first video.

The denoising module 902 is configured to denoise, for any frame of first image, any frame of first image through a neural network model to obtain prediction noise information, denoise the prediction noise information to obtain a reconstructed image corresponding to any frame of first image, and the reconstructed image corresponding to any frame of first image has the same resolution as a second image corresponding to any frame of first image in the second video.

The training module 903 is configured to train the neural network model based on the second image and the reconstructed image corresponding to the first image of each frame to obtain a video reconstruction model, where the video reconstruction model is used for reconstructing a video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

In one possible implementation, the denoising module 902 is configured to determine, from the first video, a reference image of a first image of any frame through a neural network model; determining the image characteristics of any frame of first image according to the reference image and any frame of first image through a neural network model; and carrying out noise adding processing on the image characteristics of the first image of any frame through the neural network model to obtain prediction noise information.

In one possible implementation, the denoising module 902 is configured to determine, through the neural network model, an image change feature based on the reference image and the first image of any frame, where the image change feature is used to characterize a change made by changing the reference image to the first image of any frame; extracting features of the reference image through the neural network model to obtain image features of the reference image; image features of the first image of any frame are determined based on the image features and the image change features of the reference image through the neural network model.

In a possible implementation manner, the denoising module 902 is configured to perform feature extraction on the first image of any frame through a neural network model to obtain a first feature of the first image of any frame; performing change processing on the image characteristics of the reference image based on the image change characteristics through a neural network model to obtain second characteristics of the first image of any frame; and fusing the first features and the second features of the first images of any frame through the neural network model to obtain the image features of the first images of any frame.

In one possible implementation, the number of times the noise addition process is multiple times; the denoising module 902 is configured to determine an image feature of a first image of any frame through a neural network model; performing first noise adding processing on the image features of the first image of any frame through the neural network model to obtain the features of the first image of any frame obtained after the first noise adding processing; and for any noise adding process except the first noise adding process, performing any noise adding process on the characteristics obtained after the last noise adding process of the first image of any frame by using a neural network model to obtain the characteristics obtained after the last noise adding process of the first image of any frame, wherein the characteristics obtained after the last noise adding process of the first image of any frame are prediction noise information.

In one possible implementation, the number of denoising processes is multiple times; the denoising module 902 is configured to denoise the predicted noise information for the first time through the neural network model, so as to obtain a feature of the first image of any frame obtained after the first denoising process; for any denoising process except the first denoising process, performing any denoising process on the characteristics obtained after the last denoising process of the first image of any frame by using a neural network model to obtain the characteristics obtained after the any denoising process of the first image of any frame; and determining a reconstructed image corresponding to the first image of any frame based on the characteristics of the first image of any frame obtained after the last denoising treatment through the neural network model.

In one possible implementation, a denoising module 902 is configured to obtain description information of a first image of any frame; and denoising the prediction noise information based on the description information of the first image of any frame to obtain a reconstructed image corresponding to the first image of any frame.

In one possible implementation, the training module 903 is configured to determine, for the first image of any frame, an image loss corresponding to the first image of any frame based on an error between the second image corresponding to the first image of any frame and the reconstructed image; and training the neural network model based on the image loss corresponding to the first image of each frame to obtain a video reconstruction model.

In one possible implementation manner, the training module 903 is configured to obtain labeling and noise adding data of each frame of the first image, where the labeling and noise adding data of any frame of the first image characterizes noise added in a process of adding noise of any frame of the first image into labeling noise information; acquiring labeling denoising data of the second images of each frame, wherein the labeling denoising data of any frame of the second image represents noise removed in the process of denoising labeling noise information into any frame of the second image; for any frame of first image, obtaining prediction noise adding data added in the process of adding noise to any frame of first image through a neural network model to obtain prediction noise information, and obtaining prediction noise removing data removed in the process of removing noise to the prediction noise information through the neural network model to obtain a reconstructed image; determining a first loss based on the labeling denoising data of the second image of each frame, the labeling denoising data of the first image of each frame, the prediction denoising data and the prediction denoising data; and training the neural network model based on the first loss, the second image corresponding to the first image of each frame and the reconstructed image to obtain a video reconstruction model.

In one possible implementation, the training module 903 is configured to determine a denoising data loss based on the labeled denoising data of the second image of each frame and the predicted denoising data corresponding to the first image of each frame; determining noise adding data loss based on the labeling noise adding data of the first images of each frame and the prediction noise adding data corresponding to the first images of each frame; the first loss is determined based on the de-noised data loss and the noisy data loss.

In the device, the first images of each frame in the first video are subjected to noise adding processing through the neural network model to obtain the prediction noise information, and the prediction noise information is subjected to noise removing processing to obtain the reconstructed images corresponding to the first images of each frame, so that the reconstruction of the first video with lower resolution is realized to obtain the video with higher resolution. Because the reconstructed image is obtained by denoising the predicted noise information, the reconstructed image is not influenced by the type, the size, the resolution and the like of the first image, and detail information in the image is reserved by denoising, so that the definition of the reconstructed image is higher. On the basis, the neural network model is trained through the second video and the reconstructed images of each frame, so that the neural network model can be optimized towards the direction of enabling the reconstructed video to approach to the second video, the accuracy, the universality and the stability of the neural network model are improved, the video reconstruction model obtained through training can reconstruct the video with higher resolution, and the definition and the quality of the video are higher.

It should be understood that, in implementing the functions of the apparatus provided in fig. 9, only the division of the functional modules is illustrated, and in practical application, the functional modules may be allocated to different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 10 is a schematic structural diagram of a video reconstruction device according to an embodiment of the present application, and as shown in fig. 10, the device includes an acquisition module 1001 and a noise adding and removing module 1002.

An obtaining module 1001 is configured to obtain a video to be reconstructed.

The denoising module 1002 is configured to perform denoising processing on a video to be reconstructed through a video reconstruction model to obtain reference noise information, and perform denoising processing on the reference noise information to obtain a target video.

The video reconstruction model is obtained by training according to a training method of the video reconstruction model related to fig. 2, the content of the target video is the same as that of the video to be reconstructed, and the resolution of the target video is higher than that of the video to be reconstructed.

In one possible implementation manner, the denoising module 1002 is configured to determine, for any frame of an image to be reconstructed in a video to be reconstructed, a reference image of the image to be reconstructed of any frame from the video to be reconstructed through a video reconstruction model; determining the image characteristics of any frame of image to be reconstructed according to the reference image and any frame of image to be reconstructed through a video reconstruction model; and carrying out noise adding processing on the image characteristics of any frame of image to be reconstructed through the video reconstruction model to obtain reference noise information.

In a possible implementation manner, the denoising module 1002 is configured to obtain, for any frame of an image to be reconstructed in a video to be reconstructed, description information of the image to be reconstructed of the any frame; and denoising the reference noise information based on the description information of the image to be reconstructed of any frame to obtain a target image corresponding to the image to be reconstructed of any frame.

In the device, the video reconstruction model is used for carrying out noise adding processing on each frame of to-be-reconstructed image in the to-be-reconstructed video to obtain the reference noise information, and denoising processing is carried out on the reference noise information to obtain the target image corresponding to each frame of to-be-reconstructed image, so that the reconstruction of the to-be-reconstructed video with lower resolution ratio to obtain the target video with higher resolution ratio is realized. The target image is obtained by denoising the reference noise information, so that the target image is not influenced by the type, the size, the resolution and the like of the image to be reconstructed, the detail information in the image is reserved by denoising, the definition of the target image is higher, and the quality of the target video is improved.

It should be understood that, in implementing the functions of the apparatus provided in fig. 10, only the division of the functional modules is illustrated, and in practical application, the functional modules may be allocated to different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 11 shows a block diagram of a terminal device 1100 according to an exemplary embodiment of the present application. The terminal device 1100 includes: a processor 1101 and a memory 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one computer program for execution by processor 1101 to implement the training method or video reconstruction method of the video reconstruction model provided by the method embodiments of the present application.

In some embodiments, the terminal device 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102, and peripheral interface 1103 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, and a power supply 1108.

A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1104 may also include NFC (Near Field Communication, short-range wireless communication) related circuitry, which is not limiting of the application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this time, the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1105 may be one and disposed on the front panel of the terminal device 1100; in other embodiments, the display 1105 may be at least two, and disposed on different surfaces of the terminal device 1100 or in a folded design; in other embodiments, the display 1105 may be a flexible display disposed on a curved surface or a folded surface of the terminal device 1100. Even more, the display 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1105 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1106 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal device 1100, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1107 may also include a headphone jack.

A power supply 1108 is used to power the various components in terminal device 1100. The power supply 1108 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1108 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal device 1100 also includes one or more sensors 1109. The one or more sensors 1109 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, optical sensor 1114, and proximity sensor 1115.

The acceleration sensor 1111 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established in the terminal apparatus 1100. For example, the acceleration sensor 1111 may be configured to detect components of gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1111. Acceleration sensor 1111 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal device 1100, and the gyro sensor 1112 may collect a 3D motion of the user on the terminal device 1100 in cooperation with the acceleration sensor 1111. The processor 1101 may implement the following functions based on the data collected by the gyro sensor 1112: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1113 may be disposed at a side frame of the terminal device 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is provided at a side frame of the terminal apparatus 1100, a grip signal of the terminal apparatus 1100 by a user can be detected, and the processor 1101 performs left-right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1114 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1114. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1105 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1105 is turned down. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 based on the intensity of ambient light collected by the optical sensor 1114.

A proximity sensor 1115, also referred to as a distance sensor, is typically provided on the front panel of the terminal device 1100. The proximity sensor 1115 is used to collect the distance between the user and the front surface of the terminal device 1100. In one embodiment, when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal device 1100 gradually decreases, the processor 1101 controls the display 1105 to switch from the bright screen state to the off screen state; when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal apparatus 1100 gradually increases, the processor 1101 controls the display screen 1105 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is not limiting and that terminal device 1100 may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more processors 1201 and one or more memories 1202, where the one or more memories 1202 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1201 to implement the training method or the video reconstruction method of the video reconstruction model according to the foregoing method embodiments, and the processor 1201 is a CPU. Of course, the server 1200 may also have a wired or wireless network interface, a keyboard, an input/output interface, etc. for performing input/output, and the server 1200 may also include other components for implementing device functions, which are not described herein.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to cause an electronic device to implement a training method or a video reconstruction method of any of the above video reconstruction models.

Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program, which is at least one computer program, loaded and executed by a processor, to cause an electronic device to implement a training method or a video reconstruction method of any of the above-mentioned video reconstruction models.

In an exemplary embodiment, there is also provided a computer program product having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to cause an electronic device to implement a training method or a video reconstruction method of any of the above-mentioned video reconstruction models.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The above embodiments are merely exemplary embodiments of the present application and are not intended to limit the present application, any modifications, equivalent substitutions, improvements, etc. that fall within the principles of the present application should be included in the scope of the present application.

Claims

1. A method of training a video reconstruction model, the method comprising:

acquiring a first video and a second video with the same content, wherein the resolution of the second video is higher than that of the first video;

for any frame of first image in the first video, performing noise adding processing on the any frame of first image through a neural network model to obtain prediction noise information, and performing noise removing processing on the prediction noise information to obtain a reconstructed image corresponding to the any frame of first image, wherein the reconstructed image corresponding to the any frame of first image is identical to a second image corresponding to the any frame of first image in the second video in resolution;

Training the neural network model based on a second image and a reconstruction image corresponding to each frame of the first image to obtain a video reconstruction model, wherein the video reconstruction model is used for reconstructing a video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

2. The method according to claim 1, wherein the noise-adding the first image of any frame by the neural network model to obtain the prediction noise information includes:

determining a reference image of the first image of any frame from the first video through a neural network model;

determining image characteristics of the first image of any frame according to the reference image and the first image of any frame through the neural network model;

and carrying out noise adding processing on the image characteristics of the first image of any frame through the neural network model to obtain prediction noise information.

3. The method of claim 2, wherein determining, by the neural network model, image features of the arbitrary frame first image from the reference image for the arbitrary frame first image, comprises:

Determining, by the neural network model, image change features based on the reference image and the arbitrary frame first image, the image change features being used to characterize changes made to change the reference image to the arbitrary frame first image;

extracting features of the reference image through the neural network model to obtain image features of the reference image;

and determining the image characteristics of the first image of any frame based on the image characteristics and the image change characteristics of the reference image through the neural network model.

4. A method according to claim 3, wherein said determining, by the neural network model, image features of the first image of any frame based on image features of the reference image and the image change features, comprises:

extracting features of the first image of any frame through the neural network model to obtain first features of the first image of any frame;

performing change processing on the image features of the reference image based on the image change features through the neural network model to obtain second features of the first image of any frame;

and fusing the first features and the second features of the first image of any frame through the neural network model to obtain the image features of the first image of any frame.

5. The method of claim 1, wherein the number of times the noise adding process is multiple times;

the noise adding processing is performed on the first image of any frame through the neural network model to obtain prediction noise information, and the method comprises the following steps:

determining the image characteristics of the first image of any frame through a neural network model;

performing first noise adding processing on the image features of the first image of any frame through the neural network model to obtain the features of the first image of any frame obtained after the first noise adding processing;

and for any one of the noise adding processes except the first noise adding process, performing the any one of the noise adding processes on the characteristics of the first image of any one frame obtained after the last noise adding process of the any one of the noise adding processes through the neural network model to obtain the characteristics of the first image of any one frame obtained after the any one of the noise adding processes, wherein the characteristics of the first image of any one frame obtained after the last noise adding process are the prediction noise information.

6. The method of claim 1, wherein the number of denoising processes is a plurality of times;

denoising the predicted noise information to obtain a reconstructed image corresponding to the first image of any frame, including:

Performing first denoising processing on the predicted noise information through the neural network model to obtain features of the first image of any frame obtained after the first denoising processing;

for any denoising process except the first denoising process, performing the any denoising process on the characteristics of the first image of any frame obtained after the last denoising process of the any denoising process through the neural network model to obtain the characteristics of the first image of any frame obtained after the any denoising process;

and determining a reconstructed image corresponding to the first image of any frame based on the characteristics obtained after the last denoising treatment of the first image of any frame through the neural network model.

7. The method according to claim 1, wherein the denoising the prediction noise information to obtain a reconstructed image corresponding to the first image of the arbitrary frame includes:

acquiring description information of the first image of any frame;

and denoising the prediction noise information based on the description information of the first image of any frame to obtain a reconstructed image corresponding to the first image of any frame.

8. The method according to any one of claims 1 to 7, wherein training the neural network model based on the second image and the reconstructed image corresponding to the first image of each frame to obtain a video reconstruction model includes:

for any frame of first image, determining image loss corresponding to any frame of first image based on error between a second image corresponding to any frame of first image and a reconstructed image;

and training the neural network model based on the image loss corresponding to the first image of each frame to obtain a video reconstruction model.

9. The method according to any one of claims 1 to 7, wherein training the neural network model based on the second image and the reconstructed image corresponding to the first image of each frame to obtain a video reconstruction model includes:

acquiring marking and noise adding data of each frame of first image, wherein the marking and noise adding data of any frame of first image represents noise added in the process of adding the marking and noise information into the marking and noise information of any frame of first image;

acquiring labeling denoising data of each frame of second image, wherein the labeling denoising data of any frame of second image represents noise removed in the process of denoising the labeling noise information into any frame of second image;

For any frame of first image, obtaining prediction noise adding data added in the process of adding noise to any frame of first image through a neural network model to obtain prediction noise information, and obtaining prediction noise removing data removed in the process of removing noise to the prediction noise information through the neural network model to obtain a reconstructed image;

determining a first loss based on the labeling denoising data of the second images of each frame, the labeling denoising data of the first images of each frame, the prediction denoising data and the prediction denoising data;

and training the neural network model based on the first loss, the second image corresponding to the first image of each frame and the reconstructed image to obtain a video reconstruction model.

10. The method of claim 9, wherein the determining the first loss based on the labeled denoising data for the second image of each frame, the labeled denoising data for the first image of each frame, the predicted denoising data, and the predicted denoising data comprises:

determining denoising data loss based on the labeling denoising data of the second image of each frame and the prediction denoising data corresponding to the first image of each frame;

Determining noise adding data loss based on the labeling noise adding data of the first images of each frame and the prediction noise adding data corresponding to the first images of each frame;

the first loss is determined based on the denoising data loss and the denoising data loss.

11. A method of video reconstruction, the method comprising:

acquiring a video to be reconstructed;

carrying out noise adding processing on the video to be reconstructed through a video reconstruction model to obtain reference noise information, and carrying out denoising processing on the reference noise information to obtain a target video;

the video reconstruction model is trained according to the method of any one of claims 1 to 10, the content of the target video is the same as that of the video to be reconstructed, and the resolution of the target video is higher than that of the video to be reconstructed.

12. The method of claim 11, wherein the performing the noise adding process on the video to be reconstructed by the video reconstruction model to obtain the reference noise information includes:

for any frame of to-be-reconstructed image in the to-be-reconstructed video, determining a reference image of the to-be-reconstructed image of any frame from the to-be-reconstructed video through a video reconstruction model;

Determining image characteristics of any frame of image to be reconstructed according to the reference image and the image to be reconstructed of any frame through the video reconstruction model;

and carrying out noise adding processing on the image characteristics of the image to be reconstructed of any frame through the video reconstruction model to obtain reference noise information.

13. The method of claim 11, wherein denoising the reference noise information to obtain a target video, comprises:

for any frame of image to be reconstructed in the video to be reconstructed, acquiring description information of the image to be reconstructed of any frame;

and denoising the reference noise information based on the description information of the image to be reconstructed of any frame to obtain a target image corresponding to the image to be reconstructed of any frame.

14. A training apparatus for a video reconstruction model, the apparatus comprising:

the acquisition module is used for acquiring a first video and a second video with the same content, and the resolution of the second video is higher than that of the first video;

the noise adding and denoising module is used for adding noise to any frame of first image in the first video through a neural network model to obtain prediction noise information, and denoising the prediction noise information to obtain a reconstructed image corresponding to any frame of first image, wherein the reconstructed image corresponding to any frame of first image is identical to the second image corresponding to any frame of first image in the second video in resolution;

The training module is used for training the neural network model based on the second image and the reconstructed image corresponding to each frame of the first image to obtain a video reconstruction model, the video reconstruction model is used for reconstructing a video to be reconstructed to obtain a target video, and the resolution of the target video is higher than that of the video to be reconstructed.

15. A video reconstruction apparatus, the apparatus comprising:

the acquisition module is used for acquiring the video to be reconstructed;

the noise adding and denoising module is used for adding noise to the video to be reconstructed through a video reconstruction model to obtain reference noise information, and denoising the reference noise information to obtain a target video;

16. An electronic device comprising a processor and a memory, wherein the memory stores at least one computer program, the at least one computer program being loaded and executed by the processor to cause the electronic device to implement the training method of the video reconstruction model of any one of claims 1 to 10 or the video reconstruction method of any one of claims 11 to 13.

17. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, and the at least one computer program is loaded and executed by a processor, to cause an electronic device to implement the training method of the video reconstruction model according to any one of claims 1 to 10 or the video reconstruction method according to any one of claims 11 to 13.