CN113706414B

CN113706414B - Training method of video optimization model and electronic equipment

Info

Publication number: CN113706414B
Application number: CN202110990080.XA
Authority: CN
Inventors: 卢圣卿; 肖斌; 王宇; 朱聪超
Original assignee: Honor Device Co Ltd
Current assignee: Shanghai Glory Smart Technology Development Co ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2022-09-09
Anticipated expiration: 2041-08-26
Also published as: CN113706414A

Abstract

The application belongs to the field of image processing, and provides a training method of a video optimization model and electronic equipment, wherein the method comprises the following steps: inputting the blurred video into a preset video optimization model for alignment processing and deblurring processing to obtain a first output image in the output video; determining optical flow according to the label video, and transforming the first output image according to the optical flow to obtain a second output image; and determining the difference between the first output image and the first label image and the difference between the second output image and the second label image, and adjusting the parameters of the video optimization model according to the differences until the training of the video optimization model is completed. The alignment module aligns the video to be optimized, and performs video optimization on the aligned images through the trained video optimization model, so that the deblurred video can be obtained, and meanwhile, the problem of time domain consistency can be effectively optimized, and the optimization efficiency of the blurred video can be effectively improved.

Description

Training method of video optimization model and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for a video optimization model and an electronic device.

Background

With the improvement of the performance of electronic devices such as mobile phones, flat panels and smart televisions and the development of communication technologies, the application of video services in the electronic devices is also more and more extensive. The high-quality video playing effect is beneficial to the user to watch clearer picture content and is beneficial to improving the watching experience of the user.

When a video is collected, due to camera shake, target motion and the like, relative motion occurs between a camera and an object and between the camera and a background to generate motion blur; alternatively, motion blur occurs due to a change in the focus position of the camera due to a change in the depth of field of the camera. Or, in the shooting process, the video may have a time domain consistency defect due to the influence of shooting factors, post-editing factors or transcoding factors.

In order to improve the video playing quality, the video is usually required to be deblurred and subjected to time domain consistency processing. However, in the current video enhancement processing method, the temporal consistency is usually processed according to the factor of the temporal consistency problem, and the motion blur problem cannot be processed while the temporal consistency problem is solved, which is not beneficial to improving the video optimization efficiency.

Disclosure of Invention

The embodiment of the application discloses a training method of a video optimization model and electronic equipment, and aims to solve the problems that the time domain consistency problem and the motion blur problem cannot be processed simultaneously when video images are optimized at present, and the video optimization efficiency cannot be improved.

In order to solve the technical problem, the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a training method for a video optimization model, where the method includes: the electronic equipment obtains a sample video, carries out alignment processing on a fuzzy video included in the sample video through a video optimization model, inputs an aligned characteristic image into a deblurring network model in the video optimization model to obtain a first output image of the output video, determines an optical flow according to a label video, transforms the first output image according to the determined optical flow to obtain a second output image adjacent to the first output image, adjusts parameters of the video optimization model according to the difference between the first output image and the first label image and the difference between the second output image and the second label image until the calculated difference after the parameters are adjusted meets a preset requirement, and finishes training of the video optimization model.

The first label image and the second label image are images in the label video included in the sample video. The first label image is associated with the content of the first output image and the second label image is associated with the content of the second output image. The content of the label image is related to the content of the output image, and the label image and the output image have the same content, or the label image and the output image are obtained by transformation according to the same blurred image.

The image in the blurred video comprises an image with definition lower than a preset definition requirement and a time domain consistency problem. The image in the label video is the image with the definition higher than the preset definition requirement and without the time domain consistency problem.

After the electronic equipment generates a first output image according to the video optimization model, the first output image is subjected to time domain transformation according to the optical flow determined by the label video, and a second output image adjacent to the first output image is obtained. Since the second output image is obtained by transforming the first output image by the optical flow determined by the standard tag video, the time domain features included in the second output image are consistent with the first output image, and the second difference is determined by comparing the second output image with the second tag image with the same time position, so that the time domain consistency in the output video of the video optimization model can be effectively reflected. The first difference is determined by comparing the first output image with the first label image, and the image quality problem in the output video of the video optimization model can be reflected. And optimizing and adjusting parameters of the video optimization model according to the determined first difference and the second difference until the difference calculated by the video optimization model after the parameters are adjusted meets the preset convergence requirement, thereby completing the training of the model. The training process does not need to limit the reason causing the time domain consistency problem, so that the trained video optimization model can adapt to the optimization processing of the video with the time domain consistency problem caused by different reasons.

The sample video comprises a fuzzy video and a label video. The label video is a video obtained by optimizing a video in a time domain and a space domain, and the fuzzy video is a video with an image comprising time domain noise and space domain noise, namely the image with the definition of the image in the video less than the preset definition requirement.

The sample video comprises a fuzzy video and a label video, when the sample video is obtained, the image quality and time domain consistency of the video to be trained can be detected, the video type of the video to be trained is determined, and a corresponding sample video obtaining method is selected according to the determined video type. Or, the type of the video to be trained may be specified, and the corresponding sample video acquisition method may be determined according to the specified type.

With reference to the first aspect, in a first possible implementation manner of the first aspect, when the video to be trained is a blurred video, time-domain consistency processing may be performed on an image of the blurred video to obtain a time-domain stable video, and then image enhancement processing is performed on the time-domain stable video to obtain a tag video in the sample video. And obtaining a sample video for model training according to the generated label video and the fuzzy video.

With reference to the first aspect, in a second possible implementation manner of the first aspect, when a video to be trained is a video with a stable time domain, degradation processing may be performed on an image of the video to be trained to obtain a blurred video of a sample video; and performing image enhancement processing on the image of the video to be trained to obtain a label video of the sample video. The image quality of the video to be trained is enhanced to obtain a label image, and the time domain consistency of the video to be trained is degraded to obtain a fuzzy image, so that model training can be performed according to the obtained sample video.

When the time domain consistency processing is carried out on the blurred video, parameters in the video can be adjusted according to reasons causing the time domain consistency problem, so that the adjusted parameters meet the time domain consistency requirement.

Therefore, the acquired video to be trained, namely the original video, is subjected to transformation processing, and the fuzzy video and the label video required by the sample video can be generated, so that the label video and the fuzzy video included in the sample video can be effectively generated.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, when video degradation with temporal consistency is processed into a blurred video, intra-frame filtering may be performed on an image of the video to be trained; and/or performing inter-frame filtering on the image of the video to be trained. The intra-frame filtering includes, for example, gaussian filtering, radial filtering, block filtering, and the like. Inter-frame filtering involves blurring the image between different frames.

The electronic equipment inputs the aligned characteristic image into a preset deblurring network model to obtain a first output image in an output video, and the method comprises the following steps: the electronic equipment inputs the t frame image after alignment processing and the t-1 frame image into a preset deblurring network model, and outputs a residual error between the t frame output image and the t frame image of the blurred video, wherein t is greater than or equal to 2; and the electronic equipment sums the residual error and the t frame image of the blurred video to obtain a t frame output image, wherein the t frame output image is a first output image.

And introducing a residual error block into the deblurring network model, enabling the deblurring network model to output a residual error, and generating a first output image according to the output residual error and the image of the blurred video, so that the deblurring network model can effectively solve the problem of deepening the layer number during optimization training.

Where T is greater than or equal to 2 and less than T, T is the frame number of the sample video, and T can be used to represent the sequence number of the images in the video. Two adjacent frames of images in the blurred video are input into the deblurring network model for one-time calculation, and a first output image can be output. It will be appreciated that the first output image output may vary from one input blurred image to another, and from one parameter to another in the deblurring network model.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the inputting, by an electronic device, an image in the blurred video to an image alignment module, and performing alignment processing on adjacent images in the blurred video includes: the electronic equipment inputs the t frame image and the t-1 frame image of the blurred video into a preset image alignment module, and outputs a characteristic image of the t frame image aligned with the t-1 frame image.

And sequentially aligning the blurred images of two adjacent frames, for example, aligning the t frame blurred image with the t-1 frame blurred image, and aligning the t +1 frame blurred image with the t frame blurred image to respectively obtain the t frame aligned image and the t +1 frame aligned image. After the fuzzy video is aligned, the 1 st frame image and the aligned T-1 frame characteristic image can be obtained. The 1 st frame image and the T-1 frame aligned feature image can be input into a deblurring network model for deblurring calculation.

With reference to the first aspect, in a sixth possible implementation manner of the first aspect, the loss between the first output image and the first label image includes a loss of detail and a loss of perception of the first output image and the first label image, and the loss between the second output image and the second label image includes a loss of temporal consistency; adjusting parameters of the video optimization model according to the determined loss, including: determining a weight coefficient alpha for a temporal consistency loss _t Weight coefficient of perceptual loss alpha _p And weight coefficient alpha of detail loss _e Determining a total loss of the video optimization model; and adjusting parameters of the video optimization model according to the total loss.

With reference to the sixth implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the adjusting parameters of the video optimization model according to the total loss includes: a first training process, adjusting the parameters of the video optimization model according to a first weight coefficient combination; a second training process, adjusting the parameters of the video optimization model according to a second weight coefficient combination; wherein, in the first weight coefficient combination, α _t >α _p And, in the second weight coefficient combination, α _t <α _p 。

With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, in the first weight coefficient combination or the second weight coefficient combination, the weight coefficient α of the loss of detail is _e A weighting factor alpha with said perceptual loss _p Is greater than 8:1 and less than 100: 1.

Wherein alpha is _t Weight coefficient, alpha, for temporal coherence loss _p For perceptual loss of weight coefficient, α _e Weight coefficients for loss of detail. When the weight coefficient of the temporal consistency loss is large, for example, the weight coefficient of the temporal consistency loss is more than 1.2 times of the weight coefficient of the perceptual loss, the determined total loss is more than the temporal consistency loss, and the output image has better temporal consistency by performing parameter adjustment according to the total loss. After the first training process, the weight coefficient of the perception loss is improved, so that the trained model has better image quality. By means of a separate emphasis training mode, the video optimization model can be converged more easily, and training efficiency of the video optimization model is improved. And after the time domain consistency parameters are trained preferentially, the training difficulty of the time domain consistency parameters can be reduced, and the system training efficiency is improved.

In a second aspect, an embodiment of the present application provides a video optimization method, including obtaining a video to be optimized; and inputting the image of the video to be optimized into the trained video optimization model according to any one of the first aspect to obtain an image of the optimized output video.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor coupled with a memory, the memory being configured to store instructions, and the processor being configured to execute the instructions in the memory, so that the electronic device performs the method according to any one of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which instructions are stored, and when the instructions are executed, the method according to any one of the first aspect is implemented.

Drawings

FIG. 1 is a schematic diagram of a video with stripes or flicker;

fig. 2 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a block diagram of a software structure of an electronic device according to an embodiment of the present application;

fig. 4 is a schematic implementation flowchart of a training method for a video optimization model according to an embodiment of the present application;

fig. 5 is a schematic diagram of sample video generation provided by an embodiment of the present application;

fig. 6 is a schematic diagram of another sample video generation provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of an alignment module according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a training structure of a video optimization model according to an embodiment of the present application;

fig. 9 is a schematic diagram of a reference learning process provided in an embodiment of the present application;

fig. 10 is a schematic diagram of a video optimization method according to an embodiment of the present application;

fig. 11 is a schematic diagram of a training apparatus for a video optimization model of a video according to an embodiment of the present application;

fig. 12 is a schematic diagram of a video optimization apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions in the embodiments of the present application better understood and make the above objects, features and advantages of the embodiments of the present application more obvious and understandable to those skilled in the art, the technical solutions in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Before describing the technical solutions of the embodiments of the present application, moral refers to the technical scenes and related technical terms of the present application in combination with the drawings.

The technical scheme of the embodiment is applied to the technical field of image processing, and mainly aims at image quality enhancement and time domain consistency processing of images of a series of continuous frames in a video. The image quality enhancement may include an optimized adjustment of intra-frame image blur and inter-frame image blur of the video.

The intra-frame image blur refers to a blur in the spatial distribution of an image due to optical diffraction, spherical non-uniformity, and the like in the imaging process. For example, due to the diffraction effect of light, the size of the small hole through which the light passes is too small to generate image blur. Or, the light passing through the small hole is too large, which causes the light of one pixel point to come from a plurality of object points, and thus the collected image is blurred.

The inter-frame image blur can refer to blur generated by inconsistent inter-frame definition due to relative motion between a camera and a shooting object in a video capture process. For example, in the shooting process, the shooting target is in a stationary state and the camera is in a moving state, or the target is in a moving state and the camera is in a stationary state or a relative moving state. Due to the generation of relative motion, the image in the video taken by the camera generates an artifact blur.

Temporal consistency processing includes stripes (Banding) that occur within the same frame image of the video, and flicker (flickering) that occurs between different frames of the video. When the object in the image in the processed video is not changed, the change which can be perceived by a user is not generated, and the video has no time domain consistency problem.

The reason for generating the stripes (Banding) or the flickers is shown in fig. 1, and when a scene of video shooting is under a light source device such as a fluorescent lamp, the light source can generate light energy with a certain frequency according to the frequency of a power supply. For example, the power supply shown in FIG. 1 has a frequency of 50Hz and the generated light energy has a frequency of 100Hz (since the energy is not directional). The exposure mode of the CMOS light sensor is line-by-line exposure, and exposure is carried out according to a set exposure period. The time points of starting exposure of all the pixels in the same row are the same, and the time length of exposure required by all the pixels in any row is the same. Therefore, the exposure time points of the pixels of different lines in the same frame screen are different. At different time points, the light energy generated by the light source device may be different, so that the brightness of the pixels in different rows in the same frame image is different, and bright and dark stripes (Banding) are formed in the same frame image.

The reason for flicker (Flicking) between different frames may be that the frame rate of video acquisition is greater than the energy frequency of the power supply. For example, the energy frequency of a commercial power supply with a frequency of 50Hz is 100Hz, if the frame rate is large, for example, the frame rate is greater than 100Hz, pixel points located at the same position may have different brightness when acquiring different frames according to a preset exposure period, so that different frames exhibit different brightness display effects, and flicker (Flicking) phenomena with different brightness occur between different frames. Without being limited thereto, when the captured video is compressed and encoded, the video images may have a problem of inconsistent time domain, such as flickering.

Currently, a video may be subjected to a targeted deblurring process according to the cause of image blur. The time domain consistency can be optimized in a targeted manner according to the reason of the time domain consistency problem generated by the video.

For example, under the condition that the blur kernel is unknown, blind deconvolution can be adopted to recover and obtain a clear image, and under the condition that the blur kernel is known, methods such as wiener filtering and Richardson-lucy are adopted to recover and obtain the clear image. And recovering the obtained image according to the size of the fuzzy area, such as global deblurring and local deblurring.

For reasons of time domain consistency, such as shooting factors, post-clipping factors or transcoding factors, corresponding time domain consistency processing algorithms can be adopted to make videos have time domain consistency. However, in the current image deblurring algorithm, when the video is processed with the time domain consistency problem, the deblurring processing of the video cannot be completed at the same time.

In order to optimize the time domain consistency of a video and optimize the fuzzy problem of the video, the embodiment of the application provides a training method of a video optimization model.

According to the training method of the video optimization model, the sample video is obtained and comprises the fuzzy video and the label video after the fuzzy video is optimized. Inputting the blurred video into a video optimization model, wherein the method comprises the steps of inputting two adjacent frames (for example, a t-th frame and a t-1-th frame) of the blurred video into an alignment module, and outputting an aligned feature image through the alignment module. And taking the aligned characteristic images as the input of the deblurring network model to obtain first output images corresponding to the images of two adjacent frames in the blurred video. And determining optical flow of the tag video (such as optical flow from the t-th frame to the t-1-th frame) according to two adjacent frames (such as the t-th frame and the t-1-th frame) of images which are positioned in the same position as the characteristic image of the blurred video and are positioned in the tag video. And generating a second output image adjacent to the first output image by combining the output images obtained by the deblurring network model according to the determined optical flows of the label images (for example, the t-1 frame output image can be calculated and generated as the second output image by combining the t-1 frame output image according to the optical flows from the t-frame label images to the t-1 frame label images). Determining the loss of the first output image and the first label image of the label video, including the detailed loss and the perception loss, and the loss of the second output image and the second label image, namely the loss of the consistency of the time domain, and adjusting the parameters of the video optimization model until the adjusted loss meets the preset requirement. Wherein T is an integer greater than 1 and less than T, T is the frame number of the sample video, and a T-1 group of sample images can be generated from the sample video. The T-1 group of sample images can be called in sequence according to a time sequence to train the video optimization model, and the T-1 group of sample images can also be called randomly to train.

The training method of the video optimization model and the video optimization method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment can be a terminal and also can be a chip in the terminal. The terminal may be, for example, an electronic device such as a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the specific type of the electronic device is not limited in this embodiment.

Fig. 2 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 2, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, an audio module 140, a speaker 140A, a microphone 140B, a headphone interface 140C, a sensor module 150, a camera 160, a display screen 170, and the like. The sensor module 150 may include a pressure sensor 150A, a gyroscope sensor 150B, an acceleration sensor 150C, a distance sensor 150D, a fingerprint sensor 150E, a touch sensor 150F, an ambient light sensor 150G, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), and/or a neural-Network Processing Unit (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.

Wherein the controller may be a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution. The method is used for executing the image transformation network model training method, training and optimizing the image transformation network model, and can be used for optimizing the video according to the image transformation network model after training and optimizing.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system. For example, after the training of the image transformation network model is completed, the related data of the image transformation network model may be stored in a cache memory in the processor 110, which facilitates to improve the processing efficiency of the system for video optimization.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 150F, the charger, the flash, the camera 160, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 150F through an I2C interface, so that the processor 110 and the touch sensor 150F communicate through an I2C bus interface, thereby implementing a touch function of the electronic device 100, and enabling the electronic device to receive a video shooting instruction, a video optimization instruction, and the like through the touch sensor.

The MIPI interface may be used to connect the processor 110 with peripheral devices such as the display screen 170, the camera 160, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, the processor 110 and the camera 160 communicate through a CSI interface to implement a shooting function of the electronic device 100, and the shot video may be optimized through the optimized image transformation network model. The processor 110 and the display screen 170 communicate via the DSI interface to implement the display function of the electronic device 100, so that the electronic device can view the optimized video picture through the display screen.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 160, the display 170, the wireless communication module 160, the audio module 140, the sensor module 150, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device, such as receiving video files transferred from other devices or memories, or transferring video files of the electronic device to other devices or memories through the USB interface.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The electronic device 100 implements display functions, such as displaying a preview image when capturing a video or displaying a picture when playing a video, through the GPU, the display screen 170, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 170 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 170 is used to display images, video, etc., such as to display optimized video. The display screen 170 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or M display screens 170, M being a positive integer greater than 1.

The electronic device 100 may implement a photographing function through the ISP, the camera 160, the video codec, the GPU, the display screen 170, the application processor, and the like. The ISP is used to process the data fed back by the camera 160. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 160. The shot video can be optimized through the video optimization method, so that a better video display effect can be obtained in application scenes such as live broadcast and video call.

The camera 160 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or M cameras 160, M being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 is performing video optimization, the video optimization method according to the embodiment of the present application may be performed by a digital signal processor.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: in the embodiment of the application, intelligent learning of the image transformation network model can be realized through the NPU, and parameters in the image transformation network model are optimized.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, a file such as a video after or before optimization may be saved in the external memory card.

The internal memory 121 may be configured to store computer executable program codes, including executable program codes corresponding to the training method of the image transformation network model as described in the embodiment of the present application, executable program codes corresponding to the video optimization method as described in the embodiment of the present application, and the like, where the executable program codes include instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions and audio acquisition through the audio module 140, the speaker 140A, the microphone 140B, the headphone interface 140C, and the application processor. Such as audio playback in video, recording while video was being taken, etc.

The audio module 140 is used to convert digital audio information into analog audio signal output and also used to convert analog audio input into digital audio signal. The audio module 140 may also be used to encode and decode audio signals. In some embodiments, the audio module 140 may be disposed in the processor 110, or some functional modules of the audio module 140 may be disposed in the processor 110.

The speaker 140A, also called a "horn", is used to convert audio electrical signals into sound signals. The electronic apparatus 100 can listen to audio in video or the like through the speaker 140A.

The microphone 140B, also called "microphone", is used to convert sound signals into electrical signals. When a video is captured, sound information in the scene may be captured by the microphone 140B. In other embodiments, the electronic device 100 may be provided with two microphones 140B to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 140B for collecting sound signals, reducing noise, identifying sound sources, and implementing directional recording functions.

The headphone interface 140C is used to connect a wired headphone. The headset interface 140C may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association) standard interface of the USA.

The pressure sensor 150A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 150A may be disposed on the display screen 170. The pressure sensor 150A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 150A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 170, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 150A. The electronic device 100 may also calculate the touched position according to the detection signal of the pressure sensor 150A, so as to achieve the acquisition of different instructions, including, for example, a video shooting instruction. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 150B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 150B. The gyro sensor 150B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 150B detects a shake angle of the electronic device 100, calculates a distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse motion, thereby preventing shake, reducing a shake problem of a captured video, and improving quality of the captured video.

The acceleration sensor 150C may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications. For example, when the optimized video is played, the gesture of the electronic device 100 is detected, and the full-screen playing state and the non-full-screen playing state are automatically switched.

A distance sensor 150D for measuring distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, shooting a scene, the electronic device 100 may utilize the distance sensor 150D to range to achieve fast focus, making the picture of the shot video clearer. For example, in a live broadcast scene or a video call scene, a video can be focused on a human face, and the video interaction experience can be improved.

The ambient light sensor 150G is for sensing ambient light level. The electronic device 100 may adaptively adjust video capture parameters including, for example, sensitivity, shutter time, and exposure according to the perceived ambient light brightness.

The fingerprint sensor 150E is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The touch sensor 150F is also referred to as a "touch panel". The touch sensor 150F may be disposed on the display screen 170, and the touch sensor 150F and the display screen 170 form a touch screen, which is also called a "touch screen". The touch sensor 150F is used to detect a touch operation acting thereon or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 170. In other embodiments, the touch sensor 150F can be disposed on a surface of the electronic device 100, different from the position of the display screen 170.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the electronic device 100.

Fig. 3 is a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application. The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom. The application layer may include a series of application packages.

As shown in fig. 3, the application package may include a camera, video, etc. application.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 3, the application framework layer may include a content provider, a view system, an explorer, a notification manager, and the like.

Content providers are used to store and retrieve data and make it accessible to applications. The data may include video, images, and the like.

The view system includes visual controls, such as controls to display text, controls to display video, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including video icons may include a view showing text and a view showing pictures.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform video downloads, video optimization completion, message reminders, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

Android Runtime includes a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application layer and the application framework layer as binary files. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: media Libraries (Media Libraries), image processing Libraries, and the like.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

After the sample video used for training is obtained, the electronic device can input images of the blurred video in the sample video, namely the blurred images, into the video optimization model, perform alignment processing through an alignment module in the video optimization model, and input the feature images after the alignment processing into a deblurring network model in the video optimization model to obtain a first output image. The second output image can be obtained by extracting the optical flow in the label video and combining the first output image output by the deblurring network model. And adjusting parameters of the video optimization model according to the loss between the first output image and the corresponding first label image in the label video, such as the loss of detail and the loss of perception, and the loss between the second output image and the corresponding second label image in the label video, such as the loss of time domain consistency, until the loss, including the loss of detail, the loss of perception and the loss of time domain consistency, meet the preset requirements, thereby completing the training of the video optimization model. And according to the trained image transformation model, combining an alignment module, and performing video optimization processing on any fuzzy video.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The terminal having the structure shown in fig. 2 and 3 may be used to perform a training method or a video optimization method of an image transformation network model provided in the embodiments of the present application. For convenience of understanding, the following embodiments of the present application will specifically describe an image processing method in a shooting scene provided by the embodiments of the present application, by taking a mobile phone having a structure shown in fig. 2 and fig. 3 as an example, with reference to the accompanying drawings.

Fig. 4 is a schematic implementation flow diagram of a training method of a video optimization model provided in an embodiment of the present application, which is detailed as follows:

in S401, the electronic device acquires a sample video.

The sample video in the embodiment of the present application may be understood as a video used for training a video optimization model. The sample video comprises a plurality of frames of images, and each frame of image in the sample video can be labeled in sequence according to the sequence of the images. For example, the sample video includes T frame images, and the T frame images are sequentially marked as a 1 st frame image and a 2 nd frame image … … th frame image of the sample video according to the sequence of each frame image in the sample video. For convenience of description, an ith (0< i < T) frame image of the sample video may be denoted as an ith frame sample image. The sample video comprises a fuzzy video and a label video. The ith frame image in the blurred video can be represented as the ith frame blurred image, and the ith frame image in the tagged video can be represented as the ith frame tag image.

The blurred video comprises images of temporal noise and blur problems. The temporal noise may include noise generated by differences in display parameters between different frames, such as noise generated by differences in brightness or contrast between two frames. The blur problem may include intra-frame blur or inter-frame blur. The intra-frame blurring comprises blurring presented when the definition of a single-frame image is smaller than a preset definition requirement, and the inter-frame blurring comprises blurring presented due to the difference of the definitions among different frames.

The marked ith frame image in the blurred video and the marked ith frame image in the label video are images with related or same image contents. In a possible understanding mode, the image with the same content or related to the content in the tag video and the blurred video can be understood as performing deblurring and image quality enhancement processing on the image in the blurred video to obtain the tag image in the tag video with the same content or related to the content, or performing blurring processing and image quality reduction on the tag image to obtain the blurred image with related content or the same content. The frame i label image can be obtained by deblurring and image quality enhancement processing according to the frame i blurred image; the frame i blurred image can be obtained by performing blurring processing and image quality reduction on the frame i label image.

In this embodiment, the "obtaining a sample video" may refer to that the electronic device receives a sample video from another device, or the electronic device reads the sample video in a local storage, or in a possible implementation manner, the sample video may be generated according to the received video or according to a locally read video.

The blurred video image comprises time domain noise and space domain noise. The spatial noise may be expressed as the definition of the image not meeting the predetermined requirement. Temporal noise can be generated by changes in brightness, contrast, sharpness of adjacent frames of the video.

For example, for an image in a blurred video, that is, a blurred image, intra-frame blur or inter-frame blur may exist, and a difference between one or more parameters of brightness, contrast, or sharpness of two adjacent frames may exist, which is greater than a preset temporal difference threshold. The intra-frame blur may refer to a blur generated by a low definition of an image of a blurred video, and the inter-frame blur may refer to a blur generated by a non-uniform definition between images in the blurred video.

When a blurred video is played, because parameters such as brightness or contrast of different frames are different, a flickering state is presented. The images in the blurred video are blurred, and artifacts appear on the images which may appear during playing, so that the watching experience of a user is influenced.

When the blurred video is converted into the tag video, image enhancement processing for deblurring a blurred image in the blurred video and time domain consistency processing for the blurred video are required.

The image of the tag video is an image which meets the preset definition requirement and time domain consistency requirement. For example, the image quality of the image of the tag video may be set to be higher than a predetermined requirement, and the image may not have the temporal consistency problem between different frames. The image of the tag video and the image of the blur video correspond to each other in time. Namely: and the content of the ith frame image of the label video is similar to or identical to that of the ith frame image of the fuzzy video. The image of the tag video can be obtained by performing image quality enhancement processing and time domain consistency processing on the blurred video. For example, the content of the ith frame image of the tag video is the same as that of the ith frame image of the blur video, but the parameters of the pictures may be different. In the processing process, the image quality enhancement processing and the time domain consistency processing are carried out on the ith frame image of the blurred video, so that the ith frame image of the label video is obtained.

The evaluation index of the image definition may include one or more of Brenner gradient function, Tenengrad gradient function, Laplacian gradient function, SMD grayscale function, and the like.

In the process of generating the sample video, the sample video required by the embodiment of the present application can be obtained according to different processing modes according to different obtained original videos.

As shown in fig. 5, the acquired original video is a blurred video. In order to obtain a complete sample video, a tag video (Ground route) corresponding to the original video needs to be generated. When the tag video is generated, image quality enhancement processing of improving the definition of each frame of blurred image of the original video, namely the blurred video, can be performed to obtain a clear image corresponding to the blurred image in the blurred video. A video consisting of sharp images, we may refer to as a sharp video. The image quality enhancement process may use an algorithm such as DeblurGAN, SRN-DeblurNet, DeblurGAN-v2, EDVR, or PSS-NSC to convert the blurred image into a sharp image that satisfies the requirements for sharpness in and between frames. However, there are still differences in parameters such as brightness and contrast between different frames of a clear video, and there are flickering problems such as light and dark alternation or contrast change in the time dimension when the clear video is played.

The temporal consistency processing can be further performed on the clear video. The time domain consistency problem of the clear video can be adjusted according to the reason of causing the time domain consistency fault. The method can adjust the clear images in the clear video according to the reasons of shooting equipment, the reasons of shooting scenes, the coding reasons and the like, so that the contrast or brightness of the adjusted images is kept consistent, and the tag video meeting the time domain consistency requirement is obtained.

In a possible implementation manner, after the time domain consistency processing is performed on the original video to obtain the video with the time domain consistency, the definition of the image in the video is further improved to obtain the label image corresponding to the original video.

In a possible implementation, the acquired original video may not be a blurred video. In the sample video generation diagram shown in fig. 6, there is a problem of sharpness of the image included in the acquired original video, and the sharpness of the image in the original video needs to be improved. However, there is no temporal consistency problem between images in the original video. In order to obtain the required label video and blurred video, the original video needs to be subjected to transformation processing.

When generating the blurred video, the original image included in the original video may be subjected to the degradation processing, resulting in images with different brightness or contrast. For example, the brightness and/or contrast of different frame images in the original video may be adjusted according to a random brightness and/or random contrast manner, so as to obtain images with different brightness and/or contrast. Blurred video is generated from images of different brightness and/or contrast.

For example, the variation range of the random brightness may be set to be [ -50,50], the variation range of the random contrast may be [ -0.5, 1.5], and in the set variation range of the random brightness and/or the variation range of the random contrast, the adjustment parameters of the brightness and/or the contrast of different frames are randomly determined, so that the time domain of the video after random adjustment does not have consistency, and there is a problem of time domain consistency. Of course, the adjustment parameters of the brightness and/or contrast of different frames may be determined according to a preset variation form without being limited to random brightness and random contrast.

When the tag video is generated, because the definition of the image of the original video has a defect, in order to meet the definition requirement of the image in the tag video, image quality enhancement (or image enhancement) processing may be performed on the original image, and each frame of image in the original image is converted into an image whose definition meets a preset requirement. Because the original video has time domain consistency, an image meeting the definition requirement is obtained after image quality enhancement transformation, namely the label image.

Of course, it is not limited to the cases shown in fig. 5 and 6. In this case, the original video may be a tagged video, and the degradation processing, including adding noise to the image of the original video, reduces the sharpness of the image of the original video, and then further performs the degradation processing on the image with the reduced sharpness, for example, by performing random contrast processing and/or random brightness processing on the images of different frames, so as to obtain a blurred image with the temporal consistency problem and the image quality problem. Alternatively, the sharpness of the image may be reduced by intra-frame filtering or inter-frame filtering. The intra-frame filtering includes, for example, gaussian filtering, radial filtering, block filtering, and the like. Inter-frame filtering involves blurring the image between different frames.

Alternatively, the original video may have temporal consistency problem, but the image is a clear image. In order to obtain the blurred video in the sample video, noise can be added to the original image of the original video, so that the image quality of the original image is reduced, and the blurred video is generated according to the processed image.

And further performing time domain consistency processing on the original video. The time domain consistency problem of the original video can be adjusted according to the reason causing the time domain consistency fault. For example, the original image in the original video may be adjusted according to the reason of shooting a scene, the coding reason, and the like, so that the contrast or brightness of the adjusted image is kept consistent, thereby obtaining a tag video with time domain consistency.

In S402, the images in the blurred video are input to an image alignment module in the video optimization model, and adjacent images in the blurred video are aligned.

In this embodiment, the image alignment module may include a feature-based alignment method, a deep learning-based alignment method, an optical flow-based alignment algorithm, or a Deformable (Deformable) convolution-based PCD (all called Pyramid, cascade and Deformable, chinese) alignment module.

Fig. 7 is a schematic structural diagram of a PCD alignment module according to an embodiment of the present application. As shown in fig. 7, the PCD alignment module includes a pyramid module and a cascade module.

Wherein in the pyramid module, low resolution features are first aligned by a coarse estimation, and then the offset and aligned features are propagated to a higher resolution to achieve more accurate motion compensation. In addition, an additional deformable convolution is cascaded after the pyramid alignment operation, so that the alignment robustness can be further improved.

As shown in fig. 7, in the cascade module, for any input two frame blurred images, such as the t-th frame blurred image and the t-1-th frame blurred image in fig. 7, feature maps of three sizes of an L1 layer, an L2 layer, and an L3 layer may be generated. Here, the L1 layer feature map is in accordance with the size of the input blurred image. The L2-layer feature map is an image of size 1/2 of the blurred image obtained by performing convolution calculation on the input image, and the L3-layer feature map is an image of size 1/4 of the blurred image obtained by performing convolution calculation on the input image.

At the layer L3, the feature maps of any input two frames of blurred images (such as the t-th frame blurred image and the t-1 th frame blurred image) are connected together at the layer L3, and the offset of the t-th frame relative to the t-1 th frame in the layer L3 is obtained through CNN convolution calculation. And then using the offset for the t-1 frame blurred image deformable convolution (DConv) calculation to obtain the feature image after the L3 layer alignment, performing upsampling on the feature image after the L3 layer alignment and the offset of the L3 layer, and transmitting the upsampled feature image to the L2 layer.

At the layer L2, the feature maps of the two input frames of blurred images (such as the t-th frame blurred image and the t-1 th frame blurred image) at the layer L2 are connected together, and the offset of the t-th frame relative to the t-1 th frame in the layer L2 is obtained through CNN convolution calculation and by combining the offset propagated by the layer L3. And then the offset is used for the deformable convolution (DConv) calculation of the t-1 frame blurred image, and the aligned feature images propagated by the L3 layer are combined to obtain the L2 layer aligned feature images. And the aligned signature of the L2 layer and the offset of the L2 layer are up-sampled and propagated to the L1 layer.

At the layer L1, the feature maps of the two input frames of blurred images (such as the t-th frame blurred image and the t-1 th frame blurred image) at the layer L1 are connected together, and the offset of the t-th frame relative to the t-1 th frame in the layer L1 is obtained through CNN convolution calculation and by combining the offset propagated by the layer L2. And then the offset is used for the deformable convolution (DConv) calculation of the t-1 frame blurred image, and the aligned feature images propagated by the L2 layer are combined to obtain the L1 layer aligned feature images.

In the cascade module, the aligned feature map obtained by the L1 layer is connected to the t frame blurred image to obtain an offset of the t frame from the aligned feature map, and the feature map of the L1 layer is subjected to the deformable convolution calculation according to the offset to obtain the aligned feature image obtained by the cascade module.

According to the PCD pair module, alignment processing can be sequentially carried out on blurred images comprising T frame images to obtain T-1 frame aligned feature images, the T-1 frame aligned feature images comprise 2 frame aligned feature images output by a 1 st frame blurred image and a 2 nd frame blurred image, 3 frame aligned feature images output by a 2 nd frame blurred image and a 3 rd frame blurred image, and the like to obtain T-1 frame aligned feature images output by a T-1 th frame image and a T-th frame image, and the T frame aligned feature images are obtained by combining the 1 st frame blurred image.

In S403, the feature image after the alignment processing is input into a deblurring network model in the video optimization model, so as to obtain a first output image in the output video.

As shown in the schematic diagram of the training structure of the video optimization model shown in fig. 8, the video optimization model in the embodiment of the present application may capture the spatiotemporal correlation in the blurred video by embedding a convolution Long-Short Term Memory artificial neural network (referred to as a convolution Long-Short Term Memory, abbreviated as conv-LSTM). Spatial features are extracted through convolution operation, and the extracted spatial features are loaded into an LSTM network capable of extracting time sequence features, so that the trained video optimization model can effectively enhance the image quality of a video and can solve the problem of time domain consistency of the video.

In the embodiment of the application, a residual block can be included in a deblurring network model in the video optimization model, so that a blurred image (I) in a video is blurred _t ，I _t-1 ) After the calculation by the deblurring network model, the output data is a residual error, and the first output image is not directly output. For example, the first output image may be the t-th frame output image, and the t-th frame output image O _t Can be expressed as: o is _t ＝I _t +F(I _t ) Wherein, I _t For the t-th frame image (t-th frame blurred image) of the blurred video, F (I) _t ) Residual errors output for the deblurred network model. By introducing the residual block, the deblurring network model outputs the difference between the output image of the t th frame and the blurred image of the t th frame, so that the deblurring network model can effectively solve the problem that the layer number is deepened during optimization training. Wherein T is a natural number greater than or equal to 2.

It is understood that the first output image may be the image of the first frame in the output video. The second output image is an image adjacent to the first output image, the second label image is an image adjacent to the first label image, and the content of the second label image is associated with the content of the second output image.

Blurred images of two adjacent framesThe method can be a T frame blurred image and a T-1 frame blurred image, wherein T is a natural number which is greater than 1 and less than T, and T is the frame number of the blurred video. After fuzzy images of two adjacent frames are input into the deblurring network model, residual errors F (I) are obtained through calculation by the computing processing of the deblurring network model _t ) I.e. the difference between the output image of the t-th frame and the blurred image of the t-th frame. According to the output residual (or called residual image), summing the pixels at the corresponding positions with the input t frame blurred image to obtain the t frame output image, namely the first output image. The pixels at the corresponding positions are summed, which can be understood as summing the pixels at the same positions of the blurred image and the residual image. For example, the pixel value of the first pixel in the upper left corner of the blurred image is summed with the pixel value of the first pixel in the upper left corner of the residual image, and the summation result is the pixel value of the first pixel in the upper left corner of the output image.

The residual error output by the deblurring network model is the difference value between the t-th frame output image (i.e. the t-th frame image of the output video) and the t-th frame blurred image (i.e. the t-th frame image of the blurred video). The difference may be a pixel difference between pixels at corresponding positions in the two images.

For example, the pixel value of the pixel point at the position a of the t-th frame output image is (r1, g1, b1), the pixel value of the pixel point at the position a of the t-th frame blurred image is (r2, g2, b2), and the residual corresponding to the pixel point can be represented as (r1-r2, g1-g2, b1-b 2).

In the embodiment of the application, the blurred images of the blurred video input into the deblurring network model, that is, the t-th frame blurred image and the t-1-th frame blurred image, may be any two adjacent frames of images in the blurred video. When training the video optimization model, the parameters in the video optimization model can be input to the video optimization model in sequence or randomly according to the playing sequence of the multi-frame images included in the fuzzy video, so as to complete the training of the parameters in the video optimization model. For example, the blurred images of frame 1 and frame 2 may be sequentially aligned and then input to a deblurring network model for training, the images of frame 2 and frame 3 may be aligned and then input to a deblurring network model for training … …, and finally the images of frame T-1 and frame T may be aligned and then input to a deblurring network model for training. Alternatively, it is not necessarily limited to the playing order of the blurred images, and any group of blurred images included in the blurred images may be input to the deblurring network model for training in a randomly determined order.

Before training, the video optimization model in the embodiment of the application may determine parameters in the video optimization model in a random generation manner, or may use preset values as parameters in the deblurring network model. The parameters to be trained in the video optimization model may include values of convolution kernels of convolution layers in the deblurring network model, values of convolution kernels of deformable convolution in the image alignment module, and the like. The numerical value of each position in the convolution kernel of the convolution layer in the deblurring network model and the numerical value of the convolution kernel of the deformable convolution in the image alignment module can be continuously corrected according to the output result, so that the output result meets the preset requirement.

Typically, the initialized parameters are not the same as the parameters for which training is complete. Therefore, in the training process, there may be differences between the output image generated by the residual error output by the deblurring network model and the expected output image, i.e., the label image, including detail differences and perceptual differences, or also referred to as detail loss and perceptual loss. The difference information of the generated output image and the tag image in the spatial domain can be determined according to the difference between the output image of the t-th frame and the tag image of the t-th frame, the parameters of the deblurring network model are adjusted, and the spatial domain information of the tag image is learned.

In S404, an optical flow is determined from the tag video, and the first output image is subjected to image conversion based on the determined optical flow to obtain a second output image.

After the first output image is determined, in order to ensure the consistency of the first output image in the time domain, the generated first output image needs to be compared with the label images of other times in the time domain, so that the video corresponding to the image output by the video optimization model has no problem of time domain consistency.

However, two adjacent frames of images in the blurred video are input into the video optimization model for calculation to generate the first output image, and whether the time domain consistency problem exists in the time domain of the output video cannot be determined directly according to the first output image. For example, when the first output image is generated, the contrast or brightness of the image may be adjusted, and when the image quality enhancement requirement is satisfied, the image may not be temporally coincident with an adjacent image of the first output image.

Therefore, the embodiment of the application introduces optical flow, transforms the first output image through the optical flow, and compares the transformed second output image with the second label image with the same time point, so as to optimize the time domain transformation parameters of the video optimization model according to the comparison result.

The optical flow refers to the instantaneous speed of the pixel motion of a spatial moving object on the observation imaging plane. Can be determined by the adjacent tag image in the tag video. I.e., any two adjacent frames of tag images in the tag video, an optical flow can be determined. For example, the 1 st frame tag image and the 2 nd frame tag image can determine optical flow, the 2 nd frame tag image and the 3 rd frame tag image can determine optical flow, and the T-1 st frame tag image and the T-th frame tag image can determine optical flow. Where T is the number of frames of the tag image included in the tag video.

As shown in FIG. 8, a tag image G in a video according to tags _t And G _t-1 When calculating the optical flow, a neural optical flow network (FlowNet) may be selected for optical flow calculation. The neuro-optical flow network may include a simple neuro-optical flow network (FlowNetsimple), a correlated neuro-optical flow network (FlowNetcorr), and the like. And inputting the two frames of images needing to be calculated into a neural optical flow network, and outputting optical flow calculation results of the two frames of images.

Therein, with the first output image O _t The adjacent second output image may be an image O 'of a frame preceding the first output image' _t-1 Or may be a frame image subsequent to the first output image. For example, the first output image is the t-th frame output imageThe second output image may be the output image of the t-1 th frame or the output image of the t +1 th frame.

As for the direction of optical flow, the direction from the t frame label image to the t-1 frame image can be expressed as

Or the direction from the t-1 frame label image to the t frame label image. The selection of the direction of the optical flow is associated with the selection of the second label image. And enabling the first output image to be consistent with the second label image in time through the second output image after optical flow transformation. The selected second label image may be determined by the direction of the optical flow, or the direction of the optical flow may be determined from the selected second label image.

For example, the first output image is a t-th frame output image, and the first label image is a t-th frame label image. If the second label image is the t-1 frame label image, the first output image is transformed into the optical flow of the second output image, which can be the optical flow determined from the t-frame label image to the t-1 frame label image, and the t-1 frame output image is obtained by performing transformation calculation on the t-1 optical flow and the t-frame output image.

If the second label image is a t +1 frame label image, the optical flow transformed from the first output image to the second output image may be the optical flow from the t frame label image to the t +1 frame label image. And performing transformation calculation by combining the t-th frame output image through the t-th optical flow to obtain a t + 1-th frame output image.

The optical flows in the embodiment of the present application may be calculated to obtain T-1 optical flows in advance according to the setting of the second tag image and the T-1 group tag images (i.e., the 1 st frame and the 2 nd frame, and the 2 nd frame and the 3 rd frame … …, respectively) in the tag video. When the first output image is obtained through calculation, the optical flow of the same position is searched according to the position of the first output image, and the first output image is transformed according to the searched optical flow to obtain a second output image. For example, if the first output image is the t-th frame image, the t-1 st optical flow can be found, and the first output image is transformed according to the t-1 st optical flow. And calculating the 1 st optical flow by the 1 st frame tag image and the 2 nd frame tag image, and calculating the t-1 st optical flow by the t-1 st frame tag image and the t-1 th frame tag image.

It should be noted that the sequence number of the first output image in the embodiment of the present application is the same as the larger sequence number of the blurred images adjacent to the two frames used for calculating the output image. For example, the t-1 frame blurred image and the t frame blurred image are input into a video optimization model, residual errors are obtained through model calculation, and the t frame output image is obtained through calculation according to the residual errors. Namely, the blurred video comprising the T frame image is calculated through a video optimization model, and then the output video comprising the 2 nd frame to the T frame output images is obtained.

In S405, parameters of the video optimization model are adjusted according to a difference between the output image and the tag image.

In an embodiment of the present application, the output image includes a first output image and a second output image, and the label image includes a first label image and a second label image. The difference between the output image and the label image comprises the difference between the first output image and the first label image, and the difference between the second output image and the second label image. The first label image and the second label image are adjacent images in the label video. The first label image has the same content as the first output image, and the second label image has the same content as the second output image. The first tag image and the first output image, or the second tag image and the second output image, may have a difference caused by the image quality parameter or the temporal consistency parameter. When the first output image generated by the t-1 th frame blurred image and the t-th frame blurred image is the t-th frame output image, the first label image can be the t-th frame label image. And when the second output image is the t-1 frame output image, the second label image is the t-1 frame label image. And when the second output image is the t +1 th frame output image, the second label image is the t +1 th frame label image.

The differences between the first output image and the first label image may include detail differences and perceptual differences. The perception difference between the first output image and the first label image can be determined by calculating the perception loss of the first output image and the first label image through convolutional neural networks such as a VGG network, AlexNet, LeNet and the like.

In a possible implementation, the perceptual loss function may be based on

Calculating the perception loss, wherein N is the total pixel number of a frame of blurred image or output image, T is the total frame number of the blurred video, phi _l (.) represents feature activation for layer l of the network phi,

pixel values representing the output image of the t-th frame,

pixel value, L, representing the tag image of the t-th frame _p Indicating a loss of perception.

The difference in detail between the first output image and the first label image may be represented by a loss of detail. This loss of detail can be expressed as:

in which, indicating a loss of detail,

pixel values representing the output image of the t-1 th frame,

pixel value, L, representing the tag image of the t-1 th frame _e Showing loss of detail, ε represents a regularization term, which is a small value, such as 10 ^-4 ～10 ^-6 And is used for preventing the reverse propagation gradient of 0 when the loss is zero, so as to cause the death of the neuron.

The difference between the second output image and the second label image can be represented by calculating a temporal consistency loss. The temporal consistency loss function can be expressed as:

wherein, the first and the second end of the pipe are connected with each other,

representing the pixel values of the output image of the t-1 th frame,

pixel value, L, representing the tag image of the t-1 th frame _t And representing time domain consistency loss, wherein N is the total number of pixel points of a frame of blurred image or an output image, and T is the total frame number of the blurred video.

In the training of parameter optimization, the total loss can be determined by the temporal consistency loss, the detail loss and the perception loss. Namely, the total loss can be calculated by time-domain consistency loss, perception loss and corresponding weight coefficients. Can be expressed in the form of a formula:

L _Total ＝α _e L _e +α _p L _p +α _t L _t

L _Tota l represents the total loss, α _t Weight coefficient, alpha, representing a loss of temporal coherence _p Weight coefficient, alpha, representing the perceptual loss _e Weight coefficients representing loss of detail. When the weighting coefficient is larger, the influence of the loss associated with the coefficient on the total loss is larger, and the weighting coefficient is smaller, the influence of the loss associated with the coefficient on the total loss is smaller. For example, when α _t Does not change and increases alpha _p And alpha _e The value of (2) represents the influence of the increased perception loss and the detail loss on the total loss, and the parameters of the video optimization model are trained through the total loss corresponding to the weight coefficient, so that the output image generated by the trained video optimization model has better perception performance, namely the image quality of the output video is better.

When alpha is _p And alpha _e Does not change and increases alpha _t The value of (b) represents the influence of increasing time domain consistency loss on the total loss, and the parameters of the video optimization model are trained through the total loss corresponding to the weight coefficient, so that the training is realizedThe output image generated by the video optimization model has better image quality.

When the total loss L of images in the sample video is calculated according to the network framework shown in FIG. 8 _Total The video optimization model can be stably and effectively converged to a certain value, for example, if the total loss value is smaller than a preset convergence threshold value, the training of the video optimization model can be completed.

In the embodiment of the present application, in order to improve training efficiency, a weight coefficient combination may be formed by weight coefficients of different values of perceptual loss and weight coefficients of temporal consistency loss. Two or more weight coefficient combinations can be adopted to train the parameters of the video optimization model in turn.

In a possible implementation, two weight coefficient combinations may be adopted to train parameters of the video optimization model in turn. In the first weight coefficient combination, the weight coefficient of the perception loss is larger than the weight coefficient of the time domain consistency loss, and in the second weight coefficient combination, the weight coefficient of the perception loss is smaller than the weight coefficient of the time domain consistency loss. Wherein, the second weight coefficient combination can be obtained by adjusting the parameter of the first weight coefficient combination. For example, the weight coefficients of the temporal consistency loss in the first weight coefficient combination may be kept unchanged, and the weight coefficients of the perceptual loss and the detail loss may be increased. Alternatively, the weight coefficients of temporal coherence loss in the first weight coefficient combination may be reduced, and the weight coefficients of perceptual loss and detail loss may be increased or maintained.

Weight coefficient alpha of the loss of detail _e A weighting factor alpha with said perceptual loss _p The ratio of (a) to (b) may be a predetermined value. The ratio may be greater than 8:1 and less than 100:1, for example 10: 1, etc.

Fig. 9 is a schematic diagram of a reference learning process provided in the embodiment of the present application, and for simplicity of description, two adjacent frames in a video are taken as an example for illustration. The left image is a blurred video which has a time domain consistency problem and a picture quality problem. The temporal consistency problem is a front-back inconsistency problem expressed in the temporal domain, for example, a blurred video shown in fig. 9, brightness inconsistency between two adjacent frames of images, and the like. The image quality problem is the image quality problem of a single frame image in a blurred video, and includes problems such as image quality blur.

When training is performed by the first weight coefficient combination, since the weight coefficient of the perceptual loss and the detail loss in the first weight coefficient combination is smaller than the weight coefficient of the temporal consistency loss, for example, the weight coefficient α of the temporal consistency loss may be set _t Weight coefficient alpha for perceptual loss _p More than 1.2 times of the time domain, and a weight coefficient alpha of time domain consistency loss _t Weighting factor alpha for loss of detail _e Over 1.2 times of that of the total loss L _Total And the constraint learning of the time domain consistency parameters is more emphasized. And after the first weight coefficient combination training is converged, the constraint learning of the time domain consistency parameters is completed. Through the learned parameters of the video optimization model, the fuzzy video is subjected to image transformation calculation, the time domain consistency of the image of the output video is effectively improved as shown in the middle graph of fig. 9, and the brightness problem existing between two adjacent frames is obviously relieved.

After learning of the constraint parameters of the time domain consistency is completed, training of the parameters can be further performed through the second weight coefficient combination. In the second weight coefficient combination, the weight coefficient alpha of the perceptual loss _p And weight coefficient alpha of detail loss _e Weight coefficient alpha greater than time-domain consistency loss _t For example, a weight coefficient α of the perceptual loss can be set _p Weighting factor alpha for temporal consistency loss _t More than 1.2 times of the weight coefficient alpha of the perception loss _e Weight coefficient alpha for loss of temporal consistency _t More than 1.2 times. Total loss L _Total And the constraint learning of the perception parameters and the detail parameters is emphasized more. After the second weight coefficient combination training converges, the fuzzy video or the video calculated and output by the parameters after the first weight coefficient combination training may be transformed according to the parameters of the video optimization model after training learning, so as to obtain the output video shown in the right diagram of fig. 9. The quality of the output video is improved and the problem of temporal consistency is overcome.

Wherein the weight coefficient alpha of detail loss _e Weight coefficient alpha with perceptual loss _p The ratio of (b) may be a predetermined value. For example, the set value may be greater than 8:1 and less than 100:1, can be 10: 1. 20: 1, and the like.

When the training optimization of the parameters is carried out through the first weight coefficient combination and the second weight coefficient combination, the parameters are optimized and trained in a mode of gradually emphasizing, and compared with a mode of simultaneously carrying out optimization training, the method can enable the training process to be more easily converged, and therefore the training efficiency can be effectively improved.

Fig. 10 is a schematic diagram of a video optimization model trained by the method for training a video optimization model shown in fig. 4 according to an embodiment of the present application, and optimizing a video to be optimized. Fig. 10 is a schematic diagram illustrating two adjacent frames of images. The video to be optimized may be a video with a temporal consistency problem, or may also be a video with a picture quality problem, or may also be a video with a temporal consistency problem (brightness of adjacent frames is significantly different) and a picture quality problem (picture quality is blurred) as shown in fig. 10. The video optimization model trained by the training method of the video optimization model shown in fig. 4 of the present application is subjected to computational transformation, so that the output video shown in the right image of fig. 10 can be output, in the output video, the definition of the image of the output video is higher than that of the video to be optimized, the brightness change of the adjacent frames of the image of the output video is milder or basically consistent, and the time domain consistency problem and the image quality problem are obviously improved.

When the video is optimized, the video collected by the electronic equipment can be optimized, and the received video sent by other electronic equipment can also be optimized by the electronic equipment. The video optimization method can be used before video playing and can also be used for optimizing during video playing.

Fig. 11 is a schematic diagram of a training apparatus for a video optimization model according to an embodiment of the present application, and as shown in fig. 11, the apparatus includes: a sample video acquisition unit 1101, an alignment unit 1102, a first output image acquisition unit 1103, a second output image acquisition unit 1104, and a parameter adjustment unit 1105.

The sample video acquiring unit 1101 is configured to acquire a sample video, where the sample video includes a blur video and a tag video. The label video is the video expected after video optimization. The image quality of the image in the tag video is good, for example, whether the image in the video meets the requirement of the tag video can be determined through image quality evaluation parameters including parameters such as color, definition and the like. The adjacent images in the tag video have temporal consistency, that is, the variation of the contrast or brightness of the adjacent images is smaller than a preset value. The blurred image is opposite to the label video, the image quality problem and the time domain consistency problem exist in the blurred image, and whether the video is the blurred video can be determined through related parameters which are preset.

The alignment unit 1102 is configured to input an image in the blurred video to an image alignment module in the video optimization model, and perform alignment processing on adjacent images in the blurred video. The image alignment module may include a feature-based alignment method, a deep learning-based alignment method, or a deformable convolution-based PCD alignment module.

The first output image obtaining unit 1103 is configured to input two adjacent frames of images aligned in the blurred video into a deblurring network model in a preset video optimization model, so as to obtain a first output image. Wherein, the output of the deblurring network model may be a residual. The first output image may be obtained from a sum of the residual and an image of the input blurred video. Residual errors are output through the deblurring network model, so that the deblurring network model can effectively solve the problem of deepening of the layer number during optimization training.

The second output image acquisition unit 1104 is configured to determine optical flows of adjacent tag images in the tag video according to the adjacent two frame images in the tag video, and transform the first output image according to the determined optical flows to obtain a second output image adjacent to the first output image. The first output image and the second output image are two adjacent frames of images in the output video. If the first output image is the output image of the t-th frame, the second output image can be the output image of the t-1 th frame or the output image of the t +1 th frame.

When transforming the first output image into the second output image, the selected optical flow may be determined according to the position of the first output image. For example, the first output image is the t-th frame output image, and the optical flow may be the optical flow of the t-1 st frame tag image to the t-th frame tag image, or the optical flow of the t +1 th frame to the t-th frame. And determining the second output image as the t +1 th frame output image or the t-1 th frame output image according to the direction of the optical flow.

The parameter adjusting unit 1105 is configured to determine a loss or a difference between the first output image and the first label image, and a loss or a difference between the second output image and the second label image, adjust a parameter of the video optimization model according to the loss or the difference, until the calculated difference after adjusting the parameter meets a preset requirement, and complete training of the video optimization model. The first label image and the second label image are images in the label video, the first label image is associated with the content of the first output image, and the second label image is associated with the content of the second output image. Here, the content association may be understood as that the content is the same, but the image quality parameter and/or the temporal consistency parameter are different, or may be understood as that the first tag image and the first output image are obtained by the same image transformation, for example, the t-th frame tag image as the first tag image is obtained by the t-th frame blurred image transformation; and calculating a t frame output image serving as a first output image according to the t-1 frame blurred image and the t frame blurred image.

The difference between the first output image and the first label image may be represented by a loss of perception and a loss of detail. The difference between the second output image and the second label image may be represented by a temporal consistency loss.

The perceptual loss may be calculated based on a neural network model, or may be calculated by a formula. For example, the perceptual loss may be expressed as:

wherein N is the total number of pixels of a frame of blurred image or output image, T is the total frame number of blurred video, phi _l (.) represents feature activation for layer l of the network phi,

pixel values representing the output image of the t-th frame,

The loss of detail can be expressed as:

in which, to show the loss of detail,

pixel values representing the output image of the t-1 th frame,

pixel value, L, representing the tag image of the t-1 th frame _e Indicating a loss of detail.

The temporal coherence loss may be by a temporal coherence loss function

And (4) showing. Wherein the content of the first and second substances,

pixel values representing the output image of the t-1 th frame,

the pixel value of the tag image of the T-1 th frame represents the time domain consistency loss, N is the total pixel number of a frame of blurred image or output image, T is the total frame number of the blurred video, and L _t Representing a loss of temporal coherence.

Based on the obtained perception loss and the time domain consistency loss, the total loss can be obtained by combining the summation of corresponding weight coefficients, and the parameters in the video optimization model are optimized and adjusted by calculating the convergence of the total loss.

In a possible implementation manner, the parameters in the video optimization model can be gradually trained according to the sequence through two or more groups of weight coefficients. For example, a weight coefficient with a higher time-domain consistency loss relative to a perceptual loss and a higher detail loss may be trained, and a time-domain parameter in the model is constrained, so that a video generated by the trained model has a better time-domain consistency. And then training by adopting a weight coefficient with higher perception loss and detail loss relative to time domain consistency loss, and constraining the spatial domain parameters in the model to ensure that the video generated by the trained model has better image quality and time domain consistency.

The training device of the video optimization model shown in fig. 11 corresponds to the training method of the video optimization model shown in fig. 4.

Fig. 12 is a schematic diagram of a video optimization apparatus according to an embodiment of the present application. As shown in fig. 12, the apparatus includes a video to be optimized acquisition unit 1201 for acquiring an optimized video. The video to be optimized may be a video acquired by the electronic device itself, or may also be a received video transmitted by another electronic device.

The video optimization unit 1102 is configured to perform optimization processing on a video to be optimized according to a deblurring network model obtained by training in the training method of the video optimization model shown in fig. 4, so as to obtain an optimized output video.

The video optimization apparatus shown in fig. 12 corresponds to the video optimization method shown in fig. 10.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

Moreover, various aspects or features of embodiments of the application may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. For example, computer-readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, etc.), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD), etc.), smart cards, and flash memory devices (e.g., erasable programmable read-only memory (EPROM), card, stick, key drive, etc.). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media capable of storing, containing, and/or carrying instruction(s) and/or data.

In the above embodiments, the apparatus in fig. 11 or fig. 12 may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not limit the implementation processes of the embodiments of the present application.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application, which essentially or partly contribute to the prior art, may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or an access network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A method for training a video optimization model, the method comprising:

the method comprises the steps that electronic equipment obtains a sample video, wherein the sample video comprises a fuzzy video and a label video which is optimized to the fuzzy video, the image of the fuzzy video comprises time domain noise and a fuzzy problem, and the label video is obtained by carrying out deblurring processing and time domain consistency processing on the fuzzy video;

the electronic equipment inputs images in the blurred video into a preset video optimization model to obtain a first image of an output video, wherein the video optimization model comprises an image alignment module and a deblurring network model, the image alignment module performs alignment processing on the input images, and the deblurring network model performs deblurring processing on the aligned images to obtain a first output image of the output video;

the electronic equipment determines optical flow according to the tag video, searches for the optical flow at the same position according to the position of a first output image, determines the optical flow according to two adjacent frame tag images in the tag video, and transforms the first output image according to the searched optical flow to obtain a second output image adjacent to the first output image;

the electronic equipment determines the loss between the first output image and the first label image, the loss between the second output image and the second label image, adjusts the parameters of the video optimization model according to the determined loss until the calculated difference after the parameters are adjusted meets the preset requirements, and completes the training of the video optimization model, wherein the first label image and the second label image are images in the label video, the content of the first label image is associated with that of the first output image, the content of the second label image is associated with that of the second output image, the loss between the first output image and the first label image comprises the detail loss and the perception loss of the first output image and the first label image, the loss between the second output image and the second label image comprises the time-domain consistency loss, and the detail loss is as follows:

wherein the content of the first and second substances,

representing the pixel values of the output image of the t-1 th frame,

pixel value, L, representing the tag image of the t-1 th frame _e And showing detail loss, wherein epsilon shows a regular term, T is the frame number of the blurred video, and N is the total pixel number of a frame of the blurred image.

2. The method of claim 1, wherein when the video to be trained is a blurred video, the electronic device obtains a sample video, and the method comprises:

the electronic equipment carries out time domain consistency processing on the image of the fuzzy video to obtain a video with stable time domain;

and the electronic equipment performs image enhancement processing on the time-domain stable video to obtain a label video in the sample video.

3. The method of claim 1, wherein when the video to be trained is a temporally stable video, the electronic device obtains a sample video, comprising:

the electronic equipment carries out degradation processing on the image of the video to be trained to obtain a fuzzy video in the sample video;

and the electronic equipment performs image enhancement processing on the image of the video to be trained to obtain a label video in the sample video.

4. The method of claim 3, wherein an electronic device performs degradation processing on the image of the video to be trained, comprising:

carrying out intra-frame filtering on the image of the video to be trained;

and/or performing inter-frame filtering on the image of the video to be trained.

5. The method of claim 1, wherein the image alignment module performs alignment processing on the input image, and comprises:

the electronic equipment inputs the t frame image and the t-1 frame image of the blurred video into a preset image alignment module, and outputs a characteristic image of the t frame image aligned with the t-1 frame image.

6. The method of claim 1, wherein the electronic device determines optical flow from tagged video, comprising:

and the electronic equipment inputs the t frame image of the label video and the adjacent image of the t frame image into a pre-trained optical flow computing network, and outputs the optical flows from the t frame image to the adjacent image of the t frame image in the label video, wherein t is greater than or equal to 2.

7. The method of claim 1, wherein adjusting parameters of the video optimization model based on the determined loss comprises:

determining a weight coefficient alpha for a temporal coherence loss _t Weight coefficient of perceptual loss alpha _p And weight coefficient alpha of detail loss _e Determining a total loss of the video optimization model;

and adjusting parameters of the video optimization model according to the total loss.

8. The method of claim 7, wherein adjusting parameters of the video optimization model based on the total loss comprises:

a first training process, adjusting the parameters of the video optimization model according to a first weight coefficient combination;

a second training process, adjusting the parameters of the video optimization model according to a second weight coefficient combination;

wherein, in the first weight coefficient combination, α _t >α _p And, in the second weight coefficient combination, α _t <α _p 。

9. The method according to claim 8, wherein the weight coefficient α of the loss of detail in the first weight coefficient combination or the second weight coefficient combination _e A weighting factor alpha with said perceptual loss _p Is greater than 8:1 and less than 100: 1.

10. A method for video optimization, the method comprising:

acquiring a video to be optimized;

inputting the image of the video to be optimized into the video optimization model obtained by training according to the method of any one of claims 1 to 9, and obtaining the image of the optimized output video.

11. An electronic device comprising a processor coupled with a memory, wherein the memory is configured to store instructions and the processor is configured to execute the instructions in the memory such that the electronic device performs the method of any of claims 1-10.

12. A computer-readable storage medium having instructions stored thereon, which when executed, implement the method of any one of claims 1-10.