CN113705665B

CN113705665B - Training method of image transformation network model and electronic equipment

Info

Publication number: CN113705665B
Application number: CN202110990081.4A
Authority: CN
Inventors: 卢圣卿; 肖斌; 王宇; 朱聪超
Original assignee: Honor Device Co Ltd
Current assignee: Shanghai Glory Smart Technology Development Co ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2022-09-23
Anticipated expiration: 2041-08-26
Also published as: CN113705665A

Abstract

The application belongs to the field of image processing, and provides a training method of an image transformation network model and electronic equipment, wherein the method comprises the following steps: the electronic equipment acquires a sample video; inputting the flicker video into a preset image transformation network model to obtain a first output image in the output video; determining optical flow according to the label video, and transforming the first output image according to the optical flow to obtain a second output image adjacent to the first output image; and determining the difference between the first output image and the first label image and the difference between the second output image and the second label image, and adjusting the parameters of the image transformation network model according to the differences until the training of the image transformation network model is completed. The video optimized by the image transformation network model trained by the method is not limited by the reason causing the time domain consistency problem, can adapt to the optimization of videos with different time domain consistency problems, and is beneficial to improving the adaptation range of video optimization.

Description

Training method of image transformation network model and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for an image transformation network model and an electronic device.

Background

With the improvement of the performance of electronic devices such as mobile phones, flat panels, smart televisions and the like and the development of communication technologies, video services are applied to the electronic devices more and more widely. The high-quality video playing effect is beneficial to the user to watch clearer picture content and is beneficial to improving the watching experience of the user.

The video images played by the electronic device may have image quality defects or time domain consistency defects due to the influence of shooting factors, post-editing factors or transcoding factors. In order to improve the quality of video images, the user viewing experience is improved. Before the electronic device plays the video, the video image can be optimized.

At present, for an optimization processing mode of a video image, time domain consistency processing is generally performed first, and then image quality enhancement processing is performed. However, there are many factors causing the defect of temporal inconsistency, and for a specific video, the factor causing the defect of temporal inconsistency may be a single factor, or may be caused by the combined action of multiple factors, and when the temporal consistency is processed, the factor causing the problem of temporal consistency needs to be processed, which is troublesome in the processing process, and the video optimization efficiency is not high.

Disclosure of Invention

The embodiment of the application discloses a training method of an image transformation network model and electronic equipment, and aims to solve the problems that in the existing optimization processing of time domain consistency of video images, processing needs to be carried out aiming at factors generating the time domain consistency problem, the processing process is troublesome, and the video optimization efficiency is low.

In order to solve the technical problem, the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for training an image transformation network model, where the method includes: the method comprises the steps that electronic equipment obtains a sample video, images in a flicker video included in the sample video are input into a predicted image transformation network model to obtain a first output image of the output video, an optical flow is determined according to a label video, the first output image is transformed according to the determined optical flow to obtain a second output image adjacent to the first output image, parameters of the image transformation network model are adjusted according to the difference between the first output image and the first label image and the difference between the second output image and the second label image until the difference calculated after the parameters are adjusted meets the preset requirement, and training of the image transformation network model is completed.

The first label image and the second label image are images in the label video included in the sample video. The first label image is associated with the content of the first output image and the second label image is associated with the content of the second output image. The content of the label image is related to the content of the output image, and the label image and the output image have the same content, or the label image and the output image are obtained by transformation according to the same flicker image.

And after the electronic equipment generates a first output image according to the image transformation network model, performing time-domain transformation on the first output image according to the optical flow determined by the label video to obtain a second output image adjacent to the first output image. Since the second output image is obtained by transforming the first output image by the optical flow determined by the standard tag video, the time domain features included in the second output image are consistent with the first output image, and the second difference is determined by comparing the second output image with the second tag image with the same time position, so that the time domain consistency in the output video of the image transformation network model can be effectively reflected. Determining the first difference by comparing the first output image with the first label image may reflect picture quality problems in the output video of the image transformation network model. And optimizing and adjusting parameters of the image transformation network model according to the determined first difference and the second difference until the difference calculated by the image transformation network model after the parameters are adjusted meets the preset convergence requirement, thereby completing the training of the model. The training process does not need to limit the reason causing the time domain consistency problem, so that the trained image transformation network model can adapt to the optimization processing of the video with the time domain consistency problem caused by different reasons.

The sample video comprises a flicker video and a label video. The label video is a video obtained by optimizing the video in a time domain and a space domain, and the flicker video is a video with an image comprising time domain noise and space domain noise.

The sample video comprises a flicker video and a label video, when the sample video is obtained, the image quality and time domain consistency of the video to be trained can be detected, the video type of the video to be trained is determined, and a corresponding sample video obtaining method is selected according to the determined video type. Or, the corresponding sample video acquisition method may also be determined according to the specified type by specifying the type of the video to be trained.

With reference to the first aspect, in a first possible implementation manner of the first aspect, when the video to be trained is a flickering video, time-domain consistency processing may be performed on an image of the flickering video to obtain a time-domain stable video, and then image enhancement processing is performed on the time-domain stable video to obtain a tag video in the sample video. And obtaining a sample video for model training according to the generated label video and the flicker video before image enhancement and time domain consistency processing.

With reference to the first aspect, in a second possible implementation manner of the first aspect, when a video to be trained is a video with a stable time domain, a degradation process may be performed on an image of the video to be trained to obtain a flickering video of a sample video; and performing image enhancement processing on the image of the video to be trained to obtain a label video of the sample video. The image quality of the video to be trained is enhanced to obtain a label image, and the time domain consistency of the video to be trained is degraded to obtain a flicker image, so that model training can be performed according to the obtained sample video.

When the time domain consistency processing is carried out on the flickering video, the time domain consistency processing can be carried out by adopting a traditional image processing algorithm according to the reason causing the time domain consistency problem, namely the flickering problem. For example, the flicker is caused by brightness inconsistency between frames, a temporal filter corresponding to the brightness level of the current video may be constructed according to the brightness level of the current video, and the temporal filter includes filters such as an average filter and a median filter for filtering, and parameters of the filters may be determined by the overall brightness level of the video. After the videos with inconsistent inter-brightness are filtered, the videos meet the requirement of time domain consistency.

Therefore, the obtained video to be trained, namely the original video, is subjected to transformation processing, and the flicker video and the label video required by the sample video can be generated, so that the label video and the flicker video included in the sample video can be effectively generated.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, when video degradation with temporal consistency is processed as a flickering video, the brightness and/or the contrast of an image in a video to be trained may be determined according to a random brightness and/or a random contrast. Or, the conversion parameters with the same effect can be determined according to the time domain consistency problem generated by different reasons, and the video is degraded into the flicker video. Through the mode of random brightness and/or contrast, the flicker video can be generated quickly and effectively.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the inputting, by the electronic device, an image in the flickering video to a preset image transformation network model to obtain a first output image in an output video includes: the electronic equipment inputs the t frame image of the flickering video and the t-1 frame image into a preset image transformation network model, and outputs a residual error between the t frame output image and the t frame image of the flickering video, wherein t is greater than or equal to 2; and the electronic equipment sums the residual error and the t frame image of the flickering video to obtain a t frame output image, wherein the t frame output image is a first output image.

The image transformation network model is introduced with a residual block, so that the image transformation network model outputs a residual, and a first output image is generated according to the output residual and an image of a flash video, so that the image transformation network model can effectively solve the problem of deepening of the layer number during optimization training, the required network layer number is reduced when the same effect is achieved, meanwhile, the parameter convergence speed of the network model is higher, and the problems of gradient explosion, gradient dispersion and the like during deep learning network training are not easy to occur in the training process.

Where t is greater than or equal to 2, which can be used to represent the sequence number of the pictures in the video. Two adjacent frames of images in the flash video are input into the image transformation network model for one-time calculation, and then a first output image can be output. It will be appreciated that the first output image output may differ from input flicker image to input flicker image, and from parameters in the image transformation network model.

With reference to the first aspect, in a sixth possible implementation manner of the first aspect, the electronic device may determine, through a perceptual loss function, a difference between the first output image and the first label image, and determine, through the perceptual loss function, a difference between the first output image and the first label image, that is, a perceptual loss between the first output image and the first label image. Parameters of the image transformation network model can be adjusted and optimized according to the perception loss, so that the perception loss of the first output image and the first label image is reduced, and the quality of the output image in the output video is improved.

With reference to the first aspect, in a seventh possible implementation manner of the first aspect, the first tag image and the first output image determine a first difference, and the second output image and the second tag image determine a second difference. When the parameter optimization adjustment of the image transformation network model is performed, the total loss of the image transformation network model can be obtained according to the perception loss determined by the first difference and the time domain consistency loss determined by the second difference, and the corresponding weight coefficient is combined, and the parameter adjustment of the model is performed according to the total loss until the total loss meets the preset convergence requirement.

With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, different weight coefficient combinations may be set, and training may be performed through the different weight coefficient combinations respectively. For example, the parameter training process may include a first training process, where parameters of the image transformation network model are adjusted according to a first weight coefficient combination; a second training process, adjusting the parameters of the image transformation network model according to a second weight coefficient combination; wherein, in the first weight coefficient combination, α _t >α _p In the second weight coefficient combination, α _t <α _p 。

Wherein alpha is _t Weight coefficient, alpha, for temporal coherence loss _p Is a weight coefficient of the perceptual loss. When the weight coefficient of the time domain consistency loss is large, for example, the weight coefficient of the time domain consistency loss is more than 1.2 times of the weight coefficient of the perception loss, the determined total loss sideThe method is characterized in that the time domain consistency loss is emphasized, parameter adjustment is carried out according to the total loss, and the output image has better time domain consistency. After the first training process, the weight coefficient of the perception loss is improved, so that the trained model has better image quality. By means of separated emphasis training, the image transformation network model can be more easily converged, and training efficiency of the image transformation network model is improved. And after the time domain consistency parameters are trained preferentially, the training difficulty of the time domain consistency parameters can be reduced, and the system training efficiency is improved.

In a second aspect, an embodiment of the present application provides a video optimization method, including obtaining a video to be optimized; and inputting the image of the video to be optimized into the trained image transformation network model according to any one of the first aspect to obtain the image of the optimized output video.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor coupled with a memory, the memory being configured to store instructions, and the processor being configured to execute the instructions in the memory, so that the electronic device performs the method according to any one of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which instructions are stored, and when executed, implement the method according to any one of the first aspect.

Drawings

FIG. 1 is a schematic diagram of a video with stripes or flicker;

fig. 2 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application;

fig. 3 is a block diagram of a software structure of an electronic device according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart illustrating an implementation process of a training method for an image transformation network model according to an embodiment of the present application;

fig. 5 is a schematic diagram of sample video generation provided by an embodiment of the present application;

fig. 6 is a schematic diagram of another sample video generation provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a training structure of an image transformation network model according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a reference learning process provided in an embodiment of the present application;

fig. 9 is a schematic diagram of a video optimization method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an apparatus for training an image transformation network model according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a video optimization apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions in the embodiments of the present application better understood and make the above objects, features and advantages of the embodiments of the present application more obvious and understandable, the technical solutions in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Before describing the technical solutions of the embodiments of the present application, moral introduces the technical scenarios and related technical terms of the present application in conjunction with the drawings.

The technical scheme of the embodiment is applied to the technical field of image processing, and mainly aims at image quality enhancement and time domain consistency processing of images of a series of continuous frames in a video. The image quality enhancement may include optimization adjustment of parameters such as color, brightness, and contrast of the image. Temporal consistency processing includes streaking (Banding) of an image occurring between different frames of a video, and flickering (flickering) occurring between different frames of a video. When the object in the image in the processed video is not changed, the change which can be perceived by a user is not generated, and the video has no time domain consistency problem.

The reason for generating the stripes (Banding) or the flickers is shown in fig. 1, and when a scene of video shooting is under a light source device such as a fluorescent lamp, the light source can generate light energy with a certain frequency according to the frequency of a power supply. For example, the power supply shown in FIG. 1 has a frequency of 50Hz and the generated light energy has a frequency of 100Hz (since the energy is not directional). The exposure mode of the CMOS light sensor is line-by-line exposure, and exposure is carried out according to a set exposure period. The time points of starting exposure of all the pixels in the same row are the same, and the time length of exposure required by all the pixels in any row is the same. Therefore, when the brightness of the top position of the nth frame image is high, the brightness of the middle position of the (N + 1) th frame image is high, and the brightness of the bottom position of the (N + 2) th frame image is high, when the nth to (N + 2) th frames are played, a scroll stripe from top to bottom may be displayed on the screen.

The reason for flicker (Flicking) between different frames may be that the frame rate of video acquisition is greater than the energy frequency of the power supply. For example, the energy frequency of a commercial power supply with a frequency of 50Hz is 100Hz, if the frame rate is large, for example, the frame rate is greater than 100Hz, pixel points located at the same position may have different brightness when acquiring different frames according to a preset exposure period, so that different frames exhibit different brightness display effects, and flicker (Flicking) phenomena with different brightness occur between different frames. Without being limited thereto, when the captured video is compressed and encoded, the video images may have a problem of inconsistent time domain, such as flickering.

At present, a single image can be optimized according to the existing image optimization algorithm, and an excellent processing effect can be obtained. However, if the image optimization algorithm is directly used for video optimization, a problem of temporal inconsistency, such as flicker caused by different brightness of different frames of the video, is often encountered. In order to improve temporal consistency, most researchers have designed dedicated algorithms for different video processing tasks to improve temporal consistency. Such as to improve temporal consistency by minimizing the distance of the output and processed video in the gradient domain, and the distance between two consecutive output frames. However, this processing method is based on the similarity between the output and the processed video in the gradient domain, and the practical situation may not be met.

In order to improve time domain consistency after video optimization and have better universal applicability, the embodiment of the application provides a training method of an image transformation network model.

The training method for the image transformation network model, provided by the embodiment of the application, comprises the steps of obtaining a sample video comprising a flicker video and a label video after the flicker video is optimized, inputting two adjacent frames (a t-th frame and a t-1-th frame) of the flicker video into the image transformation network model, and obtaining a t-th frame output image. And determining optical flows of the t frame to the t-1 frame of the label video according to the images of two adjacent frames (the t frame and the t-1 frame) corresponding to the flickering video and positioned in the label video. And generating a t-1 frame output image by combining a t frame output image obtained by the image transformation network model according to the determined optical flows from the t frame to the t-1 frame of the label video. And determining the difference between the output image of the t-th frame and the image of the t-th frame of the label video and the difference between the output image of the t-1 th frame and the output image of the t-1 th frame of the label video, and adjusting the parameters of the image transformation network model until the adjusted difference meets the preset requirement. Wherein T is an integer greater than 1 and less than T, T is the frame number of the sample video, and a T-1 group of sample images can be generated from the sample video. The T-1 group of sample images can be sequentially called according to the time sequence to train the image transformation network model, and the T-1 group of sample images can also be randomly called to train.

The training method of the image transformation network model and the video optimization method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment can be a terminal and can also be a chip in the terminal. The terminal may be, for example, an electronic device such as a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the present invention does not set any limit to a specific type of the electronic device.

Fig. 2 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 2, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, an audio module 140, a speaker 140A, a microphone 140B, a headphone interface 140C, a sensor module 150, a camera 160, a display screen 170, and the like. The sensor module 150 may include a pressure sensor 150A, a gyroscope sensor 150B, an acceleration sensor 150C, a distance sensor 150D, a fingerprint sensor 150E, a touch sensor 150F, an ambient light sensor 150G, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), and/or a neural-Network Processing Unit (NPU), among others. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the time sequence signal to finish the control of instruction fetching and instruction execution. The method is used for executing the training method of the image transformation network model, training and optimizing the image transformation network model, and can be used for optimizing the video according to the image transformation network model after training and optimizing.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system. For example, after the training of the image transformation network model is completed, the related data of the image transformation network model may be stored in a cache memory in the processor 110, which facilitates to improve the processing efficiency of the system for video optimization.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 150F, the charger, the flash, the camera 160, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 150F through an I2C interface, so that the processor 110 and the touch sensor 150F communicate through an I2C bus interface, thereby implementing a touch function of the electronic device 100, and enabling the electronic device to receive a video shooting instruction, a video optimization instruction, and the like through the touch sensor.

The MIPI interface may be used to connect the processor 110 with peripheral devices such as the display screen 170, the camera 160, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, the processor 110 and the camera 160 communicate through a CSI interface to implement a shooting function of the electronic device 100, and the shot video may be optimized through the optimized image transformation network model. The processor 110 and the display screen 170 communicate through the DSI interface to implement a display function of the electronic device 100, so that the electronic device can view a picture of an optimized video through the display screen.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 160, the display screen 170, the wireless communication module 160, the audio module 140, the sensor module 150, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device, such as receiving a video file transmitted by another device or a memory, or transmitting a video file of the electronic device to another device or a memory through the USB interface.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The electronic device 100 implements display functions, such as displaying a preview image when capturing a video or displaying a picture when playing a video, through the GPU, the display screen 170, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 170 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 170 is used to display images, video, etc., such as to display optimized video. The display screen 170 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or M display screens 170, M being a positive integer greater than 1.

The electronic device 100 may implement a photographing function through the ISP, the camera 160, the video codec, the GPU, the display screen 170, the application processor, and the like. The ISP is used to process the data fed back by the camera 160. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 160. The shot video can be optimized through the video optimization method, so that a better video display effect can be obtained in application scenes such as live broadcast and video call.

The camera 160 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to be converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or M cameras 160, M being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 performs video optimization, the video optimization method according to the embodiment of the present application may be performed by a digital signal processor.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, which processes input information quickly by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: in the embodiment of the application, intelligent learning of the image transformation network model can be realized through the NPU, and parameters in the image transformation network model are optimized.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, a file such as a video after or before optimization may be saved in the external memory card.

The internal memory 121 may be configured to store computer executable program codes, including executable program codes corresponding to the training method of the image transformation network model as described in the embodiment of the present application, executable program codes corresponding to the video optimization method as described in the embodiment of the present application, and the like, where the executable program codes include instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions and audio acquisition through the audio module 140, the speaker 140A, the microphone 140B, the headphone interface 140C, and the application processor. Such as audio playback in video, recording while video was being taken, etc.

The audio module 140 is used to convert digital audio information into an analog audio signal output and also used to convert an analog audio input into a digital audio signal. The audio module 140 may also be used to encode and decode audio signals. In some embodiments, the audio module 140 may be disposed in the processor 110, or some functional modules of the audio module 140 may be disposed in the processor 110.

The speaker 140A, also called "horn", is used to convert electrical audio signals into sound signals. The electronic apparatus 100 can listen to audio in video or the like through the speaker 140A.

The microphone 140B, also called "microphone", is used to convert sound signals into electrical signals. When a video is captured, sound information in the scene may be captured by the microphone 140B. In other embodiments, the electronic device 100 may be provided with two microphones 140B to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 140B to collect sound signals, reduce noise, identify sound sources, perform directional recording, and the like.

The headphone interface 140C is used to connect a wired headphone. The headset interface 140C may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 150A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 150A may be disposed on the display screen 170. The pressure sensor 150A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 150A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 170, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 150A. The electronic device 100 may also calculate the touched position according to the detection signal of the pressure sensor 150A, so as to achieve the acquisition of different instructions, including, for example, a video shooting instruction. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 150B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 150B. The gyro sensor 150B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 150B detects a shake angle of the electronic device 100, calculates a distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse motion, thereby preventing shake, reducing a shake problem of a captured video, and improving quality of the captured video.

The acceleration sensor 150C may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications. For example, when the optimized video is played, the gesture of the electronic device 100 is detected, and the full-screen playing state and the non-full-screen playing state are automatically switched.

A distance sensor 150D for measuring distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, shooting a scene, the electronic device 100 may utilize the distance sensor 150D to range to achieve fast focus, making the picture of the shot video clearer. For example, in a live broadcast scene or a video call scene, a video can be focused on a human face, and the video interaction experience can be improved.

The ambient light sensor 150G is for sensing ambient light level. The electronic device 100 may adaptively adjust video capture parameters, including, for example, sensitivity, shutter time, and exposure, based on the perceived ambient light brightness.

The fingerprint sensor 150E is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The touch sensor 150F is also referred to as a "touch panel". The touch sensor 150F may be disposed on the display screen 170, and the touch sensor 150F and the display screen 170 form a touch screen, which is also called a "touch screen". The touch sensor 150F is used to detect a touch operation applied thereto or therearound. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display screen 170. In other embodiments, the touch sensor 150F can be disposed on a surface of the electronic device 100, different from the position of the display screen 170.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the electronic device 100.

Fig. 3 is a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application. The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom. The application layer may include a series of application packages.

As shown in fig. 3, the application package may include a camera, video, etc. application.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 3, the application framework layer may include a content provider, a view system, an explorer, a notification manager, and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, and the like.

The view system includes visual controls, such as controls to display text, controls to display video, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including video icons may include a view that displays text and a view that displays pictures.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a brief dwell, and does not require user interaction. Such as notification managers used to inform video downloads, video optimization completion, message reminders, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: media Libraries (Media Libraries), image processing Libraries, and the like.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

After the sample video for training is obtained, the electronic device may process the images in the sample video through an image transformation network model in an image processing library to obtain a t-th frame output image corresponding to a t-th frame of the flicker video in the sample image. Through the image of the tag video, optical flows of the t frame to the t-1 frame in the tag video can be extracted. According to the extracted optical flow and the t-th frame output image, a t-1-th frame output image can be obtained through transformation. And adjusting parameters of the image transformation network model according to the difference between the output image of the t-th frame and the image of the t-th frame of the label video and the difference between the output image of the t-1 th frame and the image of the t-1 th frame of the label video until the difference meets a preset requirement, thereby finishing the training of the image transformation network model. And performing video optimization processing on any flickering video according to the trained image transformation model.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The terminal having the structure shown in fig. 2 and 3 may be used to perform a training method or a video optimization method of an image transformation network model provided in the embodiments of the present application. For convenience of understanding, the following embodiments of the present application will specifically describe an image processing method in a shooting scene provided by the embodiments of the present application, by taking a mobile phone having a structure shown in fig. 2 and fig. 3 as an example, with reference to the accompanying drawings.

Fig. 4 is a schematic implementation flow diagram of a training method for an image transformation network model according to an embodiment of the present application, which is detailed as follows:

in S401, the electronic device acquires a sample video.

The sample video in the embodiment of the present application may be understood as a video for training an image transformation network model. The sample video comprises a plurality of frames of images, and each frame of image in the sample video can be labeled in sequence according to the sequence of the images. For example, the sample video includes T frame images, and the T frame images are sequentially marked as a 1 st frame image and a 2 nd frame image … … th frame image of the sample video according to the sequence of each frame image in the sample video. For convenience of description, an ith (0< i < T) frame image of the sample video may be represented as an ith frame sample image.

The sample video comprises a flicker video and a label video. The ith frame image in the marked flashing video and the ith frame image of the marked label video are images with the same image content. Namely: the i-th frame of the flickering image (the image of the flickering video) can be subjected to flicker removal and image quality enhancement processing, and the i-th frame of the label image (the image in the label video) can be obtained.

In this embodiment, the "obtaining a sample video" may refer to that the electronic device receives a sample video from another device, or the electronic device reads the sample video in a local storage, or in a possible implementation manner, the sample video may be generated according to the received video or according to a locally read video.

The sample video may include a blinking video and a label video, among others.

The image of the flash video comprises time domain noise and space domain noise. The time domain noise can be determined by setting values of parameters such as definition, contrast and saturation of the image. The spatial noise can be determined by the brightness consistency and the contrast consistency of adjacent video frames.

For example, for an image in a flash video, the sharpness, contrast and/or saturation of the image may be less than a predetermined spatial parameter threshold, and the difference in brightness and/or contrast between two adjacent images may be greater than a predetermined temporal difference threshold. When the flickering video is played, because parameters such as brightness or contrast of different frames are different, the flickering state can be presented. The images in the flash video are fuzzy, the image quality enhancement processing such as brightness and color is not carried out on the single-frame images of the flash video, and when the flash video is played, the presented images are fuzzy, so that the watching experience of a user is influenced.

When the flickering video is converted into the tag video, the image quality of the flickering image in the flickering video needs to be enhanced, and time domain consistency processing needs to be performed on the flickering video.

The image of the label video is an image which meets the preset image quality requirement and time domain consistency requirement. For example, the sharpness and brightness of the image quality of the tag video image may be set to an image that has been subjected to image quality enhancement processing and has no temporal consistency between different frames. The image of the tag video corresponds to the image of the blinking video in terms of time. Namely: the content of the ith frame image of the tag video is similar to or identical to that of the ith frame image of the flicker video. The image of the tag video can be obtained by performing image quality enhancement processing and time domain consistency processing on the flickering video. For example, the content of the ith frame image of the tag video is the same as that of the ith frame image of the flash video, but the parameters of the pictures may be different. In the processing process, the image quality enhancement processing and the time domain consistency processing are carried out on the ith frame image of the flickering video to obtain the ith frame image of the label video. For convenience of description, the ith frame image of the tag video can be briefly described as the ith frame tag image, and the ith frame image of the flickering video can be expressed as the ith frame flickering image.

In the process of generating the sample video, the sample video required by the embodiment of the present application can be obtained according to different processing modes according to different obtained original videos.

As shown in fig. 5, the acquired original video is a flickering video. In order to obtain a complete sample video, a tag video (Ground route) corresponding to the original video needs to be generated. When generating the tag video, image quality enhancement processing can be performed on each frame of flash image of the original video, namely the flash video, so as to obtain a clear image corresponding to each frame of image in the flash video. A video consisting of sharp images, we may refer to as a sharp video. The image quality enhancement processing may be performed by using a convolutional neural network, including noise reduction processing performed by a network such as DnCNN, FFDNet, or CBDNet, and image super-resolution processing, color enhancement processing performed by a network such as HDR-Net, U-Net, ResNet, or Alex-Net. The image quality of the image can be enhanced through operations such as contrast adjustment, color adjustment, filtering, saturation adjustment, detail enhancement, edge processing and the like. After the image quality of the original video is enhanced, the image quality of each frame of image is enhanced to become a clear image. However, there are still differences in parameters such as brightness and contrast between different frames of a clear video, and there are problems of alternation of brightness and contrast or flickering due to change in contrast in the time dimension when the clear video is played.

The temporal consistency processing can be further performed on the clear video. The time domain consistency problem of the clear video can be adjusted according to the reason of causing the time domain consistency fault. The method can adjust the clear images in the clear video according to the reasons of shooting equipment, the reasons of shooting scenes, the coding reasons and the like, so that the contrast or brightness of the adjusted images is kept consistent, and the tag video meeting the time domain consistency requirement is obtained.

In a possible implementation manner, the time-domain consistency processing may be performed on the original video to obtain a video with time-domain consistency, and then the image quality enhancement processing is further performed on the image in the video with time-domain consistency to obtain a tag image corresponding to the original video.

In a possible implementation, the acquired original video may not be a flickering video. As shown in the sample video generation diagram of fig. 6, the image included in the acquired original video has a problem in image quality, and the image in the original video is not subjected to image quality enhancement processing. However, there is no temporal consistency problem between images in the original video. In order to obtain the required tag video and flicker video, the original video needs to be subjected to transformation processing.

When generating a flicker video, an original image included in the original video may be subjected to degradation processing to obtain images with different brightness or contrast. For example, the brightness and/or contrast of different frame images in the original video may be adjusted according to a random brightness and/or random contrast manner, so as to obtain images with different brightness and/or contrast. A flickering video is generated from images of different brightness and/or contrast.

For example, the variation range of the random brightness is set to be [ -50,50], the variation range of the random contrast is [ -0.5, 1.5], and the adjustment parameters of the brightness and/or the contrast of different frames are randomly determined in the set variation range of the random brightness and/or the random contrast, so that the time domain of the video after random adjustment does not have consistency, and the problem of time domain consistency exists. Of course, the adjustment parameters of the brightness and/or contrast of different frames may be determined according to a preset variation form without being limited to random brightness and random contrast.

In the generation of the tag video, since the image of the original video is an image that has not been subjected to the image quality enhancement processing, in order to satisfy a requirement that the image of the tag video is a clear image, the image quality enhancement (or image enhancement) processing may be performed on the original image to convert each frame of image of the original image into a clear image. Because the original video has time domain consistency, the label image can be generated by combining the image with clear image quality of the transformed video.

Of course, it is not limited to the cases shown in fig. 5 and 6. In this case, the original video may be a tagged video, and the image quality of the original video is reduced by performing degradation processing, including adding noise to the image of the original video, and then performing degradation processing on the image with the reduced image quality, for example, by performing random contrast and/or random brightness processing on images of different frames, to obtain a flickering image with temporal consistency and image quality problems.

Or, the acquired original video may have a problem of temporal consistency, but the image is a clear image. In order to obtain the flicker video in the sample video, noise can be added to the original image of the original video, so that the image quality of the original image is reduced, and the flicker video is generated according to the processed image.

And further performing time domain consistency processing on the original video. The time domain consistency problem of the original video can be adjusted according to the reason causing the time domain consistency fault. For example, the original image in the original video may be adjusted according to the reason of shooting a scene, the coding reason, and the like, so that the contrast or brightness of the adjusted image is kept consistent, thereby obtaining a tag video with time domain consistency.

In S402, an image in a flicker video included in the sample video is input to the image transformation network model, and a first output image in the output video is obtained.

As shown in the schematic diagram of the training structure of the image transformation network model shown in fig. 7, the image transformation network model in the embodiment of the present application may capture the spatiotemporal correlation in the flicker video by embedding a convolution Long-Short Term Memory artificial neural network (referred to as a convolution Long-Short Term Memory, abbreviated as conv-LSTM). Spatial features are extracted through convolution operation, and the extracted spatial features are loaded into an LSTM network capable of extracting time sequence features, so that the trained image transformation network model can effectively enhance the image quality of a video and can solve the problem of time domain consistency of the video.

In the embodiment of the application, the image transformation network model can comprise a residual error block so as to enable a flicker image (I) in a flicker video _t ，I _t-1 ) The image transforms the residual of the network model output, rather than outputting the first output image directly. For example, the first output image may be the t-th frame output image, and the t-th frame output image O _t Can be expressed as: o is _t ＝I _t +F(I _t ) Wherein, I _t The image of the t-th frame (the flicker image of the t-th frame) of the flicker video F (I) _t ) And residual errors output by the image transformation network model. By introducing the residual block, the image transformation network model outputs the difference between the output image of the t-th frame and the flicker image of the t-th frame, so that the image transformation network model can effectively solve the problem of deepening the layer number during optimization training. Wherein T is a natural number greater than or equal to 2.

It is understood that the first output image may be the image of the first frame in the output video. Similarly, the first tag image may be any frame tag image in the tag video. The second output image is an image adjacent to the first output image, the second label image is an image adjacent to the first label image, and the content of the second label image is the same as the second output image.

The flicker images of two adjacent frames can be a t-th frame flicker image and a t-1-th frame flicker image. After the flicker images of two adjacent frames are input into the image transformation network model, the residual F (I) is obtained by calculation through the calculation processing of the image transformation network model _t ) I.e. the difference between the output image of the t-th frame and the flicker image of the t-th frame. And summing the pixels at the corresponding positions according to the output residual (or called residual image) and the input t-th frame flicker image to obtain the t-th frame output image. The pixels at corresponding positions are summed, which can be understood as summing the pixels at the same positions of the flicker image and the residual image. For example, the pixel value of the first pixel in the upper left corner of the flicker image is summed with the pixel value of the first pixel in the upper left corner of the residual image, and the summation result is the pixel value of the first pixel in the upper left corner of the output image.

The residual error output by the image transformation network model is the difference value between the output image of the t-th frame (i.e. the image of the t-th frame of the output video) and the flicker image of the t-th frame (i.e. the image of the t-th frame of the flicker video). The difference may be a pixel difference between pixel points at corresponding positions in the two images.

For example, the pixel value of the pixel point at the position a of the t-th frame output image is (r1, g1, b1), the pixel value of the pixel point at the position a of the t-th frame flicker image is (r2, g2, b2), and the residual corresponding to the pixel point can be represented as (r1-r2, g1-g2, b1-b 2).

In the embodiment of the application, the flicker images of the flicker video input into the image transformation network model, that is, the t-th frame flicker image and the t-1-th frame flicker image, may be any two adjacent frames of images in the flicker video. When training the image transformation network model, the parameters in the image transformation network model can be input to the image transformation network model in sequence or randomly according to the playing sequence of the multi-frame images included in the flash video, so as to complete the training of the parameters in the image transformation network model. For example, the 1 st frame and the 2 nd frame of flicker images may be input to the image transformation network model for training, then the 2 nd frame and the 3 rd frame may be input to the image transformation network model for training … …, and finally the T-1 st frame and the T-th frame of flicker images may be input to the image transformation network model for training. Alternatively, it is not necessary to limit the order of playing the flicker images, and any group of flicker images included in the flicker images may be input to the image transformation network model in a randomly determined order for training.

Before training, the image transformation network model in the embodiment of the application may determine parameters in the image transformation network model in a random generation manner, or may use preset values as parameters in the image transformation network model. The parameters of the image transformation network model may include the size of the convolution kernel of the convolution layer in the image transformation network model, the value of the convolution kernel, the learning rate, the step size of the convolution calculation, the size of the extended edge, and other parameters. For example, the value of each position in the convolution kernel of the convolution layer, and the step size, may be initially set to a predetermined value, such as 1 or 2. The convolution kernel size may be initialized to 3 x 3 size, the extended edge size may be initialized to 0, and the learning rate may be initialized to 0.1.

Typically, the initialized parameters are not the same as the parameters for which training is complete. Therefore, in the training process, there may be a difference between an output image generated by the residual outputted by the image transformation network model and a desired output image, i.e., a label image. The difference information of the generated output image and the tag image in the spatial domain can be determined according to the difference between the output image of the t-th frame and the tag image of the t-th frame, the parameters of the image transformation network model are adjusted, and the spatial domain information of the tag image is learned.

In S403, an optical flow is determined from the tag video, and the first output image is subjected to image conversion based on the determined optical flow to obtain a second output image.

After the output image is determined, in order to ensure the consistency of the output image in the time domain, the generated output image needs to be compared with the label images of other times in the time domain, so that the video corresponding to the image output by the image transformation network model has no problem of time domain consistency.

However, two adjacent frames of images in the flicker video are input to the image transformation network model for calculation to generate the first output image, and whether the output video has a time domain consistency problem in the time domain cannot be determined directly according to the first output image. For example, when the first output image is generated, the contrast or brightness of the image may be adjusted, and when the image quality enhancement requirement is satisfied, the image may not be temporally coincident with an adjacent image of the first output image.

For this purpose, the embodiment of the present application introduces optical flow, transforms the first output image through optical flow, and compares the transformed second output image with the second tag image with the same time point, so as to optimize the transformation parameters of the time domain of the image transformation network model according to the comparison result.

The optical flow refers to the instantaneous speed of the pixel motion of a spatial moving object on the observation imaging plane. Can be determined by the adjacent tag image in the tag video. I.e., any two adjacent frames of tagged images in the tagged video, an optical flow can be determined. For example, the 1 st frame tag image and the 2 nd frame tag image can determine optical flow, the 2 nd frame tag image and the 3 rd frame tag image can determine optical flow, and the T-1 st frame tag image and the T-th frame tag image can determine optical flow. Where T is the number of frames of the tag image included in the tag video.

As shown in FIG. 7, a tag image G in a video according to tags _t And G _t-1 When calculating the optical flow, a neural optical flow network (FlowNet) may be selected for optical flow calculation. The neuro-optical flow network may include a simple neuro-optical flow network (FlowNetsimple), a correlated neuro-optical flow network (FlowNetcorr), and the like. And inputting the two frames of images needing to be calculated into a neural optical flow network, and outputting optical flow calculation results of the two frames of images.

Therein, with the first output image O _t The adjacent second output image may be an image O of a frame preceding the first output image _t ' _-1 Or may be a frame image subsequent to the first output image. For example, the first output image is the output image of the t-th frame, and the second output image may be the output image of the t-1 th frame or the output image of the t +1 th frame.

As for the direction of optical flow, the direction from the t frame label image to the t-1 frame image can be expressed as

Or the direction from the t-1 th frame label image to the t-th frame label image. The selection of the direction of the optical flow is associated with the selection of the second label image. And enabling the first output image to be consistent with the second label image in time through the second output image after optical flow transformation. The selected second label image may be determined by the direction of the optical flow, or the direction of the optical flow may be determined based on the selected second label image.

For example, the first output image is a t-th frame output image, and the first label image is a t-th frame label image. If the second label image is the t-1 frame label image, the first output image is transformed into the optical flow of the second output image, which can be the optical flow determined from the t-frame label image to the t-1 frame label image, and the t-1 frame output image is obtained by performing transformation calculation on the t-1 optical flow and the t-frame output image.

If the second label image is t +1 frame label image, the optical flow transformed from the first output image to the second output image may be from the t frame label image to the t +1 frame label image. And performing transformation calculation by combining the tth frame output image through the tth optical flow to obtain a t +1 frame output image.

The optical flows in the embodiment of the present application may be calculated to obtain T-1 optical flows in advance according to the setting of the second tag image and the T-1 group tag images (i.e., the 1 st frame and the 2 nd frame, and the 2 nd frame and the 3 rd frame … …, respectively) in the tag video. When the first output image is obtained through calculation, the optical flow of the same position is searched according to the position of the first output image, and the first output image is transformed according to the searched optical flow to obtain a second output image. For example, if the first output image is the t-th frame image, the t-1 st optical flow can be found, and the first output image is transformed according to the t-1 st optical flow. And calculating the 1 st optical flow by the 1 st frame tag image and the 2 nd frame tag image, and calculating the t-1 st optical flow by the t-1 st frame tag image and the t-1 th frame tag image.

It should be noted that the serial number of the output image in the embodiment of the present application is the same as the larger serial number of the flicker image adjacent to the two frames used for calculating the output image. For example, the t-1 th frame of flicker image and the t-th frame of flicker image are input into the image transformation network model, residual errors are obtained through model calculation, and the t-th frame of output image is obtained through calculation according to the residual errors. Namely, the flicker video comprising the T frame image is calculated through the image transformation network model to obtain the output video comprising the 2 nd frame to the T frame output images.

In S404, parameters of the image transformation network model are adjusted according to a difference between the output image and the tag image.

In the embodiment of the present application, the difference between the output image and the label image includes a difference between the first output image and the first label image, and a difference between the second output image and the second label image. The first label image and the second label image are adjacent images in the label video. The first label image and the first output image have the same content, and the second label image and the second output image have the same content, and there may be a difference caused by the image quality parameter or the temporal consistency parameter. When the first output image generated by the t-1 th frame flicker image and the t-th frame flicker image is the t-th frame output image, the first label image may be the t-th frame label image. And when the second output image is the t-1 frame output image, the second label image is the t-1 frame label image. And when the second output image is the t +1 th frame output image, the second label image is the t +1 th frame label image.

The difference between the first output image and the first label image may be a perceptual difference between the first output image and the first label image, and may be calculated by a partial module in a convolutional neural network such as a VGG network, AlexNet, LeNet, and the like, for example, by using a convolutional layer in the network, to extract features of the first output image and the first label image, and to calculate a perceptual loss between the first output image and the first label image.

In a possible implementation, the perceptual loss function may be based

Calculating the perception loss, wherein N is the total pixel number of a frame of flickering image or output image, T is the total frame number of the flickering video, phi _l (.) represents feature activation for layer l of the network phi,

pixel values representing the output image of the t-th frame,

pixel value, L, representing the tag image of the t-th frame _p Indicating a loss of perception.

The difference between the second output image and the second label image can be represented by calculating a temporal consistency loss. The temporal consistency loss function can be expressed as:

wherein the content of the first and second substances,

representing the pixel values of the output image of the t-1 th frame,

the pixel value of the tag image of the T-1 th frame represents the time domain consistency loss, N is the total pixel number of a frame of flicker image or output image, T is the total frame number of the flicker video, and L _t Representing a loss of temporal coherence.

In the parameter-optimized training, the total loss can be determined from the temporal consistency loss and the perceptual loss. Namely, the total loss can be calculated by the time-domain consistency loss, the perception loss and the corresponding weight coefficient. Can be expressed in the form of a formula: l is _Total ＝α _p L _p +α _t L _t 。L _Total Denotes the total loss, α _t Weight coefficient, alpha, representing a loss of temporal coherence _p A weighting factor representing the perceptual loss. When the weighting coefficient is larger, the influence of the loss associated with the coefficient on the total loss is larger, and the weighting coefficient is smaller, the influence of the loss associated with the coefficient on the total loss is smaller. For example, when α is _t Does not change and increases alpha _p The value of (2) represents the influence of the increased perception loss on the total loss, and the parameters of the image transformation network model are trained through the total loss corresponding to the weight coefficient, so that the output image generated by the trained image transformation network model has better perception performance, namely the image quality of the output video is better. When alpha is _p Does not change and increases alpha _t The value of (2) represents the influence of the increased time-domain consistency loss on the total loss, and the parameters of the image transformation network model are trained through the total loss corresponding to the weight coefficient, so that the output image generated by the trained image transformation network model has better image quality.

When the total loss L of images in the sample video is calculated according to the network framework shown in FIG. 7 _Total Can stably and effectively converge to a certain value, for example, if the total loss value is less than the preset convergence threshold value, the completion of the convergence can be completedAnd training the image transformation network model.

In the embodiment of the present application, in order to improve the training efficiency, a weight coefficient combination may be formed by weight coefficients of different values of perceptual loss and weight coefficients of temporal consistency loss. The parameters of the image transformation network model can be trained sequentially by adopting two or more weight coefficient combinations.

In a possible implementation manner, two weight coefficient combinations may be adopted to train parameters of the image transformation network model in sequence. In the first weight coefficient combination, the weight coefficient of the perception loss is larger than the weight coefficient of the time domain consistency loss, and in the second weight coefficient combination, the weight coefficient of the perception loss is smaller than the weight coefficient of the time domain consistency loss. Wherein, the second weight coefficient combination can be obtained by adjusting the parameter of the first weight coefficient combination. For example, the weight coefficient of the temporal coherence loss in the first weight coefficient combination may be kept unchanged, and the weight coefficient of the perceptual loss may be increased, or the weight coefficient of the temporal coherence loss in the first weight coefficient combination may be decreased, and the weight coefficient of the perceptual loss may be increased or kept.

Fig. 8 is a schematic diagram of a reference learning process provided in the embodiment of the present application, and for simplicity of description, two adjacent frames of images in a video are taken as an example for illustration. The left image is a flickering video which has time domain consistency problem and image quality problem. The temporal consistency problem is a front-back inconsistency problem expressed in the temporal domain, for example, a flicker video shown in fig. 8, brightness inconsistency between two adjacent frames of images, and the like. The image quality problem is the image quality problem of a single frame image in a flicker video, and comprises problems such as definition, color and the like.

When training is performed by the first weight coefficient combination, since the weight coefficient of the perceptual loss in the first weight coefficient combination is smaller than the weight coefficient of the temporal coherence loss, for example, the weight coefficient of the temporal coherence loss may be set to be 1.2 times or more of the weight coefficient of the perceptual loss, and the total loss L may be set to be _Total And the constraint learning of the time domain consistency parameters is more emphasized. After the first weight coefficient combination training converges, time domain oneThe constraint learning of the sexual parameters is completed. Through the learned parameters of the image transformation network model, the flicker video is subjected to image transformation calculation, the time domain consistency of the image of the output video is effectively improved as shown in the middle graph of fig. 8, and the brightness problem existing between two adjacent frames is obviously relieved.

After learning of the constraint parameters of the time domain consistency is completed, training of the parameters can be further performed through the second weight coefficient combination. In the second weight coefficient combination, the weight coefficient of the perceptual loss is greater than the weight coefficient of the temporal consistency loss, and for example, the weight coefficient of the perceptual loss may be set to be 1.2 times or more the weight coefficient of the temporal consistency loss, and the total loss L may be set to be equal to or greater than the weight coefficient of the temporal consistency loss _Total And the constraint learning of the perception parameters is emphasized more. After the second weight coefficient combination training converges, the flicker video or the video calculated and output by the parameter after the first weight coefficient combination training may be calculated and transformed according to the parameter of the image transformation network model after training and learning, so as to obtain the output video shown in the right diagram of fig. 8. The quality of the output video is improved and the problem of temporal consistency is overcome.

When the training optimization of the parameters is carried out through the first weight coefficient combination and the second weight coefficient combination, the parameters are optimally trained in a gradual and emphasizing mode, and compared with the mode of simultaneously carrying out the optimization training, the method can enable the training process to be more easily converged, so that the training efficiency can be effectively improved.

Fig. 9 is a schematic diagram illustrating an image transformation network model trained by the method for training an image transformation network model shown in fig. 4 to optimize a video to be optimized according to an embodiment of the present application. Fig. 9 is a schematic diagram illustrating two adjacent frame images. The video to be optimized may be a video with a temporal consistency problem, or may also be a video with a picture quality problem, or may also be a video with a temporal consistency problem (brightness of adjacent frames is significantly different) and a picture quality problem (picture quality is blurred) as shown in fig. 9. The image transformation network model trained by the training method of the image transformation network model shown in fig. 4 of the application is subjected to computational transformation, so that the output video shown in the right image of fig. 9 can be output, the definition of the image of the output video is higher in the output video compared with that of the video to be optimized, the brightness change of the adjacent frames of the image of the output video is milder or basically consistent, and the problem of time domain consistency and the problem of image quality are obviously improved.

When the video is optimized, the video collected by the electronic equipment can be optimized, and the received video sent by other electronic equipment can also be optimized by the electronic equipment. The video optimization method can be used before video playing and can also be used for optimizing during video playing.

Fig. 10 is a schematic diagram of an apparatus for training an image transformation network model according to an embodiment of the present application, and as shown in fig. 10, the apparatus includes: a sample video acquisition unit 1001, a first output image acquisition unit 1002, a second output image acquisition unit 1003, and a parameter adjustment unit 1004.

The sample video acquiring unit 1001 is configured to acquire a sample video, where the sample video includes a flash video and a tag video. The label video is the video expected after video optimization. The image quality of the image in the tag video is good, for example, whether the image in the video meets the requirement of the tag video can be determined through image quality evaluation parameters including parameters such as color, definition and the like. The adjacent images in the label video have temporal consistency, that is, the variation of the contrast or brightness of the adjacent images is smaller than a preset value. In contrast to the tag video, the image in the flicker video has image quality problems and time domain consistency problems, and whether the video is the flicker video can be determined by preset relevant parameters.

The first output image obtaining unit 1002 is configured to input two adjacent frames of images of the flickering video into a preset image transformation network model, so as to obtain a first output image. Wherein, the output of the image transformation network model may be a residual. The first output image may be obtained from a sum of the residual and an image of the input flicker video. Residual errors are output through the image transformation network model, so that the image transformation network model can effectively solve the problem that the layer number is deepened during optimization training.

The second output image obtaining unit 1003 is configured to determine an optical flow of an adjacent tag image in the tag video according to two adjacent frame images in the tag video, and transform the first output image according to the determined optical flow to obtain a second output image adjacent to the first output image. The first output image and the second output image are two adjacent frames of images in the output video. If the first output image is the output image of the t-th frame, the second output image can be the output image of the t-1 th frame or the output image of the t +1 th frame.

When transforming the first output image into the second output image, the selected optical flow may be determined according to the position of the first output image. For example, the first output image is a t-th frame output image, and the optical flow may be optical flows of a t-1 th frame tag image to a t-th frame tag image, or optical flows of a t +1 th frame to a t-th frame. And determining the second output image as the t +1 th frame output image or the t-1 th frame output image according to the direction of the optical flow.

The parameter adjusting unit 1004 is configured to determine a difference between the first output image and the first label image and a difference between the second output image and the second label image, adjust a parameter of the image transformation network model according to the difference until the calculated difference after the parameter adjustment meets a preset requirement, and complete training of the image transformation network model. The first label image and the second label image are images in the label video, the first label image is associated with the content of the first output image, and the second label image is associated with the content of the second output image. The content association here may be understood as that the content is the same, but the image quality parameter and/or the temporal consistency parameter are different, or may be understood as that the first tag image and the first output image are obtained by the same image transformation, for example, the t-th frame tag image as the first tag image is obtained by the t-th frame flicker image transformation; and calculating the output image of the t frame as the first output image according to the flicker image of the t-1 frame and the flicker image of the t frame.

The difference between the first output image and the first label image may be represented by a perceptual loss. The difference between the second output image and the second label image may be represented by a temporal consistency loss.

The perceptual loss may be calculated based on a neural network model, or may be calculated by a formula. For example, the perceptual loss may be expressed as:

wherein N is the total pixel number of a frame of flicker image or output image, T is the total frame number of flicker video, phi _l (.) represents feature activation for layer l of the network phi,

pixel values representing the output image of the t-th frame,

The temporal consistency loss may be passed through a temporal consistency loss function

And (4) showing. Wherein the content of the first and second substances,

pixel values representing the output image of the t-1 th frame,

the pixel value of the tag image of the T-1 th frame represents the time domain consistency loss, N is the total pixel number of a frame of flicker image or output image, T is the total frame number of the flicker video, and L _t Representing a loss of temporal consistency.

Based on the obtained perception loss and the time domain consistency loss, the total loss can be obtained by combining the summation of corresponding weight coefficients, and the parameters in the image transformation network model are optimized and adjusted by calculating the convergence of the total loss.

In a possible implementation manner, the parameters in the image transformation network model may be gradually trained through two or more sets of weight coefficients according to a sequential order. For example, a weight coefficient with a higher time domain consistency loss relative to a perceptual loss may be trained, and a time domain parameter in the model is constrained, so that a video generated by the trained model has a better time domain consistency. And then training by adopting a weight coefficient with higher perception loss relative to time domain consistency loss, and constraining the spatial domain parameters in the model to ensure that the video generated by the trained model has better image quality and time domain consistency.

The training apparatus of the image conversion network model shown in fig. 10 corresponds to the training method of the image conversion network model shown in fig. 4.

Fig. 11 is a schematic diagram of a video optimization apparatus according to an embodiment of the present application. As shown in fig. 11, the apparatus includes a video to be optimized acquisition unit 1101 for acquiring an optimized video. The video to be optimized may be a video acquired by the electronic device itself, or may also be a received video transmitted by another electronic device. The video optimization unit 1102 is configured to perform optimization processing on a video to be optimized according to the image transformation network model obtained through training by the image transformation network model training method shown in fig. 4, so as to obtain an optimized output video.

The video optimization apparatus shown in fig. 11 corresponds to the video optimization method shown in fig. 9.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

Moreover, various aspects or features of embodiments of the application may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. For example, computer-readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, etc.), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD), etc.), smart cards, and flash memory devices (e.g., erasable programmable read-only memory (EPROM), card, stick, or key drive, etc.). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media capable of storing, containing, and/or carrying instruction(s) and/or data.

In the above embodiments, the apparatuses in fig. 10 or fig. 11 may be wholly or partially implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply any order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not limit the implementation processes of the embodiments of the present application.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application, which are essential or part of the technical solutions contributing to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or an access network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A method for training an image transformation network model, the method comprising:

the method comprises the steps that an electronic device obtains a sample video, wherein the sample video comprises a flicker video and a tag video, an image of the flicker video is an image comprising time domain noise and space domain noise, and the tag video is a video obtained by performing image quality enhancement processing and time domain consistency processing on the flicker video;

the electronic equipment inputs the image in the flickering video into a preset image transformation network model to obtain a first output image;

the electronic equipment determines an optical flow according to the tag video, transforms the first output image according to the optical flow to obtain a second output image adjacent to the first output image, the optical flow is determined according to two adjacent frame tag images in the tag video, and the optical flow at the same position is searched according to the position of the first output image;

the electronic equipment determines the difference between the first output image and the first label image and the difference between the second output image and the second label image, adjusts the parameters of the image transformation network model according to the difference between the first output image and the first label image and the difference between the second output image and the second label image until the calculated difference meets the preset requirement after the parameters are adjusted, and completes the training of the image transformation network model, wherein the first label image and the second label image are images in a label video, the content of the first label image is related to the content of the first output image, and the content of the second label image is related to the content of the second output image.

2. The method of claim 1, wherein when the video to be trained is a flickering video, the electronic device obtains a sample video, comprising:

the electronic equipment carries out time domain consistency processing on the images of the flickering video to obtain a video with stable time domain;

and the electronic equipment performs image enhancement processing on the video with the stable time domain to obtain a label video in the sample video.

3. The method of claim 1, wherein when the video to be trained is a temporally stable video, the electronic device obtains a sample video, comprising:

the electronic equipment carries out degradation processing on the image of the video to be trained to obtain a flicker video of the sample video;

and the electronic equipment performs image enhancement processing on the image of the video to be trained to obtain a label video of the sample video.

4. The method of claim 3, wherein an electronic device performs degradation processing on the image of the video to be trained, comprising:

and the electronic equipment randomly adjusts the brightness and/or the contrast of the image of the video to be trained according to the random brightness and/or the random contrast mode.

5. The method of claim 1, wherein the electronic device inputs the image in the flickering video into a preset image transformation network model to obtain a first output image in an output video, and the method comprises:

the electronic equipment inputs the t frame image of the flickering video and the t-1 frame image into a preset image transformation network model, and outputs a residual error between the t frame output image and the t frame image of the flickering video, wherein t is greater than or equal to 2;

and the electronic equipment sums the residual error and the t frame image of the flickering video to obtain a t frame output image, wherein the t frame output image is a first output image.

6. The method of claim 1, wherein the electronic device determines optical flow from tagged video, comprising:

the electronic equipment inputs the t frame image of the label video and the adjacent image of the t frame image into a pre-trained optical flow calculation network, and outputs the optical flows of the t frame image in the label video and the adjacent image of the t frame image, wherein t is greater than or equal to 2.

7. The method of claim 1, wherein the electronic device determining the difference between the first output image and the first label image comprises:

the electronic device determines a difference between the first output image and the first label image via a perceptual loss function.

8. The method of claim 1, wherein adjusting parameters of the image transformation network model based on the differences comprises:

determining a temporal consistency loss according to a difference between the second output image and the second label image, and determining a perception loss according to a difference between the first output image and the first label image;

according to the time domain consistency loss, the perception loss and a preset weight coefficient alpha of the time domain consistency loss _t And a weight coefficient alpha of the perceptual loss _p Determining a total loss of the image transformation network model;

and adjusting parameters of the image transformation network model according to the total loss.

9. The method of claim 8, wherein adjusting parameters of the image transformation network model based on the total loss comprises:

a first training process, adjusting the parameters of the image transformation network model according to a first weight coefficient combination;

a second training process, adjusting the parameters of the image transformation network model according to a second weight coefficient combination;

wherein, in the first weight coefficient combination, α _t >α _p In the second weight coefficient combination, α _t <α _p 。

10. A method for video optimization, the method comprising:

acquiring a video to be optimized;

inputting the image of the video to be optimized into the trained image transformation network model according to any one of claims 1 to 9 to obtain an image of the optimized output video.

11. An electronic device comprising a processor coupled with a memory, wherein the memory is configured to store instructions and the processor is configured to execute the instructions in the memory such that the electronic device performs the method of any of claims 1-10.

12. A computer-readable storage medium having instructions stored thereon, which when executed, implement the method of any one of claims 1-10.