WO2023056730A1

WO2023056730A1 - Video image augmentation method, network training method, electronic device and storage medium

Info

Publication number: WO2023056730A1
Application number: PCT/CN2022/081245
Authority: WO
Inventors: 游晶; 徐科; 孔德辉; 艾吉松; 刘欣; 任聪
Original assignee: 深圳市中兴微电子技术有限公司
Priority date: 2021-10-09
Filing date: 2022-03-16
Publication date: 2023-04-13
Also published as: CN115965560A

Abstract

Embodiments of the present disclosure provide a video image augmentation method and device, a network training method, an electronic device and a computer-readable storage medium. The video image enhancement method comprises: using a first network to extract a first feature of the i-th frame of a short-exposure video image, where i is a variable sequentially selected from 0, 1, 2, 3...; using a second network to denoise the (i-N)th frame of the short-exposure video image to the (i+N)th frame of the short-exposure video image and obtain a denoised video image corresponding to the i-th frame of the short-exposure video image, where N is a constant greater than or equal to 1; and performing first fusion processing on the first feature of the i-th frame of the short-exposure video image and the denoised video image corresponding to the i-th frame of the short-exposure video image to obtain an augmented video image corresponding to the i-th frame of the short-exposure video image. FIG. 1

Description

Video image enhancement method, network training method, electronic device, storage medium

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application submitted to the State Intellectual Property Office on October 9, 2021, with the application number 202111174652.3, and the title of the invention is "video image enhancement method, network training method, electronic equipment, storage medium". The entire contents of are incorporated by reference in this disclosure.

technical field

Embodiments of the present disclosure relate to but are not limited to the technical field of video image processing, in particular, to a video image enhancement method, a network training method, electronic equipment, and a computer-readable storage medium.

Background technique

With the popularization and development of terminal equipment such as video surveillance and mobile phones, the impact of equipment imaging by ambient light is still one of the main problems currently existing. The video imaging system can provide clear images with high color reproduction during daylight hours. However, at night, due to the insufficient exposure of the image sensor, the image quality is greatly reduced, resulting in problems such as large image noise, low brightness, loss of details, and color distortion. These deficiencies in low light seriously affect subsequent tasks such as object recognition and segmentation. Therefore, in view of this deficiency, it is very necessary to study video image enhancement algorithms in night scenes.

Night scene video image enhancement algorithm based on neural network is one of the main research directions at present. However, these algorithms mainly focus on one of the directions such as denoising, brightness enhancement, and color restoration, and often cannot take into account the problems of denoising and avoiding distortion at the same time, so the enhancement effect is poor.

Contents of the invention

Embodiments of the present disclosure provide a video image enhancement method, a network training method, electronic equipment, and a computer-readable storage medium.

In the first aspect, an embodiment of the present disclosure provides a video image enhancement method, including: using the first network to extract the first feature of the i-th frame of the short-exposure video image; wherein, i is sequentially selected from 0, 1, 2, 3, The variable of ...; adopt the second network to denoise the (i-N)th frame short exposure video image to the (i+N) frame short exposure video image to obtain the denoising corresponding to the i frame short exposure video image A video image; wherein, N is a constant greater than or equal to 1; the first feature of the i-th frame short-exposure video image and the denoised video image corresponding to the i-th frame short-exposure video image are subjected to the first The fusion process obtains the enhanced video image corresponding to the short-exposure video image of the i-th frame.

In a second aspect, an embodiment of the present disclosure provides a network training method, including: for each training sample, using any of the video image enhancement methods described above to obtain an enhanced video image corresponding to the training sample; wherein the training sample includes : (2N+1) frame short exposure video images; determine the total objective function value according to the enhanced video image corresponding to the training sample and the long exposure video image, update the training parameter value according to the total objective function value; for each For training samples, according to the updated training parameter values, use any one of the above video image enhancement methods to obtain enhanced video images corresponding to the training samples until the total objective function value converges.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; a memory, at least one program is stored in the memory, and when the at least one program is executed by the at least one processor, any of the above-mentioned A video image enhancement method, or any one of the above-mentioned network training methods.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the above-mentioned video image enhancement methods, or any of the above-mentioned video image enhancement methods, can be realized. A network training method.

The video image enhancement method provided by the embodiment of the present disclosure realizes the feature extraction and denoising processing of the short-exposure video image respectively through two different branches, and then fuses the processing results of the two branches to obtain the enhanced video image , while denoising is taken into account, distortion is avoided through feature extraction, and the enhancement effect is improved.

The network training method provided by the embodiment of the present disclosure performs training on the above video image enhancement method to obtain an optimal training parameter value, which further improves the enhancement effect.

Description of drawings

FIG. 1 is a flowchart of a video image enhancement method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a realizable first network provided by an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of another realizable first network provided by an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a realizable second network provided by an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a realizable codec sub-module provided by an embodiment of the present disclosure;

Fig. 6 is a flowchart of a network training method provided by another embodiment of the present disclosure.

Detailed ways

In order for those skilled in the art to better understand the technical solutions of the present disclosure, the video image enhancement method, network training method, electronic equipment, and computer-readable storage medium provided by the present disclosure will be described in detail below with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the case of no conflict, various embodiments of the present disclosure and various features in the embodiments can be combined with each other.

As used herein, the term "and/or" includes any and all combinations of at least one of the associated listed items.

The terminology used herein is for describing particular embodiments only and is not intended to limit the present disclosure. As used herein, the singular forms "a" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that when the terms "comprising" and/or "consisting of" are used in this specification, the stated features, integers, steps, operations, elements and/or components are specified to be present but not excluded to be present or Add at least one other feature, entity, step, operation, element, component and/or group thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with their meanings in the context of the relevant art and the present disclosure, and will not be interpreted as having idealized or excessive formal meanings, Unless expressly so limited herein.

Fig. 1 is a flowchart of a video image enhancement method provided by an embodiment of the present disclosure.

In the first aspect, referring to FIG. 1 , an embodiment of the present disclosure provides a video image enhancement method, including:

Step 100, using the first network to extract the first feature of the i-th frame of the short-exposure video image; wherein, i is a variable sequentially selected from 0, 1, 2, 3, . . . .

In some exemplary embodiments, the i-th frame of short-exposure video image is a video image in a sequence of video images in Bayer format obtained from a CMOS (Complementary Metal Oxide Semiconductor) sensor, and each frame of short-exposure video image corresponds to a frame length Expose the image.

In some exemplary embodiments, as shown in FIG. 2 , the first network may include: a first subnetwork, a second subnetwork, and a fusion submodule. Wherein, the fusion sub-module is configured to implement the second fusion process.

In some other exemplary embodiments, as shown in FIG. 3 , in order to reduce the amount of calculation, in addition to the first sub-network, the second sub-network and the fusion sub-module, the first network may further include: a down-sampling sub-module and an up-sampling sub-module submodule. Wherein, the down-sampling sub-module is configured to realize down-sampling processing, and the up-sampling sub-module is configured to realize up-sampling processing.

In some exemplary embodiments, when the first network includes: a first sub-network, a second sub-network, and a fusion sub-module, using the first network to extract the first feature of the i-th frame of short-exposure video image includes: using The first sub-network extracts the second feature of the i-th frame short-exposure video image, and uses the second sub-network to extract the third feature of the i-th frame short-exposure video image; the second feature of the i-th frame short-exposure video image and the ith The second fusion process is performed on the third feature of the frame of short-exposure video image to obtain the first feature of the i-th frame of short-exposure video image.

In some exemplary embodiments, the second feature may be a local feature, and the third feature may be a global feature.

In some exemplary embodiments, the local feature refers to local information of the short-exposure video image, such as local maximum value, local minimum value, etc., which can reflect information of local details of the video image.

In some exemplary embodiments, the global feature refers to global information of the short-exposure video image, such as brightness histogram distribution information, color information, etc., at least one of which can perform brightness adjustment and color correction.

In some exemplary embodiments, when the first network includes, besides the first sub-network, the second sub-network and the fusion sub-module: a down-sampling sub-module and an up-sampling sub-module, the first network is used to extract the i-th The first feature of the frame short-exposure video image includes: performing down-sampling processing on the i-th frame short-exposure video image to obtain a down-sampled video image corresponding to the i-th frame short-exposure video image; using the first sub-network to extract the i-th frame short-exposure video image The fourth feature of the down-sampled video image corresponding to the video image; the second sub-network is used to extract the fifth feature of the down-sampled video image corresponding to the i-th frame short-exposure video image; The fourth feature and the fifth feature of the sampled video image are subjected to the second fusion process to obtain the sixth feature of the down-sampled video image corresponding to the i-th frame short-exposure video image; The sixth feature of the video image is up-sampled to obtain the first feature of the short-exposure video image of the i-th frame.

In some exemplary embodiments, the fourth feature may be a local feature, and the fifth feature may be a global feature.

In some exemplary embodiments, the first subnetwork includes: N1 first convolutional layers; wherein, N1 is an integer greater than or equal to 3, and the first convolutional layer is configured to implement a first convolutional operation. For example, the first sub-network may be L-NET.

In some exemplary embodiments, the second sub-network includes: N2 second convolutional layers and 3 fully connected layers (FC, Fully Connected layers); wherein, N2 is an integer greater than or equal to 3, and the second convolutional layer The layer is configured to implement the second convolution operation, and the FC is configured to implement the FC operation. For example, the second sub-network can be G-NET.

In some exemplary embodiments, the 3 FCs in the second sub-network are located after the N2 second convolutional layers.

In some exemplary embodiments, each neuron in the FC is fully connected to all neurons in the previous layer.

In some exemplary embodiments, FC can use the receptive field covering the entire short-exposure video image to learn global features.

In some exemplary embodiments, the down-sampling sub-module may be implemented by any operation or network that can realize the down-sampling function. For example, any one of the convolution operation with a step size of 2, the conversion (S2D, Space to Depth) operation from the space dimension to the depth dimension, and the pooling operation. As another example, the downsampling submodule includes: N3 third convolutional layers and N4 first pooling layers; wherein, N3 and N4 are integers greater than or equal to 1, and the third convolutional layer is configured to implement the third convolutional layer product operation, the first pooling layer is configured to implement the first pooling operation.

In some exemplary embodiments, the upsampling submodule can be implemented by any operation or network that can realize the upsampling function. For example, any one of deconvolution operations, depth-to-space conversion (D2S, Depth to Space) operations, interpolation operations, etc.

In some exemplary embodiments, the downsampling multiple of the downsampling process can be set according to actual needs.

In some exemplary embodiments, the upsampling process is the inverse of the downsampling process to restore the original resolution. For example, assuming that the size of the short-exposure video image of the i-th frame is H×W×1, H is the height, W is the width, and 1 is the number of channels, the short-exposure video image of the i-th frame is down-sampled to a size of 256×256×1 The video image, then the fourth feature with a size of 256×256×1 is output through the first sub-network, and the fifth feature with a size of 256×256×1 is output through the second sub-network, and then the fourth feature is combined through the fusion sub-module Perform the second fusion process with the fifth feature, and finally send it to the upsampling sub-module to restore the first feature with a size of H×W×1.

In some exemplary embodiments, the fusion sub-module may employ any operation or network that implements the fusion function. For example, the corresponding pixel addition operation, cascade operation, convolution operation, etc.

Step 101, using the second network to perform denoising processing on the short-exposure video image from the (i-N) frame to the short-exposure video image in the (i+N) frame to obtain a denoised video image corresponding to the short-exposure video image of the i-th frame; Wherein, N is a constant greater than or equal to 1.

In some exemplary embodiments, when i is less than N, the short-exposure video image from the (i-N)th frame to the -1-th frame short-exposure video image is the same as the 0th frame short-exposure video image; when i is greater than M and N In the case of the difference, the (M+1) frame short exposure video image to the (i+N) frame short exposure video image is the same as the last frame short exposure video image, and M is the total number of frames of the short exposure video image. In one embodiment, when i=0, the short-exposure video image from the -N frame to the -1 frame short-exposure video image is the same as the 0-th frame short-exposure video image; assuming that the total number of frames of the short-exposure video image is 50 , then when i=50, the short exposure video image from the 51st frame to the (50+N) frame short exposure video image is the same as the 50th frame short exposure video image.

In some exemplary embodiments, the value of N may be set according to actual needs. The larger the value of N is, the higher the complexity of the alignment operation is.

In some exemplary embodiments, as shown in FIG. 4 , the second network includes: an alignment submodule, a codec submodule, and an output submodule. Wherein, the alignment submodule is configured to implement an alignment operation, the codec submodule is configured to implement encoding and decoding processing, and the output submodule is configured to implement output processing.

The embodiment of the present disclosure does not limit the specific realization of the alignment sub-module. For example, the alignment submodule may include: N5 fourth convolutional layers and N6 first residual connection layers; wherein, N5 and N6 are integers greater than or equal to 1, and the fourth convolutional layer is configured to implement the fourth convolutional layer product operation, the first residual connection layer is configured to implement the first residual connection operation. As another example, the alignment sub-module can also adopt all other operations or network implementations that can realize the alignment function of multi-frame video images, such as optical flow network, deformable (Deformable) convolutional network, motion estimation and motion compensation (MEMC, Motion Estimate and Motion Compensation) network and other implementations.

In some exemplary embodiments, the second network is used to denoise the (i-N) frame short exposure video image to the (i+N) frame short exposure video image to obtain the denoising corresponding to the i frame short exposure video image The final video image includes: aligning the (i-N) frame short-exposure video image to the (i+N) frame short-exposure video image to obtain the aligned video image; performing encoding and decoding processing on the aligned video image Obtaining the codec-processed video image; performing output processing on the codec-processed video image to obtain a denoised video image corresponding to the i-th frame of the short-exposure video image.

The embodiment of the present disclosure does not limit the specific implementation of the codec sub-module. In one embodiment, as shown in Figure 5, the codec submodule may include: a first encoding submodule, a second encoding submodule, a third encoding submodule, a fourth encoding submodule, a second output submodule, a An F submodule, a second F submodule, a third F submodule, a fourth F submodule, a first decoding submodule, a second decoding submodule, a third decoding submodule, and a fourth decoding submodule. As another example, the codec sub-module can also be realized by using any codec structure network that can realize the image denoising function, such as U-shaped network (UNET, U-shape Network), convolutional blind denoising network (CBDNet, Convolutional Blind Denoising Network) and so on.

Wherein, the first encoding submodule is configured to implement the first encoding process, the second encoding submodule is configured to implement the second encoding process, the third encoding submodule is configured to implement the third encoding process, and the fourth encoding submodule is configured to implement the second encoding process. configured to implement a fourth encoding process, the second output submodule is configured to implement a second output process, the first F submodule is configured to implement a first F operation, the second F submodule is configured to implement a second F operation, The third F submodule is configured to implement the third F operation, the fourth F submodule is configured to implement the fourth F operation, the first decoding submodule is configured to implement the first decoding process, and the second decoding submodule is configured to The second decoding process is implemented, the third decoding sub-module is configured to implement the third decoding process, and the fourth decoding sub-module is configured to implement the fourth decoding process.

Among them, the first encoding sub-module includes: N7 fifth convolutional layers, N8 second residual connection layers and 1 second pooling layer; where N7, N8 are integers greater than or equal to 1, the fifth volume The product layer is configured to implement a fifth convolution operation, the second residual connection layer is configured to implement a second residual connection operation, and the second pooling layer is configured to implement a second pooling operation; the second encoding submodule includes : N9 sixth convolutional layers, N10 third residual connection layers and 1 third pooling layer; wherein, N9, N10 are integers greater than or equal to 1, and the sixth convolutional layer is configured to implement the sixth Convolution operation, the third residual connection layer is configured to implement the third residual connection operation, the third pooling layer is configured to implement the third pooling operation; the third coding sub-module includes: N11 seventh convolutional layers , N12 fourth residual connection layers and 1 fourth pooling layer; wherein, N11, N12 are integers greater than or equal to 1, the seventh convolutional layer is configured to implement the seventh convolution operation, the fourth residual The connection layer is configured to implement the fourth residual connection operation, and the fourth pooling layer is configured to implement the fourth pooling operation; the fourth encoding submodule includes: N13 eighth convolutional layers, N14 fifth residual connections layer and a fifth pooling layer; wherein, N13, N14 are integers greater than or equal to 1, the eighth convolutional layer is configured to implement the eighth convolutional operation, and the fifth residual connection layer is configured to implement the fifth For the residual connection operation, the fifth pooling layer is configured to implement a fifth pooling operation.

Wherein, the first decoding submodule includes: N15 ninth convolutional layers, N16 sixth residual connection layers and 1 first deconvolution layer; wherein, N15 and N16 are integers greater than or equal to 1, and the ninth The convolution layer is configured to implement the ninth convolution operation, the sixth residual connection layer is configured to implement the sixth residual connection operation, the first deconvolution layer is configured to implement the first deconvolution operation; the second decoding The sub-module includes: N17 tenth convolutional layers, N18 seventh residual connection layers and one second deconvolutional layer; wherein, N17 and N18 are integers greater than or equal to 1, and the tenth convolutional layer is configured To realize the tenth convolution operation, the seventh residual connection layer is configured to realize the seventh residual connection operation, and the second deconvolution layer is configured to realize the second deconvolution operation; the third decoding submodule includes: N19 An eleventh convolutional layer, N20 eighth residual connection layers, and a third deconvolutional layer; wherein, N19, N20 are integers greater than or equal to 1, and the eleventh convolutional layer is configured to implement the first Eleven convolution operations, the eighth residual connection layer is configured to implement the eighth residual connection operation, and the third deconvolution layer is configured to implement the third deconvolution operation; the fourth decoding submodule includes: N21th Twelve convolutional layers, N22 ninth residual connection layers and one fourth deconvolutional layer; wherein, N21, N22 are integers greater than or equal to 1, and the twelfth convolutional layer is configured to implement the twelfth The convolution operation, the ninth residual connection layer is configured to implement the ninth residual connection operation, and the fourth deconvolution layer is configured to implement the fourth deconvolution operation.

Wherein, the first F sub-module, the second F sub-module, the third F sub-module, and the fourth F sub-module can be implemented by any operation that can realize interconnection. For example, the first F sub-module, the second F sub-module, the third F sub-module, and the fourth F sub-module can be implemented by any one of addition (add) operation, combination (concate) operation, convolution operation, etc. .

In some exemplary embodiments, performing codec processing on the aligned video image to obtain the encoded video image includes: performing a first encoding process on the aligned video image to obtain the first encoded video image A video image; wherein, the first encoding process includes: a fifth convolution operation, a second residual connection operation, and a second pooling operation; performing a second encoding process on the video image after the first encoding process to obtain the first The video image after the second encoding process; wherein, the second encoding process includes: the sixth convolution operation, the third residual connection operation, and the third pooling operation; performing the second encoding process on the video image after the second encoding process The third encoding process obtains the video image after the third encoding process; wherein, the third encoding process includes: the seventh convolution operation, the fourth residual connection operation, and the fourth pooling operation; after the third encoding process The video image is subjected to the fourth encoding process to obtain the video image after the fourth encoding process; wherein, the fourth encoding process includes: the eighth convolution operation, the fifth residual connection operation, and the fifth pooling operation; Performing the first decoding process on the video image after the fourth encoding process to obtain the video image after the first decoding process; performing the first F operation on the video image before the fifth pooling operation and the video image after the first decoding process to obtain For the video image after the first F operation, perform a second decoding process on the video image after the first F operation to obtain a second decoded video image; for the video image before the fourth pooling operation and the Performing a second F operation on the video image after the second decoding process to obtain a video image after the second F operation, performing a third decoding process on the video image after the second F operation to obtain a video image after the third decoding process; Performing a third F operation on the video image before the third pooling operation and the video image after the third decoding process to obtain a video image after the third F operation, and performing the third F operation on the video image after the third F operation Four decoding processing to obtain the codec-processed video image; performing a fourth F operation on the video image before the second pooling operation and the fourth decoding-processed video image to obtain the codec-processed video image video image.

In some exemplary embodiments, the resolution of the video image after the first encoding process is half of the resolution of the video image after the alignment operation; the resolution of the video image after the second encoding process is half of the resolution of the video image after the first encoding process Half of the resolution of the video image; the resolution of the video image after the third encoding process is half of the resolution of the video image after the second encoding process; the resolution of the video image after the fourth encoding process is the third encoding process Half the resolution of the video image after.

In some exemplary embodiments, the resolution of the video image after the first decoding process is twice the resolution of the video image after the fourth encoding process; the resolution of the video image after the second decoding process is the first decoding process 2 times of the resolution of the video image after processing; the resolution of the video image after the third decoding process is 2 times of the resolution of the video image after the second decoding process; the resolution of the video image after the fourth decoding process It is twice the resolution of the video image after the third decoding process.

For example, the size of the video image after the alignment operation is H×W×(2N+1), H is the height, W is the width, and 2N+1 is the number of channels, then the size of the video image before the second pooling operation is H ×W×32, the size of the video image after the first encoding process is H/2×W/2×128; the size of the video image before the third pooling operation is H/2×W/2×64, the second The size of the encoded video image is H/4×W/4×256; the size of the video image before the fourth pooling operation is H/4×W/4×128, and the size of the third encoded video image is The size is H/8×W/8×512; the size of the video image before the fifth pooling operation is H/8×W/8×256, and the size of the video image after the fourth encoding process is H/16×W /16×1024.

The size of the video image before the first deconvolution operation is H/16×W/16×512, the size of the video image after the first decoding process is H/8×W/8×256, and the size of the video image after the first F operation The size of the video image is H/8×W/8×512; the size of the video image before the second deconvolution operation is H/8×W/8×256, and the size of the video image after the second decoding process is H /4×W/4×128, the size of the video image after the second F operation is H/4×W/4×256; the size of the video image before the third deconvolution operation is H/4×W/4 ×128, the size of the video image after the third decoding process is H/2×W/2×64, the size of the video image after the third F operation is H/2×W/2×128; the fourth deconvolution The size of the video image before the operation is H/2×W/2×64, the size of the video image after the fourth decoding process is H×W×32, and the size of the video image after the fourth F operation is H×W× 64.

In some exemplary embodiments, the output submodule includes: N23 thirteenth convolutional layers; wherein, N23 is an integer greater than or equal to 3, and the thirteenth convolutional layer is configured to implement a thirteenth convolutional operation.

Step 102: Perform first fusion processing on the first feature of the short-exposure video image of the i-th frame and the denoised video image corresponding to the short-exposure video image of the i-th frame to obtain an enhanced video corresponding to the short-exposure video image of the i-th frame image.

In some exemplary embodiments, the first fusion process may be to compare the first feature of the short-exposure video image of the i-th frame with the pixel value at the same position in the denoised video image corresponding to the short-exposure video image of the i-th frame Multiply to obtain the enhanced video image corresponding to the i-th frame of short-exposure video image.

In some exemplary embodiments, the first fusion process may be implemented by any operation or network that realizes the fusion function. For example, the corresponding pixel addition operation, cascade operation, convolution operation, etc.

In some exemplary embodiments, the enhanced video image corresponding to the i-th short-exposure video image is sent to a subsequent image signal processing (ISP, Image Signal Processing) module and other data processing modules for corresponding processing.

In the second aspect, referring to FIG. 6, another embodiment of the present disclosure provides a network training method, including:

Step 600. For each training sample, use any one of the above-mentioned video image enhancement methods to obtain an enhanced video image corresponding to the training sample; wherein, the training sample includes: (2N+1) frames of short-exposure video images.

In the embodiment of the present disclosure, each frame of short-exposure video image corresponds to a frame of long-exposure video image.

In the embodiment of the present disclosure, the i-th training sample includes: (2N+1) frames of short-exposure video images centered on the i-th frame of short-exposure video images.

Step 601: Determine the total objective function value according to the enhanced video image and the long-exposure video image corresponding to the training samples, and update the training parameter value according to the total objective function value.

In some exemplary embodiments, determining the total objective function value according to the enhanced video image and the long-exposure video image corresponding to the training sample includes: for each training sample, according to the enhanced video image and the long-exposure video image corresponding to the training sample The image determines the objective function value corresponding to the training sample; the total objective function value is determined according to the objective function value corresponding to the training sample.

In some exemplary embodiments, when calculating the objective function value corresponding to each training sample, the objective function value corresponding to the training sample should be calculated according to the enhanced video image corresponding to the training sample and the long exposure video image corresponding to the training sample.

In some exemplary embodiments, determining the objective function value corresponding to the training sample according to the enhanced video image and the long-exposure video image corresponding to the training sample includes: determining the training sample according to the formula L=αL _enh +(1-α)L _color The corresponding objective function value; where, L is the objective function value corresponding to the training sample, α is the weight coefficient, L _enh is the _L1 norm of the absolute value of the difference between the enhanced video image corresponding to the training sample and the long-exposure video image Number, L _color is the color consistency loss function.

In some exemplary embodiments,

Wherein, I _out (i, j) is the pixel value of the i-th row and j-column of the enhanced video image corresponding to the training sample, and I _GT (i, j) is the i-th row of the long-exposure video image corresponding to the training sample The pixel value of the jth column, m is the total number of rows, and n is the total number of columns.

In some exemplary embodiments,

In some exemplary embodiments, the total objective function value is an average value of the objective function values corresponding to all training samples.

In some exemplary embodiments, the training parameter value is updated according to the total objective function value; or, the training parameter value is updated according to the total objective function value and a preset index.

The embodiment of the present disclosure does not limit the preset index. For example, the preset index may be at least one of Peak Signal to Noise Ratio (PSNR, Peak Signal to Noise Ratio), Structural Similarity Index (SSIM, Structural Similarity Index) and the like.

The embodiment of the present disclosure does not limit the training parameters. For example, the training parameters may be parameters that need to be trained in the first sub-network, the second sub-network, and the fusion sub-module.

Step 602 , for each training sample, according to the updated training parameter value, use any one of the above video image enhancement methods to obtain the enhanced video image corresponding to the training sample until the total objective function value converges.

In some exemplary embodiments, according to the updated training parameter values, use any one of the above video image enhancement methods to obtain the enhanced video images corresponding to the training samples until the total objective function value converges; or, according to the updated training For parameter values, use any of the above video image enhancement methods to obtain the enhanced video images corresponding to the training samples until the total objective function value converges and the preset index reaches the best.

The network training method provided by the embodiment of the present disclosure trains the above-mentioned video image enhancement method to obtain the optimal training parameter value, which further improves the enhancement effect.

In a third aspect, another embodiment of the present disclosure provides an electronic device, including: at least one processor; a memory, at least one program is stored in the memory, and when at least one program is executed by at least one processor, any of the above-mentioned A video image enhancement method, or any of the above-mentioned network training methods.

Wherein, the processor is a device with data processing capability, which includes but not limited to central processing unit (CPU), etc.; the memory is a device with data storage capability, which includes but not limited to random access memory (RAM, more specifically SDRAM , DDR, etc.), read-only memory (ROM), charged erasable programmable read-only memory (EEPROM), flash memory (FLASH).

In some embodiments, the processor and the memory are connected to each other through a bus, and further connected to other components of the computing device.

In a fourth aspect, another embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the above-mentioned video image enhancement methods, or any of the above-mentioned video image enhancement methods can be implemented. A network training method.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage, or may be used Any other medium that stores desired information and can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Example embodiments have been disclosed herein, and while specific terms have been employed, they are used and should be construed in a generic descriptive sense only and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be described in combination with other embodiments, unless explicitly stated otherwise. Combinations of features and/or elements. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the scope of the present disclosure as set forth in the appended claims.

Claims

A video image enhancement method, comprising:

Adopting the first network to extract the first feature of the i-th frame short-exposure video image; wherein, i is a variable sequentially taken from 0,1,2,3,...;

Using the second network to denoise the (i-N) frame short exposure video image to the (i+N) frame short exposure video image to obtain the denoised video image corresponding to the i frame short exposure video image; wherein , N is a constant greater than or equal to 1; and

performing a first fusion process on the first feature of the short-exposure video image of the i-th frame and the denoised video image corresponding to the short-exposure video image of the i-th frame to obtain the enhancement corresponding to the short-exposure video image of the i-th frame after the video image.
The video image enhancement method according to claim 1, wherein said first feature of extracting the i-th frame short-exposure video image using the first network comprises:

Using the first sub-network to extract the second feature of the i-th frame of short-exposure video image, using the second sub-network to extract the third feature of the i-th frame of short-exposure video image; and

performing a second fusion process on the second feature of the i-th frame of the short-exposure video image and the third feature of the i-th frame of the short-exposure video image to obtain the first feature of the i-th frame of the short-exposure video image.
The video image enhancement method according to claim 1, wherein said first feature of extracting the i-th frame short-exposure video image using the first network comprises:

Performing downsampling processing on the i-th frame of short-exposure video image to obtain a down-sampled video image corresponding to the i-th frame of short-exposure video image;

Using the first sub-network to extract the fourth feature of the downsampled video image corresponding to the i-th frame short-exposure video image;

The fifth feature of the down-sampled video image corresponding to the i-th frame short-exposure video image is extracted using the second sub-network;

performing a second fusion process on the fourth and fifth features of the downsampled video image corresponding to the i-th frame of short-exposure video image to obtain the sixth feature of the down-sampled video image corresponding to the i-th frame of short-exposure video image characteristics; and

Perform upsampling processing on the sixth feature of the downsampled video image corresponding to the i-th frame of short-exposure video image to obtain the first feature of the i-th frame of short-exposure video image.
The video image enhancement method according to any one of claims 2-3, wherein the first sub-network includes: N1 first convolutional layers; wherein, N1 is an integer greater than or equal to 3.
The video image enhancement method according to any one of claims 2-3, wherein the second subnetwork comprises: N2 second convolutional layers and 3 fully connected layers FC; wherein N2 is greater than or equal to 3 an integer of .
The video image enhancement method according to claim 1, wherein the second network is used to denoise the (i-N)th frame short-exposure video image to the (i+N)th frame short-exposure video image to obtain the first The denoised video image corresponding to the i-frame short-exposure video image includes:

performing an alignment operation on the (i-N)th frame short-exposure video image to the (i+N)th frame short-exposure video image to obtain a video image after the alignment operation;

performing codec processing on the video image after the alignment operation to obtain a codec-processed video image; and

Output processing is performed on the codec-processed video image to obtain a denoised video image corresponding to the ith frame of short-exposure video image.
The video image enhancement method according to claim 6, wherein said performing codec processing on the video image after the alignment operation to obtain the codec processed video image comprises:

Performing a first encoding process on the video image after the alignment operation to obtain a video image after the first encoding process; wherein, the first encoding process includes: a fifth convolution operation, a second residual connection operation, and a second pooling operation. operation;

Performing a second encoding process on the video image after the first encoding process to obtain a video image after the second encoding process; wherein, the second encoding process includes: a sixth convolution operation, a third residual connection operation, and a second Three-pool operation;

Performing a third encoding process on the video image after the second encoding process to obtain a video image after the third encoding process; wherein, the third encoding process includes: a seventh convolution operation, a fourth residual connection operation, and a fourth residual connection operation. Four-pool operation;

Performing a fourth encoding process on the video image after the third encoding process to obtain a video image after the fourth encoding process; wherein, the fourth encoding process includes: an eighth convolution operation, a fifth residual connection operation, and a fourth encoding process. Five-pool operation;

performing a first decoding process on the video image after the fourth encoding process to obtain a video image after the first decoding process;

Performing the first F operation on the video image before the fifth pooling operation and the video image after the first decoding process to obtain the video image after the first F operation, and performing the second F operation on the video image after the first F operation The decoding process obtains the video image after the second decoding process;

Performing a second F operation on the video image before the fourth pooling operation and the video image after the second decoding process to obtain a video image after the second F operation, and performing a second F operation on the video image after the second F operation The third decoding process obtains the video image after the third decoding process;

Performing a third F operation on the video image before the third pooling operation and the video image after the third decoding process to obtain a video image after the third F operation, and performing a third F operation on the video image after the third F operation The fourth decoding process obtains the codec-processed video image; and

Performing a fourth F operation on the video image before the second pooling operation and the video image after the fourth decoding process to obtain the codec-processed video image.
A network training method, comprising:

For each training sample, adopt the video image enhancement method described in any one of claims 1-7 to obtain the enhanced video image corresponding to the training sample; wherein, the training sample includes (2N+1) frame short exposure video images;

determining a total objective function value according to the enhanced video image and the long-exposure video image corresponding to the training sample, and updating the training parameter value according to the total objective function value; and

For each training sample, according to the updated training parameter value, adopt the video image enhancement method described in any one of claims 1-7 to obtain the enhanced video image corresponding to the training sample until the total objective function value convergence.
The network training method according to claim 8, wherein,

The updating the training parameter value according to the total objective function value includes: updating the training parameter value according to the total objective function value and a preset index; and

According to the updated training parameter value, using the video image enhancement method described in any one of claims 1-7 to obtain the enhanced video image corresponding to the training sample until the total objective function value converges includes: according to For the updated training parameter value, use the video image enhancement method described in any one of claims 1-7 to obtain the enhanced video image corresponding to the training sample until the total objective function value converges, and the preset The index is the best.
The network training method according to any one of claims 8-9, wherein said determining the total objective function value according to the enhanced video image and long-exposure video image corresponding to the training sample includes:

For each of the training samples, determine an objective function value corresponding to the training samples according to the enhanced video image corresponding to the training sample and the long exposure video image; and

The total objective function value is determined according to the objective function value corresponding to the training samples.
The network training method according to claim 10, wherein said determining the objective function value corresponding to the training sample according to the enhanced video image corresponding to the training sample and the long exposure video image comprises:

Determine the objective function value corresponding to the training sample according to the formula L=αL enh +(1-α)L color ;

Wherein, L is the objective function value corresponding to the training sample, α is the weight coefficient, L enh is the L1 norm of the absolute value of the difference between the enhanced video image corresponding to the training sample and the long-exposure video image Number, L color is the color consistency loss function.
An electronic device comprising:

at least one processor;

A memory, at least one program is stored on the memory, and when the at least one program is executed by the at least one processor, the video image enhancement method described in any one of claims 1-7 is realized, or the method of claim 8- 11. The network training method described in any item.
A computer-readable storage medium, a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the video image enhancement method described in any one of claims 1-7 is realized, or the claim The network training method described in any one of 8-11.