CN114641792A

CN114641792A - Image processing method, image processing apparatus, and readable storage medium

Info

Publication number: CN114641792A
Application number: CN202080002356.2A
Authority: CN
Inventors: 陈冠男; 段然; 高艳
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2022-06-17
Also published as: WO2022077417A1

Abstract

Disclosed are an image processing method, an image processing apparatus, and a computer-readable storage medium, the image processing method including: processing the input image by using the trained first neural network to obtain a target output image; the trained first neural network is obtained by training the first neural network to be trained by a first training method, and the first training method comprises the following steps: alternately training the second neural network to be trained and the discrimination network to be trained to obtain a trained second neural network and a trained discrimination network; respectively providing the first sample image to a trained second neural network and a first neural network to be trained, so that the first neural network to be trained outputs a first output image, and the trained second neural network outputs a second output image; providing the first output image to a trained discrimination network so that the discrimination network generates a first discrimination result; and adjusting parameters of the first neural network according to the total loss to obtain an updated first neural network.

Description

Image processing method, image processing apparatus, and readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, and a readable storage medium.

Background

Video needs to be compression encoded before transmission due to bandwidth limitations. The compressed video can generate various compression noises, and the visual experience of the video on the display terminal is influenced.

The rise of the deep learning technology brings a technical breakthrough for the video compression and restoration direction. The repairing effect can be well improved through training and learning of a large amount of video data. However, the algorithm model for deep learning generally has larger parameters, and the deeper the network structure is, the better the processing effect is, which may result in too large calculation amount and may not meet the real-time processing requirement of the video on the display terminal.

Disclosure of Invention

Aspects of the present disclosure provide an image processing method, an image processing apparatus, and a readable storage medium.

The embodiment of the disclosure provides an image processing method, which includes:

processing the input image by using the trained first neural network to obtain a target output image; the definition of the target output image is greater than that of the input image;

the trained first neural network is obtained by training a first training method on a first neural network to be trained, and the first training method comprises the following steps:

alternately training a second neural network to be trained and a discrimination network to be trained to obtain a trained second neural network and a trained discrimination network; wherein the parameters of the trained second neural network are more than the parameters of the first neural network to be trained; the trained second neural network is configured to transform the received image with a first definition into an image with a second definition, the second definition being greater than the first definition; the first neural network to be trained comprises: a plurality of first feature extraction sub-networks and a first output sub-network located after the plurality of first feature extraction sub-networks, the trained second neural network comprising: a plurality of second feature extraction sub-networks and a second output sub-network located after the plurality of second feature extraction sub-networks, the first feature extraction sub-networks corresponding one-to-one to the second feature extraction sub-networks;

respectively providing the first sample image to the trained second neural network and the first neural network to be trained, so that the first neural network to be trained outputs a first output image, and the trained second neural network outputs a second output image;

providing the first output image to the trained discrimination network so that the trained discrimination network generates a first discrimination result based on the first output image;

adjusting parameters of the first neural network according to the total loss to obtain an updated first neural network; wherein the total loss comprises a first loss, a second loss, and a third loss, the first loss being based on a difference of the first output image and the second output image; the second loss is obtained based on a difference between the first discrimination result and a first target result; the third penalty is based on a difference between an output image of at least one of the first sub-networks of feature extraction and an output image of the corresponding second sub-network of feature extraction.

In some embodiments, the number of channels of the output images of the first feature extraction sub-network is less than the number of channels of the output images of the corresponding second feature extraction sub-network;

the first training method further comprises: providing the output images of the second feature extraction sub-networks to a plurality of dimensionality reduction layers in a one-to-one correspondence, so that each dimensionality reduction layer generates an intermediate image; the number of channels of the intermediate image is the same as the number of channels of the output image of the first feature extraction sub-network;

adjusting a parameter of the first neural network according to a total loss function, including: adjusting parameters of both the first neural network and the dimensionality reduction layer; wherein the third penalty is based on a sum of differences between each of the intermediate images and an output image of the corresponding first feature extraction sub-network.

In some embodiments, the total loss further comprises: a fourth loss based on a perceptual loss of the first output image and the second output image.

In some embodiments, the first output image and the second output image have a perceptual loss

Calculated according to the following formula:

wherein y1 is the first output image, y2 is the second output image,

j is the number of layers of the preset network layer in the trained discrimination network, C is the number of channels of an output image of the preset network layer, H is the height of the output image of the preset network layer, and W is the width of the output image of the preset network layer.

In some embodiments, the first loss comprises an L1 loss of the first output image and the second output image.

In some embodiments, the second penalty comprises a cross-entropy penalty of the first discrimination result and the first target result.

In some embodiments, the third loss term comprises a sum of L2 losses for the output image of each first feature extraction sub-network and the corresponding intermediate image.

In some embodiments, providing the first output image to the trained discrimination network to cause the trained discrimination network to generate a first discrimination result based on the first output image includes:

and setting the first output image to have a label with a truth value, and providing the first output image with the label with the truth value to the trained discrimination network so that the discrimination network outputs a first discrimination result.

In some embodiments, the training the discriminant network to be trained in the step of alternately training the second neural network to be trained and the discriminant network to be trained includes:

providing the second sample image to the current second neural network so that the current second neural network generates a first sharpness-enhanced image;

and providing the first definition-improved image and the original sample image corresponding to the second sample image to the current discrimination network, and adjusting the current parameters of the discrimination network according to the loss function of the current discrimination network, so that the output of the discrimination network after parameter adjustment can represent the discrimination result of whether the input of the discrimination network is the output image of the second neural network or the original sample image.

In some embodiments, in the step of training the second neural network to be trained and the discriminant network to be trained alternately, training the second neural network to be trained includes:

providing the third sample image to the current second neural network so that the current second neural network generates a second sharpness-enhanced image;

inputting the second definition-improved image into the discrimination network after parameter adjustment, so that the discrimination network after parameter adjustment generates a second discrimination result based on the second definition-improved image;

adjusting the current parameters of the second neural network based on the current loss function of the second neural network to obtain an updated second neural network; a first term in a current loss function of the second neural network is based on a difference between the second sharpness enhancement image and its corresponding original sample image, and a second term in the current loss function of the second neural network is based on a difference between the second discrimination result and a second target result.

In some embodiments, the first term in the current loss function of the second neural network is λ₁LossG1，λ ₁For a preset weight, LossG1 is the L1 loss between the second sharpness enhanced image and its corresponding original sample image;

the second term in the current loss function of the second neural network is λ₂L _D，λ ₂Is a preset weight value, L_DThe cross entropy of the second judgment result and the second target result is obtained;

the third term in the current loss function of the second neural network is

λ ₃Is a preset weight value, y is an original sample image corresponding to the second sharpness enhanced image,

boosting the image for the second sharpness;

optimized for presetsA preset network layer in a network, j is the number of layers of the preset network layer in the preset optimization network, C is the number of channels of an output image of the preset network layer, H is the height of the output image of the preset network layer, and W is the width of the output image of the preset network layer; the preset optimization network adopts a VGG-19 network.

In some embodiments, the first neural network to be trained comprises: a plurality of first upsampling layers, a plurality of first downsampling layers, and a plurality of single-layer convolutional layers, each of the first upsampling layers and each of the first downsampling layers being located between two of the single-layer convolutional layers; input data of the ith-last single-layer convolutional layer comprises superposition of output data of the ith-last single-layer convolutional layer and output data of the ith-positive single-layer convolutional layer; wherein the number of the single-layer convolution layers is an even number, i is greater than 0 and less than half of the number of the single-layer convolution layers;

the trained second neural network comprises: a plurality of second upsampling layers, a plurality of second downsampling layers and a plurality of residual blocks, wherein the plurality of second upsampling layers correspond to the plurality of first upsampling layers one to one, the plurality of second downsampling layers correspond to the plurality of first downsampling layers one to one, and the plurality of residual blocks correspond to the plurality of single-layer convolutional layers one to one; input data of the ith last residual block is superposition of output data of the ith last second upsampling layer and output data of the ith positive residual block;

the first feature extraction sub-network comprises: the first upsampling layer, or the first downsampling layer, or the single convolutional layer; the first output subnetwork comprises the single-layer convolutional layer; the second feature extraction sub-network comprises: the second upsampling layer, or the second downsampling layer, or the residual block; the second output sub-network comprises the residual block.

The embodiment of the present disclosure further provides an image processing apparatus, which includes a memory and a processor, where the memory stores a computer program, and the computer program implements the image processing method when executed by the processor.

The disclosed embodiments also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the image processing method described above.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a schematic diagram of a convolutional neural network.

Fig. 2 is a schematic diagram of an image processing method provided in an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a first training method provided in an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of a network architecture including a first neural network and a second neural network provided in an embodiment of the present disclosure.

Fig. 5 is an exemplary diagram of a residual block.

Fig. 6 is a schematic structural diagram of a trained discriminant network provided in an embodiment of the present disclosure.

Fig. 7 is a flowchart of an alternative implementation manner of step S21 provided in the embodiment of the present disclosure.

Fig. 8 is a diagram illustrating effects before and after image processing by the image processing method according to the embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Video needs to be compression encoded before transmission due to bandwidth limitations. The compressed video can generate various compression noises, and the visual experience of the video on the display terminal is influenced. The video compression restoration technology based on deep learning can improve the restoration effect on video compression noise. However, the amount of parameters of the deep learning algorithm model is large, which results in an excessive calculation amount of the display terminal.

The main component of the deep learning system is a convolutional neural network, and fig. 1 is a schematic diagram of the convolutional neural network. The convolutional neural network can be used for image processing, which uses images as input and output and replaces scalar weights by filters (i.e., convolution). Only a convolutional neural network having a 3-layer structure is illustrated in fig. 1, and embodiments of the present disclosure are not limited thereto. As shown in fig. 1, the convolutional neural network includes an input layer 101, a hidden layer 102, and an output layer 103. 4 input images are input in the input layer 101, 3 cells exist in the hidden layer 102 in the middle to output 3 output images, and 2 cells exist in the output layer 103 to output 2 output images.

As shown in FIG. 1, the convolutional layer has a weight w_ij ^kAnd bias b_i ^kWeight w_ij ^kRepresenting the convolution kernel, the offset is a scalar that is superimposed on the output of the convolution layer, where k is the label representing the input layer 101 number, and i and j are labels for the elements of the input layer 101 and the elements of the hidden layer 102, respectively. For example, the first convolution layer 201 includes a first set of convolution kernels (w in FIG. 1)_ij ¹) And a first set of offsets (b in FIG. 1)_i ¹). The second convolutional layer 202 includes a second set of convolutional kernels (w in FIG. 1)_ij ²) And a second set of offsets (b in FIG. 1)_i ²). Typically, each convolutional layer comprises tens or hundreds of convolutional kernels, and if the convolutional neural network is a deep convolutional neural network, at least five convolutional layers may be included.

As shown in fig. 1, the convolutional neural network further includes a first activation layer 203 and a second activation layer 204. A first active layer 203 is located behind the first convolutional layer 201, and a second active layer 204 is located behind the second convolutional layer 202. The activation layer includes an activation function that is used to introduce non-linear factors into the convolutional neural network so that the convolutional neural network can better solve more complex problems. The activation function may include a linear modification unit (ReLU) function, a Sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function), etc. The activation layer may be used alone as a layer of the convolutional neural network, or the activation layer may be included in the convolutional layer.

The convolutional neural network of fig. 1 can be used to improve the sharpness of an image, and the trained convolutional neural network improves the sharpness of an input low-sharpness image to obtain a high-sharpness image. The training process of the convolutional neural network is an optimization process of the parameters of the convolutional neural network. Wherein the loss of the convolutional neural network helps to optimize the parameters (weights) of the convolutional neural network, and the goal of the training process is to minimize the loss of the neural network by optimizing the parameters of the neural network. The loss of the neural network is used for measuring the quality of the network model prediction, namely, the difference degree between the prediction result and the actual data is expressed.

An embodiment of the present disclosure provides an image processing method, and fig. 2 is a schematic diagram of the image processing method provided in the embodiment of the present disclosure, and as shown in fig. 2, the image processing method includes: and S10, processing the input image by using the trained first neural network to obtain a target output image. The sharpness of the target output image is higher than the sharpness of the input image.

In the embodiments of the present disclosure, the term "sharpness" refers to, for example, the degree of sharpness of each detail shadow and its boundary in an image, and the higher the sharpness is, the better the human eye perception effect is. The definition of the target output image is higher than that of the input image, for example, the target output image is processed by using the image processing method provided by the embodiment of the present disclosure, for example, denoising and/or deblurring is performed, so that the target output image obtained after processing is clearer than the input image.

In this disclosure, the trained first neural network is obtained by performing a first training method on the first neural network to be trained, fig. 3 is a schematic diagram of the first training method provided in this disclosure, and as shown in fig. 3, the first training method includes:

and S21, alternately training the second neural network to be trained and the discrimination network to be trained to obtain the trained second neural network and the trained discrimination network.

In the embodiment of the present disclosure, the parameters of the trained second neural network are more than the parameters of the first neural network to be trained. Fig. 4 is a schematic diagram of a network architecture including a first neural network to be trained and a trained second neural network provided in an embodiment of the present disclosure, where the first neural network 10 to be trained includes: a plurality of first feature extraction sub-networks ML1 and a first output sub-network OL1 located after the plurality of first feature extraction sub-networks ML1, the trained second neural network 20 comprising: a plurality of second feature extraction sub-networks ML2 and a second output sub-network OL2 located after the plurality of second feature extraction sub-networks ML2, the first feature extraction sub-network ML1 corresponding one-to-one to the second feature extraction sub-network ML 2.

In some embodiments, the first neural network 10 to be trained comprises: a plurality of first upsampling layers 13, a plurality of first downsampling layers 12, and a plurality of single-layer convolutional layers 11, each first upsampling layer 13 and each first downsampling layer 12 being located between two single-layer convolutional layers 11; the input data to the i-th last single layer convolutional layer 11 includes the superposition of the output data of the i-th last first up-sampling layer 13 and the output data of the i-th positive single layer convolutional layer 11. Wherein, the number of the single-layer convolution layers 11 is even number, i is larger than 0 and less than half of the number of the single-layer convolution layers. The second neural network 20 includes: a plurality of second upsampling layers 23, a plurality of second downsampling layers 22 and a plurality of residual blocks 21, wherein the plurality of second upsampling layers 23 correspond to the plurality of first upsampling layers 13 one to one, the plurality of second downsampling layers 22 correspond to the plurality of first downsampling layers 12 one to one, and the plurality of residual blocks 21 correspond to the plurality of single-layer convolutional layers 11 one to one; the input data of the i-last residual block 21 comprises a superposition of the output data of the i-last second upsampling layer 23 and the output data of the positive i-th residual block 21.

The single convolution layer 11, the first up-sampling layer 13 and the first down-sampling layer 12 all adopt 3 × 3 convolution kernels, and the number of the convolution kernels is 128. The second upsampling layer 23 has the same sampling rate as the first upsampling layer 13, and the second downsampling layer 22 has the same sampling rate as the first downsampling layer 12. Illustratively, the first upsampling layer 13 and the first downsampling layer 12 are each 2 times sampled. The first downsampling Layer 12 and the second downsampling Layer 22 may include inverse Muxout layers, striped Convolution (striped Convolution), max pooling layers (Maxpool Layer), or standard per-channel downsamplers (e.g., bicubic interpolation). The first 13 and second 23 upsampling layers may comprise Muxout layers, stripe deconvolution (striped transformed Convolution), or standard per-channel upsamplers (e.g., bicubic interpolation).

Fig. 5 is an exemplary diagram of a residual block, and each residual block 21 includes three sub-residual blocks 21a connected in sequence as shown in fig. 5. Each sub-residual block 21a employs two convolutional layers with 3 × 3 convolutional kernels, between which an active layer is connected, and in each sub-residual block 21a, its input is superimposed on the output result of the last convolutional layer, thereby serving as the output of the sub-residual block 21 a. The activation layer includes an activation function, which may include a linear modification unit (ReLU) function, a Sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function), etc. The activation layer may be used alone as a layer of the convolutional neural network, or the activation layer may be included in the convolutional layer. In the first convolution network 10, the residual block 21 in the second neural network 20 is replaced with a single layer convolution layer 11, thereby reducing the parameters of the first convolution network 10.

The first feature extraction sub-network ML1 includes: a first upsampling layer 13, or a first downsampling layer 12, or said single convolutional layer 11; the first output sub-network OL1 comprises a single layer convolutional layer 21; the second feature extraction subnetwork ML2 includes: a second upsampling layer 23, or a second downsampling layer 22, or a residual block 21; the second output sub-network OL2 comprises a residual block 21. In addition, the number of channels of the output image of the first feature extraction sub-network ML1 is larger than the number of channels of the output image of the second feature extraction sub-network ML 2. Illustratively, the number of channels of the output image of the first feature extraction sub-network ML1 is 128, and the number of channels of the output image of the second feature extraction sub-network ML2 is 32. It should be noted that in the neural network, the image input to each network layer is represented by a matrix, and the image received by the first layer in the neural network may be R, G, B three-channel image matrices, that is, the image matrix of each channel represents data of a red component, a green component or a blue component of the image. Each network layer is used for extracting the features of the image, and after the features are extracted, the output data of the network layer comprises a plurality of matrixes, wherein each matrix is a channel representing the image.

In the embodiment of the disclosure, the second neural network to be trained and the discriminant network to be trained are alternately trained, thereby competing with each other and obtaining the optimal model. Specifically, the trained second neural network is configured to transform the received image having a first definition into an image having a second definition, the second definition being greater than the first definition. And configuring the trained discrimination network, and determining the matching degree of the output result of the second neural network and a preset standard image, wherein the matching degree is between 0 and 1. When a second neural network to be trained is trained, the output result of the second neural network after parameter adjustment is input into the current discrimination network by adjusting the parameters of the current second neural network, and the discrimination network outputs a matching degree as close to 1 as possible; when the discrimination network to be trained is trained, the parameters of the current discrimination network are adjusted, so that after the preset standard image is input into the current discrimination network, the output result of the current discrimination network is as close to 1 as possible (namely, the input of the discrimination network is judged to be a 'true' sample), and after the output result of the current second neural network enters the discrimination network, the output result of the discrimination network is as close to 0 as possible (namely, the input of the discrimination network is judged to be a 'false' sample). Through the alternate training of the second neural network and the discrimination network, the discrimination network is continuously optimized to discriminate and distinguish the output result of the second neural network and the preset standard image as much as possible, and the second neural network is continuously optimized to make the output result as close to the preset standard image as possible. This approach allows two neural networks to compete and improve on each training to get better and better network models based on the better and better results of the other network.

Fig. 6 is a schematic structural diagram of a trained decision network provided in an embodiment of the present disclosure, and as shown in fig. 6, the trained decision network 30 includes a plurality of convolutional layers 31 to 34 and a fully-connected layer 35, for example, each convolutional layer 31 to 34 employs a 2-fold downsampled convolutional layer, an active layer is connected behind each convolutional layer 31 to 34, the active layer includes an activation function, and the activation function may include a linear modification unit (ReLU) function, a Sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function). Each of the convolutional layers 31 to 34 employs a 3 × 3 convolutional kernel, the number of channels of the image output from the convolutional layer 31 is 32, the number of channels of the image output from the convolutional layer 32 is 64, the number of channels of the image output from the convolutional layer 33 is 128, and the number of channels of the image output from the convolutional layer 34 is 192. The fully-connected layer 35 outputs 1024 x 1 vectors, which then pass through the active layer (e.g., the active layer uses sigmoid as an activation function) and outputs values between 0 and 1.

It should be understood that the structure of the trained discriminative network and the discriminative network to be trained (i.e., the number of convolutional layers, the number of convolutional kernels in the convolutional layer) is the same, except for the different weights in the convolutional layer.

It should be noted that the number of network layers in the trained first neural network 10 and the trained decision network 30 in fig. 4 and 5 is only an exemplary illustration, and in practical applications, the network structure may be adjusted as needed.

And S22, providing the first sample image to the trained second neural network and the first neural network to be trained respectively, so that the first neural network to be trained outputs a first output image, and the trained second neural network outputs a second output image.

In some examples, the original video may be compressed at a low bit rate (e.g., 1Mbps), resulting in a compressed video, and each frame of image in the compressed video may be a first sample image with noise, which may be gaussian noise.

S23, the first output image is provided to the trained discrimination network so that the trained discrimination network generates a first discrimination result based on the first output image.

And S24, adjusting the parameters of the first neural network according to the total loss to obtain an updated first neural network. Wherein the total loss includes a first loss, a second loss, and a third loss, the first loss being based on a difference of the first output image and the second output image; the second loss is obtained based on a difference between the first discrimination result and the first target result; the third penalty is based on a difference between the output image of at least one first feature extraction sub-network and the output image of the corresponding second feature extraction sub-network.

As described above, the output of the trained discrimination network 30 is a degree of matching between 0 and 1, in which case the first target result is a degree of matching close to 1 or equal to 1.

The phrase "adjusting the parameter of the first neural network in accordance with the total loss" means that the parameter of the first neural network is adjusted so that the value of the total loss tends to decrease as a whole when the first training method is performed a plurality of times. The number of times of execution of the first training method may be preset, or when the total loss is less than a preset value, the first training method is not performed. It should be noted that the first sample image utilized in the first training method at different times may be different.

In the disclosed embodiment, the difference between the two images is the difference of the low frequency information of the two images, which may be characterized using L1 loss value, Mean Square Error (MSE), similarity (SSIM), and the like.

In some embodiments, the first loss includes L1 loss of the first output image and the second output image, and may be x₁Loss1 where x₁For the preset weight, Loss1 is the L1 Loss of the first output image and the second output image, i.e., Loss1 | | | y1-y2| |₁Y1 is the first output image and y2 is the second output image.

In some embodiments, the second penalty comprises a cross-entropy penalty, in particular x, of the first discrimination result and the first target result₂Loss2，x ₂For the preset weight, Loss2 is the cross entropy Loss between the first decision result and the first target result of the decision network.

Specifically, Loss2 ═ PlogP ' + (1-P) log (1-P ') ], where P is the first target result and P ' is the first discrimination result. In some embodiments, step S23 specifically includes: the first output image is set with a label with a truth value, and the first output image with the label with the truth value is provided to the trained discrimination network so that the discrimination network outputs a first discrimination result. The truth label is used for indicating that the image is a 'true' sample, and the first target result is the probability corresponding to the truth label. For example, the first target result is 1.

In some embodiments, the third loss is in particular based on a difference of a transformed image of the output image of the at least one first feature extraction sub-network and the output image of the corresponding second feature extraction sub-network. For example, a network architecture including a first neural network and a second neural network also includes a plurality of dimension reduction layers 40. The dimensionality reduction layer 40 is in one-to-one correspondence with the first feature extraction sub-networks ML1, and the dimensionality reduction layer 40 is configured to perform channel dimensionality reduction on the output images of the corresponding first feature extraction sub-networks to generate intermediate images; the number of channels of the intermediate image is the same as the number of channels of the output image of the second feature extraction sub-network.

The first training method further comprises: providing the output images of the second feature extraction sub-networks to a plurality of dimensionality reduction layers in a one-to-one correspondence, so that each dimensionality reduction layer generates an intermediate image; the number of channels of the intermediate image is the same as the number of channels of the output image of the first feature extraction sub-network. In this case, in step S24, the parameters of both the first neural network and the dimensionality reduction layer are adjusted. Wherein the third penalty is based on a sum of differences between each of the intermediate images and an output image of the corresponding first feature extraction sub-network.

In some embodiments, the difference between the intermediate image and the output image of the second feature extraction sub-network is represented by the L2 penalty of both. The third loss is x₃Loss3, where Loss3 is the sum of the L2 Loss for each first feature extraction sub-network's output image and the corresponding intermediate image. Specifically, Loss3 is calculated according to the following formula:

wherein x is₃Is a preset weight value; t is the number of first feature extraction subnetworks, S_n(z) extracting the output image of the subnetwork for the nth layer of second features in the first neural network, G_n(z) output image of the nth layer first feature extraction sub-network in the second neural network, f (G)_n(z)) is an intermediate image output by a dimensionality reduction layer corresponding to the nth layer first feature extraction sub-network in the second neural network.

In the embodiment of the present disclosure, compared with the trained second neural network, the trained first neural network is simplified, and the trained first neural network has fewer parameters and a simpler network structure, so that the trained first neural network occupies fewer resources (e.g., computing resources, storage resources, etc.) during its operation, and thus can be applied to a lightweight terminal. And, in the total loss adopted when training the first neural network to be trained, the first loss is obtained based on the difference between the output result of the first neural network and the output result of the second neural network, the second loss is obtained based on the difference between the discrimination result of the trained discrimination network and the first target result, and the third loss is obtained based on the difference between the output image of at least one first feature extraction sub-network and the output image of the corresponding second feature extraction sub-network, so that the performance of the trained first neural network is as same as the second neural network as possible. Therefore, the embodiment of the disclosure can reduce the parameters of the image processing model on the premise of ensuring the image processing effect, thereby improving the image processing speed.

In some embodiments, the total loss further comprises: and a fourth loss, the fourth loss being based on the perceptual loss of the first output image and the second output image. Where the perceptual loss is used to characterize the difference in high frequency information (e.g., texture, hair, etc. detail features on the image) of the two images.

Optionally, the fourth loss is:

x ₄is a preset weight value.

The perceptual loss of the first output image and the second output image is calculated according to the following formula:

where y1 is the first output image, y2 is the second output image,

j is the number of the preset network layers in the trained discrimination network, C is the number of channels of the output images of the preset network layers, H is the height of the output images of the preset network layers, and W is the width of the output images of the preset network layers. It will be appreciated that the above-described,

after the first output image is input into the trained discrimination network, presetting an output image of a network layer;

and after the second output image is input into the preset optimization network, presetting the output image of the network layer. Alternatively, the predetermined network layer may output a convolutional layer with an image channel number of 128.

Fig. 7 is a flowchart of an alternative implementation manner of step S21 provided in the embodiment of the present disclosure, and as shown in fig. 7, step S21 specifically includes: step S21a and step S21b are alternately performed until the preset training condition is reached. The preset training conditions are, for example: the number of times of alternation of step S21a and step S21b reaches a preset number of times.

And S21a, providing the second sample image to the current second neural network so that the second neural network generates a first definition improved image. And providing the first definition promoted image and the original sample image corresponding to the second sample image to a current discrimination network, and adjusting parameters of the discrimination network according to a loss function of the current discrimination network, so that the output of the discrimination network after parameter adjustment can represent the discrimination result of whether the input of the discrimination network is the output image of the second neural network or the original sample image.

And S21b, providing the third sample image to the current second neural network so that the second neural network generates a second definition improved image. And inputting the second definition improved image into the discrimination network after parameter adjustment so that the discrimination network after parameter adjustment generates a second discrimination result based on the second definition improved image. And adjusting parameters of the second neural network based on the loss function of the second neural network to obtain an updated second neural network.

It should be noted that, the nth step S21a and the nth step S21b are taken as a training process, so in the first training process, the current second neural network is the second neural network to be trained; in each round of training process after the first round, the current second neural network is the updated second neural network in step S21b of the previous round of training process. In the first round of training process, the current discrimination network is the discrimination network to be trained; in each training process after the first round, the current discriminant network is the discriminant network after the reference in step S21a of the previous training process.

Wherein a first term in the loss function of the second neural network is based on a difference between the second sharpness-enhanced image and its corresponding original sample image, and a second term in the loss function of the second neural network is based on a difference between the second discrimination result and a second target result.

In some embodiments, the first term in the loss function LossG of the second neural network is λ ₁LossG1，λ ₁For the preset weight, LossG1 is the L1 loss between the second sharpness enhanced image and its corresponding original sample image. In particular, the amount of the solvent to be used,

y is the original sample image corresponding to the second sharpness enhanced image,

and improving the image for the second definition.

The second term in the loss function of the second neural network is λ₂L _D，λ ₂Is a preset weight value, L_DAnd the cross entropy of the second judgment result and the second target result is obtained. The second target result is used for representing that the input of the discrimination network is the original image corresponding to the second sharpness enhanced image, that is, the input of the discrimination network is a 'true' sample. For example, the second target result is 1.

The third term in the loss function of the second neural network is

Wherein λ is₃Is a preset weight value.

The method comprises the steps that a preset network layer in a preset optimization network is set, j is the number of layers of the preset network layer in the preset optimization network, C is the number of channels of an output image of the preset network layer, H is the height of the output image of the preset network layer, and W is the width of the output image of the preset network layer; the preset optimization network adopts a VGG-19 network. It should be noted that, in the neural network, the image output by each layer of the network layer is not a visually observable imageInstead, the image is represented in a matrix, and the height of the image can be regarded as the number of rows of the matrix, and the width of the image can be regarded as the number of columns of the matrix.

That is, after the second neural network is trained a plurality of times, the updated L1 loss values of the image output from the second neural network and the original sample image are as close as possible, the perceptual loss of the image output from the second neural network and the original sample image is as close as possible, and the result output from the discrimination network is as close as 1 after the image output from the second neural network is supplied to the discrimination network.

Alternatively, in steps S21a and S21b in the same round of training process, the second sample image and the third sample image may be the same. While the second sample image is different and the third sample image is different during different rounds of training.

In each round of training, the training step for discriminating the network may be performed first, or the training step for generating the network may be performed first.

In some examples, the original video may be losslessly compressed to obtain a lossless compressed video, and the image frames in the lossless compressed video image are used as original sample images; and performing compression of a first code rate on the original video to obtain a low-loss compressed video, and taking an image frame in the low-loss compressed video as a second sample image or a third sample image.

In some examples, the training process of step S21 may employ an Adam optimizer with a learning rate of 1 e-4.

In the embodiment of the disclosure, the trained first neural network has fewer parameters and a simpler network structure than the second neural network, so that the first neural network occupies fewer resources (such as computing resources, storage resources and the like) when the first neural network runs, and thus, the method can be applied to a lightweight terminal. In addition, the training method of the first neural network to be trained can enable the performance of the trained first neural network to be close to that of the trained second neural network, so that the image processing method of the embodiment of the disclosure can obtain images with high definition and improve the image processing speed.

Fig. 8 is a diagram of effects before and after image processing by using the image processing method according to the embodiment of the disclosure, where the left diagram in fig. 8 is an input image before processing and the right diagram is a target output image after processing. As shown in fig. 8, the sharpness of the image is improved after the image processing. Compared with the second convolution network, the parameter quantity compression multiple of the first convolution network is more than 50 times, and the processing speed is improved by about 15 times.

The present disclosure also provides an image processing apparatus comprising a memory and a processor, the memory having stored thereon a computer program, which, when executed by the processor, implements the above-described training method of an image processing model.

The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method of training an image processing model.

The above memory and the computer readable storage medium include, but are not limited to, the following readable media: such as Random Access Memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable PROM (eeprom), flash memory, magnetic or optical data storage, registers, magnetic disk or tape, optical storage media such as a Compact Disk (CD) or DVD (digital versatile disk), and other non-transitory media. Examples of processors include, but are not limited to, general purpose processors, Central Processing Units (CPUs), microprocessors, Digital Signal Processors (DSPs), controllers, microcontrollers, state machines, and the like.

It is to be understood that the above embodiments are merely exemplary embodiments that are employed to illustrate the principles of the present disclosure, and that the present disclosure is not limited thereto. It will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosure, and these are to be considered as the scope of the disclosure.

Claims

An image processing method comprising:

processing the input image by using the trained first neural network to obtain a target output image; the definition of the target output image is greater than that of the input image;

the trained first neural network is obtained by training a first training method on the first neural network to be trained, and the first training method comprises the following steps:

alternately training a second neural network to be trained and a discrimination network to be trained to obtain a trained second neural network and a trained discrimination network; wherein the parameters of the trained second neural network are more than the parameters of the first neural network to be trained; the trained second neural network is configured to transform the received image with a first definition into an image with a second definition, the second definition being greater than the first definition; the first neural network to be trained comprises: a plurality of first feature extraction sub-networks and a first output sub-network located after the plurality of first feature extraction sub-networks, the trained second neural network comprising: a plurality of second feature extraction sub-networks and a second output sub-network located after the plurality of second feature extraction sub-networks, the first feature extraction sub-networks corresponding one-to-one to the second feature extraction sub-networks;

respectively providing the first sample image to the trained second neural network and the first neural network to be trained, so that the first neural network to be trained outputs a first output image, and the trained second neural network outputs a second output image;

providing the first output image to the trained discrimination network so that the trained discrimination network generates a first discrimination result based on the first output image;

adjusting parameters of the first neural network according to the total loss to obtain an updated first neural network; wherein the total loss comprises a first loss, a second loss, and a third loss, the first loss being based on a difference of the first output image and the second output image; the second loss is obtained based on a difference between the first discrimination result and a first target result; the third penalty is based on a difference between an output image of at least one of the first sub-networks of feature extraction and an output image of the corresponding second sub-network of feature extraction.
The image processing method of claim 1, wherein the number of channels of the output image of the first feature extraction sub-network is smaller than the number of channels of the output image of the corresponding second feature extraction sub-network;

the first training method further comprises: providing the output images of the second feature extraction sub-networks to a plurality of dimensionality reduction layers in a one-to-one correspondence, so that each dimensionality reduction layer generates an intermediate image; the number of channels of the intermediate image is the same as the number of channels of the output image of the first feature extraction sub-network;

adjusting a parameter of the first neural network according to a total loss function, including: adjusting parameters of both the first neural network and the dimensionality reduction layer; wherein the third penalty is based on a sum of differences between each of the intermediate images and an output image of the corresponding first feature extraction sub-network.
The image processing method of claim 1, wherein the total loss further comprises: a fourth loss based on a perceptual loss of the first output image and the second output image.
The image processing method of claim 3, wherein the first output image and the second output image have a perceptual loss
Calculated according to the following formula:

wherein y1 is the first output image, y2 is the second output image,
j is the number of layers of the preset network layer in the trained discrimination network, C is the number of channels of an output image of the preset network layer, H is the height of the output image of the preset network layer, and W is the width of the output image of the preset network layer.
The image processing method according to any one of claims 1 to 4, wherein the first loss comprises an L1 loss of the first output image and the second output image.
The image processing method according to any one of claims 1 to 4, wherein the second loss comprises a cross-entropy loss of the first discrimination result and a first target result.
The image processing method according to any one of claims 2 to 4, wherein the third loss term comprises a sum of L2 losses of the output image of each first feature extraction sub-network and the corresponding intermediate image.
The image processing method of any of claims 1 to 4, wherein providing the first output image to the trained discrimination network to cause the trained discrimination network to generate a first discrimination result based on the first output image comprises:

and setting the first output image to have a label with a truth value, and providing the first output image with the label with the truth value to the trained discrimination network so that the discrimination network outputs a first discrimination result.
The image processing method according to any one of claims 1 to 4, wherein in the step of training the second neural network to be trained and the discriminant network to be trained alternately, training the discriminant network to be trained includes:

providing the second sample image to the current second neural network so that the current second neural network generates a first sharpness-enhanced image;

and providing the first definition-improved image and the original sample image corresponding to the second sample image to the current discrimination network, and adjusting the current parameters of the discrimination network according to the loss function of the current discrimination network, so that the output of the discrimination network after parameter adjustment can represent the discrimination result of whether the input of the discrimination network is the output image of the second neural network or the original sample image.
The image processing method according to claim 9, wherein the training the second neural network to be trained in the step of alternately training the second neural network to be trained and the discriminant network to be trained comprises:

providing the third sample image to the current second neural network so that the current second neural network generates a second sharpness-enhanced image;

inputting the second definition-improved image into the discrimination network after parameter adjustment, so that the discrimination network after parameter adjustment generates a second discrimination result based on the second definition-improved image;

adjusting the current parameters of the second neural network based on the current loss function of the second neural network to obtain an updated second neural network; a first term in a loss function of the current second neural network is based on a difference between the second sharpness enhancement image and its corresponding original sample image, and a second term in a loss function of the current second neural network is based on a difference between the second discrimination result and a second target result.
The image processing method of claim 10, wherein a first term in a current loss function of the second neural network is λ₁LossG1，λ ₁For a preset weight, LossG1 is the L1 loss between the second sharpness enhanced image and its corresponding original sample image;

the second term in the current loss function of the second neural network is λ₂L _D，λ ₂Is a preset weight value, L_DThe cross entropy of the second judgment result and the second target result is obtained;

the third term in the current loss function of the second neural network is
λ ₃Is a preset weight value, y is an original sample image corresponding to the second sharpness enhanced image,
boosting the image for the second sharpness;

setting a preset network layer in a preset optimization network, j being the number of layers of the preset network layer in the preset optimization network, C being the number of channels of an output image of the preset network layer, H being the height of the output image of the preset network layer, and W being the width of the output image of the preset network layer; the preset optimization network adopts a VGG-19 network.
The image processing method according to any one of claims 1 to 4, wherein the first neural network to be trained comprises: a plurality of first upsampling layers, a plurality of first downsampling layers, and a plurality of single-layer convolutional layers, each of the first upsampling layers and each of the first downsampling layers being located between two of the single-layer convolutional layers; input data of the ith-last single-layer convolutional layer comprises superposition of output data of the ith-last single-layer convolutional layer and output data of the ith-positive single-layer convolutional layer; wherein the number of the single-layer convolution layers is an even number, i is greater than 0 and less than half of the number of the single-layer convolution layers;

the trained second neural network comprises: a plurality of second upsampling layers, a plurality of second downsampling layers and a plurality of residual blocks, wherein the plurality of second upsampling layers correspond to the plurality of first upsampling layers one to one, the plurality of second downsampling layers correspond to the plurality of first downsampling layers one to one, and the plurality of residual blocks correspond to the plurality of single-layer convolutional layers one to one; input data of the ith last residual block is superposition of output data of the ith last second upsampling layer and output data of the ith positive residual block;

the first feature extraction sub-network comprises: the first upsampling layer, or the first downsampling layer, or the single convolutional layer; the first output subnetwork comprises the single-layer convolutional layer; the second feature extraction sub-network comprises: the second upsampling layer, or the second downsampling layer, or the residual block; the second output sub-network comprises the residual block.
An image processing apparatus comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program, when executed by the processor, implements the image processing method of any one of claims 1 to 12.
A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the image processing method of any one of claims 1 to 12.