CN112288658A

CN112288658A - Underwater image enhancement method based on multi-residual joint learning

Info

Publication number: CN112288658A
Application number: CN202011321187.7A
Authority: CN
Inventors: 丁丹丹; 陈龙
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-01-29
Anticipated expiration: 2040-11-23
Also published as: CN112288658B

Abstract

The invention relates to an underwater image enhancement method based on multi-residual joint learning, and belongs to the technical field of deep learning. The method comprises the following steps: 1) randomly cutting pictures with different resolutions in an underwater image data set containing the degraded images and the corresponding reference images into images with the same resolution, and establishing a training set of an underwater image enhancement model; 2) processing the degraded images cut in the training set by adopting a plurality of preprocessing methods respectively, wherein each preprocessing method correspondingly obtains a preprocessed image; 3) taking a reference image as a label of a degraded image, inputting an original image of the degraded image and the preprocessed degraded image into a multi-branch convolution neural network of multi-residual joint learning for training to obtain an image enhancement model; 4) and inputting the image to be enhanced into the image enhancement model to obtain a processed enhanced image.

Description

Underwater image enhancement method based on multi-residual joint learning

Technical Field

The invention relates to the technical field of deep learning, in particular to an underwater image enhancement method based on multi-residual joint learning.

Background

Underwater image enhancement techniques are of great interest due to their great significance in the field of ocean engineering and waterborne robots. Due to the complex underwater environment, the images shot underwater by the camera have the problems of low contrast, color deviation, blurred details and the like.

Due to the influence of many factors in the imaging process, such as poor collimation of light emitted by the auxiliary illumination light source and uneven distribution of light illumination intensity in a shooting scene, the brightness distribution of the background of a shot underwater image has large difference. In addition, water has absorption and scattering effects on light, so that the light is strongly attenuated when transmitted in water, the visibility of an underwater shooting environment is low, the detail characteristics of an obtained underwater image are unclear, and the contrast is low. Although the quality of an underwater image is improved to a certain extent by the existing mature underwater imaging technology, the underwater image with the characteristics of illumination nonuniformity, poor contrast, low signal-to-noise ratio and the like can not meet the actual application requirements of people.

Many underwater image enhancement algorithms have been proposed in recent years and can be classified into: non-physical model based, and data driven methods. Non-physical model based methods aim to modify image pixel values to improve visual quality, ignoring underwater optical properties.

Physical model-based approaches treat the enhancement of underwater images as an inverse problem, in which the potential parameters of the image forming model are estimated from a given image. These methods generally follow the same flow: establishing a degraded physical model; estimating unknown model parameters; the inverse problem is solved. It relies on assumed conditions and a priori knowledge that are not fully adaptive to the underwater environment.

Disclosure of Invention

The invention aims to provide an underwater image enhancement method based on multi-residual combined learning, which can be well suitable for various underwater environments.

In order to achieve the above object, the underwater image enhancement method based on multi-residual joint learning provided by the invention comprises the following steps:

1) randomly cutting pictures with different resolutions in an underwater image data set containing the degraded images and the corresponding reference images into images with the same resolution, and establishing a training set of an underwater image enhancement model;

2) processing the degraded images cut in the training set by adopting a plurality of preprocessing methods respectively, wherein each preprocessing method correspondingly obtains a preprocessed image;

3) taking a reference image as a label of a degraded image, inputting an original image of the degraded image and the preprocessed degraded image into a multi-branch convolution neural network of multi-residual joint learning for training to obtain an image enhancement model;

4) and inputting the image to be enhanced into the image enhancement model to obtain a processed enhanced image.

In step 1), the pretreatment method comprises:

carrying out sigmoid correction on the image;

gamma correction is carried out on the image;

carrying out contrast-limiting adaptive histogram equalization processing on the image;

and carrying out white balance processing on the image.

In step 1), the picture is randomly cropped to 256 × 256 size images of the same resolution.

In step 2), the multi-branch convolutional neural network comprises a first branch-channel attention branch network and a second branch-convolutional enhancement branch network, wherein the first branch is provided with a first convolution unit and a second convolution unit for realizing downsampling, and a plurality of residual groups containing a plurality of residual channel attention blocks and connected behind the first convolution unit and the second convolution unit;

the second branch is provided with a third convolution unit, a fourth convolution unit and a fifth convolution unit for realizing convolution operation;

and a sixth convolution unit is arranged after the first branch and the second branch are cascaded and combined.

In the first branch, the down-sampled image features are used as the input of a first residual channel attention block in a first residual group, the output of the first residual attention block is used as the input of a second residual attention block in the first residual group, and the like; in each residual group, the output of the last residual channel attention block is cascaded with the input of the first residual channel attention block after passing through a convolution unit, and is used as the input of the first residual channel attention block in the next residual group, and the like; and after convolution operation is carried out on the image features after down sampling through a plurality of residual error groups, cascading the image features with the original down sampling image features, carrying out up sampling on the image features after cascading to the size of the input image, and finally carrying out convolution operation, wherein the output channel is 3.

In the second branch, after an input image passes through three convolution units, densely cascading each convolved image feature and the input image feature, and using the densely cascaded image feature and the input image feature as the input of the next convolution unit, repeating the above operations, and after 3 identical convolution units, cascading each convolved image feature and the image feature which is cascaded for the first time, and using the image feature as the input of the next convolution unit, and so on; after the third cascading operation, performing convolution operation on the cascaded image features, inputting the image features of 3 channels, performing sigmoid operation, and recording the sigmoid operation as sigmoid 1.

In a merging network, carrying out cascade operation on the output of a first branch and the output of a second branch, carrying out convolution operation on image features after cascade operation for 3 times, carrying out sigmoid operation on the image features after convolution, and recording the sigmoid operation as sigmoid 2; and finally, after element-by-element multiplication is carried out on the results of sigmoid 1 and sigmoid 2 and the output image characteristic matrix of the first branch, the two multiplied results are added element-by-element, and finally the output of the multi-branch convolutional neural network is obtained.

Compared with the prior art, the invention has the advantages that:

the invention utilizes the convolutional neural network in the deep learning algorithm to realize the enhancement of the underwater image, uses an end-to-end self-coding network aiming at the difficulty of obtaining the underwater image, does not depend on professional physical prior knowledge and an underwater imaging model, realizes the training of the convolutional neural network through the existing underwater image reference data set, designs a joint loss function, maintains the original structure and texture, simultaneously reduces the problems of color cast, reduced contrast, blurred details and the like of the underwater image with poor quality, improves the quality of the underwater image, obtains the enhanced underwater image, and improves the conditions of serious color cast and low quality of the underwater image.

Drawings

FIG. 1 is a schematic structural diagram of a multi-residual joint learning neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of residual groups of a multi-residual joint learning neural network according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a residual channel attention module in a multi-residual joint learning neural network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an authentication network in embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following embodiments and accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments without any inventive step, are within the scope of protection of the invention.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of the word "comprise" or "comprises", and the like, in the context of this application, is intended to mean that the elements or items listed before that word, in addition to those listed after that word, do not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Example 1

The underwater image enhancement method based on multi-residual joint learning in the embodiment comprises the following steps:

s100, randomly cutting pictures with different resolutions in an underwater image data set containing the degraded images and the corresponding reference images into images with the same resolution, and establishing a training set of an underwater image enhancement model;

s200, processing the degraded images cut in the training set by adopting a plurality of preprocessing methods respectively, wherein each preprocessing method correspondingly obtains a preprocessed image;

s300, taking the reference image as a label of the degraded image, inputting the original image of the degraded image and the preprocessed degraded image into a multi-branch convolutional neural network of multi-residual joint learning for training to obtain an image enhancement model;

and S400, inputting the image to be enhanced into the image enhancement model to obtain the processed enhanced image.

In this embodiment, an Underwater Image Enhancement Benchmark (UIEB) dataset is selected. The UIEB data set contains 950 real-world underwater images covering a wide variety of underwater scenes, different quality degradation characteristics and a wide range of image content. 890 of which have corresponding reference images. In the embodiment, an image enhancement model is trained mainly by using a low-quality underwater image provided by the UIEB data set and a corresponding reference image aiming at the problems of color distortion, contrast reduction, detail blurring and the like.

The data set comprises 890 real underwater images with different resolutions, 800 images of the data set are selected as a training set, and the remaining 90 images are selected as a testing set. Due to computer hardware limitations, the input image pair, the reference image and the distorted image are segmented. Each image is divided into 256 multiplied by 256 image blocks, which are convenient for inputting into a network for training. And carrying out sigmoid correction on the training set and the test set images. The effect of which is to adjust the contrast of the image. After scaling each pixel to the [0,1] range, according to the equation

Each pixel of the input image is corrected. gain represents a constant multiplier of the exponential power of the sigmoid function, with a default value of 10; cutoff is a cutoff value for changing the cutoff of the sigmoid function of the characteristic curve in the horizontal direction, and is 0.5 as a default value.

Gamma correction is applied to the training and test set images, which serves to enhance image contrast and saturation. After scaling each pixel to the [0,1] range, according to the equation

O＝I^gamma

Each pixel of the input image is corrected with a power gamma of 2.2.

The training set and test set images are subjected to a limited contrast adaptive histogram equalization (CLAHE) process, which serves to enhance contrast and sharpness. CLAHE crops the histogram by a predefined threshold for the purpose of limiting the contrast magnification. The detailed steps are as follows: a) expanding the image boundary to enable the image to be exactly divided into a plurality of sub-blocks, calculating a histogram for each sub-block, then trimming the histogram and carrying out equalization; b) traversing each image block, and performing inter-block linear interpolation; c) and performing layer color filtering mixing operation with the original image.

Carrying out white balance processing on the images of the training set and the test set, wherein the white balance processing has the functions of correcting color temperature and restoring the color of a shooting subject; and respectively counting the occurrence times of each pixel value on an RGB three channel. The maximum value and the minimum value of 1% are set to be 255 and 0 respectively, and the rest values are mapped to be (0, 255), so that the values of each channel are distributed uniformly in RGB, and the effect of color balance is achieved.

In step S300, a multi-residual joint learning network model for enhancing the underwater image and eliminating the blue-green phenomenon of the underwater image is designed. The multi-residual joint learning network comprises a first branch-channel attention branch network and a second branch-convolution enhancement branch network.

A first branch: the input image is an original image of a cropped size. Starting from the convolution stage, a first convolution unit and a second convolution unit for realizing down sampling are arranged and used for learning the low-frequency information of the image; a plurality of residual error groups containing a plurality of residual error channel attention blocks are connected behind the first convolution unit and the second convolution unit and used for extracting image high-frequency information; the down-sampled image features are used as the input of the first residual channel attention block in the first residual group, the output of the first residual attention block is used as the input of the second residual attention block in the first residual group, and so on. In each residual group, the output of the last residual channel attention block is cascaded with the input of the first residual channel attention block after passing through the convolution unit, and is used as the input of the first residual channel attention block in the next residual group, and so on. And finally, after convolution operation is carried out on the image features after down-sampling through a plurality of residual error groups, cascading the image features with the original down-sampling image features, carrying out up-sampling on the image features after cascading to the size of the input image, and finally carrying out convolution operation, wherein the output channel is 3.

A second branch: the input image is the cascade connection of the original image with the size of 256 multiplied by 256 after cutting and the image which is respectively processed by sigmoid correction, gamma correction, limitation contrast self-adaptive histogram equalization and white balance. And after the input image passes through the three convolution units, cascading each convolved image feature with the input image feature as the input of the next convolution unit, repeating the above operations again, cascading each convolved image feature with the image feature cascaded for the first time through 3 identical convolution units, using the image feature cascaded for the first time as the input of the next convolution unit, and so on. After the third cascading operation, performing convolution operation on the cascaded image features, inputting the image features of 3 channels, and performing sigmoid operation (denoted as sigmoid 1).

Subsequently, the output of the first branch and the output of the second branch are subjected to a cascade operation, and after 3 convolution units, a sigmoid operation (denoted as sigmoid 2) is performed.

And finally, after element-by-element multiplication is carried out on the results of sigmoid 1 and sigmoid 2 and the feature vector output by the first branch respectively, the two multiplied results are added element by element, and finally the output of the multi-branch convolutional neural network is obtained.

In detail, the multi-residual joint learning network for enhancing underwater images adopts a two-branch input structure, as shown in fig. 1.

In the first branch, shallow feature extraction, residual group deep feature extraction, an amplification module and a reconstruction part are included. The input image is an RGB underwater image of 256 × 256 size that is originally not subjected to any correction processing. Firstly, a shallow layer feature extraction part adopts a flat convolution layer and two down-sampling convolution layers to extract feature attributes from an input image. The flat convolution layer adopts convolution kernel with size of 3 × 3, convolution step size of 1 and convolution kernel number of 64. The downsampling convolutional layer adopts the convolution kernel size of 3 multiplied by 3, the convolution step size is 2, and the number of convolution kernels is 64. After two downsampled convolutional layers, the image feature matrix becomes 64 × 64 × 64 in size.

In the residual group deep feature extraction part, there are 3 residual groups, as shown in fig. 2, serving as basic modules for deepening the network layer number. Each residual group contains 3 residual channel attention blocks, as shown in fig. 3. In the first residual channel attention module, the extracted shallow image features are input, and after convolution, Relu activation and convolution operations, an image feature matrix C1 with the size of 64 × 64 × 64 is obtained. And carrying out global pooling, convolution, Relu activation and convolution operations on the C1 in sequence to obtain a feature matrix C2 after convolution. Sigmoid operation is performed on C2, and the feature matrix C2 is mapped between (0, 1), as indicated by S1. Subsequently, element-by-element multiplication of S1 and C1 is performed, and the result of the element-by-element multiplication is added to the input image feature to obtain the output of the residual channel attention module. And the output of the third residual channel attention block is subjected to one convolution operation and then subjected to element-by-element addition operation with the input of the first residual channel attention block by using jump connection to be used as the input of a second residual group. The output of the second residual group is used as the input of the third residual group, and after the output of the third residual group passes through a flat convolution layer, the output of the third residual group and the input of the first residual group are subjected to element-by-element addition operation by using jump connection. The flat convolution layer adopts convolution kernel with size of 3 × 3, convolution step size of 1 and convolution kernel number of 64. The image feature matrix size at this time is 64 × 64 × 64.

Because the input image is downsampled twice in the shallow feature extraction module, the convolution step length is 2 each time, and the image is reduced by 4 times compared with the input image. In the amplification module, the image subjected to the deep feature extraction of the residual group is amplified to the size of the input image by adopting a deconvolution mode, namely the image is amplified by 4 times. The enlarged image feature matrix size is 256 × 256 × 64.

And finally, in the reconstruction module, the scaled image features pass through a flat convolution layer, the size is 3 multiplied by 3, the convolution step length is 1, and the number of convolution kernels is 3. To this end, the enhanced image feature of the first branch is output. The output enhanced image feature matrix size is 256 × 256 × 3.

In the second branch, the network is composed of convolutional layers and dense cascades and is used for extracting deep features of the image, jointly optimizing the multi-period loss and keeping the original structure and texture of the input image. And (3) cascading (splicing) the original underwater image which is not subjected to any correction processing and a third dimension (channel) of an image feature matrix which is subjected to Gamma correction, Sigmoid correction, white balance and limited contrast histogram equalization processing, and feeding the image feature matrix into a network, wherein the size of the input cascaded image feature matrix is 256 multiplied by 15. First, depth feature extraction, which consists of 3 dense cascaded blocks with three convolutional layers in each dense cascaded block. In the first dense cascade block, the first layer of convolution layer is composed of 16 convolution kernels with the size of 3 × 3 × 3, 16 output feature maps are generated for the first layer, and the convolution step size is 1; and the subsequent convolutional layers generate 16 output feature maps using a convolution kernel of 3 × 3 × 16, with a convolution step size of 1. After the input image is convolved for three times, the image characteristics after each convolution and the input image characteristic matrix are cascaded and used as the input of the next dense cascade.

Similarly, in the second dense cascade block, the image features after the convolution for three times, the image features after the convolution for one time and two times and the image features output by the first dense cascade block are cascaded and used as the input of the third dense cascade block. In the third dense cascade block, the image features after the convolution for three times are cascaded with the image features after the convolution for one time and two times and the channel dimension of the image feature matrix output by the second dense cascade block. The input image features are subjected to convolution once after passing through three dense cascade blocks, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, and a 3-layer feature map is output. To this end, the image enhancement feature of the second branch is output. And finally, carrying out sigmoid operation on the output characteristic matrix.

In the trunk branch, image enhancement features output by the first branch and the second branch are cascaded, and pass through a dense cascade block, wherein the dense cascade mode and the convolution layer convolution kernel have the same size and step length as the first dense cascade block in the second branch. And carrying out convolution operation on the image features subjected to dense cascade connection once, wherein the size of a convolution kernel is 3 multiplied by 3, the step length is 1, and the number of output channels is 3. Subsequently, sigmoid operation is performed on the output image features.

And finally, multiplying the results of the enhanced image features sigmoid in the second branch and the main branch with the enhanced image feature matrix of the first branch element by element respectively, and performing element-by-element addition operation on the two multiplied results to obtain a final enhanced image feature matrix for output.

The training process of the convolutional neural network is as follows:

to ensure that the neural network can generate images with good visual effects, the present network uses the following loss function. Specifically, in the design, the content perception loss, the MSE loss, the SSIM loss and the gradient descent loss are combined together with a certain weight to form a new loss function. The new loss function is defined as follows:

L＝a×L_vgg+b×L_mse+c×L_ssim+d×L_gdl

wherein, a is 0.05, b is 1, c is 0.1, and d is 0.01. L is_vggA content loss item is represented for the generator to generate an enhanced image having similar content to the target real image. The ReLU activation layer of a 19-layer VGG network based on pre-training defines the perceptual loss, let φ_j(x) For the jth activated convolutional layer of the pretrained VGG19 network φ, the value of j is set to 5. The loss of content is represented as an enhanced image I_enAnd a reference picture I_gtBetween the feature representations ofThe difference is as follows:

where N is the number of batches in the training process; c_j H_j W_jRepresents the dimensions of the feature map for the jth convolutional layer in the VGG19 network. C_j，H_jAnd W_jThe number, height and width of the respective feature maps.

L_mseRepresenting the MSE loss between the feature representations of the enhanced image and the reference image.

L_ssimRepresenting the SSIM loss between the feature representations of the enhanced image and the reference image.

L_gdlRepresenting a gradient descent penalty, given an enhanced image I^EReference picture I^GWhile introducing correlation between adjacent pixels, the gradient descent penalty is expressed as:

in this embodiment, an image pair reference image and a distorted image are input into a convolutional neural network, for image pairs in a data set, a batch of 16 image pairs with a size of 256 × 256 are randomly selected for training each time as the input of the network, the weight of the neural network is trained, and the initial learning rate is 0.001; inputting all training data into the network according to batches for one-time training to define one epoch, setting momentum and weight attenuation values to be 0.9 by utilizing an Adam optimization algorithm in the training process, and enabling the network to achieve preliminary convergence through 200 epoch iterations. The learning rate is reduced to half of the initial value, i.e. the learning rate is set to 0.0005, and 200 epoch iterations are further trained, and the network converges.

Example 2

In this embodiment, except that the structure of the convolutional neural network is different from that of embodiment 1, the rest is the same as that of embodiment 1, and is not described herein again.

In step S300 of this embodiment, a multi-residual joint learning method is first designed to enhance the underwater image and eliminate the blue-green phenomenon of the underwater image to generate a confrontation network model. The generator for generating the countermeasure network by the multi-residual joint learning comprises a convolution network unit, a residual network unit and a channel attention module.

A first branch: the input image is a cropped 256 × 256-size original image. Starting from the convolution stage, a first convolution unit and a second convolution unit for realizing down sampling are arranged and used for learning the low-frequency information of the image; a plurality of residual error groups containing a plurality of residual error channel attention blocks are connected behind the first convolution unit and the second convolution unit and used for extracting image high-frequency information; the down-sampled image features are used as the input of the first residual channel attention block in the first residual group, the output of the first residual attention block is used as the input of the second residual attention block in the first residual group, and so on. In each residual group, the output of the last residual channel attention block is cascaded with the input of the first residual channel attention block after passing through the convolution unit, and is used as the input of the first residual channel attention block in the next residual group, and so on. And finally, after convolution operation is carried out on the image features after down-sampling through a plurality of residual error groups, cascading the image features with the original down-sampling image features, carrying out up-sampling on the image features after cascading to the size of the input image, and finally carrying out convolution operation, wherein the output channel is 3.

And in the second branch, the input image is the cascade connection of the clipped image and the image which is respectively subjected to sigmoid correction, gamma correction, contrast-limited adaptive histogram equalization and white balance processing. And after the input image passes through the three convolution units, cascading each convolved image feature with the input image feature as the input of the next convolution unit, repeating the above operations again, cascading each convolved image feature with the image feature cascaded for the first time through 3 identical convolution units, using the image feature cascaded for the first time as the input of the next convolution unit, and so on. After the third cascading operation, performing convolution operation on the cascaded image features, inputting the image features of 3 channels, and performing sigmoid operation (denoted as sigmoid 1).

And finally, after element-by-element multiplication is carried out on the results of sigmoid 1 and sigmoid 2 and the feature vector output by the first branch respectively, element-by-element addition is carried out on the two multiplied results, and finally the enhanced image feature of the multi-branch convolutional neural network is obtained and output.

Referring to fig. 4, the authentication network uses 5 convolutional layers to down-sample the image to be authenticated and finally output a value for distinguishing the authenticity of the image to be authenticated.

Construction of the generation countermeasure network: the generation of the antagonistic network GAN comprises 2 networks, one is a generation network and the other is a discrimination network, the ultimate purpose of the GAN is to learn a high-quality generator G, and the GAN is realized by introducing a discriminator D to obtain the high-quality generator G. G aims to generate pictures which are as vivid as possible in the training process to enable a discriminator to not judge whether the picture is a real picture or a generated false picture, D aims to distinguish the real picture or the false picture as much as possible in the training process, so G hopes that the error rate of D is maximized, D hopes that the error rate of D is minimized, and the G hopes that the error rate of D is minimized and the D hopes that the error rate of D and the D are mutually contradictory, so that the two pictures are improved in competition. Theoretically, the relationship can reach a balance point, namely so-called nash balance, that is, the probability that the picture D generated by the generator G is judged to be real data is 0.5, namely, the discriminator cannot distinguish the true and false of the picture generated by the generator, so that the purpose of the generator is achieved, and the picture is falsified and true. Thus, the present embodiment can obtain a generation network for enhancing the quality of the underwater image.

The generation network employs a two-branch input structure, as shown in fig. 1. In the first branch, shallow feature extraction, residual group deep feature extraction, an amplification module and a reconstruction part are included. The input image is an RGB underwater image of 256 × 256 size that is originally not subjected to any correction processing. Firstly, a shallow layer feature extraction part adopts a flat convolution layer and two down-sampling convolution layers to extract feature attributes from an input image. The flat convolution layer adopts convolution kernel with size of 3 × 3, convolution step size of 1 and convolution kernel number of 64. The downsampling convolutional layer adopts the convolution kernel size of 3 multiplied by 3, the convolution step size is 2, and the number of convolution kernels is 64. After two downsampled convolutional layers, the image size becomes 64 × 64 × 64.

In the residual group deep feature extraction part, there are 3 residual groups, as shown in fig. 2, serving as basic modules for deepening the network layer number. Each residual group contains 3 residual channel attention blocks, as shown in fig. 3. In the first residual channel attention module, the extracted shallow image features are input, and after convolution, Relu activation and convolution operations, an image feature matrix C1 of the size is obtained. And carrying out global pooling, convolution, Relu activation and convolution operations on the C1 in sequence to obtain a feature matrix C2 after convolution. Sigmoid operation is performed on C2, and the feature matrix C2 is mapped between (0, 1), as indicated by S1. Subsequently, element-by-element multiplication of S1 and C1 is performed, and the result of the element-by-element multiplication is added to the input image feature to obtain the output of the residual channel attention module. And the output of the third residual channel attention block is subjected to one convolution operation and then subjected to element-by-element addition operation with the input of the first residual channel attention block by using jump connection to be used as the input of a second residual group. The output of the second residual group is used as the input of the third residual group, and after the output of the third residual group passes through a flat convolution layer, the output of the third residual group and the input of the first residual group are subjected to element-by-element addition operation by using jump connection. The flat convolution layer adopts convolution kernel with size of 3 × 3, convolution step size of 1 and convolution kernel number of 64. The image feature matrix size at this time is 64 × 64 × 64.

In the trunk branch, image enhancement features output by the branch one and the branch two are cascaded and pass through a dense cascade block, the dense cascade mode and the size and the step length of a convolution layer convolution kernel are the same as those of the first dense cascade block in the branch two. And carrying out convolution operation on the image features subjected to dense cascade connection once, wherein the size of a convolution kernel is 3 multiplied by 3, the step length is 1, and the number of output channels is 3. Subsequently, sigmoid operation is performed on the output image features.

And finally, multiplying the results of the enhanced image features sigmoid in the second branch and the main branch with the enhanced image features of the first branch element by element respectively, and performing element-by-element addition operation on the two multiplied results to obtain a final enhanced image for output.

The discrimination network uses 5 convolutional layers to learn the features of the image for determining the authenticity, as shown in fig. 4. The first four convolutional layers have convolution kernel size of 1 × 1, convolution step size of 1, and number of convolution kernels of 64, 128, 256 and 512, respectively, and all have batch normalization and LeakyReLU activation functions added. The last convolution kernel size is 4 x 4, the convolution step size is 1, the number of convolution kernels is 3, and the batch normalization and sigmoid activation functions are added.

Training to generate an antagonistic network:

generating a loss function definition for the network: in order to ensure that the generating network can generate images with good visual effects, the present network uses the following loss function. Specifically, the design combines the countermeasure loss, the content perception loss, the MSE loss, the SSIM loss and the gradient descent loss with a certain weight to form a new loss function. The new loss function is defined as follows:

L_g＝a×L_gan+b×L_vgg+c×L_mse+d×L_ssim+e×L_gdl

wherein, a is 1, b is 0.05, c is 1, d is 0.1, and e is 0.01. Lgan represents the penalty, and the penalty function for a conventional Gan network is as follows, where G, D are the generator and discriminator, respectively:

L_vgga content loss item is represented for the generator to generate an enhanced image having similar content to the target real image. The ReLU activation layer of a 19-layer VGG network based on pre-training defines the perceptual loss, let φ_j(x) For the jth activated convolutional layer of the pretrained VGG19 network φ, the value of j is set to 5. The loss of content is represented as an enhanced image I_enAnd a reference picture I_gtIs represented by the difference between the characteristic representations:

in the embodiment, an image pair reference image and a distorted image are input into a generation countermeasure network, for the image pairs in a data set, a batch of 16 image pairs with the size of 256 × 256 are randomly selected for training each time to serve as the input of the network, the weights of the generation network and the judgment network are respectively trained, and the learning rate is 0.001; inputting all training data into the network according to batches for one-time training to define one epoch, setting momentum and weight attenuation values to be 0.9 by utilizing an Adam optimization algorithm in the training process, and enabling the network to achieve convergence through 300 epoch iterations.

In the training of the generation countermeasure network, a distorted underwater image and a preprocessed image thereof are input into the generation network to obtain an enhanced image; respectively inputting the enhanced image and the reference image into a discrimination network to obtain discrimination labels, calculating the loss of the discrimination network by using the discrimination labels, and calculating the loss of the generated network at the same time, wherein the loss comprises confrontation loss, perception loss, gradient descent loss, MSE loss and SSIM loss;

and respectively updating parameters of the generation network and the identification network according to an Adam optimization algorithm, and alternately training the two networks to finally achieve convergence.

Claims

1. An underwater image enhancement method based on multi-residual joint learning is characterized by comprising the following steps:

2. The underwater image enhancement method based on multi-residual joint learning according to claim 1, wherein in step 1), the preprocessing method comprises:

carrying out sigmoid correction on the image;

gamma correction is carried out on the image;

and carrying out white balance processing on the image.

3. The underwater image enhancement method based on multi-residual joint learning according to claim 1, wherein in step 1), the image is randomly cropped to the same resolution image with the size of 256 × 256.

4. The underwater image enhancement method based on multi-residual joint learning of claim 1, wherein in step 2), the multi-branch convolutional neural network comprises a first branch-channel attention branch network and a second branch-channel convolutional enhancement branch network, the first branch is provided with a first convolution unit and a second convolution unit for realizing downsampling, and a plurality of residual groups containing a plurality of residual channel attention blocks are connected after the first convolution unit and the second convolution unit;

5. The underwater image enhancement method based on multi-residual joint learning of claim 4, characterized in that in the first branch, the down-sampled image features are used as input of a first residual channel attention block in the first residual group, the output of the first residual attention block is used as input of a second residual attention block in the first residual group, and so on; in each residual group, the output of the last residual channel attention block is cascaded with the input of the first residual channel attention block after passing through a convolution unit, and is used as the input of the first residual channel attention block in the next residual group, and the like; and after convolution operation is carried out on the image features after down sampling through a plurality of residual error groups, cascading the image features with the original down sampling image features, carrying out up sampling on the image features after cascading to the size of the input image, and finally carrying out convolution operation, wherein the output channel is 3.

6. The underwater image enhancement method based on multi-residual united learning of claim 4 is characterized in that in the second branch, after an input image passes through three convolution units, each convolved image feature and the input image feature are subjected to dense cascade connection and are used as the input of the next convolution unit, the operations are repeated, each convolved image feature and the image feature which is subjected to first cascade connection are subjected to 3 same convolution units and are used as the input of the next convolution unit, and the process is repeated; after the third cascading operation, performing convolution operation on the cascaded image features, inputting the image features of 3 channels, performing sigmoid operation, and recording the sigmoid operation as sigmoid 1.

7. The underwater image enhancement method based on multi-residual joint learning of claim 4, wherein in the merging network, cascade operation is performed on the outputs of the first branch and the second branch, convolution operation is performed on the image features after cascade operation for 3 times, sigmoid operation is performed on the image features after convolution, and the sigmoid operation is recorded as sigmoid 2; and finally, after element-by-element multiplication is carried out on the results of sigmoid 1 and sigmoid 2 and the output image characteristic matrix of the first branch-channel attention branch network, the two multiplied results are added element by element, and finally the output of the multi-branch convolution neural network is obtained.