CN114092793A

CN114092793A - End-to-end biological target detection method suitable for complex underwater environment

Info

Publication number: CN114092793A
Application number: CN202111342981.4A
Authority: CN
Inventors: 方笑海; 章学挺; 潘勉; 于海滨; 吕帅帅; 彭时林; 史剑光
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-02-25
Anticipated expiration: 2041-11-12
Also published as: CN114092793B

Abstract

The invention discloses an end-to-end biological target detection method suitable for a complex underwater environment. The method comprises the following steps: s1, grabbing the used underwater data set by an underwater robot, dividing the underwater data set into a training set and a testing set, unifying the size of the underwater image through up-sampling or down-sampling, and then normalizing; s2, selecting an underwater image with poor imaging quality from the existing underwater data set, and enhancing the image by a histogram equalization method to form a data set of an enhanced network; s3, training the underwater image enhancement network by taking the poor underwater image as the input of the enhancement network and the enhanced image as a true value; s4, extracting the features of the underwater training set image after network enhancement by using a full convolution network, and then performing target recognition and classification on the feature map of the underwater image by using a one-stage detection network to obtain a trained model; and S5, sending the processed underwater test set into the trained model for testing.

Description

End-to-end biological target detection method suitable for complex underwater environment

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an end-to-end biological target detection method suitable for a complex underwater environment.

Background

With the development of computer vision and image processing technology, the application of image processing methods to improve underwater image quality to meet the requirements of human vision system and machine recognition has gradually become a hotspot. With the development of artificial intelligence, the deep learning method is gradually applied to underwater target recognition. However, due to the influence of uneven underwater light, the underwater image has the problems of color distortion, underexposure and the like, and the traditional deep learning target detection method lacks sufficient capacity for processing the underwater target. In underwater environments, enhancement of low quality images is essential for computer vision.

For underwater image enhancement, conventional image processing methods including color correction algorithms and contrast enhancement algorithms, white balance methods, gray world theory and gray edge theory are typical color correction methods. However, the processing results of these methods are not satisfactory for underwater vision. The task of target detection for underwater images remains to be further studied.

Disclosure of Invention

In view of the technical problems, the invention is used for providing an end-to-end biological target detection method suitable for a complex underwater environment, which comprises the steps of preprocessing data to improve the accuracy of a classifier, enhancing an underwater image through a generation network, extracting features through a deep neural network, and finally identifying and classifying.

In order to solve the technical problems, the invention adopts the following technical scheme:

s1, grabbing an underwater data set by an underwater robot, dividing the underwater data set into a training set and a testing set, wherein the underwater data set comprises underwater targets of sea cucumbers, sea urchins, scallops and starfishes, 20% of the underwater data set is used as the testing set, 80% of the underwater data set is used as the training set, the underwater images are unified in size through up-sampling or down-sampling, and then normalization is carried out;

s2, selecting an underwater image with poor imaging quality from the existing underwater data set, and enhancing the image by a histogram equalization method to form a data set of an enhanced network;

s3, training the underwater image enhancement network by taking the poor underwater image as the input of the enhancement network and the enhanced image as a true value;

s4, extracting the features of the underwater training set image after network enhancement by using a full convolution network, and then performing target recognition and classification on the feature map of the underwater image by using a one-stage detection network to obtain a trained model;

and S5, sending the processed underwater test set into the trained model for testing.

Preferably, S1 further includes:

s101, assume x_iIs the image pixel point value, min (x)_i) And max (x)_i) Representing the maximum and minimum values of the image pixel, respectively. The normalized underwater images are:

preferably, S2 further includes:

gain comparison using histogram equalizationThe real value of the difference image is used for training the enhancement network, and the number n of pixels of each gray level in the image is counted firstly_kK is in the range of [ O, L-1]The initial probability density function of the image histogram is p (r)_k) Then the transformation function is:

obtaining the probability density function p (S) after equalization through the transformation function_k) And the method is applied to actual images to obtain paired underwater data sets.

Preferably, S3 further includes:

s301, using the generation countermeasure network to perform image enhancement, inputting X of the underwater image with poor quality into the generation network, wherein the convolution module of each layer comprises three processes of convolution, batch normalization and ReLu, the input X passes through N convolution kernels with kernel _ size of 3 × 3, and output is obtained

N represents the total number of channels, i represents the ith channel, and the extracted features are as follows:

in the formula

Representing a convolution operation, with a total of 5 convolutional layers, the output of the 3 rd convolutional layer and the output of the fifth convolutional layer superimposed;

s302, the data after the convolution layer needs to be further processed, in order to enable the model to be easy to converge and enable the network training process to be more stable, batch normalization is added after the convolution, and by calculating the mean value and the variance of the data in each batch, N is assumed to exist in a small batch_mOne sample, then define the output as

Wherein FⁿRepresents the convolution output corresponding to the nth sample, in each small batch, for

The data in (1) is subjected to batch normalization to obtain

Expressed as:

wherein, Fⁿ(k, l) represents the l element in the k channel in the convolutional layer output corresponding to the sample before batch normalization,

i.e. the data after batch normalization, alpha^kAnd beta^kFor trainable parameters corresponding to the kth channel,. epsilon.is a very small number that prevents the divisor from being 0, E (. eta.) for the averaging operation, and Var (. eta.) for the variance operation;

s303, then using the pair of activation functions ReLU

Each element in the group is nonlinearly activated to obtain

If the input is

The corresponding output after the ReLU

Expressed as:

s304, inputting the image generated by the generation network into the countermeasure network to judge whether the output of the generation network achieves the purpose of enhancement, wherein the discriminator consists of 3 simple convolutional layers, and each convolutional layer also has three processes of convolution, batch processing normalization and ReLu;

s305, in order to ensure that the result has a good visual and quantitative performance, the loss function is the countermeasure loss L₁And characteristic loss L₂Two-part, loss-fighting in order for the generator to generate a better-performing output, assuming D represents the arbiter network, x_rAnd x_fIf the values are sampled from the true distribution and the pseudo distribution, respectively, the penalty is:

the feature loss is the Euclidean distance of the feature data extracted from the convolutional layer of VGG16 through which the input and generated images respectively pass, and can reduce the instability of the generator network, assuming I^LInput representing color cast, G (I)^L) Represents the output of the generating network, phi_iRepresenting a feature map obtained from a feature extraction network, i representing its ith pooled feature map, W_i，H_iIs the dimension of the extracted feature map, the feature loss is:

preferably, S4 further includes:

s401, a Resnet50 module is used by a feature extraction layer, a Resnet50 network structure firstly performs convolution operation on input, then 4 residual blocks are included, the convolution operation is performed for 50 times in total, each residual block has jump connection to solve the problem of gradient dissipation or explosion, and if the input of the residual block is X and the output of the tape measure network is H, the output is as follows:

Y＝H(X)+X

s402, the low-level feature semantic information is less, but the target position is accurate; the high-level feature semantic information is rich, but the target position is rough, firstly, a top-down path is adopted to transmit the high-level strong semantic features, then, a bottom-up path is added to supplement the feature map, and the low-level strong positioning features are transmitted.

Preferably, S5 further includes:

s501, the detection module mainly comprises two sub-networks, namely a classification sub-network and a frame regression sub-network, wherein the classification sub-network predicts the existence probability and the class probability of the target at each spatial position for each anchor point. The sub-network is a simple full convolution module and consists of four full convolution layers, and the parameters of the sub-network are shared among all feature maps with different scales; finally, the classification is performed using sidmoid, assuming p_iTo determine the probability that the current ith anchor point may be the target via the network,

for the probability that the ith anchor point is labeled as the target, the classification loss function is:

s502, in parallel with the object classification sub-network, the offset of each anchor frame is regressed to a near real value by using another simple full-volume machine network, and the object classification sub-network and the border regression sub-network have the same structure but different use parameters. Let t_iThe offset of the block relative to the anchor point is predicted for the positive samples,

for the offset of the true value relative to the anchor point, the bounding box regression loss is:

drawings

Fig. 1 is a flowchart of the overall algorithm steps of an end-to-end biological target detection method suitable for a complex underwater environment according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow chart of steps of an end-to-end biological target detection method suitable for a complex underwater environment according to an embodiment of the present invention is shown, which includes the following steps:

s1, grabbing an underwater data set by an underwater robot, dividing the underwater data set into a training set and a testing set, wherein the underwater data set comprises underwater targets of sea cucumbers, sea urchins, scallops and starfishes, 20% of the underwater data set is used as the testing set, 80% of the underwater data set is used as the training set, the underwater images cannot directly enter a network due to different sizes of the underwater images, the sizes of the underwater images are unified through up-sampling or down-sampling, and then normalization is performed.

S2, aiming at the problem that an underwater data set paired for image enhancement network training cannot be obtained in actual engineering, selecting an underwater image with poor imaging quality from the existing underwater data set, and enhancing the image by a histogram equalization method to form a data set of an enhancement network;

In a specific application example, S1 further includes:

the image is normalized, the goal of the normalization is to find a certain mapping relation, after the normalization, the characteristics of different dimensions have certain comparability on the numerical value, and the accuracy of the classifier can be greatly improved. Let x be_iIs the image pixel point value, min (x)_i) And max (x)_i) Representing the maximum and minimum values of the image pixel, respectively. The normalized underwater images are:

in a specific application example, S2 further includes:

paired underwater data sets cannot be obtained in actual engineering, and the real values of poor images obtained by histogram equalization are used for training of an enhanced network. Firstly, counting the number n of pixels of each gray level in an image_kK is in the range of [ O, L-1]The initial probability density function of the image histogram is p (r)_k) Then the transformation function is:

In a specific application example, S3 further includes:

s301, image enhancement is carried out by using the generation countermeasure network. Inputting X of the underwater image with poor quality into a generating network, wherein the convolution module of each layer comprises three processes of convolution, batch normalization and ReLu, and the input X passes through N convolution kernels with kernel _ size of 3X 3 to obtain output

in the formula

The data in (1) is subjected to batch normalization to obtain

Expressed as:

s303, then using the pair of activation functions ReLU

Each element in the group is nonlinearly activated to obtain

If the input is

The corresponding output after the ReLU

Expressed as:

s304, inputting the image generated by the generation network into the countermeasure network to judge whether the output of the generation network achieves the purpose of enhancement. The discriminator consists of 3 simple convolutional layers, and each convolutional layer also has three processes of convolution, batch normalization and ReLu.

S305, in order to ensure that the result has a good visual and quantitative performance, our loss function is based on the adversarial loss L₁And characteristic loss L₂Two parts are formed. The penalty is combated in order for the generator to generate a better-performing output. Let D denote the arbiter network, x_rAnd x_fSampled values from the true distribution and the pseudo distribution, respectively. The penalty on confrontation is then:

the feature loss is the euclidean distance of the feature data extracted from the convolutional layer of VGG16 through which the input and generated images pass, respectively, and can reduce the instability of the generator network. Let I^LInput representing color cast, G (I)^L) Represents the output of the generating network, phi_iA feature map obtained from a feature extraction network is shown. i represents its ith pooled feature map. W_i，H_iIs the dimension of the extracted feature map. ThenThe characteristic loss is:

in a specific application example, S4 further includes:

s401, the feature extraction layer uses a Resnet50 module. The Resnet50 network structure performs the convolution operation on the input first, and then contains 4 residual blocks for a total of 50 convolution operations. Each residual block has jump connection to solve the problem of gradient dissipation or explosion. Assuming that the input to the residual block is X and the output of the tape measure network is H, the output is:

Y＝H(X)+X

s402, the low-level feature semantic information is less, but the target position is accurate; and the high-level characteristic semantic information is rich, but the target position is rough. Firstly, a top-down path is adopted to transmit the high-level strong semantic features, then a bottom-up path is added to supplement the feature map, and the low-level strong positioning features are transmitted.

In a specific application example, S5 further includes:

s501, the detection module mainly comprises two sub-networks: a classification sub-network and a bounding box regression sub-network. The classification sub-network predicts for each anchor point the target existence probability and the class probability for each spatial location. The sub-network is a simple full convolution module, which is composed of four full convolution layers, and the parameters of the sub-network are shared among all the feature maps with different scales. Finally, sidmoid is used for classification. Let p be_iTo determine the probability that the current ith anchor point may be the target via the network,

S502, in parallel with the target classification sub-network, we use another simple full-scroller network to return the offset of each anchor box to a nearby true value. The target classification sub-network and the bounding box regression sub-network have the same structure but different parameters. Let t_iThe offset of the block relative to the anchor point is predicted for the positive samples,

it is to be understood that the exemplary embodiments described herein are illustrative and not restrictive. Although one or more embodiments of the present invention have been described with reference to the accompanying drawings, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims.

Claims

1. An end-to-end biological target detection method suitable for a complex underwater environment is characterized by comprising the following steps:

2. The method for end-to-end biological target detection applicable to complex underwater environments of claim 1, wherein S1 further comprises:

let x be_iIs the image pixel point value, min (x)_i) And max (x)_i) Respectively representing the maximum value and the minimum value of image pixels, wherein the normalized underwater image is as follows:

3. the method for end-to-end biological target detection applicable to complex underwater environments of claim 1, wherein S2 further comprises:

counting the number n of pixels of each gray level in the image_kK is in the range of [0, L-1]The initial probability density function of the image histogram is p (r)_k) Then the transformation function is:

4. The method for end-to-end biological target detection in a complex underwater environment as claimed in claim 1, wherein S3 further comprises:

s301, using the generation countermeasure network to enhance the image, inputting the X of the underwater image with poor quality into the generation network, and inputting the X of each layerThe convolution module comprises three processes of convolution, batch normalization and ReLu, and input X passes through N convolution kernels with kernel _ size of 3X 3 to obtain output

in the formula

s302, further processing the data after the convolution layer, adding batch normalization after the convolution in order to make the model easy to converge and make the network training process more stable, and assuming that N exists in a small batch by calculating the mean value and variance of the data in each batch_mOne sample, then define the output as

The data in (1) is subjected to batch normalization to obtain

Expressed as:

i.e. the data after batch normalization, alpha^kAnd ρ^kFor trainable parameters corresponding to the kth channel,. epsilon.is a very small number that prevents the divisor from being 0, E (. eta.) for the averaging operation, and Var (. eta.) for the variance operation;

s303, then using the pair of activation functions ReLU

Each element in the group is nonlinearly activated to obtain

If the input is

The corresponding output after the ReLU

Expressed as:

s305, in order to ensure that the result has a good visual and quantitative performance, our loss function is based on the adversarial loss L₁And characteristic loss L₂Two parts are formed. The penalty is combated in order for the generator to generate a better-performing output. FalseLet D denote the arbiter network, x_rAnd x_fSampled values from the true distribution and the pseudo distribution, respectively. The penalty on confrontation is then:

the feature loss is the Euclidean distance of the feature data extracted from the convolutional layer of VGG16 through which the input and generated images respectively pass, assuming I^LInput representing color cast, G (I)^L) Represents the output of the generating network, phi_iRepresenting a feature map obtained from a feature extraction network, i representing its ith pooled feature map, W_i,H_iIs the dimension of the extracted feature map, the feature loss is:

5. the method for end-to-end biological target detection in a complex underwater environment as claimed in claim 1, wherein S4 further comprises:

s401, the feature extraction layer uses a Resnet50 module. The Resnet50 network structure firstly performs convolution operation on input, then comprises 4 residual blocks, and 50 times of convolution operation is performed in total, each residual block has jump connection to solve the problem of gradient dissipation or explosion, and if the input of the residual block is X and the output of the tape measure network is H, the output is:

Y＝H(X)+X

s402, the low-level feature semantic information is less, but the target position is accurate; and the high-level characteristic semantic information is rich, but the target position is rough. Firstly, a top-down path is adopted to transmit high-level strong semantic features, then a bottom-up path is added to supplement a feature map, and low-level strong positioning features are transmitted.

6. The method for end-to-end biological target detection in a complex underwater environment as claimed in claim 1, wherein S5 further comprises:

s501, the detection module mainly comprises two sub-networks: the classification sub-network is a simple full-convolution module and consists of four full-convolution layers, the parameters of the sub-network are shared among all feature maps with different scales, and finally, the classification is carried out by using sidmoid, and p is assumed to be_iTo determine the probability that the current ith anchor point may be the target via the network,

s502, in parallel with the target classification sub-network, we use another simple full-reel network to return the offset of each anchor frame to a nearby true value. The target classification sub-network and the bounding box regression sub-network have the same structure but different parameters. Let t_iThe offset of the block relative to the anchor point is predicted for the positive samples,