CN116612378A

CN116612378A - Unbalanced data and underwater small target detection method under complex background based on SSD improvement

Info

Publication number: CN116612378A
Application number: CN202310578589.2A
Authority: CN
Inventors: 于俊洋; 何义茹; 谷航宇; 潘顺杰; 辛致宜; 赵宇曦
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-18

Abstract

The application discloses an SSD (solid State disk) -based improved unbalanced data and underwater small target detection method under a complex background, which comprises the following steps: the network of SSD is improved: using VGG16 as the front-end backbone network, taking the output at conv3_3 as the first layer of feature layer, embedding the multidimensional pixel attention network after conv3_3; using hole convolution with expansion rate r and ReLU activation after the multidimensional pixel attention network to sequentially generate a plurality of residual feature layers needing prediction; inputting the generated feature graphs corresponding to the feature layers into a joint weighted knowledge distillation and multi-scale feature distillation module; adjusting the size of a picture to be detected in the original unbalanced underwater image data set and inputting the picture to a network of the improved SSD; and detecting the underwater small target based on the network of the SSD after improvement. The application greatly improves the detection capability of rare categories and reduces the influence on the model detection capability due to unbalanced sample distribution.

Description

Unbalanced data and underwater small target detection method under complex background based on SSD improvement

Technical Field

The application relates to the technical field of computer vision and image processing, in particular to an underwater small target detection method based on SSD (solid state drive) improved unbalanced data and a complex background.

Background

With the advent of deep convolutional neural networks, researchers have made significant progress in the task of target detection. Underwater target detection aims to locate and identify objects in an underwater scene. This research has been receiving continuous attention for its wide application in the fields of oceanography, underwater navigation, fish farming, etc. However, this is still a challenging task due to the complex underwater environment and lighting conditions.

The traditional SSD (single shot multibox detector) algorithm is a single-stage target detection algorithm, adopts a multi-stage characteristic diagram to carry out multi-stage output, and effectively improves the detection capability of the algorithm on targets with different scales. Ideal results are achieved on many generic test datasets. However, there are still deficiencies in handling underwater target detection. This is because, first, the underwater survey data set is scarce, and the available underwater data set and the objects in practical use are typically small. The SSD detection algorithm is also decreasing the resolution of the feature map as the number of network layers is deepened, and cannot effectively detect small objects in the image. Second, the images of the actual underwater dataset are cluttered. In underwater scenes, wavelength dependent absorption and scattering can significantly reduce the quality of the underwater image, which causes many problems such as visibility loss, weak contrast and color variation, which makes it difficult for a general purpose target detection algorithm to meet the detection accuracy in underwater environments with complex backgrounds.

Disclosure of Invention

Aiming at the problems that the visibility of an underwater image is low, the background is complex, the target duty ratio is small, and a sample is unbalanced, and the detection precision of a traditional SSD algorithm in an underwater environment with a complex background is difficult to meet the requirement, the application provides unbalanced data based on SSD improvement and an underwater small target detection method under the complex background; for images in most underwater environments, the background is complex, the occupation of the target in the image is small, and the multidimensional pixel attention network can well eliminate the influence of the complex background in the detection process; finally, for unbalanced distribution of sample types of underwater images, knowledge distillation is introduced into a classifier to strengthen training of few sample types, and detection accuracy is effectively improved.

In order to achieve the above purpose, the present application adopts the following technical scheme:

an underwater small target detection method based on SSD improved unbalanced data and complex background comprises the following steps:

step 1: using VGG16 as front end backbone network, taking output at Conv3_3 of VGG16 as a first layer for predicting feature layer, embedding multidimensional pixel attention network after Conv3_3 of VGG16 model to eliminate complex background of underwater environment; using hole convolution with expansion rate r and ReLU activation after the multidimensional pixel attention network to sequentially generate a plurality of residual feature layers needing prediction; inputting the generated feature graphs corresponding to the feature layers into a joint weighted knowledge distillation and multi-scale feature distillation module so as to improve the detection performance of the underwater small target with fewer sample types;

step 2: adjusting the size of a picture to be detected in the original unbalanced underwater image data set and inputting the picture to a network of the SSD after the improvement in the step 1;

step 3: and (3) detecting the underwater small target based on the network of the SSD after the improvement of the step (1).

Further, the feature maps of the plurality of feature layers (including the feature layer of the first layer for prediction, and the plurality of feature layers remaining to be predicted) have different sizes.

Further, the multi-dimensional pixel attention network is composed of a combination of pixel attention and CBAM attention.

Further, in the multidimensional pixel attention network, the following steps are specifically executed:

in the pixel attention network, a first layer of feature layers F ₁ The feature map of the system is characterized in that the feature map of the system is obtained through an initial structure with convolution kernels with different ratios, and then a two-channel saliency map is learned through convolution operation, and the obtained saliency map respectively represents scores of a foreground and a background; then, a Softmax operation is performed on the significance map, and one of the channels is selected with F ₁ Is multiplied by the feature map of (a); finally, a new information characteristic diagram A is obtained ₁ ；

The method of supervised learning is adopted: firstly, obtaining a binary mapping as a label according to a sample real label, and then using cross entropy loss and saliency mapping of the binary mapping as attention loss; furthermore, CBAM is used as a secondary attention network.

Further, in the combined weighted knowledge distillation and multi-scale feature distillation module, the following steps are specifically executed:

based on the generated feature graphs corresponding to the feature layers, the feature graphs are continuously trained by adopting example sampling and cross entropy loss to obtain a teacher model

Retraining a student model ψ by adding a weighted knowledge distillation loss and multi-scale feature distillation loss _θ,ω In this process, multi-scale features and predictions from the trained teacher model are used, leaving room for training better student models to learn.

Advancing oneStep, the weighted knowledge distillation loss L is calculated according to the following formula _RKL ：

Where abs () represents the absolute value function, the weight factor w _i Is that

wherein t_i and s_i Class prediction probabilities respectively representing teacher model and student model, C representing class number, N _i Representing the number of samples of class i.

Further, the multiscale characteristic distillation loss L is calculated according to the following formula _KF ：

wherein , and />Is a multi-scale feature v learned by a teacher model _t Characteristics v of the student's end _s And respectively carrying out normalization to obtain n representing the number of the feature graphs extracted by the model.

Further, in the combined weighted knowledge distillation and multi-scale feature distillation module, a final classification loss L _JWAFD The method comprises the following steps:

obtaining the classification loss L _JWAFD ：

wherein ,L_CE Real label for representing ground and studentCross entropy loss between model predictions,y＝(y ₁ ,y ₂ ,...,y _C )∈R ^C true label vector representing an image data point, C representing the number of categories, y _i Represents the ith component of y, z _s Output representing student model->The estimated class probability representing the output of the student model, the superparameters alpha and beta control the respective distillation amounts, while delta, gamma represent scaling parameters for multi-scale feature distillation and weighted knowledge distillation, respectively, and T represents the temperature of the knowledge distillation.

Compared with the prior art, the application has the beneficial effects that:

according to the SSD-based improved unbalanced data and underwater small target detection method under the complex background, the designed supervised multidimensional pixel attention network can effectively eliminate the influence of the complex background of the underwater image. Aiming at the problem of unbalanced distribution of the types in an underwater environment, a combined weighted knowledge distillation and multi-scale characteristic distillation module is provided, so that the types with fewer samples can obtain a better identification and detection effect, and compared with an original SSD algorithm, the detection capability of rare types is greatly improved, and the influence on the detection capability of the model due to unbalanced distribution of the samples is reduced. In the training process, as the backbone network extracts the features of the underwater image, the resolution of the feature map is also reduced, and the application adds a layer of cavity convolution with the expansion rate of 3, so that the network can have a multi-scale convolution kernel, and the capability of the model for detecting small objects is further enhanced. Finally, compared with the original SSD target detection model, the design in the application is more suitable for underwater image detection with complex background.

Drawings

FIG. 1 is a diagram of a network structure improved by an SSD-based improved unbalanced data and underwater small target detection method in a complex background;

FIG. 2 is a diagram of an SSD network structure;

FIG. 3 is a diagram of a supervised multidimensional pixel attention network architecture;

FIG. 4 is a block diagram of a CBAM module;

FIG. 5 is a schematic diagram of a hole convolution structure;

FIG. 6 is a schematic diagram of a weighted knowledge distillation and multi-scale feature distillation module architecture.

Detailed Description

The application is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

the application provides an SSD (solid State disk) -based improved unbalanced data and underwater small target detection method under a complex background, wherein the network structure of the application is shown in figure 1, and the method comprises the following steps:

s101: an improvement to the network of SSDs (the original network is as shown in fig. 2), comprising: the feature extraction layer of the front end is based on the architecture of the standard VGG16 model (truncated at conv3—3 layer), using its output as the feature layer F of the first layer for prediction ₁ . And a multidimensional pixel attention network (MDA-Net, as shown in fig. 3) is embedded after conv3—3, which is composed of a combination of pixel attention and CBAM attention (as shown in fig. 4). The complex background of the underwater environment can be eliminated. After this, using a hole convolution with expansion ratio r (as an embodiment, in particular r=3) and with ReLU activation (as shown in fig. 5), a plurality of feature layers (as an embodiment, in particular 6 feature layers F ₂ ,F ₃ ,F ₄ ,F ₅ ,F ₆ ,F ₇ ). This can achieve a larger receptive field without sacrificing feature map resolution (large receptive fields result in strong semantics). The feature maps of the 7 feature layers have different sizes, which can be used to effectively predict targets of different sizes. Finally, in order to solve the problem of sample unbalance in underwater target detection, a combined weighted knowledge distillation and multi-scale characteristic distillation module (shown in fig. 6) is used, so that the performance of fewer sample types is greatly improved.

S102: adjusting the size of a picture to be detected in the original unbalanced underwater image data set and inputting the picture to a network of the SSD after the improvement of S101;

s103: the network based on the SSD after S101 improvement detects the underwater small object.

For a better understanding of the present application, the present application will be specifically described by:

(1) Supervised multidimensional pixel attention network for eliminating complex background effects

To more effectively capture small underwater objects in a complex context, we have introduced a supervised multidimensional attention network (MDA-Net) among the original SSD networks. Specifically, in a pixel attention network, feature layer F ₁ Through an initial structure with convolution kernels of different ratios, and then through a convolution operation learn a two-pass saliency map. The resulting saliency maps represent the scores of the foreground and background, respectively. Then, a Softmax operation is performed on the significance map, and one of the channels is selected with F ₁ Is multiplied by the feature map of (c). Finally, a new information characteristic diagram A is obtained ₁ . It should be noted that the significance map values following the Softmax function are in [0,1 ]]Between them. That is, it can reduce noise and relatively enhance target information. Since the saliency map is continuous, non-object information is not completely eliminated, which is beneficial to preserving certain context information and improving robustness. To guide the process of web learning, we use a method of supervised learning. First, we can derive a binary map as a label from the sample genuine label, and then use the cross entropy loss of the binary map and the saliency map as a loss of attention. Furthermore, we also used CBAM as an auxiliary attention network with a reduction ratio of 16.

(2) Hole convolution

Along with the extraction of the backbone network to the underwater image features, the resolution of the feature map is also becoming smaller. Convolution is to preserve a small number of key features in the data to reduce learning and training costs. The large convolution kernel is beneficial to detecting a large target; smaller convolution kernels are advantageous for detecting small targets. The model is additionally provided with a layer of cavity convolution with the expansion rate of 3, so that the network can have a multi-scale convolution kernel, and the capability of the model for detecting objects with different sizes is further enhanced. In addition, the hole convolution with a smaller expansion rate is designed to be more effective in extracting low-resolution feature map information.

(3) Combined weighted knowledge distillation and multi-scale feature distillation module

Unlike the original SSD, when we do class prediction, this can bias the classifier model trained by the original data towards multiple classes of samples due to the class sample imbalance problem in the underwater target detection, and the remaining classes can be misclassified. Therefore, the patent designs and implements a new two-stage training method for training the classifier, which is called a joint weighted knowledge distillation and multi-scale feature distillation module. The module needs to train a teacher model by using the class data of the training sample first, and then in the prediction stage, the teacher model is used to provide guidance for the class prediction process so as to learn a more balanced classifier, which is beneficial to the detection effect of the final model.

The model is characterized by comprising a weighted knowledge distillation module and a characteristic distillation module, wherein the weighted knowledge distillation module considers class priors, and different weights are given to different classes, so that the model is focused on the class with less samples. The feature distillation module effectively compensates for the problem of insufficient feature representation caused by the weighted knowledge distillation method. It is applicable to data sets with unbalanced category numbers, which effectively combines the advantages of re-weighting with raw knowledge distillation.

Weighted knowledge distillation loss L _RKL The calculation is as follows:

where abs () represents the absolute value function, the weight factor w _i The calculation of (2) is shown in the following formula, and the weight of the class with fewer samples can be effectively improved in this way, and the negative gradient caused by the class with more samples on other classes is reduced.

Multiscale characteristic distillation loss L _KF The calculation is as follows:

wherein , and />Is a multi-scale feature v learned by a teacher model _t Characteristics v of the student's end _s And respectively carrying out normalization to obtain n representing the number of feature graphs extracted by the model, wherein in the embodiment, n=5 is taken.

Some of the mispredictions of the teacher model give erroneous guidance to the student model. Therefore, the application combines the cross entropy loss when training the student network, so that the student model has the opportunity to learn the real label of the sample. Finally training a student model by minimizing the sum of the above three partial losses, based on which the classification losses of the original SSD destination detection algorithm are redesigned as:

wherein ,L_CE Representing cross entropy loss between ground truth labels and student model predictions,y＝(y ₁ ,y ₂ ,...,y _C )∈R ^C true label vector representing an image data point, C representing the number of categoriesAmount, y _i Represents the ith component of y, z _s Output representing student model->The estimated class probability representing the output of the student model, the superparameters alpha and beta control the respective distillation amounts, while delta, gamma represent scaling parameters of multi-scale feature distillation and weighted knowledge distillation, t, respectively _i and s_i The model prediction probabilities of the teacher model and the student model are represented respectively, and T represents the temperature of knowledge distillation.

As a specific implementation manner, the underwater small target detection method based on SSD improved unbalanced data and under complex background comprises the following steps:

step 1: the picture to be detected in the original unbalanced underwater image data set is input into the network through the size of 512×512×3 by Resize.

Step 2: the underwater small target detection method based on SSD improvement comprises the following steps: using VGG-16 (truncated at Conv3_3 layer) as front-end backbone network, feature layer F is taken at its Conv3_3 th ₁ ，F ₁ The feature map of (2) is 64 multiplied by 512, and the feature map is processed by a supervised multidimensional pixel attention network to eliminate complex and noisy background and obtain a clearer feature map A ₁ 。

Step 3: at A ₁ Then adopting a cavity convolution block with the expansion rate of 3 to obtain more effective characteristics, and further carrying out 3X 3 convolution and maximum pooling operation to obtain a characteristic layer F ₂ The feature map size is 32×32×1024, and then similar operation is performed to obtain the remaining 5 prediction feature layers F ₃ (feature map size is 16×16×512), F ₄ (feature map size is 8×8×256), F ₅ (feature map size is 4×4×256), F ₆ (feature map size is 2×2×256), F ₇ (the feature map size is 1×1×256). Finally, 7 features are subjected to category prediction and position deviation prediction to obtain confidence coefficient and coordinates respectively, wherein the features are input into a 3X 3 convolution in the prediction process, F ₁ ,F ₆ ,F ₇ Each pixel point generates 4 prior frames, F ₂ ,F ₃ ,F ₄ ,F ₅ Each pixel point generates 6 prior frames respectively.

Step 4: firstly, training a teacher model by using class data of training samples, and then providing guidance for a class prediction process by using the teacher model in a prediction stage so as to learn a more balanced classifier and further finish underwater small target detection.

Further, the step 4 includes:

step 4.1, based on the generated feature graphs corresponding to the feature layers, continuing training the feature graphs by adopting example sampling and cross entropy loss to obtain a teacher model

Step 4.2, retraining a student model ψ by adding a weighted knowledge distillation loss and multi-scale feature distillation loss _θ,ω In the process, multi-scale characteristics and predictions from a trained teacher model are fully utilized, and a learnable space is reserved for training a better student model.

To verify the effect of the application, the following experiments were performed:

(a) Experimental configuration and data set

The experiment used ubuntu20.04 operating system, NVIDIA Tesla V100 GPU,32GB memory. Training, validation and testing on URPC2019 and chinamam datasets based on the pyrerch framework. The training process sets the batch size to 16, the input picture size to 512×512, the initial learning rate to 0.001, after 80 epochs training are performed on each detector, the learning rate is reduced to 0.0001, another 40 epochs training are performed, the momentum is set to 0.9, and the total number of iterations is 120.

The URPC data set is a sea cucumber, sea urchin, scallop and other sea treasure image data set established by the national natural science foundation committee for underwater robot target grabbing. The method is mainly used for underwater image detection tasks as a public data set for image processing. The application uses the URPC2019 data set, and as the test set is not disclosed, the training set of the URPC2019 is divided into 3409 training images and 1000 test images, and the training images comprise four object categories including sea cucumbers, sea urchins, scallops and starfishes. The chinamam data collection of underwater images enhanced games contained 2071 training images and 676 Zhang Yanzheng images altogether.

All data sets have the problem of class imbalance, i.e. scallops and starfish contain much more data than sea cucumbers and sea urchins.

(b) Evaluation index

The application uses 3 indexes such as a common accuracy-recall (P-R) curve, an average accuracy (Average Precision, AP), an average accuracy average (mean Average Precision, mAP) and the like in a target detection task, mainly selects 3 indexes such as an accuracy-recall (P-R) curve, an average accuracy (Average Precision, AP), an average accuracy average (mean Average Precision, mAP) and the like, and has the following calculation formula:

wherein ：T_TP Representing a correct prediction; f (F) _FP Representing mispredictions, including detecting objects that are not sea cucumbers as sea cucumbers and missed detection; f (F) _FN Representing the situation that the sea cucumber target is erroneously detected as other types; p is the accuracy; r is the recall rate. In the P-R curve, the area enclosed by the P-R curve and the coordinate axis is equal to the AP value. In particular, we use an AP in order to evaluate the imbalance problem in the underwater dataset _r ,AP _m ,AP _f And evaluating the performances of rare categories, general categories and frequent categories in the data set respectively. Finally, the mAP is obtained by averaging the AP values of all the classes, and is generally used for the whole target detection network modelAnd evaluating the detection performance.

Finally, based on the experimental setting, compared with an original SSD target detection model, the design in the application is more suitable for detecting the underwater image with the complex background, and the small target detection effect is better.

In summary, the application designs the method for detecting the underwater small target based on the unbalanced data and the complex background of the SSD, and the designed supervised multidimensional pixel attention network can effectively eliminate the influence of the complex background of the underwater image. Aiming at the problem of unbalanced distribution of the types in an underwater environment, a combined weighted knowledge distillation and multi-scale characteristic distillation module is provided, so that the types with fewer samples can obtain a better identification and detection effect, and compared with an original SSD algorithm, the detection capability of rare types is greatly improved, and the influence on the detection capability of the model due to unbalanced distribution of the samples is reduced. In the training process, as the backbone network extracts the features of the underwater image, the resolution of the feature map is also reduced, and the application adds a layer of cavity convolution with the expansion rate of 3, so that the network can have a multi-scale convolution kernel, and the capability of the model for detecting small objects is further enhanced. Finally, compared with the original SSD target detection model, the design in the application is more suitable for underwater image detection with complex background.

The foregoing is merely illustrative of the preferred embodiments of this application, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this application, and it is intended to cover such modifications and changes as fall within the true scope of the application.

Claims

1. An underwater small target detection method based on SSD improved unbalanced data and complex background is characterized by comprising the following steps:

step 1: an improvement to a network of SSDs, comprising: using VGG16 as front end backbone network, taking output at Conv3_3 of VGG16 as a first layer for predicting feature layer, embedding multidimensional pixel attention network after Conv3_3 of VGG16 model to eliminate complex background of underwater environment; using hole convolution with expansion rate r and ReLU activation after the multidimensional pixel attention network to sequentially generate a plurality of residual feature layers needing prediction; inputting the generated feature graphs corresponding to the feature layers into a joint weighted knowledge distillation and multi-scale feature distillation module so as to improve the detection performance of the underwater small target with fewer sample types;

2. The method for detecting underwater small objects based on unbalanced data and complex background of SSD improvement according to claim 1, wherein feature patterns of a plurality of feature layers have different sizes.

3. The method for detecting underwater small objects in an unbalanced data and complex background based on SSD improvement of claim 1, wherein the multidimensional pixel attention network is composed of a combination of pixel attention and CBAM attention.

4. The method for detecting underwater small objects in unbalanced data and complex backgrounds based on SSD improvement according to claim 3, wherein the following steps are specifically executed in the multidimensional pixel attention network:

5. The method for detecting underwater small targets based on unbalanced data and complex background of SSD improvement of claim 1, wherein the combined weighted knowledge distillation and multi-scale feature distillation module specifically performs the following steps:

6. The method for detecting underwater small objects in unbalanced data and complex backgrounds based on SSD improvement as claimed in claim 5, wherein the weighted knowledge distillation loss L is calculated according to the following formula _RKL ：

7. The method for detecting underwater small targets in an unbalanced data and complex background based on SSD improvement of claim 6, wherein the multi-scale feature distillation loss L is calculated according to the following formula _KF ：

8. The method for detecting underwater small targets in an unbalanced data and complex background based on SSD improvement of claim 7, wherein the final classification loss L in the combined weighted knowledge distillation and multi-scale feature distillation module _JWAFD The method comprises the following steps:

obtaining the classification loss L _JWAFD ：

wherein ,L_CE Representing cross entropy loss between ground truth labels and student model predictions,true label vector representing an image data point, C representing the number of categories, y _i Represents the ith component of y, z _s Output representing student model->The estimated class probability representing the output of the student model, the superparameters alpha and beta control the respective distillation amounts, while delta, gamma represent scaling parameters for multi-scale feature distillation and weighted knowledge distillation, respectively, and T represents the temperature of the knowledge distillation.