CN115527105A

CN115527105A - Underwater target detection method based on multi-scale feature learning

Info

Publication number: CN115527105A
Application number: CN202211190261.5A
Authority: CN
Inventors: 罗俊海; 陈瑜
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-27

Abstract

The invention discloses an underwater target detection method based on multi-scale feature learning. The method increases the number of the prediction heads to 4, can better detect small targets, reduces the resource occupation of a network model, ensures that the model is suitable for embedded equipment with limited hardware conditions, further improves the detection precision of underwater targets compared with other advanced target detectors, can effectively detect the targets, and has good detection precision and real-time property.

Description

Underwater target detection method based on multi-scale feature learning

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to an underwater target detection method based on multi-scale feature learning.

Background

Marine environments are a complex system. Since the underwater environment is very different from the land environment, some remote sensing technologies, including acoustics, magnetics and three-dimensional shallow earthquakes, have achieved good performance in the marine field. With the development of computer vision technology, it has become a new approach to explore the ocean by using computer vision technology. As a basic task of computer vision, target detection based on optical imaging has become a research hotspot in the marine field. In recent years, many researchers have begun to study underwater target detection based on optical imaging with great success. Currently, underwater target detection has many applications in marine environments, including marine ecosystem research, marine biota population estimation, marine species protection, ocean fishery, underwater unexplosive agent detection, underwater archaeology, and many other potential applications, providing an effective way to develop marine resources. Although the target detector based on the Deep Convolutional Neural Network (DCNN) performs well on general class data sets in recent years, it cannot be directly applied to underwater scenes due to the slow speed and large model size of large networks. Since underwater scenes are more complex than terrestrial scenes, the image samples obtained by underwater camera devices are small and generally of low quality. On the one hand, these images often suffer from high noise, low visibility, edge blur, low contrast and color cast.

Furthermore, although the object detection technique achieves good performance in general data sets, the image quality is generally poor due to low visibility and color deviation problems in complex underwater environments, the performance of YOLOv5, which is often used today, is not satisfactory for underwater object images in complex scenes, such as small objects, the YOLOv5 model still has much room for improvement, and the problem of small objects etc. results in less extractable information, which makes it difficult to obtain satisfactory results. The original feature extraction network of YOLOv5 has 53 layers, and parameters and calculation amount are huge, and in addition, due to the limitation of a convolution sampling method, the manual convolution network used by the CSPDarknet53 is not sensitive to target identification of different scales. Its ability to handle geometric changes in features is relatively limited, requiring extensive image training to improve the generalization ability of the network. In the practical use of YOLOv5, if the network encounters an element not in the data set, missed detection and false detection are likely to occur, thereby affecting the result of target detection. In the underwater target detection research based on deep learning, most researches mainly focus on using a deep network, so that the extracted features are not rich enough, the detection precision is not high enough, and the underwater target detection method is particularly directed to the detection of small targets. Therefore, better detection performance is needed for applying target detection techniques in marine environments.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an underwater target detection method based on multi-scale feature learning.

The technical scheme of the invention is as follows: an underwater target detection method based on multi-scale feature learning comprises the following specific steps:

the first step, the image pre-processing,

aiming at the detection of underwater targets, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance the images, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched.

Step two, extracting the depth characteristics,

the CSPDarknet53 network structure is improved, the feature extraction network is replaced by ESNet, a double-backbone network is adopted to extract more abundant features, and an SE module is added in the ES block.

And when stride =2, adding depthwise depth-by-depth convolution and pointwise convolution to fuse different channel information, and when stride =1, introducing a Ghost module into the ES block to improve the performance of the ES block.

ES block in ESNet adopts depth separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) to extract a characteristic diagram, and Ghost block is applied to ES block with stride of 1.

Step three, extracting the texture characteristics,

aiming at the texture information extracted from the target, a Gray-Level Co-occurrrence Matrix (GLCM) is used for describing the correlation degree between adjacent pixel points in a local area and reflecting the comprehensive information of the Gray direction, interval and variation amplitude of the image.

The method comprises the steps of carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image aiming at each grid region, carrying out contrast adjustment in a histogram equalization mode, and then compressing the gray level of an adjustment result to reduce the post-calculation amount. And further analyzing texture classification characteristics caused by repeated appearance of a specific gray level structure by using statistic parameters calculated by CLCM evolution, and expressing the texture characteristics by selecting several parameters of contrast, energy and inverse variance.

The contrast expression is as follows:

wherein i represents the ith row of the gray matrix, j represents the jth column of the gray matrix, P (i, j) represents the probability that the gray level j appears with i as the starting point, and Con represents the contrast ratio.

The angular second moment (energy) expression is as follows:

wherein, P (i, j) represents the probability that the gray level takes i as a starting point and the gray level j appears, and Asm represents the solved angular second moment.

The inverse variance expression is as follows:

where d denotes the spatial distance, theta denotes the direction, H denotes the inverse variance, P (i, j | d, theta) denotes the probability that a gray level j occurs with i as the starting point, given the spatial distance d and the direction theta, and N denotes the maximum number of rows (columns) of the gray matrix.

By inspecting different texture rules in the horizontal direction, the vertical direction and the diagonal direction, namely taking four directions of 0 degree, 45 degrees, 90 degrees and 135 degrees respectively for the direction of the gray level co-occurrence matrix, extracting three types of rich texture characteristic parameters of contrast, second moment and inverse variance on the basis of the gray level co-occurrence matrix in each direction, calculating the characteristic mean value of all images and normalizing the characteristic mean value:

wherein x is a characteristic value to be calculated, x _max Is the largest eigenvalue in the matrix, x _min Is the smallest eigenvalue, x, in the matrix ^* Is a normalized characteristic value.

The classifier compares texture features between grid regions while paying attention to local grid texture features, and finally outputs 1 × 108 feature representation image texture features.

Step four, the depth characteristic and the texture characteristic are processed in a combined way,

the extracted texture features and the extracted depth features are fused, so that the network obtains richer feature information. After feature extraction, a Coordinate Attention mechanism is adopted, a new Coordinate Attention block is added in the last layer of a feature extraction network, position information is embedded into channel Attention by Coordinate Attention (CA), the channel Attention is decomposed into two one-dimensional feature coding processes by the Coordinate Attention block, features in different directions are gathered in the processes, and then the generated feature map is independently coded to form a pair of direction sensing and position sensing feature maps.

Step five, predicting the characteristic diagram to obtain a result,

the prediction module transfers and fuses the feature information into a feature map in an up-sampling mode, performs prediction after obtaining the feature map, and then obtains a final result, wherein the prediction module comprises 4 YOLO heads, the lower YOLO head applies the feature information of the upper YOLO head, and performs feature map splicing after up-sampling to obtain the feature map of the layer for prediction.

The invention has the beneficial effects that: the invention discloses an underwater target detection method based on multi-scale feature learning. The method increases the number of the prediction heads to 4, can better detect small targets, reduces the resource occupation of a network model, enables the model to be suitable for embedded equipment with limited hardware conditions, further improves the detection precision of underwater targets compared with other advanced target detectors, can effectively detect the targets, and has good detection precision and real-time performance.

Drawings

Fig. 1 is a flowchart of an underwater target detection method based on multi-scale feature learning according to the present invention.

Fig. 2 is an overall structure diagram of an underwater target detection method based on multi-scale feature learning according to the present invention.

Fig. 3 is a structural diagram of an ESNet according to an embodiment of the present invention.

Fig. 4 is an ES block structure diagram according to an embodiment of the present invention.

Fig. 5 is a structural diagram of a texture feature extraction module according to an embodiment of the present invention.

FIG. 6 is a diagram of a coordinate attention structure according to an embodiment of the present invention.

Detailed Description

The method of the present invention is further described with reference to the accompanying drawings and examples.

As shown in fig. 1, a flowchart of an underwater target detection method based on multi-scale feature learning according to the present invention includes the following specific steps:

as shown in fig. 2, the overall structure of the underwater target detection method based on multi-scale feature learning of the present invention is based on YOLOv5 detection algorithm, and is composed of a multi-scale feature extraction network, an attention mechanism based on coordinate attention block, and a YOLOv5 target detector, wherein a CBL module is composed of convolution operation, batch processing, and an activation function, the multi-scale feature extraction part includes a depth feature extraction part and a texture feature extraction part, and a schematic structural diagram in the depth feature extraction part is shown in fig. 3.

Step one, the image is preprocessed,

the quality of partial images of the data set is low due to the conditions of underwater depth, light and the like of the underwater images, and the problems of target blurring, low contrast or target overlapping and shielding are mainly shown, so that the detection precision is greatly influenced.

Aiming at the detection of an underwater target, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance an image, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched.

Step two, extracting the depth characteristics,

the structure of the CSPDarknet53 network is improved, the feature extraction network is replaced by ESNet, and a double-backbone network is adopted to extract more abundant features, wherein the size of an input image is 416 multiplied by 3, so that the specific structures of three feature graphs with the sizes of 384, 192 and 96, which are output after the features are extracted by a single ESNet, are respectively shown in FIG. 4. And an SE module is added in the ES block, and the SE is a very good operation for improving the feature expression capability.

The design of the SE module refers to MobileNet V3, namely the two activation functions are Sigmoid and H-Sigmoid respectively. In order to solve the problem that fusion characteristics are lost due to the fact that channels are replaced by channels in ShuffLeNet V2 to provide information exchange, when stride =2, depthwise depth-by-depth convolution and pointwise convolution are added to fuse different channel information, a Ghost module in GhostNet can generate more characteristics with fewer parameters and improve the learning capability of a model, and therefore when stride =1, the Ghost module is introduced into ES block to improve the performance of the ES block.

At present, most convolution operations are point-by-point convolution for reducing dimensions, an ES block in ESNet adopts deep separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) parts for extracting feature maps, the ES block with stride of 1 is used for a Ghost block, the Ghost block combines linear operation and common convolution, and some redundant feature maps are linearly converted from the generated common convolution feature maps to obtain similar feature maps, so that a high-dimensional convolution effect is generated, and model parameters and calculated amount are reduced. And finally, outputting the image depth features S1, S2 and S3.

Step three, extracting the texture characteristics,

as shown in fig. 5, the texture feature extraction module mainly extracts texture information for a target, and uses a Gray-Level Co-occurrence Matrix (GLCM) to describe a correlation degree between adjacent pixels in a local area, so as to reflect comprehensive information of an image Gray direction, an image Gray interval, and an image change range.

The method comprises the steps of carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image aiming at each grid region, carrying out contrast adjustment in a histogram equalization mode, and then compressing the gray level of an adjustment result to reduce the post-calculation amount. Because the statistic gray level co-occurrence matrix feature dimension is large, the later-stage classifier has certain difficulty in distinguishing texture information according to the statistic parameter calculated by CLCM evolution, the texture classification feature caused by repeated occurrence of a specific gray level structure is further analyzed, and the following parameters are selected to express the texture feature:

contrast ratio: the values of the metric matrix are how distributed and how much of the local variation in the image reflects the sharpness of the image and the depth of the texture. The deeper the furrows of the texture, the greater the contrast, the clearer the effect; otherwise, if the contrast value is small, the grooves are shallow and the effect is blurred. The expression is as follows:

Angular second moment (energy): angular Second Moment (ASM) is an index reflecting the uniformity of the image gray distribution and the thickness of the texture. The image gray distribution in the local area is more uniform, and the ASM value is correspondingly larger. The expression is as follows:

wherein, P (i, j) represents the probability of gray level j taking i as a starting point, and Asm represents the angular second moment.

Inverse variance: the inverse variance reflects the size of local change of the image texture, if different regions of the image texture are uniform, the change is slow, the inverse variance is large, and otherwise, the inverse variance is small.

Where d represents the spatial distance, θ represents the direction, and H represents the inverse variance. P (i, j |, θ) represents the probability that a gray level j occurs starting from i for a given spatial distance d and direction θ, and N represents the maximum number of rows (columns) of the gray matrix.

The classifier compares texture features between mesh regions while focusing on local mesh texture features, and finally outputs 1 × 108 feature T representing image texture features.

Step four, depth feature and texture feature fusion processing,

and respectively extracting image texture information from the images I1 and I2, then fusing the image texture information with the extracted depth features respectively, so that the network obtains richer feature information, and finally extracting network output features C3, C4 and C5 by using the features. Expressed as:

C3＝concat(S11+T1,S21+T2)

C4＝concat(S12,S22)

C5＝concat(S13,S23)

wherein, T1 represents the texture feature information extracted from the 1 st image, T2 represents the texture feature information extracted from the 2 nd image, and Sij (i ∈ (1,2), j ∈ (1,2,3)) represents the jth extracted depth feature of the ith image.

After the feature map is reduced in size, a coordinate attention mechanism is adopted, and although the channel attention can significantly improve the performance of the model, the position information is usually ignored, which is very important for generating the spatial selective attention. To enable the model to extract more useful features, a new Coordinate Attention block is added at the last layer of the feature extraction network, as shown in fig. 6, and position information is embedded into channel Attention using Coordinate Attention (CA), so that the network can focus on a larger area, unlike channel Attention which converts input into a single feature vector through two-dimensional global pooling, the Coordinate Attention block decomposes the channel Attention into two one-dimensional feature encoding processes which aggregate features in different directions, and then the generated feature maps are separately encoded to form a pair of direction-aware and position-sensitive feature maps which can be applied complementarily to the input feature maps to enhance the representation of the object of interest.

Step five, predicting the characteristic diagram to obtain a result,

the prediction module is used for transferring and fusing the feature information into a feature map in an up-sampling mode, predicting after obtaining the feature map, and then obtaining a final result, wherein the prediction module comprises 4 YOLO heads, the lower YOLO head is applied with the feature information of the upper YOLO head, and the feature map is spliced after the up-sampling, so that the feature map of the layer is obtained for prediction. Compared with the original yollov 5 three-layer prediction, the improvement in this embodiment is that a YOLO header dedicated to detecting a small target sample is added in the prediction module, that is, the current YOLO header is: 96 is multiplied by 96, 48 is multiplied by 48, 24 is multiplied by 24, 12 is multiplied by 12, which is beneficial to detecting underwater small targets.

In the data labeling process, the size of a real frame is greatly different from the default size of an original algorithm, and the size of a prior frame has influence on the network detection speed and precision, so that the size of the prior frame matched with an experimental model is selected, the convergence speed in training can be increased, the target positioning is more accurate, the size of an anchor frame needs to be redesigned, and the size of the prior frame is changed from 9 default sizes to 12 default sizes after the network is improved and is distributed to 4 detection layers with different scales. The original YOLOv5 adopts a K-Means clustering algorithm, but the algorithm has certain defects, the K-Means clustering algorithm has the defects that clustering results are likely to have larger difference after different initial centroids are selected, and K-Means + + is adopted to replace K-Means to solve the problem aiming at the defect that the K-Means selects the initial centroids. The algorithm process is as follows:

firstly, randomly selecting a sample from data samples as an initial clustering centroid, then calculating the shortest distance D (x) between other samples in the data set and the currently existing class center, then calculating the probability P (x) of the sample selected as the next class center, wherein the sample with the largest probability is the next class center, and the formula for calculating P (x) is as follows:

where X represents a sample, X represents the set of all samples, and P (X) represents the probability that a sample is selected as the next cluster center.

And then, repeatedly calculating the distance and the probability when selecting one clustering center until K clustering centers are selected. Because a layer of scale detection is added, the clustering center K is set to be 12, then a K-Means + + clustering algorithm is adopted, 12 groups of prior frames with different sizes, namely the sizes of anchor frames can be obtained, wherein the smallest feature graph is suitable for larger target detection due to the largest receptive field, and the largest feature graph is provided with a smaller receptive field, so that the smallest anchor point is applied to detect small targets.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. An underwater target detection method based on multi-scale feature learning comprises the following specific steps:

step one, the image is preprocessed,

aiming at the detection of an underwater target, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance an image, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched;

step two, extracting the depth characteristics,

the CSPDarknet53 network structure is improved, a feature extraction network is replaced by ESNet, a double-backbone network is adopted to extract more abundant features, and an SE module is added in an ES block;

the channel is replaced by a channel in ShuffleNet V2 to provide information exchange, when stride =2, depthwise depth convolution and pointwise convolution are added to fuse different channel information, and when stride =1, a Ghost module is introduced into the ES block to improve the performance of the ES block;

the ES block in ESNet adopts depth separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) to extract a characteristic diagram, and the ES block with stride of 1 uses Ghost block;

step three, extracting the texture characteristics,

aiming at the texture information extracted from the target, describing the correlation degree between adjacent pixel points in a local area by using a gray level co-occurrence matrix GLCM (global likelihood matching model), and reflecting the comprehensive information of the gray level direction, interval and variation amplitude of the image;

carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image for each grid region, carrying out contrast adjustment in a histogram equalization mode, then compressing the gray level of an adjustment result, further analyzing texture classification features caused by repeated appearance of a specific gray level structure by using statistic parameters calculated by CLCM evolution, and expressing the texture features by selecting contrast, energy and inverse variance;

the contrast expression is as follows:

wherein i represents the ith row of the gray matrix, j represents the jth column of the gray matrix, P (i, j) represents the probability of gray level j taking i as a starting point, and Con represents the solved contrast;

the angular second moment (energy) expression is as follows:

wherein, P (i, j) represents the probability that the gray level takes i as a starting point and the gray level j appears, and Asm represents the solved angular second moment;

the inverse variance expression is as follows:

where d denotes the spatial distance, θ denotes the direction, H denotes the inverse variance, P (i, j | d, θ) denotes the probability that a gray level j occurs with i as the starting point, given the spatial distance d and the direction θ, and N denotes the maximum number of rows (columns) of the gray matrix;

inspecting different texture rules in four directions of horizontal, vertical and diagonal, namely taking four directions of 0 degree, 45 degrees, 90 degrees and 135 degrees respectively as the directions of the gray level co-occurrence matrix, extracting texture characteristic parameters with rich three types of contrast, second moment and inverse variance on the basis of the gray level co-occurrence matrix in each direction, calculating the characteristic mean value of all images and normalizing the characteristic mean value:

wherein x is a characteristic value to be calculated, x _max Is the largest eigenvalue in the matrix, x _min Is the smallest eigenvalue, x, in the matrix ^* The characteristic value is normalized;

the classifier compares texture features between grid areas while paying attention to local grid texture features, and finally outputs 1 × 108 feature representation image texture features;

fusing the extracted texture features and depth features, adopting a Coordinate Attention mechanism after feature extraction, adding a new Coordinate Attention block at the last layer of a feature extraction network, embedding position information into channel Attention by using Coordinate Attention (CA), decomposing the channel Attention into two one-dimensional feature coding processes by using the Coordinate Attention block, gathering features in different directions, and then independently coding the generated feature map to form a pair of direction perception and position sensitivity feature maps;

step five, predicting the characteristic diagram to obtain a result,