CN116310997A

CN116310997A - Deep learning-based marine small target detection method

Info

Publication number: CN116310997A
Application number: CN202310399524.1A
Authority: CN
Inventors: 尹勇; 邵泽远; 吕红光
Original assignee: Dalian Haida Zhilong Technology Co ltd; Dalian Maritime University
Current assignee: Dalian Haida Zhilong Technology Co ltd; Dalian Maritime University
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-06-23

Abstract

The invention discloses a deep learning-based marine small target detection method, which comprises the following steps of: step S1: acquiring a video image of an actual sea area, and carrying out framing treatment on the video image to obtain a picture data set containing target characteristics; step S2: labeling the picture data set to obtain a characteristic picture test set and a characteristic picture training set; step S3: based on a network framework of YOLOv5, constructing a YOLO-sea network model added with an attention mechanism module SCAM++ and an enhanced bidirectional feature fusion structure PANet; step S4: the DIoU is adopted as a loss function of the YOLO-sea, and a picture training set is utilized to train the YOLO-sea network model to obtain a YOLO-sea optimization model; step S5: and performing target detection on the picture test set by adopting a YOLO-sea optimization model. The problems that tiny targets near the sea-sky line are concentrated together and are interfered by coasts and spoons, the detection precision of small targets on the sea is low and the real-time performance is poor are solved.

Description

Deep learning-based marine small target detection method

Technical Field

The invention relates to the technical field of offshore target detection, in particular to an offshore small target detection method based on deep learning.

Background

The design of the intelligent ship target detector should satisfy the following two conditions. First, high-precision detection of sea surface obstacle targets. Second, the detection speed of the algorithm needs to meet real-time and low latency. Common CNN-based target detection algorithms can be divided into two classes. The first is a two-stage method based on regional suggestion generation, such as R-CNN 1, faster R-CNN 2, etc. The method generates a regional suggestion in a first stage, classifies and regresses the content in the region of interest in a second stage, spatial information of a local target in the whole image is lost, and the detection speed cannot reach real time. The second class is single-stage object detectors such as YOLO [3], retinaNet [4] and SSD [5], which algorithms do not directly generate regions of interest, but rather consider the object detection task as a regression task for the entire image. Therefore, the speed of the single-stage detector can mostly meet the requirement of real-time performance, but the problems of low object positioning accuracy and poor recall rate also exist. YOLOv5 can infer a plurality of objects at a time, and the detection speed is extremely high [6]. In addition, the problems of low detection precision of YOLOv3[7] and YOLOv4[8] are further improved through the self-adaptive design of the anchor frame and the optimization of the network module structure.

In a real offshore navigation scene, the offshore optical image collected by the shipborne camera has the characteristic of a large number of small-scale targets. Due to the existence of perspective distortion in photogrammetry, remote objects become smaller, and accurate detection of small objects at sea is a known problem [9] - [11]. On a high-definition image with a resolution of 1920×1080, the labeling frame of an object near the ship is different in size from an obstacle target near the sea-sky line by more than 120 times. Notably, very small targets near the sea-sky line are concentrated and are disturbed by coast and spray, which greatly increases the difficulty of detecting small targets at sea; because the existing algorithm model cannot meet the accurate and real-time obstacle detection requirement in the offshore automatic driving scene; it is therefore desirable to design a detection method that can efficiently handle offshore targets.

The references to the above techniques are as follows:

[1]R.Girshick,J.Donahue,T.Darrell,and J.Malik,“Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,”in 2014IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Columbus,OH,USA,Jun.2014,pp.580–587.doi:10.1109/CVPR.2014.81.

[2]S.Ren,K.He,R.Girshick,and J.Sun,“Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.39,no.6,pp.1137–1149,2017.

[3]J.Redmon,S.Divvala,R.Girshick,and A.Farhadi,“You Only Look Once:Unified,Real-Time Object Detection,”in 2016IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Las Vegas,Nevada,USA,2016,doi:10.1109/CVPR.2016.91.

[4]T.-Y.Lin,P.Goyal,R.Girshick,K.He,and P.Dollar,“Focal Loss for Dense Object Detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.42,no.2,pp.318–327,Feb.2020,doi:10.1109/TPAMI.2018.2858826.

[5]W.Liu et al.,“SSD:Single Shot MultiBox Detector,”in Proceedings of the European conference on computer vision(ECCV),2016.doi:10.1007/978-3-319-46448-0_2.

[6]Ultralytics.YOLOv5.Available online:https://github.com/ultralytics/yolov5(accessed on 1November 2022).

[7]J.Redmon and A.Farhadi,“YOLOv3:An Incremental Improvement,”2018,arXiv:1804.02767.[Online],Available:http://arxiv.org/abs/1804.02767.

[8]A.Bochkovskiy,C.-Y.Wang,and H.-Y.M.Liao,“YOLOv4:Optimal Speed and Accuracy of Object Detection,”arXiv:2004.10934.[Online].Available:http://arxiv.org/abs/2004.10934.

[9] feng Hui, jiang Chengxin, ding Yihang. Real sea area real-time small scale target detection algorithm [ J/OL ]. University of science and technology university of China (Nature science edition): 1-7[2023-03-16]. Https:// doi.org/10.13245/j.hurt.230404.

[10] Zhao Wenjiang, sun Wei A marine target detection and identification method based on S4-YOLO [ J ]. Optical and photoelectric technology, 2020,18 (4): 38-46.

[11]B.Iancu,V.Soloviev,L.Zelioli,and J.Lilius,“ABOships—An Inshore and Offshore Maritime Vessel Detection Dataset with Precise Annotations,”Remote Sensing,vol.13,no.5,p.988,Mar.2021,doi:10.3390/rs13050988.

Disclosure of Invention

The invention provides a deep learning-based marine small target detection method, which aims to solve the technical problems that small targets near a sea-sky line are concentrated together and are interfered by coasts and spoons to cause low detection precision and poor real-time performance of the marine small targets.

In order to achieve the above object, the technical scheme of the present invention is as follows: a deep learning-based marine small target detection method comprises the following steps:

step S1: collecting video images of an actual sea area, and carrying out framing treatment on the video images to obtain a picture data set containing target characteristics; the target features include contours, textures, and colors;

step S2: labeling the picture data set to obtain a characteristic picture test set and a characteristic picture training set;

step S3: based on a network framework of YOLOv5, constructing a YOLO-sea network model added with an attention mechanism module SCAM++ and an enhanced bidirectional feature fusion structure PANet;

step S4: the DIoU is adopted as a loss function of the YOLO-sea, and a picture training set is utilized to train the YOLO-sea network model to obtain a YOLO-sea optimization model;

step S5: and performing target detection on the picture test set by adopting the YOLO-sea optimization model.

Further, the structure of the YOLO-sea network model in step S3 includes processing the picture test set acquired in step S2 sequentially through a backbone network system, a feature fusion network system and a detection head network system;

the main network system comprises an initial layer convolution Conv module, a middle layer network module, an attention mechanism module SCAM++ and an SPPF module;

the initial layer convolution Conv module performs feature extraction on training pictures in a picture training set to obtain an initial feature map, wherein the target features comprise the outline, texture and color of a target;

the middle layer network module comprises a first feature extraction unit, a second feature extraction unit, a third feature extraction unit and a fourth feature extraction unit which are sequentially connected, wherein each of the first feature extraction unit, the second feature extraction unit, the third feature extraction unit and the fourth feature extraction unit consists of a deformable convolution DCN v2 module and a first C3 module;

the deformable convolution DCN v2 module is used for giving a set weight coefficient to the offset of each sampling point on the characteristic diagram input to the deformable convolution DCN v2 module; the offset of the sampling point comprises the size, angle transformation and the proportion of the target feature relative to the feature map;

The C3 module is used for carrying out residual feature learning and feature fusion on the feature map output by the deformable convolution DCN v2 module;

the first feature extraction unit performs feature extraction and fusion on the initial feature map to obtain a low-level feature map; transmitting the low-level feature map to a second feature extraction unit;

the second feature extraction unit performs feature extraction and fusion on the low-level feature map to obtain a first middle-level feature map; transmitting the first middle-layer feature map to a third feature extraction unit;

the third feature extraction unit performs feature extraction and fusion on the first middle-layer feature map to obtain a second middle-layer feature map; transmitting the second middle-layer feature map to a fourth feature extraction unit;

the fourth feature extraction unit performs feature extraction and fusion on the second middle-layer feature map to obtain a high-layer feature map; transmitting the high-level feature map to the attention mechanism module SCAM++;

the layer attention mechanism module SCAM++ carries out weighting processing on the high-layer feature map to obtain an optimized feature map, and transmits the optimized feature map to the SPPF module;

and the SPPF module performs feature fusion on the optimized feature map to obtain feature maps with different scales.

Further, the feature fusion network system comprises a first feature fusion unit, a second feature fusion unit, a third feature fusion unit, a fourth feature fusion unit, a fifth feature fusion unit and a sixth feature fusion unit which are sequentially connected;

the first feature fusion unit, the second feature fusion unit and the third feature fusion unit comprise a first convolution Conv module, an upsampling module, a first Concat module and a second C3 module which are sequentially connected;

the fourth feature fusion unit, the fifth feature fusion unit and the sixth feature fusion unit comprise a convolution second convolution Conv module, a second Concat module and a third C3 module which are sequentially connected;

the first convolution Conv module and the second convolution Conv module are used for reducing the size of an input feature map and extracting features; the dimension of the feature map comprises a feature map width W, a feature map height H and resolution;

the Upsampled up-sampling module is used for carrying out size amplification on the feature map under the condition that the number N of the feature map channels is unchanged;

the first Concat module and the second Concat module are used for increasing the number N of channels of the feature map under the condition that the size of the feature map is kept unchanged; the feature map is optimized by combining the strong semantic information of the high-level feature map and the positioning information of the low-level feature map;

The strong semantic information is the information of the thickness granularity of the image, and the positioning information comprises the texture, color, edge and edge angle information of the target feature;

the output end of the first feature extraction unit and the output end of the Upsample up-sampling module of the third feature fusion unit are connected to a second C3 module of the third feature fusion unit after channel splicing is performed through a first Concat module of the third feature fusion unit; the second C3 module of the third feature fusion unit is used as an input module of a target detection head with an extremely small scale and is connected with the detection head network system;

the output end of the second feature extraction unit is connected with a first convolution Conv module of a third feature fusion unit, the output end of the first convolution Conv module of the third feature fusion unit is connected with the output end of an Upsample up-sampling module of the second feature fusion unit, and the output end of the Upsample up-sampling module is input to a second C3 module of the second feature fusion unit after channel splicing is performed through a first Concat module of the second feature fusion unit;

the output end of the first convolution Conv module of the third feature fusion unit and the output end of the second convolution Conv module of the fourth feature fusion unit are connected to a third C3 module of the fourth feature fusion unit after being spliced through a second Concat module of the fourth feature fusion unit; the third C3 module of the fourth feature fusion unit is used as an input module of a small-scale target detection head and is connected with the detection head network system;

The output end of the third feature extraction unit is connected with a first convolution Conv module of the second feature fusion unit, and the output end of the third feature extraction unit is connected with the output end of an Upsample upsampling module of the first feature fusion unit, and the output end of the third feature extraction unit is input to a second C3 module of the first feature fusion unit after channel splicing is performed through a first Concat module of the first feature fusion unit;

the output end of the first convolution Conv module of the second feature fusion unit and the output end of the second convolution Conv module of the fifth feature fusion unit are connected to a third C3 module of the fifth feature fusion unit after being spliced through a second Concat module of the fifth feature fusion unit; the third C3 module of the fifth feature fusion unit is used as an input module of a mesoscale target detection head and is connected with the detection head network system;

the output end of the first convolution Conv module of the first feature fusion unit and the output end of the second convolution Conv module of the sixth feature fusion unit are connected to a third C3 module of the sixth feature fusion unit after channel splicing is performed through a first Concat module of the first feature fusion unit; the third C3 module of the sixth feature fusion unit is used as an input module of a large-scale target detection head and is connected with the detection head network system;

And the detection head network system performs merging processing on the output characteristic diagram of the second C3 module of the third characteristic fusion unit, the output characteristic diagram of the third C3 module of the fourth characteristic fusion unit, the output characteristic diagram of the third C3 module of the fifth characteristic fusion unit and the output characteristic diagram of the third C3 module of the sixth characteristic fusion unit based on an NMS non-maximum suppression algorithm to obtain a final detection picture of the small offshore target.

Further, each C3 module includes a first branch and a second branch;

the first branch comprises a third convolution Conv module and a plurality of bottlenecks which are sequentially connected in series; the second branch comprises a fourth convolution Conv module; the output ends of the two branches are spliced by a third Concat module and then are connected in series with a fifth convolution Conv module.

Further, in step S3, the attention mechanism module scam++ includes a channel attention module and a spatial attention module;

the channel attention module comprises a first global pooling layer, a first average pooling layer, a full connection layer and a Silu activation function layer;

the first global pooling layer and the first average pooling layer respectively acquire feature graphs output by the eighth layer C3 module, and perform self-adaptive average pooling and self-adaptive maximum pooling operation to acquire global feature vectors;

The features of the global feature vector include color, texture, and contour in the dimension direction;

the formula for large pooling operation of the first global pooling layer and the first average pooling layer is expressed as follows:

F _B ＝max(a(x,y))

wherein: f (F) _A The global feature vector is obtained by carrying out the average pooling operation on the feature map; f (F) _B The global feature vector is obtained by carrying out self-adaptive maximum pooling on the feature map; h is the height of the feature map; w is the width of the feature map; a (x, y) is a feature value of each pixel position in the feature map;

transmitting the two global feature vectors obtained after pooling to a full connection layer of a multi-layer perceptron MLP, and carrying out convolution classification to obtain an optimized feature vector;

the optimized feature vector adopts a sigmoid activation function to generate a channel attention weight Mc, and the formula is expressed as

M _C ＝sigmoid(MLP(F _A )+MLP(F _B ))

The weight distribution of the channel attention weight Mc is improved by adopting an activation function Silu, an optimal feature vector with an optimal weight value is obtained, and the activation function Silu is expressed as follows:

Silu(x)＝x×Sigmoid(x)

channel attention weight M obtained after improved weight distribution _c Can be expressed as:

M _C ＝Silu(MLP(F _A )+MLP(F _B ))

the spatial attention module comprises a second global pooling layer, a second average pooling layer, a convolution layer and a sigmoid activation function layer;

The feature images output by the first C3 module of the fourth feature extraction unit are compressed in the channel dimension through a second global pooling layer and a second average pooling layer respectively to obtain a multidimensional feature image;

after channel splicing is carried out on the two obtained multidimensional feature graphs, the multidimensional feature graphs are accessed into the convolution layer to carry out dimension reduction treatment on the multidimensional feature graphs, so as to obtain dimension reduction feature graphs;

the dimension reduction feature map generates a space attention weight Ms through a sigmoid activation function, and the space attention weight Ms is expressed as the formula

M _S ＝sigmoid(conv((F _A )，(F _B )))

Adding the channel attention module and the space attention module to obtain an attention mechanism module SCAM++, and expressing the formula as follows

Further, in step S4, DIoU is used as a loss function for calculating the regression of the target frame, and the DIoU calculates a distance loss between the object prediction frame and the real frame, where a calculation formula is as follows:

wherein b and b ^gt Respectively representing the center points of the prediction frame and the real object frame; ρ represents the Euclidean distance; ρ (b, b) ^gt ) Representing the Euclidean distance between the center point of the prediction frame and the real object frame; c represents the diagonal distance of the smallest outer rectangle of the prediction frame and the real object frame.

The beneficial effects are that: the invention provides a deep learning-based marine small target detection method, which adopts the expansion deformable convolution enhancement modeling capability, increases the attention degree of a small target gathering area through a parallel space-channel attention mechanism, designs an enhanced bidirectional feature fusion module for feature fusion, introduces a jump connection and Concat channel splicing mode, combines the positioning information of a low-level feature map and the strong semantic information of a high-level feature map, simultaneously adds a very small target detection layer, improves the positioning precision of a long-distance target near a sea antenna, and finally improves the regression capability of a model detection target frame through optimizing a loss function of the target, thereby comprehensively improving the detection precision of the marine small target while ensuring the real-time performance.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a deep learning-based marine small target detection method of the present invention;

FIG. 2 is a schematic diagram of the overall construction of the YOLO-sea of the deep learning-based marine small target detection method;

FIG. 3 is a schematic view of a SCAM++ module of the deep learning-based marine small target detection method;

FIG. 4 is a schematic diagram of an enhanced two-way feature fusion module PANet of the marine small target detection method based on deep learning;

FIG. 5 is a schematic diagram of a C3 module of the deep learning-based marine small target detection method;

FIG. 6 is a schematic diagram of SPPF module of a deep learning-based marine small target detection method of the present invention;

FIG. 7 is a schematic diagram of a Concat module of the deep learning-based marine small target detection method of the present invention;

FIG. 8 is a schematic diagram of an Upsamples up-sampling module of the deep learning-based marine small target detection method of the present invention;

FIG. 9 is a schematic diagram of an enhanced two-way feature fusion module PANet of the marine small target detection method based on deep learning;

fig. 10 is a schematic diagram of a loss function in a deep learning-based marine small target detection method according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment provides a deep learning-based marine small target detection method, as shown in fig. 1, comprising the following steps:

step S1: collecting video images of an actual sea area, and carrying out framing treatment on the video images to obtain a picture data set containing target characteristics; the target features include contours, textures, and colors; the target characteristics further comprise target category information, wherein the target category information is characteristic information of different targets in the image;

Step S2: manually checking and marking the picture data set through Labelimg marking software to obtain a characteristic picture test set and a characteristic picture training set;

The invention designs a deep learning-based marine small target detection method, firstly, a single-stage target detector based on an anchor frame is generally composed of a main network, a feature fusion part (generally called Neck) and a detection head part for target classification and positioning, and in a network frame taking Yolov5 as a reference, the main network is optimized by using a deformable convolution DCN v2, and a attention mechanism module SCAM++ module is introduced, so that the network focuses on a small target area more; secondly, in the feature fusion part, an enhanced bidirectional feature fusion structure PANet is designed, so that strong semantic information is introduced into a low-level feature map, positioning information is introduced into a high-level feature map, and finally, the fusion efficiency between different-scale feature maps is further enhanced through jump connection and multi-layer feature fusion; finally, a minimum target detection layer is added, prediction of the small offshore targets is carried out on the low-layer feature map, and positioning accuracy of the remote minimum offshore targets is improved. The YOLO-sea optimization model is suitable for small target detection in offshore automatic driving by fully considering the characteristics and position distribution of the offshore targets; meanwhile, the YOLO-sea optimization model can accurately position the offshore remote small target, supports real-time operation, can achieve better compromise between accuracy and real-time performance, and provides a good foundation for realizing the intelligent navigation function module of the intelligent ship.

In a specific embodiment, as shown in fig. 2, the structure of the YOLO-sea network model in step S3 includes processing the image test set acquired in step S2 sequentially by a backbone network system, a feature fusion network system and a detection head network system 18;

the main network system comprises an initial layer convolution Conv module, a middle layer network module 11, an attention mechanism module SCAM++ and an SPPF module;

the initial layer convolution Conv module performs feature extraction on training pictures in a picture training set to obtain an initial feature map, wherein the features comprise the outline, texture and color of a target;

the intermediate layer network module 11 includes a first feature extraction unit 101, a second feature extraction unit 102, a third feature extraction unit 103, and a fourth feature extraction unit 104 that are sequentially connected, where each of the first feature extraction unit 101, the second feature extraction unit 102, the third feature extraction unit 103, and the fourth feature extraction unit 104 is composed of a deformable convolution DCN v2 module and a first C3 module;

the first feature extraction unit 101 performs feature extraction and fusion on the initial feature map to obtain a low-level feature map; and transmits the low-level feature map to the second feature extraction unit 102;

the second feature extraction unit 102 performs feature extraction and fusion on the low-level feature map to obtain a first middle-level feature map; and transmits the first middle-layer feature map to a third feature extraction unit 103;

the third feature extraction unit 103 performs feature extraction and fusion on the first middle-layer feature map to obtain a second middle-layer feature map; and transmits the second middle-level feature map to a fourth feature extraction unit 104;

the fourth feature extraction unit 104 performs feature extraction and fusion on the second middle-layer feature map to obtain a high-layer feature map; transmitting the high-level feature map to the attention mechanism module SCAM++;

The network framework taking Yolov5 as a reference is adopted to introduce deformable convolution to model, firstly, the deformable convolution is used for optimizing a main network, the conventional convolution uses a regular window to carry out convolution calculation to extract characteristics, and the geometric transformation of a target is difficult to adapt. The DCNv1 uses the learnable offset to describe the target characteristics, and can better adapt to the geometric transformation of the offshore target through the transformation of the size, the proportion and the angle, as shown in fig. 4, and the deformable convolution DCNv2 module assigns a weight coefficient to the offset of each sampling point on the basis of the DCNv1, so as to evaluate whether the introduced area is the interested area or not, and further improve the accuracy of the extracted information. In addition, the deformable convolution DCNv2 module also expands more deformable convolutions, and the characteristic extraction capacity of the network for various geometric deformations is enhanced. By introducing a deformable convolution DCNv2 module into the Yolov5 backbone network, wherein circles represent convolved sampling points, the deformable convolution can cover a more effective area, so that better target features can be extracted from the offshore image;

as shown in fig. 9, further, the feature fusion network system includes a first feature fusion unit 12, a second feature fusion unit 13, a third feature fusion unit 14, a fourth feature fusion unit 15, a fifth feature fusion unit 16, and a sixth feature fusion unit 17 that are sequentially connected;

The first feature fusion unit 12, the second feature fusion unit 13 and the third feature fusion unit 14 all comprise a first convolution Conv module, an upsampled upsampling module, a first Concat module and a second C3 module which are sequentially connected;

the fourth feature fusion unit 15, the fifth feature fusion unit 16 and the sixth feature fusion unit 17 each include a second convolution Conv module, a second Concat module and a third C3 module that are sequentially connected;

The output end of the first feature extraction unit 101 and the output end of the Upsample upsampling module of the third feature fusion unit 14 are connected to the second C3 module of the third feature fusion unit 14 after channel splicing is performed by the first Concat module of the third feature fusion unit 14; the second C3 module of the third feature fusion unit 14 is used as an input module of a target detection head with a very small scale, and is connected with the detection head network system;

the output end of the second feature extraction unit 102 is connected with a first convolution Conv module of the third feature fusion unit 14, and the output end of the first convolution Conv module of the third feature fusion unit 14 is connected with the output end of an upsampled up-sampling module of the second feature fusion unit 13, and the output end is input to a second C3 module of the second feature fusion unit 13 after being subjected to channel splicing through a first Concat module of the second feature fusion unit 13;

the output end of the first convolution Conv module of the third feature fusion unit 14 and the output end of the second convolution Conv module of the fourth feature fusion unit 15 are input to the third C3 module of the fourth feature fusion unit 15 after being subjected to channel splicing by the second Concat module of the fourth feature fusion unit 15; the third C3 module of the fourth feature fusion unit 15 is used as an input module of a small-scale target detection head and is connected with the detection head network system;

The output end of the third feature extraction unit 103 is connected with a first convolution Conv module of the second feature fusion unit 13, and the output end of the third feature extraction unit 103 is connected with the output end of an upsampled upsampling module of the first feature fusion unit 12, and is input to a C3 module of the first feature fusion unit 12 after being subjected to channel splicing through a first Concat module of the first feature fusion unit 12;

the output end of the first convolution Conv module of the second feature fusion unit 13 and the output end of the second convolution Conv module of the fifth feature fusion unit 16 are input to the third C3 module of the fifth feature fusion unit 16 after being subjected to channel splicing by the second Concat module of the fifth feature fusion unit 16; the third C3 module of the fifth feature fusion unit 16 is used as an input module of a mesoscale target detection head and is connected with the detection head network system;

the output end of the first convolution Conv module of the first feature fusion unit 12 and the output end of the second convolution Conv module of the sixth feature fusion unit 17 are input to the third C3 module of the sixth feature fusion unit 17 after being subjected to channel splicing by the first Concat module of the first feature fusion unit 12; the third C3 module of the sixth feature fusion unit 17 is used as an input module of a large-scale target detection head, and is connected with the detection head network system; the detection head network system comprises third convolution Conv modules respectively connected with the output end of the second C3 module of the third feature fusion unit 14, the output end of the third C3 module of the fourth feature fusion unit 15, the output end of the third C3 module of the fifth feature fusion unit 16 and the output end of the third C3 module of the sixth feature fusion unit 17;

The detection head network system 18 performs merging processing on the output feature map of the second C3 module of the third feature fusion unit 14, the output feature map of the third C3 module of the fourth feature fusion unit 15, the output feature map of the third C3 module of the fifth feature fusion unit 16, and the output feature map of the third C3 module of the sixth feature fusion unit 17 based on the NMS non-maximum suppression algorithm, so as to obtain a final detection picture of the small offshore target.

The enhanced bidirectional feature fusion structure PANet discards the idea of weighted fusion, combines bidirectional jump connection and Concat module feature fusion, does not compress channel dimensions, and reserves features as much as possible at the cost of preset memory resources, as shown in fig. 7, each Concat module feature fusion can fully utilize feature channel resources, so that better detection performance is obtained. In addition, a very small target detection head is added on the basis of three detection heads of the original YOLOv5 network, so that negative effects caused by the scale variance can be effectively relieved. The added minimum target detection head is aimed at a low-level characteristic diagram, the receptive field of the low-level characteristic diagram is large and dense, the low-level characteristic diagram contains more abundant edges and texture characteristics of targets, and small targets can be positioned more accurately on the low-level characteristic diagram, so that the detection capability of the minimum targets near the sea-sky line is improved; wherein, each convolution Conv module encapsulates three functions, including two-dimensional convolution operation Conv2D, batch normalization Batch Nomalization, and activation function Silu; the padding of Conv2d in each Conv module is automatically calculated, the reduction multiple of the feature map is determined by modifying the step size stride, in this example, the step sizes stride of the convolution Conv modules in the network are all 2, and the convolution kernels are all 3; therefore, the convolution Conv module can halve the width and the height of the feature map each time, and the function of the convolution Conv module is feature extraction and feature map arrangement; the Batch Nomalization is a normalization layer for batch, and performs normalization processing on data of each batch; the SiLU is an activation function and is used for adding nonlinear data of the feature map.

As shown in fig. 6, the SPPF module is a spatial pyramid pooling module, and by serially connecting a plurality of max pooling layers MaxPool2d, and using a convolution operation of a convolution kernel of 5×5 to replace a convolution operation of a convolution kernel of 9×9 and a convolution kernel of 13×13, the calculation speed of the network is improved under the condition of ensuring that the receptive field is the same; each Concat module is a channel splicing module and has the function of increasing the number N of channels while keeping the width H and the height W of the feature map unchanged; as shown in fig. 8, the up-sampling UpSample module is used to enlarge the size of the feature map while the number N of channels remains unchanged.

In a specific embodiment, as shown in fig. 5, each C3 module is used for learning residual characteristics, and each C3 module includes a first branch and a second branch; the first branch comprises a third convolution Conv module and a plurality of bottleneck layers in series; the second branch comprises a fourth convolution Conv module; the output ends of the two branches are spliced by a third Concat module and then are connected in series with a fifth convolution Conv module; the convolutions Conv modules function the same as the Concat modules.

In a specific embodiment, as shown in fig. 3, the attention mechanism module scam++ in step S3 includes a channel attention module and a spatial attention module;

F _B ＝max(a(x,y))

M _C ＝sigmoid(MLP(F _A )+MLP(F _B ))

Silu(x)＝x×Sigmoid(x)

M _C ＝Silu(MLP(F _A )+MLP(F _B ))

M _S ＝sigmoid(conv((F _A )，(F _B )))

In a specific embodiment, as shown in fig. 10, in step S4, DIoU is used as a loss function for calculating the regression of the target frame, and the DIoU calculates the distance loss between the object prediction frame and the real frame, where the calculation formula is as follows:

wherein b and b ^gt Respectively representing the center points of the prediction frame and the real object frame; ρ represents the Euclidean distance; ρ (b, b) ^gt ) Representing the Euclidean distance between the center point of the prediction frame and the real object frame; c represents the diagonal distance of the smallest outer rectangle of the prediction frame and the real object frame. The DIoU loss function is adopted to obtain the distance loss between the object prediction frame and the real frame, so that the method is more in line with the target frame regression mechanism, and the distance, the overlapping rate and the like between the target and the anchor frame are considered, so that the target frame regression is more stableAnd (5) setting.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The marine small target detection method based on deep learning is characterized by comprising the following steps of:

step S3: based on a network architecture of YOLOv5, constructing a YOLO-sea network model added with an attention mechanism module SCAM++ and an enhanced bidirectional feature fusion structure PANet;

2. The deep learning-based marine small target detection method according to claim 1, wherein the structure of the YOLO-sea network model in the step S3 includes processing a main network system, a feature fusion network system and a detection head network system through which the picture test set acquired in the step S2 sequentially passes;

3. The deep learning-based marine small target detection method according to claim/2, wherein the feature fusion network system comprises a first feature fusion unit, a second feature fusion unit, a third feature fusion unit, a fourth feature fusion unit, a fifth feature fusion unit and a sixth feature fusion unit which are sequentially connected;

4. The deep learning based marine small target detection method of claim 2, wherein each of the C3 modules includes a first branch and a second branch;

5. The deep learning-based marine small target detection method according to claim 2, wherein the attention mechanism module scam++ in step S3 includes a channel attention module and a spatial attention module;

F _B ＝max(a(x,y))

M _C ＝sigmoid(MLP(F _A )+MLP(F _B ))

Silu(x)＝x×Sigmoid(x)

M _C ＝Silu(MLP(F _A )+MLP(F _B ))

M _S ＝sigmoid(conv((F _A )，(F _B )))

6. The deep learning-based marine small target detection method according to claim 1, wherein in step S4, DIoU is adopted as a loss function for calculating target frame regression, and the DIoU calculates a distance loss between an object prediction frame and a real frame, and a calculation formula is as follows: