CN115471746A

CN115471746A - Ship target identification detection method based on deep learning

Info

Publication number: CN115471746A
Application number: CN202211030296.2A
Authority: CN
Inventors: 郭富海; 李晨浩; 王鸿显; 张政; 杜鹏; 胡春洋; 陈秀敏
Original assignee: Cssc Marine Technology Co ltd
Current assignee: Cssc Marine Technology Co ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-12-13

Abstract

The invention discloses a ship target identification and detection method based on deep learning, which comprises the following steps: (1) collecting an image sample of a ship sailing water area; (2) preprocessing and labeling the image; (3) Performing data enhancement to make a data set for training; (4) constructing a deep learning network model based on a YOLOv4 network; (5) Training a deep learning network by using the pre-trained parameters as initial weights; (6) Inputting the processed picture to be detected into a backbone network for feature extraction, performing feature fusion through a neck network, and performing non-maximum suppression operation to complete the prediction of the ship target; (7) And (4) performing output post-processing, performing result filtering by using a confidence threshold value, and judging by combining other indexes to obtain an optimal detection result. The invention improves the small target detection capability and the multi-target classification effect in the complex sea area.

Description

Ship target identification detection method based on deep learning

Technical Field

The invention belongs to the field of target identification, and particularly relates to a ship target identification detection method based on deep learning.

Background

The target identification is an important subject in the field of intelligent transportation, the demand of marine transportation is rapidly increased due to the continuous development of economic globalization, and the safety of marine navigation is concerned by people with the increasing number of ships. In order to improve the efficiency, reliability and safety of ship navigation, the current shipping industry is gradually developing towards the intelligentized and fully-automated directions of automatic ship driving, automatic obstacle avoidance and automatic wharf entering and leaving, so that the intelligent ship gradually becomes a new research direction.

As the density of traffic flow increases over water, the navigation environment becomes more complex. The target identification capability of the ship is directly related to the safety of sea (river) navigation. Before the convolutional neural network is applied to ship target detection, the traditional ship target detection algorithm mainly comprises region selection, combined feature extraction, background texture modeling and the like. In recent years, deep learning technology is widely used in the field of target detection, which further improves the real-time target identification and detection capability of ships. However, in actual sea navigation, variable natural conditions and complex activity scenes are often accompanied, which makes the target identification and detection effect in complex sea areas not ideal. For example, small target clusters, dense ships with complex types, fuzzy targets caused by the influence of sea surface fog and the like have the problems that the detection of sea areas has higher complexity, the targets are difficult to distinguish and identify accurately, and the requirement on the ship target identification detection capability is higher.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a ship target identification and detection method based on deep learning, which can improve the small target detection capability and the multi-target classification effect in a complex sea area.

In order to achieve the above object, the present invention provides a ship target identification and detection method based on deep learning, which includes the following steps: (1) collecting an image sample of a ship sailing water area; (2) preprocessing and labeling the image; (3) Performing data enhancement to make a data set for training; (4) constructing a deep learning network model based on a YOLOv4 network; (5) Training a deep learning network by using the pre-trained parameters as initial weights; (6) Inputting the processed picture to be detected into a backbone network for feature extraction, performing feature fusion through a neck network, and performing non-maximum suppression operation to complete the prediction of the ship target; (7) And performing output post-processing, performing result filtering by using a confidence coefficient threshold value, and judging by combining other indexes to obtain an optimal detection result.

Further, in the step (2), labeling at least the following five sea surface targets on the acquired image by using Labelimage: bulk carriers, container ships, fishing boats, cruise ships, islands.

Further, in the step (3), a data set for training is manufactured by using a method of Mosaic data enhancement, a group of 4 images is adopted, and the 4 images are spliced in a random scaling, random clipping and random arrangement mode to obtain 4 new images, so that the total input number of the 4 new images is unchanged, and random shielding is performed on the obtained new images.

Further, in the step (4), the deep learning network model includes a backbone feature extraction network, an SPP structure, a PANet multipath feature fusion structure, and a Head detection structure; the backbone feature extraction network uses an RGB image with the size of 640 × 640 as input, and after convolution, batchnormalization and Mish activation functions, the backbone feature extraction network further uses a residual block structure with the sizes of (320, 320, 64), (160, 160, 128), (80, 80, 256), (40, 40, 512), (20, 20, 1024), respectively; after feature extraction is carried out, the output of the last residual block passes through an SPP structure, and after splicing, the result of passing through the CSP and CBL structure and the output results of the penultimate and penultimate residual blocks of the backbone network are used as the input of a PANet structure; the PANet structure performs a series of up-sampling, down-sampling and convolution operations, performs multi-path feature fusion processing on three inputs, and inputs a Head; and the Head outputs target coordinate information of the ship before decoding, including a target frame abscissa x, a target frame ordinate y, a target frame width w, a target frame height h, a classification confidence coefficient and a target existence confidence coefficient.

Further, in the step (5), a self-adaptive anchor frame calculation module is introduced, an anchor frame is automatically calculated in the training process, a prediction frame is output by the network on the basis of the initial anchor frame, and then the prediction frame is compared with a real frame to calculate the difference between the prediction frame and the real frame, and then the network parameters are updated reversely and iterated.

Further, the backbone network in the step (6) is Darknet-53, a feature attention module FA is embedded into the adjusted residual error structure in the Darknet-53, feature weights in feature channel relations are redistributed, 1 × 1 and 3 × 3 convolutions are added before global average pooling, cross-channel information integration is realized, and the spatial connectivity of ship images is enhanced; and then converting the global space information of the characteristic diagram into one-dimensional vector summation through global average pooling to obtain the global information of the characteristic diagram.

Further, the global average pooling formula is as follows:

where Gc is the vector sum after global average pooling of feature maps, H and W are the width and height of the input feature map, and Uc (i, j) is the value of the c-th channel Uc at (i, j).

Further, in the step (6), the image enhanced by the Mosaic data is adaptively added with the least black edges, is subjected to standardization processing, is scaled to 640 × 640, and is converted into an RGB picture; inputting the standardized pictures into a trained network to obtain the output of the Head; the output of Head will include three feature layers, each feature layer divided into 20 × 20, 40 × 40 and 80 × 80 grids, each grid point will correspond to three anchors, each anchor performs center shift and length-width scaling within its corresponding grid point; for decoding, firstly scaling the prediction result according to the original size of the corresponding anchor, then calculating the length, width and position of a prediction frame relative to a standardized input image according to the offset of grid division and the center of the anchor, and finally filtering redundant prediction according to a gray frame added during standardized processing; after decoding, non-maximum suppression operation is carried out, and one target with the highest confidence coefficient is directly selected as output.

Further, in the step (7), output region threshold filtering is performed on the output of the step (6) first, so that prediction is prevented from being given by a network when no ship target exists, and false detection is reduced; and then using the confidence threshold value to carry out final result filtering, namely outputting the ship target with the confidence coefficient larger than the threshold value as a final prediction.

Further, the output region threshold comprises a width direction threshold and a height direction threshold, wherein the width direction threshold is a distance between a center coordinate of the ship target and a boundary of the picture where the ship target is located, and the height direction threshold is a ratio of the width to the height of the whole picture.

Further, the other indexes in the step (7) comprise a target boundary box, a positioning confidence coefficient and all category probability maps, and the ship target with the positioning confidence coefficient larger than a threshold value is output as a prediction result; the offset between the target bounding box and the prediction bounding box is smaller than a certain value, and the probability in all the class probability graphs is larger than the detected target.

Compared with the prior art, the invention has the beneficial effects that:

1. the YOLOv4 is improved to obtain a deep learning network model Ship-YOLOv4, and for the problem of insufficient training data, mosaic data enhancement, adaptive picture scaling optimization and K-means clustering algorithm adaptive anchor frame are performed at the input end, so that the generalization capability of the frame is improved, and the occurrence of overfitting is avoided.

2. And a feature attention module is constructed based on an attention mechanism and is embedded into Darknet-53 for feature recalibration, so that the feature extraction capability of the model in a complex environment is improved.

3. Aiming at the problems of insufficient bottom layer feature semantic information and feature disappearance caused by too deep network in the feature fusion process, the PANet multi-path feature fusion structure is optimized, and multi-level feature information is fused, so that the relevance of a receiving domain of a network layer and a feature extraction network is enhanced, and the small and medium target detection capability and the multi-target classification effect in a complex sea area are improved. Experiments are carried out in a user-defined data set, and the superiority of the method in ship target identification and detection is fully verified.

Drawings

FIG. 1 is a diagram of an algorithm architecture according to one embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of the present invention;

FIG. 3 is a graph of data set analysis results according to one embodiment of the present invention;

FIG. 4 is a flow diagram of a data enhancement implementation of one embodiment of the present invention;

FIG. 5 is a network architecture diagram of one embodiment of the present invention;

FIG. 6 is a block diagram of an optimized PANET implementation according to one embodiment of the present invention;

FIG. 7 is a diagram illustrating the effectiveness of network training according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the detection effect of one embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further explained with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1 to 8, an embodiment of the ship target identification and detection method based on deep learning of the present invention includes the following steps: (1) collecting an image sample of a ship sailing water area; (2) preprocessing and labeling the image; (3) Performing data enhancement to make a data set for training; (4) constructing a deep learning network model based on a YOLOv4 network; (5) Training a deep learning network by using the pre-trained parameters as initial weights; (6) Inputting the processed picture to be detected into a backbone network for feature extraction, performing feature fusion through a neck network, and performing non-maximum suppression operation to complete the prediction of the ship target; (7) And performing output post-processing, performing result filtering by using a confidence coefficient threshold value, and judging by combining other indexes to obtain an optimal detection result.

As shown in fig. 1, a single-stage detection and identification algorithm framework is adopted, and the single-stage detection and identification algorithm framework mainly comprises a feature extraction module, a feature fusion module, a detection classification module and the like, wherein the feature extraction module is constructed on the basis of an attention mechanism, the obtained feature information is subjected to feature fusion by an optimized PANet multipath feature fusion structure, and finally, image features are transmitted to a detection classifier to predict the image features and judge the position of a boundary frame and the class of a ship in the frame.

As shown in fig. 2, the method can be divided into seven steps: the method comprises the steps of image acquisition, image annotation, data enhancement, model construction, network training, network prediction and output post-processing.

In one embodiment, in the step (2), labeling at least the following five sea surface targets is performed on the acquired image by using LabelImage: bulk carriers, container ships, fishing boats, cruise ships, islands. Specifically, a numerical label is added to each image sample of the obtained ship target image sample, and the label format is a COCO format, such as an image added value "0" without a ship target, an image added value "1" with a mail ship, an image added value "2" with a container ship, an image added value "3" with a bulk cargo ship, an image added value "4" with a fishing ship and an image added value "5" with an island reef. Only by correct labeling contents can the trained model be guaranteed to have a good effect in actual operation. The normalized data contains five sea surface targets: bulk carriers, container ships, fishing boats, cruise ships, islands. The data analysis results are shown in fig. 3.

In one embodiment, in the step (3), a data set for training is manufactured by using a method of mosaics data enhancement, a group of 4 images is adopted, and the 4 images are spliced in a random scaling, random clipping and random arrangement mode to obtain 4 new images, so that the total input number of the 4 new images is unchanged, and random shielding is performed on the obtained new images, thereby greatly enriching the detection data set, particularly increasing many small targets by random scaling, and making the robustness of the network better. However, the introduction of the Mosaic data enhancement leads to the reduction of the target and the deterioration of the generalization capability of the model, so the number of pictures of the Mosaic data enhancement is selected to be 4. Based on the collected image information, a data set for training is manufactured by using a method of Mosaic data enhancement, and the implementation process is shown in FIG. 4, so that the data set is greatly enriched, and particularly, many small targets are added by random scaling, so that the robustness of the network is better. The size of the image input by single iteration is not required to be large, and the method is more friendly to single GPU training. The training set accounts for 80% of the generated data set, and the testing set accounts for 20%.

In one embodiment, in the step (4), the deep learning network model includes a backbone feature extraction network, an SPP structure, a PANet multipath feature fusion structure and a Head detection Head structure; the backbone feature extraction network uses an RGB image with the size of 640 x 640 as input, and passes through residual block structures with the sizes of (320, 320, 64), (160, 160, 128), (80, 80, 256), (40, 40, 512), (20, 20, 1024) after convolution, batch Normalization and Mish activation functions; after feature extraction, the output of the last residual block passes through an SPP structure, and after splicing, the result of the CSP and CBL structure and the output results of the penultimate and penultimate residual blocks of the main network are used as the input of a PANet structure; the PANet structure performs a series of up-sampling, down-sampling and convolution operations, performs multi-path feature fusion processing on three inputs, and inputs a Head; the Head outputs target coordinate information of the ship before decoding, including a target frame abscissa x, a target frame ordinate y, a target frame width w, a target frame height h, a classification confidence and a target existence confidence.

In this embodiment, referring to a Yolov4 network structure, an input end, a backbone network structure, a neck network structure, and an output end are improved, and a deep learning network model Ship-Yolov4 for Ship target identification and detection is constructed, where the network structure is shown in fig. 5 and can be divided into three major parts: backone is a feature extraction network CSPDarknet53 introducing a feature attention module FA, neck is composed of an SPP and an optimized PANET multipath feature fusion structure, and Head is a detection structure. The feature extraction network uses an RGB image with a size of 640 × 640 as an input, and after convolution, batch Normalization and Mish activation functions, passes through a residual block structure with sizes of (320, 320, 64), (160, 160, 128), (80, 80, 256), (40, 40, 512), (20, 20, 1024), respectively; after feature extraction, the output of the last residual block passes through an SPP structure, and after splicing, the result of the CSP and CBL structure and the output results of the penultimate and penultimate residual blocks of the main network are used as the input of the PANet structure.

The PANet architecture is shown in fig. 6, where part (a) and (b) are top-down network architectures with feature fusion by lateral connection, and part (c) is bottom-up network architecture with lateral connection maintained while aggregating lower-level features. Each top-level feature is generated by fusing three different path features, in such a way that the low-level feature, the intermediate feature and the top-level feature are fused with each other in fig. 6. And (4) giving a prediction result of each layer of characteristics in the characteristic uploading process, and then inputting Head. And outputting x, y, w and h coordinate information of the ship target before decoding, classification confidence and target existence confidence by the Head.

In one embodiment, in the step (5), an adaptive anchor frame calculation module is introduced to automatically calculate an anchor frame in the training process, the network outputs a prediction frame on the basis of the initial anchor frame, and then compares the prediction frame with a real frame to calculate the difference between the prediction frame and the real frame, and then reversely updates and iterates network parameters. When the network is trained, the parameters of the pre-training are used as initial weights, the file of the pre-training weights is yolov4.Conv.137, and the training deep learning network uses the following training parameters: the learning rate was 0.001, batch was 64, subdivisions was 16, and the partition ratio of the training set validation set to the data set was 0.9 and 0.1. The network training strategy is to use pre-trained parameters as initial weights, then freeze the weights of the Backbone part, train the rest part for 50 epochs, and finally unfreeze all the weights and train the rest part for 50 epochs. The training equipment uses 2 Nvidia RTX 2080Ti model GPUs for about 16 hours. Training results when epoch is 100, the verification set loss is 3.1534. The training effect is shown in fig. 7. The map value is larger and smaller, and the loss value is smaller and smaller in the training process.

In one embodiment, the backbone network in the step (6) is Darknet-53, the feature attention module FA is embedded into the adjusted residual error structure in the Darknet-53, the feature weights in the feature channel relationship are redistributed, and 1 × 1 and 3 × 3 convolutions are added before the global average pooling, so that cross-channel information integration is realized, and the spatial connectivity of the ship image is enhanced; and then converting the global space information of the feature map into one-dimensional vector summation through global average pooling to obtain the global information of the feature map.

In one embodiment, the global average pooling formula is as follows:

SPP max pooling is applied to convolution kernels of sizes 5 × 5, 9 × 9, 13 × 13, with the spatial dimensions preserved. Feature maps from different kernel sizes are concatenated together as output. Compared with a pure k × k maximum pooling mode, the method has the advantages that the receiving range of the trunk features is effectively increased, and the most important context features are obviously separated.

The CBL is a basic component in the multi-path feature fusion structure of the PANet, and is composed of three layers, namely, a common convolutional layer Conv + a normalization layer Bn + an activation function layer SiLU.

The SiLu function is a variant of the Sigmoid function, and has the functional form:

SiLu(x)＝x*Sigmoid(x)

concat concatenates the two tensors, expanding the dimensionality of the two tensors.

The CSP structure replaces the reset with the ordinary CBL and is applied to the Neck. The backhaul is a deeper network, and the increase of a residual structure can enhance the gradient value of backward propagation between layers, so that the disappearance of the gradient caused by deepening is avoided, and the characteristics of finer granularity are extracted without worrying about network degradation.

In one embodiment, in the step (6), a minimum black edge is adaptively added to the image enhanced by the Mosaic data, the image is normalized, scaled to 640 × 640, and converted into an RGB picture; inputting the standardized pictures into a trained network to obtain the output of Head; the output of Head will include three feature layers, each feature layer divided into 20 × 20, 40 × 40 and 80 × 80 grids, each grid point will correspond to three anchors, each anchor performs center shift and length-width scaling within its corresponding grid point; for decoding, firstly scaling the prediction result according to the original size of the corresponding anchor, then calculating the length, width and position of a prediction frame relative to a standardized input image according to the offset of grid division and the center of the anchor, and finally filtering redundant prediction according to a gray frame added during standardized processing; after decoding, non-maximum suppression operation is carried out, and one target with the highest confidence coefficient is directly selected as output, so that the detection speed is improved.

In one embodiment, in the step (7), output region threshold filtering is performed on the output of the step (6) first, so that the prediction is prevented from being given by the network when no ship target exists, and false detection is reduced; and then using the confidence threshold value to carry out final result filtering, namely outputting the ship target with the confidence coefficient larger than the threshold value as a final prediction.

In one embodiment, the output region threshold includes two parts, namely a width direction threshold and a height direction threshold, wherein the width direction threshold is the distance between the center coordinate of the ship target and the boundary of the picture, and the height direction threshold is the ratio of the width to the height of the whole picture.

In one embodiment, the other indexes in the step (7) comprise an object boundary box, a positioning confidence level and a probability map of all categories, and the ship object with the positioning confidence level larger than a threshold value is output as a prediction result; the offset between the target bounding box and the prediction bounding box is smaller than a certain value, and the probability in all the category probability graphs is larger than the detected target.

In the embodiment of the invention, ship-YOLOv4 does not need to generate the region of interest in advance, and the network can be directly trained in a regression mode. Clustering the boundary frames of the training samples by using a K-means algorithm, presetting 3 groups of predefined boundary frames according to 3 scales, and performing subsequent positioning prediction based on the 9 boundary frames. First, feature extraction is performed on an original 640 × 640 input image through a feature extraction network, and then feature vectors are input into the SPP and PANet structures, so as to generate 3 mesh regions, which are 20 × 20, 40 × 40, and 80 × 80, respectively. Three bounding boxes are predicted per mesh region. To generate (80 × 80+40 × 40+20 × 20 × 3=25200 bounding boxes, one vector N is predicted in each boundary, the combination of the vectors N is shown as follows:

N＝(t _x +t _y +t _w +t _h )+N ₀ +(N ₁ +N ₂ +…+Nn)

t _x 、t _y denotes the horizontal and vertical coordinates, t, of the bounding box _w 、t _h Indicating the width and height of the bounding box. N is a radical of hydrogen ₀ … Nn represent the probability values for objects in the prediction box.

The distance calculation formula from the center of the final prediction result boundary box to the upper left corner of the feature map and the length and width calculation formula of the prediction boundary box are as follows:

b _x ＝δ(t _x )+Cx

b _y ＝δ(t _y )+Cy

b _w ＝p _w ×e ^tw

b _h ＝p _h ×e ^th

delta denotes Sigmoid function, C _x And C _y Indicating the offset, p, of the grid to which the bounding box belongs relative to the upper left corner of the picture _h And p _w Representing the length and width of a predefined bounding box, b _x And b _y Representing final prediction boundariesDistance from center of frame to upper left corner of picture, b _h And b _w Representing the length and width of the prediction bounding box.

In the training, CIOU _ Loss is used as the Loss of a target Bounding box, and the length-width ratio of a prediction box and a target box is considered, wherein the CIOU _ Loss formula is as follows:

α is a weight parameter defined as:

v is a measure of the similarity of aspect ratios, defined as:

the partial derivatives of v for w and h in CIOU _ Loss optimization need to be defined, namely:

w, h is in [0,1]In case of w ² +h ² Are usually very small and may lead to gradient explosions, to avoid this problem

It is replaced by 1 when implemented.

In summary, CIOU _ Loss regresses the target box with three important geometric factors: overlap area, center point distance, aspect ratio are all included.

As shown in fig. 8, the method of the embodiment of the present invention can identify the target information of bulk carriers, container ships, fishing boats, cruise ships and islands, and the performance and the identification accuracy are significantly improved compared with YOLOv4.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A ship target identification detection method based on deep learning is characterized by comprising the following steps:

(1) Acquiring an image sample of a ship navigation water area;

(2) Preprocessing and labeling the image;

(3) Performing data enhancement to make a data set for training;

(4) Constructing a deep learning network model based on a YOLOv4 network;

(5) Training a deep learning network by using the pre-trained parameters as initial weights;

(6) Inputting the processed picture to be detected into a backbone network for feature extraction, performing feature fusion through a neck network, and performing non-maximum suppression operation to complete the prediction of the ship target;

(7) And performing output post-processing, performing result filtering by using a confidence coefficient threshold value, and judging by combining other indexes to obtain an optimal detection result.

2. The vessel target recognition detection method based on deep learning of claim 1, wherein in the step (2), labeling at least the following five sea surface targets is performed on the acquired image by using Labelimage: bulk carriers, container ships, fishing boats, cruise ships, islands.

3. The vessel target recognition detection method based on deep learning according to claim 1, wherein in the step (3), a mosaics data enhancement method is used to produce a data set for training, a group of 4 pictures is adopted, and the 4 pictures are spliced in a manner of random scaling, random cutting and random arrangement to obtain 4 new images, so that the total input number of the 4 new images is unchanged, and random shielding is performed on the obtained new images.

4. The deep learning based ship target identification detection method according to claim 3, wherein in the step (4), the deep learning network model comprises a backbone feature extraction network, an SPP structure, a PANet multipath feature fusion structure and a Head detection structure; the backbone feature extraction network uses an RGB image with the size of 640 x 640 as input, and passes through residual block structures with the sizes of (320, 320, 64), (160, 160, 128), (80, 80, 256), (40, 40, 512), (20, 20, 1024) after convolution, batch Normalization and Mish activation functions; after feature extraction, the output of the last residual block passes through an SPP structure, and after splicing, the result of the CSP and CBL structure and the output results of the penultimate and penultimate residual blocks of the main network are used as the input of a PANet structure; the PANet structure performs a series of up-sampling, down-sampling and convolution operations, performs multi-path feature fusion processing on three inputs, and inputs a Head; and the Head outputs target coordinate information of the ship before decoding, including a target frame abscissa x, a target frame ordinate y, a target frame width w, a target frame height h, a classification confidence coefficient and a target existence confidence coefficient.

5. The vessel target recognition detection method based on deep learning of claim 1, wherein in the step (5), an adaptive anchor frame calculation module is introduced, an anchor frame is automatically calculated in a training process, a network outputs a prediction frame on the basis of an initial anchor frame, and then the prediction frame is compared with a real frame, the difference between the two frames is calculated, and then reverse updating and network parameters are iterated.

6. The vessel target identification detection method based on deep learning of claim 1, wherein the backbone network in step (6) is Darknet-53, a feature attention module FA is embedded in the adjusted residual error structure in the Darknet-53, feature weights in feature channel relationships are redistributed, 1 × 1 and 3 × 3 convolutions are added before global average pooling, cross-channel information integration is realized, and spatial connectivity of vessel images is enhanced; and then converting the global space information of the feature map into one-dimensional vector summation through global average pooling to obtain the global information of the feature map.

7. The deep learning-based ship target identification detection method according to claim 6, wherein the global average pooling formula is as follows:

8. The vessel target recognition detection method based on deep learning of claim 4, wherein in the step (6), the least black edges are adaptively added to the image enhanced by the Mosaic data, normalized, scaled to 640 × 640 size, and converted into the RGB picture; inputting the standardized pictures into a trained network to obtain the output of Head; the output of Head will include three feature layers, each feature layer divided into 20 × 20, 40 × 40 and 80 × 80 grids, each grid point will correspond to three anchors, each anchor performs center shift and length-width scaling within its corresponding grid point; for decoding, firstly scaling the prediction result according to the original size of the corresponding anchor, then calculating the length, width and position of a prediction frame relative to a standardized input image according to the offset of grid division and the center of the anchor, and finally filtering redundant prediction according to a gray frame added during standardized processing; after decoding, non-maximum suppression operation is carried out, and one target with the highest confidence coefficient is directly selected as output.

9. The vessel target recognition detection method based on deep learning of claim 4, wherein in the step (7), output region threshold filtering is performed on the output of the step (6) first, so that prediction is prevented from being given by a network when no vessel target exists, and false detection is reduced; and then using the confidence threshold value to carry out final result filtering, namely outputting the ship target with the confidence coefficient larger than the threshold value as a final prediction.

10. The vessel target recognition detection method based on deep learning of claim 9, wherein the output region threshold includes two parts, namely a width direction threshold and a height direction threshold, wherein the width direction threshold is a distance from a center coordinate of the vessel target to a boundary of a picture where the vessel target is located, and the height direction threshold is a ratio of a width to a height of the whole picture.

11. The vessel target recognition detection method based on deep learning of claim 1, wherein the other indexes in step (7) include a target bounding box, a positioning confidence level, and all class probability maps, and the vessel target with the positioning confidence level greater than a threshold is output as a prediction result; the offset between the target bounding box and the prediction bounding box is smaller than a certain value, and the probability in all the category probability graphs is larger than the detected target.