CN115965862A

CN115965862A - SAR ship target detection method based on mask network fusion image characteristics

Info

Publication number: CN115965862A
Application number: CN202211567684.4A
Authority: CN
Inventors: 雷杰; 郭怡; 杨埂; 谢卫莹; 李云松
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-04-14

Abstract

The invention provides a SAR ship target detection method based on mask network fusion image characteristics, which comprises the following steps: generating a self-adaptive SAR image ship semantic segmentation label by using the brightness gradient difference between a ship target and a background; constructing a mask feature fusion sub-network and a mask feature fusion target detection network; carrying out iterative training on the target detection network by using the constructed loss function; and acquiring the target surrounding frame coordinates and the confidence coefficient of the test sample. The method can combine target detection and segmentation tasks when the data set lacks mask labels by generating the self-adaptive SAR image ship mask labels; a mask feature fusion sub-network is constructed, the ship target features are highlighted, background information is suppressed, and target detection precision is improved; a loss function is designed for the network, and the problems of unbalanced image foreground and background and unstable training process are solved.

Description

SAR ship target detection method based on mask network fusion image characteristics

Technical Field

The invention belongs to the technical field of image processing, and further relates to a Synthetic Aperture Radar (SAR) ship target detection method based on mask network fusion in the technical field of SAR image target detection. The method can be applied to detecting the ship target in the SAR image.

Background

The synthetic aperture radar SAR is an imaging radar with high resolution, compared with optical detection means such as visible light imaging, infrared detection and laser detection, the synthetic aperture radar is not limited by natural conditions such as cloud and fog, and can remotely detect static time-sensitive targets such as ships and warships. SAR image target detection is widely concerned as the first stage of SAR automatic target identification. The constant false alarm rate CFAR is the most widely and deeply used traditional SAR target detection method, and the method utilizes background information by means of background clutter statistical distribution modeling to further obtain a self-adaptive threshold value, and then compares the gray value of a pixel with the self-adaptive threshold value through a sliding window to obtain a detection result. Therefore, determining a proper clutter statistical model is very important for ensuring the detection performance of the CFAR, and since the actually measured SAR image contains a large amount of complex background clutter, it is difficult to select a proper clutter statistical model, thereby causing the reduction of the detection performance. With the development of deep learning, many convolutional neural network-based methods have been proposed. These methods have made significant advances in target detection due to the large amount of labeled training data learned by the web. However, due to the complex scene of the actually measured SAR image, the CNN-based SAR target detection method still has many false alarms and false alarm omissions, and the detection accuracy needs to be improved urgently.

The patent document applied by the fourteenth research of the Chinese electronic science and technology group company, "a SAR ship target detection method based on background and scale perception" (patent application No. CN 202111613298.X, application publication No. CN 114219997A) proposes a SAR ship target detection method based on background and scale perception. According to the method, the false alarm rate is effectively reduced by designing a background-aware ship detection network model, the model is enabled to focus on small targets in the training process by designing a scale-aware loss function, the small target detection rate is improved, meanwhile, a multi-scale training strategy is adopted, the size of an image input into the network at each time is fixed while multi-scale training is guaranteed, and the time required by each iteration and required hardware resources are kept consistent. However, the method still has the following defects: the multi-scale training strategy introduces a plurality of down-sampling operations for the network, and the characteristics of the small target are lost in each down-sampling process, although the network designs a scale perception loss function for the purpose of avoiding the loss of the characteristics of the small target as much as possible, the lost small target information in the sampling process is still difficult to recover completely, so that the detection precision of the network is reduced. In addition, when scale change is concerned, more noise is introduced into the network, false alarm is easily caused, and finally the network monitoring effect is reduced.

The SAR ship target detection method based on the regression loss of the balance sample is proposed in the patent document ' SAR ship target detection method based on the regression loss of the balance sample ' (patent application number: CN 202011544100.2, application publication number: CN 112668440A) applied by the university of Western ' an electronic technology. The method selects the fast-RCNN network as a training network model, improves the original loss function of the training network, forms a new total loss function to train the network, obtains a finally trained network model, and improves the detection precision of the network on the ship target. However, the method still has the following defects: when the method is used for carrying out feature extraction on an input SAR image, the whole scene is subjected to undifferentiated processing, and when the difference between a target and a background is not obvious and the scene contains a large number of complex background clutter, a large number of false alarms and false omission are caused, so that the detection precision is reduced. In addition, the method adopts a two-stage network, the parameter quantity is large, the occupied storage space is large, and the network and the model are difficult to deploy.

Zhang in its published paper "A Mask Attention Interaction and Scale Enhancement Network for SAR Cap Instance Segmentation" (IEEE Geoscience and Remote Sensing Letters (Volume: 19) 2021) proposes a SAR Ship target detection method based on Mask Attention Interaction and size Enhancement Network (MAI-SE-Net), which models Remote spatial dependency with non-local blocks, generates an additional pyramid bottom layer using content-aware feature reconstruction blocks to improve the performance of small ships, improves Scale feature description using feature balancing operations, and refines features using global context blocks. Through the operation, the MAI-SE-Net completes semantic segmentation and target detection tasks and achieves good effects. However, the method has the following disadvantages: the method needs a data set with two types of labels of a target detection bounding box and a semantic segmentation mask, and has high requirements on the data set. In addition, the network used by the method is a two-stage network, and the parameter number and the model storage space are both much larger than those of the one-stage network, so that the network and the model are difficult to deploy.

Disclosure of Invention

The invention aims to provide an SAR target detection method based on mask network fusion aiming at the defects of the prior art. The method aims to solve the problems that in the prior art, when a complex scene SAR image is detected due to the fact that a network carries out non-differential processing on a target and a background, many false alarms and false negative alarms are obtained, the detection precision is low, and the network is highly dependent on a semantic segmentation mask label.

The technical idea for realizing the purpose of the invention is as follows: the method constructs a mask network branch, utilizes the branch to highlight the characteristics of the ship target and simultaneously inhibits background characteristic information, and performs target detection on the SAR image after the trained features extracted by the mask network branch are fused with the features extracted by the main network, thereby avoiding the problems of high false alarm and low detection precision caused by indiscriminate processing of the whole scene in the prior art. The invention provides a self-adaptive semantic segmentation mask label making strategy, aiming at different SAR data sets, the self-adaptive semantic segmentation mask label making strategy can self-adaptively complete mask label making, so that a data set can combine target detection and target segmentation tasks under the condition of lacking a semantic segmentation label, and the problem of high dependence of a network on the semantic segmentation mask label in the prior art is solved. The invention designs a cross entropy loss function with weight coefficients for the mask branch network, so that the attention of branches is focused on ship targets, and the problem of unbalanced foreground and background is solved. The invention adds a self-adaptive weight distribution mechanism for the loss function of the target detection network, so that the network can adaptively and reasonably distribute weights for different tasks in the training process, and the problem of unstable network training process is solved.

The method comprises the following specific steps:

step 1, generating a sample set:

step 1.1, collecting at least 900 SAR ship image samples with the scale larger than 320 x 320 pixels, wherein each sample at least comprises a ship target;

step 1.2, labeling a target detection label file for each sample, wherein each target detection label file comprises bounding box coordinates of all ship targets in the corresponding sample;

step 1.3, forming a sample set by all SAR ship image samples and corresponding target detection label files;

step 2, generating a self-adaptive SAR image ship semantic segmentation label by using the gradient difference between the ship brightness and the background brightness:

step 2.1, amplifying the coordinates of each bounding box in the target detection label of each sample in the sample set by 1.5 times to obtain the amplified coordinates of each bounding box;

2.2, cutting all ship targets contained in each sample according to the amplified coordinate position to form a target slice image of the sample;

step 2.3, filtering each target slice image by adopting bilateral filtering and self-adaptive median filtering in sequence; the size of a filter kernel selected for bilateral filtering is 5, and the filtering range is 10; dynamically adjusting the size of the filter window according to the gray value of the filter window coverage area;

step 2.4, carrying out self-adaptive processing on each filtered target slice by using an OTSU (inter-class variance) method to obtain a binary target slice corresponding to each target slice;

step 2.5, sequentially performing corrosion expansion and expansion corrosion operations on each binary target slice to obtain a processed binary target slice;

step 2.6, cutting each processed binary target slice image according to the size coordinates of the bounding box in the step 1.1, and pasting the cut binary target slice image on a completely black image which has the same size as the sample size and all pixel values of 0 to obtain a semantic segmentation mask label with a ship target pixel value of 255 and a background pixel value of 0;

step 3, generating a training set:

step 3.1, performing self-adaptive SAR image ship semantic segmentation label generation operation on each sample in the training sample set to obtain a training sample set semantic segmentation label;

step 3.2, preprocessing each sample in the training sample set and the corresponding semantic segmentation mask label to obtain a preprocessed training sample, a preprocessed semantic segmentation label and a preprocessed target detection label;

step 3.3, forming a training set by all the training samples, the semantic segmentation labels and the target detection labels after pretreatment;

step 4, constructing a mask feature fusion target detection network consisting of a mask feature fusion sub-network, a feature extraction sub-network, a multi-scale feature fusion sub-network and a detection head thereof:

step 4.1, constructing an 8-layer mask feature fusion sub-network, wherein the sub-network comprises: the device comprises a first GateC module, a first CSP module, a second GateC module, a second CSP module, a convolution layer, a sigmoid layer, a first convolution sampling layer and a second convolution sampling layer; the first GateC module, the first CSP module, the second GateC module, the second CSP module, the convolutional layer and the sigmoid layer are connected in series; the first convolution sampling layer is connected with the first GateC module in series; the second convolution sampling layer is connected with the second GateC module in series; setting input channels of the first CSP module and the second CSP module as 64 and 32 respectively, setting output channels as 64 and 32 respectively, setting the number of BottleNeck modules as 1, and closing a residual error structure; the network structures of the first convolution sampling layer and the second convolution sampling layer are the same and are both formed by connecting a convolution layer and an upper sampling layer in series; setting input channels of convolution layers in the first convolution sampling layer and the second convolution sampling layer to be 128 and 256 respectively, setting output channels to be 1, setting convolution kernel sizes to be 1 multiplied by 1 and setting step length to be 1; the up-sampling layer in the first and second convolution sampling layers is set to be 640 multiplied by 640 pixels;

the first and second CSP modules have the same structure, and each CSP module comprises: the system comprises a first CBS module, a BottleNeck module, a second CBS module, a splicing layer and a third CBS module; the network structure of the CSP module is as follows: the first CBS module, the BottleNeck module, the splicing layer and the third CBS module are connected in series, and the second CBS module is connected with the first CBS module and the BottleNeck module in parallel on the splicing layer;

the first and second GateC modules have the same structure, and each GateC module has the following structure in sequence: the device comprises a first splicing layer, a first batch standardization layer, a first convolution layer, a ReLU activation layer, a second convolution layer, a second batch standardization layer, a sigmoid layer, a multiplication layer, an addition layer and a third convolution layer; setting input channels of first to third convolution layers in a first GateC module as 65, 65 and 64 respectively, and setting output channels as 65,1 and 64 respectively; setting the input channels of the first to the third convolution layers in the second GateC module as 33, 33 and 32 respectively, and setting the output channels as 33,1 and 32 respectively; the sizes of convolution kernels of all convolution layers in the two GateC modules are set to be 1 multiplied by 1, and the step length is set to be 1;

step 4.2, a feature extraction sub-network is built, and the structure of the feature extraction sub-network sequentially comprises the following steps: the system comprises a first CBS module, a second CBS module, a first CSP module, a third CBS module, a second CSP module, a fourth CBS module, a third CSP module, a fifth CBS module, an ASPPF module and a fourth CSP module; the number of input channels of the first CBS module to the fifth CBS module is respectively set to be 1, 32, 64, 128 and 256, the number of output channels is respectively set to be 32, 64, 128, 256 and 512, the step length is respectively set to be 1,2,2,2,2, and the sizes of convolution kernels are all set to be 3 multiplied by 3; the number of input channels of the first CSP module, the number of output channels of the fourth CSP module, the number of output channels of the first CSP module, the number of output channels of the fourth CSP module, and the number of BottleNeck modules are set to be 1,2,2,2,2 respectively; the first CSP module, the second CSP module, the third CSP module and the fourth CSP module are respectively of a closed residual error structure;

the first to third CSP modules have the same structure as the CSP module in the step 4.1; the first CBS module, the second CBS module, the third CBS module and the fourth CBS module are in the same structure, and a network of each CBS module is formed by serially connecting a convolution layer, a batch standardization layer and a SiLU activation layer;

the ASPPF module includes: the system comprises a first CBM module, a first maximum pooling layer, a second maximum pooling layer, a third maximum pooling layer, a splicing layer and a second CBM module; wherein the first CBM module, the first maximum pooling layer, the second maximum pooling layer and the third maximum pooling layer are connected in series; the splicing layer is connected with the second CBM module in series; the first CBM module, the first maximum pooling layer, the second maximum pooling layer and the third maximum pooling layer are connected in parallel at the splicing layer; setting the convolution kernel sizes of the first to third maximum pooling layers to be 5 × 5; the first CBM module and the second CBM module have the same structure, and each CBM module is formed by serially connecting a convolution layer, a batch standardization layer and a MetaAcon self-adaptive activation layer; the input channel numbers of the first CBM module and the second CBM module are respectively set to be 512 and 1024, and the output channel numbers are respectively set to be 256 and 512; the sizes of convolution kernels of the convolution layers are all set to be 1 multiplied by 1, and the step lengths are all set to be 1;

step 4.3, constructing a multi-scale feature fusion sub-network, wherein the structure of the multi-scale feature fusion sub-network comprises the following steps: the device comprises a first CBS module, a first upper sampling layer, a first splicing layer, a first CSP module, a second CBS module, a second upper sampling layer, a second splicing layer, a second CSP module, a third CBS module, a third splicing layer, a third CSP module, a fourth CBS module, a fourth splicing layer, a fourth CSP module and a Fusion module; the first CBS module, the first upper sampling layer, the first splicing layer, the first CSP module, the second CBS module, the second upper sampling layer, the second splicing layer, the second CSP module, the third CBS module, the third splicing layer, the third CSP module, the fourth CBS module, the fourth splicing layer and the fourth CSP module are sequentially connected in series; the Fusion module is connected in series with the second splicing layer; the first CBS module is connected with the fourth splicing layer in parallel; the first CSP module is connected with the third splicing layer in parallel; the first to second upsampling layers are respectively set to 80 × 80 pixels and 160 × 160 pixels; the number of input channels of the first CBS module to the fourth CBS module is respectively set to 512, 256, 128 and 256, the number of output channels of the first CBS module to the 256, 128, 128 and 256, the sizes of convolution kernels are respectively set to 1 multiplied by 1,1 multiplied by 1,3 multiplied by 3,3 multiplied by 3, and the step sizes are respectively set to 1,1,2,2; the number of input channels of the first to fourth CSP modules is set to 512, 256, 256, 256, respectively, the number of output channels is set to 256, the number of the 128, 512 and BottleNeck modules is set to be 1, and residual error structures are closed;

the Fusion module sequentially comprises a down-sampling layer, a splicing layer, a first convolution layer, a first batch standardization layer, a first ReLU activation layer, a random abandonment layer, a second convolution layer, a second batch standardization layer and a second ReLU activation layer; the Fusion module has two inputs, namely mask feature input and main feature input; setting the downsampling layer to 160 × 160 pixels; setting the input channels of the first convolution layer and the second convolution layer to be 288 and 320 respectively, setting the output channels of the first convolution layer and the second convolution layer to be 320 and 256 respectively, setting the convolution kernel size to be 3 respectively, and setting the step size to be 1 respectively; the discarding rate of the random discarding layer is set to 0.1;

step 4.4, connecting the first CSP module in the feature extraction sub-network with the first GateC module in the mask feature fusion sub-network to serve as gating input of the first GateC module; respectively connecting a first convolution sampling layer and a second convolution sampling layer of the mask feature fusion sub-network with a second CSP module and a third CSP module in the feature extraction sub-network; connecting a fourth CBS module in the feature extraction sub-network with a Fusion module in the multi-scale feature Fusion sub-network, and inputting the fourth CBS module as the main feature of the Fusion module; connecting a fifth CBS module and a fourth CSP module in the feature extraction sub-network with a first splicing layer and a first CBS module in the multi-scale feature fusion sub-network respectively; the outputs of the second CSP module, the third CSP module and the fourth CSP module of the multi-scale feature fusion sub-network are respectively connected with the detection head 1, the detection head 2 and the detection head 3 to obtain a mask feature fusion target detection network;

step 5, generating a target detection Loss function Loss _all The following were used:

therein, loss _target Loss value, loss, representing target detection task _target Represents the loss function of a conventional Yolov5 network; loss _bce Representing cross-entropy loss function values with weights, σ ₁ ，σ ₂ Respectively represents that the network training process is automatically Loss _target And Loss _target The distributed weight coefficient log represents a logarithmic operation with a natural constant e as a base;

the cross entropy Loss function Loss with weight _bce The following were used:

ω _p ＝n _neg /(n _pos +n _neg )

ω _n ＝n _pos /(n _pos +n _neg )

wherein N represents the total number of pictures of the semantic segmentation mask label in the current network iterative training, y _i Pixel value, p, representing the ith semantic segmentation mask label _i Representing the output of the ith semantic segmentation mask tag through the network model, omega _p And omega _n Weight values, n, representing vessel target and background assignments, respectively _neg And n _pos Respectively representing the total number of ship target pixels and the total number of background pixels in the semantic segmentation mask label;

step 6, training a mask feature fusion target detection network:

inputting the pictures in the training set into a mask feature fusion target detection network, and iteratively updating the network weight value by using an SGD (generalized Gaussian distribution) optimization algorithm until a loss function is converged to obtain a trained mask feature fusion target detection network;

step 7, detecting the position of the ship target in the SAR image:

sampling the SAR ship target image to be detected to 640 multiplied by 640 pixels, inputting the image into the trained mask network feature fusion target detection network, and outputting the target frame coordinates and confidence of each ship target in the image to be detected.

Compared with the prior art, the invention has the following advantages:

firstly, a mask network feature fusion branch is additionally arranged in a Yolov5 network, the attention of the network is focused on a ship target by extracting image features from the Yolov5 network and combining gating convolution and feature fusion operation, the interference brought by background information is weakened, the target and the background are better distinguished by a target detection network, the problem of low false alarm high detection precision caused by the network indiscriminate processing of the target and the background is avoided, and the detection accuracy of the ship target in the SAR image is improved.

Secondly, the invention designs a set of self-adaptive semantic segmentation mask label manufacturing strategy. The method separates the target and the background by using the brightness gradient difference between the target and the background, and adaptively extracts the semantic segmentation mask label, so that the semantic segmentation work and the target detection work can be combined under the condition of no semantic segmentation label, and the detection precision of the ship target in the SAR image is further improved.

Thirdly, the invention designs a cross entropy loss function with weight coefficients for the mask branch, improves the constraint weight of the branch on the ship target, solves the problem of unbalanced foreground and background, and further optimizes the target detection effect.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the overall network architecture of the present invention;

FIG. 3 is a schematic diagram of a residual block CSP module of the present invention;

FIG. 4 is a schematic diagram of a gated convolution GateC module of the present invention;

FIG. 5 is a schematic diagram of an adaptive fully-connected layer ASPPF module of the present invention;

FIG. 6 is a schematic diagram of a Fusion feature Fusion module according to the present invention;

FIG. 7 is a schematic diagram of the visual target detection results of the simulation experiment of the present invention on SSDD data set.

FIG. 8 is a schematic diagram of the visual target detection result of the simulation experiment on the HRSID data set according to the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

The implementation steps of the embodiment of the present invention are described in further detail with reference to fig. 1.

Step 1, generating a training sample set and a testing sample set.

The SAR image samples used by embodiments of the present invention are derived from SSDD and HRSID datasets. The SSDD dataset is a multi-resolution, multi-size and multi-sensor SAR Ship dataset published by Li et al (j.li, c.qu, and j.shao, "Ship detection in SAR images based on an improved failure r-cnn," in 2017SAR in Big Data era. The SSDD dataset contains 1160 images of different scales for a total of 2456 ships with the scale range of 7 x 7 to 211 x 298 pixels for the ship target. The HRSID dataset is the SAR ship dataset disclosed by Wei et al (S.Wei, X.Zeng, Q.Qu, M.Wang, H.Su and J.Shi, "HRSID: A high-resolution SAR images data set for shift detection and instance segmentation", IEEE Access, vol.8, pp.120234-120254,2020.). The HRSID dataset contains 5604 images, all of 800 × 800 pixels in scale, 16954 ships, with varying scale of the ship target.

Each sample of the SSDD dataset and the HRSID dataset contains at least one ship target. For each sample, the two data sets are labeled by the label labeling mode of PASCAL VOC to the bounding box coordinates of each ship target in each sample. The user can obtain the four vertex coordinates of the bounding box of all ship targets in each sample from the target detection labels.

The invention combines 232 images named by numbers 1 and 9 in an SSDD data set into an SSDD test sample set, and the other 928 images form an SSDD training sample set. 3642 images of the HRSID data set divided according to the dividing method in the author paper form an HRSID training sample set, and the rest 1962 images form an HRSID testing sample set.

And 2, generating a self-adaptive SAR image ship semantic segmentation label.

Because the SAR image has the characteristics of bright ship target and dark ocean background, the background and the target have brightness gradient difference at the boundary of the ship target. The target and the background can be separated by utilizing the brightness gradient difference, a semantic segmentation mask label is extracted, and the mask feature fusion target detection network is helped to complete training.

The embodiment of the invention separates the target and the background in each sample of the SSDD data set and the HRSID data set, and extracts the semantic segmentation mask label in the sample, and the steps are as follows:

and 2.1, amplifying the coordinates of each bounding box in the target detection label of each sample by 1.5 times to obtain the amplified coordinates of each bounding box.

And 2.2, cutting out all ship targets contained in each sample according to the amplified coordinate position to form a target slice image of the sample.

And 2.3, in order to reduce noise interference, bilateral filtering and self-adaptive median filtering adopted in the embodiment of the invention are used for sequentially filtering each target slice to obtain a clean target slice. The filtering kernel size for bilateral filtering is 5, and the filtering range is 10. The self-adaptive median filtering is characterized in that a step of self-adaptively changing the size of a sliding window is added on the basis of median filtering, the size of the filter window is dynamically adjusted according to the gray value of a coverage area of the filter window, and the variation range of the filter window is 3 to 7 in the embodiment of the invention.

And 2.4, obtaining a binarization segmentation threshold value of each filtered target slice image by using the maximum inter-class variance method OTSU according to the gray value of each filtered target slice image. Since the slice is rectangular and the ship target is shuttle-shaped, there is also some background information in the slice. And according to the calculated threshold, determining the part which is larger than the threshold in each target slice image as a target, determining the part which is smaller than the threshold as a background, dividing the slice image into the background part and the target part, wherein the pixel value of the background part is set to be 0, and the pixel value of the target part is set to be 255, so as to obtain the self-adaptively processed binary target slice image.

And 2.5, sequentially performing corrosion expansion and expansion corrosion operation on each binary target slice image in order to reduce the interference of holes and snowflakes in the binary image. Because the target scale is variable, in order to obtain a mask label which is closer to the real target, the embodiment of the invention selects filtering kernels with different sizes in two operations of erosion expansion and expansion erosion. And performing different operations on the binary target slice images divided in different operation intervals by taking the size of the target slice as a dividing basis. Specifically, no erosion expansion and expansion erosion operations are performed on binary target slices having widths and heights of less than 20 pixels. For a binary target slice with the width and the height of 20-150 pixels, 150-210 pixels and more than 210 pixels, the sizes of filter kernels selected by the corrosion expansion and expansion corrosion operations are 3, 2,4, 2,5 and 4 respectively. And obtaining a processed binary target slice image through corrosion expansion and expansion corrosion operations.

And 2.6, in order to avoid introducing interference into the target detection task, performing cutting operation on each processed binary target slice image according to the corresponding original bounding box scale in the sample label.

And 2.7, according to the vertex coordinates of the original bounding box corresponding to the sample label, pasting the cut binary target slice image on a completely black image which has the same size as the sample size and all pixel values of 0 to obtain a semantic segmentation mask label with a ship target pixel value of 255 and a background pixel value of 0.

And 3, generating a training set and a test set.

Step 3.1, because the semantic segmentation labels are only used in the training process of the network, the semantic segmentation labels are not generated aiming at the test samples. And respectively executing the generation operation of the self-adaptive SAR image ship semantic segmentation labels on each sample in the SSDD training sample set and the HRSID training sample set to obtain the SSDD training sample set semantic segmentation labels and the HRSID data set training sample set semantic segmentation labels.

And 3.2, in order to ensure that the semantic segmentation mask and the target detection label are aligned with the training sample after data enhancement, sequentially performing sampling, random overturning, random cutting and mosaic splicing data enhancement on each sample in the SSDD and HRSID training sample sets respectively to obtain the training sample with the input dimension of 640 multiplied by 640 pixels. The semantic segmentation mask and the training sample complete the same operations of sampling, random turning, random cutting and mosaic splicing data enhancement, and the coordinates of the bounding box in the target detection label are correspondingly adjusted according to the position of the ship target in the training sample after preprocessing. And obtaining the preprocessed training sample, the semantic segmentation label and the target detection label.

And 3.3, generating a training set and a testing set. And (3) forming an SSDD training set by all the samples subjected to SSDD training sample set preprocessing, the semantic segmentation labels and the target detection labels. The test sample set and the target detection label of the SSDD form an SSDD test set. All preprocessed samples, semantic segmentation labels and target detection labels in the HRSID training samples form an HRSID training set, and the HRSID test sample set and the target detection labels form an HRSID test set.

And 4, constructing a mask feature fusion target detection network.

The constructed mask feature fusion target detection network structure is further described with reference to fig. 2.

The mask feature fusion target detection network is formed by four parts of network connection. The four parts of networks are respectively: the system comprises a mask feature fusion sub-network, a feature extraction sub-network, a multi-scale feature fusion sub-network and a detection head.

And 4.1, constructing an 8-layer mask feature fusion sub-network. As shown in the mask feature fusion sub-network portion of the overall network structure of fig. 2.

The mask feature fusion sub-network comprises: the device comprises a first GateC module, a first CSP module, a second GateC module, a second CSP module, a convolution layer, a sigmoid layer, a first convolution sampling layer and a second convolution sampling layer. The first GateC module, the first CSP module, the second GateC module, the second CSP module, the convolution layer and the sigmoid layer are connected in series. The first convolution sampling layer is connected with the first GateC module in series. The second convolutional sampling layer is connected in series with the second GateC module. The input channels of the first CSP module and the second CSP module are respectively set to be 64 and 32, the output channels are respectively set to be 64 and 32, the number of the BottleNeck modules is respectively set to be 1, and the residual error structure is closed. The network structures of the first convolution sampling layer and the second convolution sampling layer are the same and are formed by connecting a convolution layer and an upper sampling layer in series. The input channels of the convolutional layers in the first convolutional sampling layer and the second convolutional sampling layer are respectively set to be 128 and 256, the output channels are both set to be 1, the sizes of the convolutional cores are both set to be 1 multiplied by 1, and the step sizes are both set to be 1. The upsampling layers in the first and second convolutional sampling layers are each set to 640 x 640 pixels.

The first to second CSP modules have the same structure as shown in fig. 4. The CSP module includes: the system comprises a first CBS module, a BottleNeck module, a second CBS module, a splicing layer and a third CBS module. The network structure of the CSP module is as follows: the first CBS module, the BottleNeck module, the splicing layer and the third CBS module are connected in series, and the second CBS module is connected with the first CBS module and the BottleNeck module in parallel with the splicing layer.

The first and second GateC modules have the same structure, as shown in fig. 3. The structure of each GateC module is as follows in sequence: the device comprises a first splicing layer, a first batch standardization layer, a first convolution layer, a ReLU activation layer, a second convolution layer, a second batch standardization layer, a sigmoid layer, a multiplication layer, an addition layer and a third convolution layer. The input channels of the first to third convolution layers in the first GateC module are set to 65, 65, 64, respectively, and the output channels are set to 65,1, 64, respectively. The input channels of the first to third convolution layers in the second GateC module are set to 33, 33, 32, respectively, and the output channels are set to 33,1, 32, respectively. The convolution kernel size of all convolution layers in both GateC modules is set to 1 × 1, and the step size is set to 1.

The first and second GateC modules have two inputs, namely a gating input and a positioning feature input. The gating input is connected in parallel with the positioning feature input at the first splice layer. And multiplying the positioning feature input and the output of the sigmoid layer by a first multiplication layer, and then adding the multiplied output and the multiplied output to an addition layer. The outputs of the first and second convolutional sampling layers are the positioning inputs of the first and second GateC modules, respectively.

And 4.2, building a feature extraction sub-network, as shown in the feature extraction sub-network part in the overall network structure of FIG. 2.

The structure of the feature extraction sub-network is as follows in sequence: the chip module comprises a first CBS module, a second CBS module, a first CSP module, a third CBS module, a second CSP module, a fourth CBS module, a third CSP module, a fifth CBS module, an ASPPF module and a fourth CSP module. Wherein the first to third CSP modules have the same structure as the CSP module in step 4.1. The number of input channels of the first CBS module to the fifth CBS module is respectively set to be 1, 32, 64, 128 and 256, the number of output channels is respectively set to be 32, 64, 128, 256 and 512, the step size is respectively set to be 1,2,2,2,2, and the sizes of convolution kernels are all set to be 3 multiplied by 3. The numbers of input channels of the first to fourth CSP modules are set to 64, 128, 256, 512, the output channel numbers are respectively set to 64, 128, 256, 512, and the number of BottleNeck modules is respectively set to 1,2,2,2,2. The first CSP module, the third CSP module and the fourth CSP module are respectively of an on residual error structure and an off residual error structure.

The first to fifth CBS modules have the same structure, and the network structure of each CBS module is: convolutional layer, batch normalization layer, siLU activation layer in series.

The structure of the ASPPF module is further described with reference to fig. 5. The ASPPF module includes: the device comprises a first CBM module, a first maximum pooling layer, a second maximum pooling layer, a third maximum pooling layer, a splicing layer and a second CBM module. Wherein the first CBM module, the first max pooling layer, the second max pooling layer and the third max pooling layer are connected in series. The splice layer is connected in series with the second CBM module. The first CBM module, the first maximum pooling layer, the second maximum pooling layer and the third maximum pooling layer are connected in parallel at the splicing layer. The convolution kernel sizes of the first to third largest pooling layers are all set to 5 × 5. The first CBM module and the second CBM module have the same structure, and the network structure of the CBM modules is as follows: and a convolution layer, a batch normalization layer and a MetaAcon self-adaptive activation layer are connected in series. The input channel numbers of the first CBM module and the second CBM module are respectively set to 512 and 1024, and the output channel numbers are respectively set to 256 and 512. The convolution kernel sizes of the convolutional layers are all set to 1 × 1, and the step sizes are all set to 1.

And 4.3, constructing a multi-scale feature fusion sub-network, wherein the structure of the multi-scale feature fusion sub-network is shown as the multi-scale feature fusion sub-network part in the whole network structure of the figure 2.

The multi-scale feature fusion sub-network comprises: the device comprises a first CBS module, a first upper sampling layer, a first splicing layer, a first CSP module, a second CBS module, a second upper sampling layer, a second splicing layer, a second CSP module, a third CBS module, a third splicing layer, a third CSP module, a fourth CBS module, a fourth splicing layer, a fourth CSP module and a Fusion module. The first CBS module, the first upper sampling layer, the first splicing layer, the first CSP module, the second CBS module, the second upper sampling layer, the second splicing layer, the second CSP module, the third CBS module, the third splicing layer, the third CSP module, the fourth CBS module, the fourth splicing layer and the fourth CSP module are connected in series. The Fusion module is connected in series with the second splicing layer, the first CBS module is connected in parallel with the fourth splicing layer, and the first CSP module is connected in parallel with the third splicing layer; . The first to second upsampling layers are respectively provided as 80 × 80 pixels and 160 × 160 pixels. The number of input channels of the first to fourth CBS modules is set to 512, 256, 128, 256, respectively, the output channels are set to 256, 128, 128, 256, respectively, the convolution kernel size is set to 1 × 1,1 × 1,3 × 3,3 × 3, respectively, and the step size is set to 1,1,2,2, respectively. The numbers of input channels of the first to fourth CSP modules are set to 512, 256, 256, 256, respectively, the numbers of output channels are set to 256, respectively, the numbers of the 128, 512 and BottleNeck modules are all set to be 1, and the residual error structures are closed.

The Fusion module has a structure shown in fig. 6, and the structure sequentially includes a down-sampling layer, a splicing layer, a first convolution layer, a first batch normalization layer, a first ReLU activation layer, a random discard layer, a second convolution layer, a second batch normalization layer, and a second ReLU activation layer. The Fusion module has two inputs, a mask feature input and a trunk feature input. The downsampling layer is set to 160 × 160 pixels. The input channels of the first and second convolutional layers are set to 288, 320 respectively, the output channels are set to 320, 256 respectively, the convolutional kernel sizes are set to 3 respectively, and the step sizes are set to 1 respectively. The discard rate of the random discard layer is set to 0.1.

And 4.4, combining the four-part network. Connecting a first CSP module in the feature extraction sub-network with a first GateC module in the mask feature fusion sub-network to serve as gating input of the first GateC module; respectively connecting a first convolution sampling layer and a second convolution sampling layer of the mask feature fusion sub-network with a second CSP module and a third CSP module in the feature extraction sub-network; connecting a fourth CBS module in the feature extraction sub-network with a Fusion module in the multi-scale feature Fusion sub-network, and inputting the fourth CBS module as the main feature of the Fusion module; respectively connecting a fifth CBS module and a fourth CSP module in the feature extraction sub-network with a first splicing layer and a first CBS module in the multi-scale feature fusion sub-network; the outputs of the second CSP module, the third CSP module and the fourth CSP module of the multi-scale feature fusion sub-network are respectively connected with the detection head 1, the detection head 2 and the detection head 3 to obtain a mask feature fusion target detection network.

The CSP module in the mask feature fusion target detection network refers to a module which is the same as the CSP module in the Yolov5 network, and a user can independently select the number of BottleNeck modules connected in the CSP module; the network structure of the BottleNeck module is divided into a residual error structure and a non-residual error structure, a user can independently select whether to start the residual error structure, and under the condition of closing the residual error structure, the network structure of the BottleNeck module is formed by connecting two CBS modules in series; under the condition of opening a residual error structure, the network structure of the BottleNeck module is that two CBS modules are connected in series, and the output of the CBS modules is added with the input of the BottleNeck module to form residual error connection.

And 6, constructing a network loss function.

The loss function of the network consists of a target detection loss function and a mask fusion sub-network cross entropy loss with weight.

And 6.1, constructing a target detection loss function. Loss function Loss for target detection _target Consistent with Yolov5, by loss of positioning l _box Confidence loss l _obj And classification loss l _cls And (4) adding and composing. Namely:

Loss _target ＝l _box +l _obj +l _cls

step 6.2, constructing cross entropy Loss with weight _bce . In order to enhance the attention of the network to ship targets, the invention improves on the basis of cross entropy loss, and introduces a weight coefficient omega when calculating the cross entropy loss _p And ω _n 。Loss _bce The calculation method of (c) is as follows:

ω _p ＝n _neg /(n _pos +n _neg )

ω _n ＝n _pos /(n _pos +n _neg )

wherein p is _i Output results, ω, representing the network model _p And omega _n Respectively expressed as the weight assigned to the ship target and the background, n _neg And n _pos Respectively representing the total pixel value of the ship target in the self-adaptive mask label and the total pixel value of the background, wherein the log represents a logarithmic function taking a natural number e as a base, and the weight omega can be calculated by the formula _p And ω _n And cross entropy Loss with weighting _bce 。

And 6.3, automatically distributing the weight of the loss function. The mask feature fusion sub-network and the target detection network can be respectively regarded as two different tasks, namely semantic segmentation and target detection, so that the training process of the network is unstable, the situation that a loss function is not converged easily occurs, and the result of network training failure is caused. Therefore, the invention adds a weight automatic distribution method for the loss function.

The total Loss function Loss _all The following:

the invention is a Loss detection method for a target _target Cross entropy Loss with weighting _bce Distribution weight coefficient σ ₁ ，σ ₂ 。σ ₁ ，σ ₂ And the network training is automatically updated to adapt to the optimization process of the network.

And 7, training a mask feature fusion target detection network.

Inputting the preprocessed training set picture into a network for training. The invention uses SGD optimization algorithm to update the network weight value in an iterative way until the loss function is converged, and a trained target detection network is obtained.

And 8, detecting the position of the ship target in the SAR image.

Sampling each test sample in the test set to 640 multiplied by 640 pixels, inputting the sampled test sample into a trained mask network feature fusion target detection network, and outputting bounding box coordinates and confidence of the ship target in each test sample.

The effect of the present invention can be further demonstrated by the following simulation.

1. Simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: NVIDIA a100. The software platform of the simulation experiment is as follows: the Ubuntu 18.04.6 operating system is based on the PyTorch1.7.1 deep learning framework and the programming language is Python3.7.

2. Simulation experiment content and result analysis:

the simulation experiment of the present invention is to perform target detection on the test samples in the SSDD and HRSID data sets respectively by using the method of the present invention and Yolov5 in the prior art, and the obtained detection results are shown in fig. 7 and 8.

In the simulation, the prior art Yolov5 refers to the target detection model proposed by Ultralytics in "Ultralytics/Yolov5: v5.0," (2021. [ Online ]. Available: https:// github.com/Ultralytics/Yolov 5).

The effect of the present invention will be further described below with reference to the simulation diagrams of fig. 7 and 8.

Fig. 7 is a comparison diagram of detection effects of a Yolov5 network and the present invention on an SSDD data set, where in fig. 7, 5 pictures are selected for comparison, a first row in fig. 7 represents a detection result of the Yolov5 network, a second row represents a detection result of the present invention, and a third row represents a correct result of a target detection label labeling.

As can be seen from the column (a) in fig. 7, the bounding box coordinates predicted by the present invention are more accurate than the prediction result of the Yolov5 network. It can be seen from columns (b) and (c) in fig. 7 that in SSDD data set, the Yolov5 network easily identifies the interfering targets such as marine reef as the ship target when the noise interference is large, resulting in false alarm, but the present invention can accurately distinguish reef from ship. As can be seen from columns (d) and (e) in fig. 7, in the near-shore background of densely arranged ships, background interference and ship target detail features are not obvious enough, which makes it difficult for the Yolov5 network to correctly identify closely arranged ships, and the present invention can better distinguish closely arranged ships. The detection result shown in fig. 7 shows that the network designed by the invention has better detection effect under the conditions of noise interference and complex background.

Fig. 8 is a comparison diagram of detection effects of the Yolov5 network and the present invention on an HRSID data set, where fig. 8 selects 5 pictures for comparison, a first row represents a detection result of the Yolov5 network, a second row represents a detection result of the present invention, and a third row represents a correct result of target detection label labeling.

As can be seen from columns (a) and (e) in fig. 8, in the HRSID dataset, for an image with a complex background, the Yolov5 network is difficult to distinguish the target from the background, and the detection result is prone to have missing detection and false alarm, which are well avoided by the present invention. Columns (b), (c) and (d) in fig. 8 show that the bounding box coordinates predicted by the invention are more accurate than the predicted result of the Yolov5 network. Compared with Yolov5, the network designed by the invention effectively reduces the problems of missed detection and false alarm, and has better detection effect.

In order to verify the simulation effect of the invention, the Yolov5 and the method of the invention are evaluated by utilizing an evaluation index mAP, and the area enclosed by all types of precision recall curves and coordinate axes is calculated by using an integral method. The specific calculation formula is as follows:

where TP, FP and FN refer to the number of correctly detected vessels, false alarms and missing vessels, respectively. N denotes the number of categories, AP is the area under the accuracy and recall curves, and maps is the average of the various AP types. The present invention is directed to only one type of target, so the mAP equals the AP.

TABLE 1 mAP comparison table of different methods on simulation experiment SSDD data set

Method	mAP ₅₀ (％)	mAP ₇₅ (％)
			Yolov5	97.3	64.6
The invention	97.9	64.7

TABLE 2 mAP comparison tables of different methods on simulation experiment HRSID dataset

Method	mAP ₅₀ (％)	mAP ₇₅ (％)
			Yolov5	91.2	65.2
The invention	93.4	69.5

By combining table 1 and table 2, it can be seen that the average detection accuracy mAP of the SAR image ship target detection method of the invention on SSDD and HRSID data sets ₅₀ 97.9 percent and 93.4 percent respectively, which are respectively improved by 0.6 percent and 2.2 percent compared with a Yolov5 network; mAP ₇₅ Respectively 64.7 percent and 69.5 percent, and respectively 0.1 percent and 3.7 percent higher than that of a Yolov5 network, thereby proving that the invention can obtain higher SAR ship target detection precision.

The above simulation experiments show that: the invention provides an SAR ship target detection method based on mask network fusion image characteristics. The invention builds a mask network feature fusion branch, highlights the target feature, reduces the interference of background information, and avoids the problem of detection accuracy rate reduction caused by indiscriminate processing of the target and the background by the network. The self-adaptive semantic segmentation mask label manufacturing strategy provided by the invention effectively separates the target and the background by using the brightness gradient difference between the target and the background, completes the manufacturing of the semantic segmentation labels aiming at different SAR data sets, and avoids the task from highly depending on the semantic segmentation labels. In addition, the invention elaborately designs a cross entropy loss with weight and self-adaptive loss weight distribution strategy for the network, so that the network can better separate the target from the background and keep the training process stable. The method can reduce false alarms and false alarm failures when the target detection is carried out on the SAR image in the complex scene, thereby improving the target detection precision and having important practical application value.

Claims

1. A SAR ship target detection method based on mask network fusion image features is characterized in that self-adaptive SAR image ship semantic segmentation labels are generated by using ship and background brightness gradient differences, a mask feature fusion target detection network is constructed, and a cross entropy loss function with weight is constructed; the target detection method comprises the following steps:

step 1, generating a sample set:

step 1.3, forming a sample set by all SAR ship image samples and target detection label files corresponding to the SAR ship image samples;

step 2, generating a self-adaptive SAR image ship semantic segmentation label by using the gradient difference of the ship and the background brightness:

2.2, cutting out all ship targets contained in each sample according to the amplified coordinate position to form a target slice image of the sample;

step 2.3, filtering each target section by adopting bilateral filtering and self-adaptive median filtering in sequence; the size of a filter kernel selected for bilateral filtering is 5, and the filtering range is 10; dynamically adjusting the size of the filter window according to the gray value of the filter window coverage area;

step 3, generating a training set:

step 3.1, executing self-adaptive SAR image ship semantic segmentation label generation operation on each sample in the training sample set to obtain a training sample set semantic segmentation label;

step 3.3, forming a training set by all the training samples, the semantic segmentation labels and the target detection labels after the pretreatment;

step 4.1, constructing an 8-layer mask feature fusion sub-network, wherein the sub-network comprises: the device comprises a first GateC module, a first CSP module, a second GateC module, a second CSP module, a convolution layer, a sigmoid layer, a first convolution sampling layer and a second convolution sampling layer; the first GateC module, the first CSP module, the second GateC module, the second CSP module, the convolution layer and the sigmoid layer are connected in series; the first convolution sampling layer is connected with the first GateC module in series; the second convolution sampling layer is connected with the second GateC module in series; setting input channels of the first CSP module and the second CSP module as 64 and 32 respectively, setting output channels as 64 and 32 respectively, setting the number of BottleNeck modules as 1, and closing a residual error structure; the network structures of the first convolution sampling layer and the second convolution sampling layer are the same and are formed by connecting a convolution layer and an upper sampling layer in series; setting input channels of convolution layers in the first convolution sampling layer and the second convolution sampling layer to be 128 and 256 respectively, setting output channels to be 1, setting convolution kernel sizes to be 1 multiplied by 1 and setting step length to be 1; the up-sampling layer in the first and second convolution sampling layers is set to be 640 multiplied by 640 pixels;

the first and second GateC modules have the same structure, and each GateC module has the following structure in sequence: the device comprises a first splicing layer, a first batch standardization layer, a first convolution layer, a ReLU activation layer, a second convolution layer, a second batch standardization layer, a sigmoid layer, a multiplication layer, an addition layer and a third convolution layer; setting input channels of the first convolution layer to the third convolution layer in the first GateC module to be 65, 65 and 64 respectively, and setting output channels to be 65,1 and 64 respectively; setting the input channels of the first to the third convolution layers in the second GateC module as 33, 33 and 32 respectively, and setting the output channels as 33,1 and 32 respectively; the sizes of convolution kernels of all convolution layers in the two GateC modules are set to be 1 multiplied by 1, and the step length is set to be 1;

the first to third CSP modules have the same structure as the CSP module in the step 4.1; the first CBS module, the second CBS module, the third CBS module and the fourth CBS module are identical in structure, and a network of each CBS module is formed by serially connecting a convolution layer, a batch standardization layer and a SiLU activation layer;

the ASPPF module includes: the system comprises a first CBM module, a first maximum pooling layer, a second maximum pooling layer, a third maximum pooling layer, a splicing layer and a second CBM module; wherein the first CBM module, the first maximum pooling layer, the second maximum pooling layer and the third maximum pooling layer are connected in series; the splicing layer is connected with the second CBM module in series; the first CBM module, the first maximum pooling layer, the second maximum pooling layer and the third maximum pooling layer are connected in parallel at the splicing layer; setting the sizes of convolution kernels of the first to third largest pooling layers to be 5 multiplied by 5; the first CBM module and the second CBM module have the same structure, and each CBM module is formed by serially connecting a convolution layer, a batch standardization layer and a MetaAcon self-adaptive activation layer; the input channel numbers of the first CBM module and the second CBM module are respectively set to be 512 and 1024, and the output channel numbers are respectively set to be 256 and 512; the sizes of convolution kernels of the convolution layers are all set to be 1 multiplied by 1, and the step lengths are all set to be 1;

step 4.3, constructing a multi-scale feature fusion sub-network, wherein the structure of the multi-scale feature fusion sub-network comprises the following steps: the device comprises a first CBS module, a first upper sampling layer, a first splicing layer, a first CSP module, a second CBS module, a second upper sampling layer, a second splicing layer, a second CSP module, a third CBS module, a third splicing layer, a third CSP module, a fourth CBS module, a fourth splicing layer, a fourth CSP module and a Fusion module; the first CBS module, the first upper sampling layer, the first splicing layer, the first CSP module, the second CBS module, the second upper sampling layer, the second splicing layer, the second CSP module, the third CBS module, the third splicing layer, the third CSP module, the fourth CBS module, the fourth splicing layer and the fourth CSP module are sequentially connected in series; the Fusion module is connected in series with the second splicing layer; the first CBS module is connected with the fourth splicing layer in parallel; the first CSP module is connected with the third splicing layer in parallel; the first to second upsampling layers are respectively set to 80 × 80 pixels and 160 × 160 pixels; the number of input channels of the first CBS module to the fourth CBS module is respectively set to be 512, 256, 128 and 256, the number of output channels of the first CBS module to the fourth CBS module to be 256, 128, 128 and 256, the sizes of convolution kernels are respectively set to be 1 multiplied by 1,1 multiplied by 1,3 multiplied by 3,3 multiplied by 3, and the step sizes are respectively set to be 1,1,2,2; the number of input channels of the first to fourth CSP modules is set to 512, 256, 256, 256, respectively, the number of output channels is set to 256, the number of the 128, 512 and BottleNeck modules is set to be 1, and residual error structures are closed;

the Fusion module sequentially comprises a down-sampling layer, a splicing layer, a first convolution layer, a first batch standardization layer, a first ReLU activation layer, a random abandonment layer, a second convolution layer, a second batch standardization layer and a second ReLU activation layer; the Fusion module has two inputs, namely a mask feature input and a trunk feature input; setting the downsampling layer to 160 × 160 pixels; setting the input channels of the first convolution layer and the second convolution layer to be 288 and 320 respectively, setting the output channels of the first convolution layer and the second convolution layer to be 320 and 256 respectively, setting the convolution kernel size to be 3 respectively, and setting the step size to be 1 respectively; the discarding rate of the random discarding layer is set to 0.1;

step 4.4, connecting the first CSP module in the feature extraction sub-network with the first GateC module in the mask feature fusion sub-network to serve as gating input of the first GateC module; respectively connecting a first convolution sampling layer and a second convolution sampling layer of the mask feature fusion sub-network with a second CSP module and a third CSP module in the feature extraction sub-network; connecting a fourth CBS module in the feature extraction sub-network with a Fusion module in the multi-scale feature Fusion sub-network, and inputting the fourth CBS module as a main feature of the Fusion module; respectively connecting a fifth CBS module and a fourth CSP module in the feature extraction sub-network with a first splicing layer and a first CBS module in the multi-scale feature fusion sub-network; the outputs of the second CSP module, the third CSP module and the fourth CSP module of the multi-scale feature fusion sub-network are respectively connected with the detection head 1, the detection head 2 and the detection head 3 to obtain a mask feature fusion target detection network;

step 5, generating a target detection Loss function Loss _all The following:

the cross entropy Loss function Loss with weight _bce The following:

ω _p ＝n _neg /(n _pos +n _neg )

ω _n ＝n _pos /(n _pos +n _neg )

wherein N represents the total number of pictures of the semantic segmentation mask label in the current network iterative training, y _i Pixel value, p, representing the ith semantic segmentation mask label _i Representing the output of the ith semantic segmentation mask tag through the network model, omega _p And ω _n Weight values, n, representing respectively ship target and background assignments _neg And n _pos Respectively representing the total number of ship target pixels and the total number of background pixels in the semantic segmentation mask label;

step 6, training a mask feature fusion target detection network:

step 7, detecting the position of the ship target in the SAR image:

2. The SAR ship target detection method based on mask network fusion image features as claimed in claim 1, characterized in that the maximum inter-class variance method OTSU in step 2.4 means that a binarization segmentation threshold value of each filtered target slice is obtained according to a gray value of the slice; determining the part of each target slice image, which is larger than the binary segmentation threshold value of the slice image, as a target, and determining the part of each target slice image, which is smaller than the threshold value, as a background; and setting the pixel value of the background part to be 0 and the pixel value of the target part to be 255 to obtain the self-adaptive binary target slice image.

3. The SAR ship target detection method based on mask network fusion image features as claimed in claim 1, wherein the erosion expansion and expansion erosion operations in step 2.5 refer to selecting different filter kernels in erosion expansion and expansion erosion operations, and performing different operations on binary target slice images divided in different operation intervals based on the scale of the target slice.

4. The SAR ship target detection method based on mask network fusion image features as claimed in claim 1, characterized in that the preprocessing in step 3.2 refers to sequentially performing sampling, random flipping, random clipping and mosaic splicing data enhancement operations on each sample in a training sample set to obtain a training sample with an input scale size of 640 x 640 pixels; and performing the same operations of sampling, random turning, random cutting and mosaic splicing data enhancement on the semantic segmentation mask label as the training sample, and correspondingly adjusting the coordinates of a bounding box in the target detection label according to the position of the ship target in the training sample after pretreatment to obtain the training sample after pretreatment, the semantic segmentation label and the target detection label.