CN112507777A

CN112507777A - Optical remote sensing image ship detection and segmentation method based on deep learning

Info

Publication number: CN112507777A
Application number: CN202011080445.7A
Authority: CN
Inventors: 黄波; 吴了泥; 何伯勇; 林宇兴
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-03-16

Abstract

The invention discloses an optical remote sensing image ship detection and segmentation method based on deep learning, which comprises the following steps: reading in image data, and preprocessing an image according to a transfer learning method; step two, constructing a multi-resolution parallel convolution backbone network HRFPN to extract an image feature map; thirdly, generating a ship candidate area based on the RPN; step four, by using the idea of multi-task cascade detection, adding semantic segmentation branches, obtaining the classification probability value, the positioning frame and the mask of the ship, and calculating a loss function; and step five, obtaining a refined detection result by utilizing an NMS method. The invention has the advantages that: the invention adds a multi-resolution parallel convolution module and a multi-task cascade detection module on the basis of deep neural network target detection and segmentation, effectively improves the accuracy of optical remote sensing image ship detection and segmentation, and particularly has better detection capability on small targets.

Description

Optical remote sensing image ship detection and segmentation method based on deep learning

Technical Field

The invention relates to an optical remote sensing image ship detection and segmentation method based on deep learning, and belongs to the technical field of intelligent remote sensing image identification.

Background

In recent years, with the development of aerospace technology, high-resolution optical remote sensing images are becoming one of important means for detecting marine ships. Compared with infrared and SAR images, the optical remote sensing image has the characteristic of high spatial resolution, so that the optical remote sensing image can have more obvious geometric representation on an object. The observation target scene obtained through the high-resolution optical remote sensing image has sufficient characteristic information. In recent years, more and more scholars and research institutions focus on the identification and positioning of high-resolution remote sensing images on marine targets, and further promote the development of marine commercial activities.

With the development of deep learning technology and GPU computing power, the deep convolutional neural network has strong target feature extraction capability on computer vision tasks. At present, most methods only carry out ship detection, but neglect to carry out pixel-level segmentation on a target. One of the problems of the existing ship detection method based on the rectangular frame is that background pixels exist in the boundary frame of the obtained local candidate area, which is not beneficial to classifying the candidate area; by performing example segmentation on the image, a ship mask image without background pixels is obtained, so that accurate classification and fine positioning of ship targets are realized.

A paper published by Kaiming He, "Mask R-CNN" (16th IEEE International Conference on Computer Vision, ICCV 2017) proposes a method for simultaneously performing target detection and instance segmentation in a network. Firstly, extracting image features by using a basic Network ResNet-50 or ResNet-101, and fusing the image features by using a Feature Pyramid Network (Feature Pyramid Network); then, acquiring a candidate region of the ship through a region generation network RPN, and performing region of interest (ROI of interest) alignment operation on a candidate region feature map; for the classification and bounding box prediction branches, respectively predicting the category and the position information of the candidate box by the aligned feature vectors through full-connection layer operation; for the split branch, the aligned feature vector passes through a full convolution Network (full volumetric Network) to predict mask information of the target. The method improves the target detection effect by monitoring the mask information. However, the method still has the disadvantages that due to the large size of the optical remote sensing image and the large size and direction change range of the ship, ship targets, especially small targets, can not be effectively detected.

Disclosure of Invention

The invention aims to solve the technical problem of providing an optical remote sensing image ship detection and segmentation method based on deep learning.

The invention is realized by the following scheme: an optical remote sensing image ship detection and segmentation method based on deep learning comprises the following steps:

reading in image data, and preprocessing an image according to a transfer learning method;

step two, constructing a multi-resolution parallel convolution backbone network HRFPN to extract an image feature map;

thirdly, generating a ship candidate area based on the RPN;

step four, by using the idea of multi-task cascade detection, adding semantic segmentation branches, obtaining the classification probability value, the positioning frame and the mask of the ship, and calculating a loss function;

and step five, obtaining a refined detection result by utilizing an NMS method.

In the first step, model parameters obtained by training a convolutional neural network on a large data set are used as initial parameter values of a network extraction feature layer, and then model fine tuning is carried out.

In the first step, the HRNETV2-W40 model obtained by training on the ImageNet data set is subjected to mean value reduction processing in the training process, and the trained HRNETV2-W40 model is transferred to a ship detection and segmentation task and is subjected to the same mean value reduction preprocessing on the image.

The overall network in the second step comprises four stages: down-sampling the resolution of the image by means of two 3 x 3 convolutions with step size 2 to 1/4 of the original as input to the first stage, which is also the resolution size of the feature map of stage 1, stages 2, 3 and 4 containing 2, 3 and 4 resolution feature maps, respectively, stage 1 containing 4 residual units, each residual unit consisting of a bottleneck module of one channel 64, then scaling the number of channels of the feature map to C by means of 1 3 x 3 convolution, stages 2, 3 and 4 consisting of 1, 4 and 3 repeated modular multi-resolution blocks, respectively, the multi-resolution block consisting of a multi-resolution group convolution and a multi-resolution convolution, each branch of the multi-resolution group convolution containing 4 residual units, for each resolution, 2 3 x 3 convolutions in each unit, the input and output being feature maps of different resolutions, respectively, in order to ensure that the resolution size and the channel number of the feature map are consistent during feature map fusion, i 3 × 3 convolutions with the step length of 2 are performed when the high-resolution feature map is fused to the low-resolution feature map, j bilinear upsampling operations are performed when the low-resolution feature map is fused to the high-resolution feature map, the range of values of i and j is [1, 3], the channel numbers of the feature map with 4 resolutions are respectively C, 2C, 4C and 8C, and C is set to be 40 according to the size of the feature map.

In the second step, the network forms feature maps C2, C3, C4 and C5 by connecting multi-resolution parallel convolutions and performing repeated information exchange between the parallel convolutions, and fuses { C2, C3, C4 and C4} to form a final feature map { P2, P3, P4 and P5}, and the specific calculation formula is as follows:

wherein, Conv_1×1And Conv_3×3Respectively represent a 1 × 1 convolutional layer and a 3 × 3 convolutional layer; upsamplie represents bilinear upsampling, followed by 1 × 1 convolution operation; down sample represents a 3 × 3 convolutional layer with a step size of 2;

representing the eigenmap addition operation, P6 has P5 generated by 1 convolution with a step size of 2 by 3 { P2, P3, P4, P5, P6} output channels all of 256.

In the third step, the anchors are respectively set to be 32 in size for P2, P3, P4, P5 and P6²、64²、128²、256²、512²Setting the aspect ratio of a ship candidate area on each layer of feature map as { 1:1, 1: 2 and 2: 1}, setting 15 ship candidate areas in the feature pyramid, distributing training positive and negative samples according to the overlapping rate of the ship candidate area and IoU of a corresponding label frame, and when IoU is greater than 0.7, the ship candidate area is a positive sample; when IoU is less than 0.3, the ship candidate area is a negative sample, and the total number of the positive and negative samples in one image is not more than 2000.

And fourthly, constructing a multi-task cascade network to obtain a ship positioning frame and a MASK, adjusting all ship candidate areas into fixed feature vectors by adopting RoIAlign, wherein the feature vector size of a classification and regression branch is 7 multiplied by 7, the feature vector size of a segmentation branch is 14 multiplied by 14, improving information flow by combining cascade and multi-task processing at each stage by using the idea of multi-task cascade detection, further improving accuracy by utilizing a space context, wherein the whole network is provided with 3 detection heads, CLS, BOX and MASK respectively represent classification, boundary frame prediction and MASK prediction branches, IoU threshold values of the 3 detection heads are respectively 0.5, 0.6 and 0.7, the prediction of each stage is input into the next stage to obtain a high-quality prediction result, and the prediction characteristic of the current boundary frame is obtained by the regression boundary frame of the previous stage through RoIAlign.

In the fourth step, the mask prediction branches of the adjacent stages are connected to provide information flow of the mask branch, the mask calculation process of the two adjacent stages is composed of 4 3 × 3 convolution layers and 1 deconvolution layer, firstly, feature graphs of 5 levels of the FPN are scaled to the same size to perform multi-scale feature fusion, then, features are extracted through the 4 3 × 3 convolution layers, and semantic features of fixed size are obtained through 1 × 1 convolution.

In the fourth step, for a single image, the multitask loss function in the training process is defined as follows:

wherein L is_cls、L_boxAnd L_maskRespectively representing classification loss, positioning frame regression loss and mask prediction loss in the t stage, wherein the values of t are 1, 2 and 3, and L_segRepresenting semantic segmentation loss; the classification loss is defined as follows:

wherein, i represents the index of the anchor,

denotes the tag value, p, of the ith anchor_iThe predicted value of the ith anchor is expressed and is predicted as p when the ship is in the time of_i1, non-ship p_iFor regression loss, define t as 0_i＝{t_x，t_y，t_w，t_hThe values of the parameters of the rectangular frame are predicted for the ship,

for the label value of a rectangular frame of the ship anchor, the calculation formula of four parameter values is defined as follows:

t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，

t_w＝log(w/w_a)，t_h＝log(h/h_a)，

wherein x, y, w and h represent coordinate values of the center point of the rectangular frame, and width and height variables x, x_aAnd x^*The sub-tables correspond to the coordinate values (y, w, h are the same) of the central point x of the prediction frame, the ship candidate area frame and the label frame; the regression loss function is defined as follows:

specifically, for the mask prediction branch, setting the output resolution of each anchor as an m × m binary mask map, the mask prediction loss function is defined as:

wherein m is_iRepresenting the confidence with which the object is predicted to be the target,

representing the output of each pixel in the ith mask after sigmoid, and setting the semantic segmentation graph output by each anchor as s and the label semantic segmentation graph as s for the semantic segmentation branches^*Then the semantic segmentation loss function is defined as: _seg ^* ^*L＝-[s×log(s)+(1-s)log(1-s)]。

and step five, sorting all the detection frames from high to low according to the scores, reserving the candidate frames with low overlapping degree and high scores among the detection frames, and discarding the candidate frames with high overlapping degree and low scores among the detection frames.

The invention has the beneficial effects that: the invention adds a multi-resolution parallel convolution module and a multi-task cascade detection module on the basis of deep neural network target detection and segmentation, effectively improves the accuracy of optical remote sensing image ship detection and segmentation, and particularly has better detection capability on small targets.

Drawings

Fig. 1 is a ship detection flow chart.

Fig. 2 is a structure diagram of a feature extraction backbone network HRFPN.

FIG. 3 is a diagram of a multi-resolution set convolution structure.

Fig. 4 is a diagram of a multi-resolution convolution structure.

Fig. 5 is a diagram of a process of generating a ship candidate region (anchor) based on RPN.

Fig. 6 is a diagram of a multitasking cascade network architecture.

Fig. 7 is a diagram showing the final detection results.

Detailed Description

The invention will be further described with reference to fig. 1-7, without limiting the scope of the invention.

In the following description, for purposes of clarity, not all features of an actual implementation are described, well-known functions or constructions are not described in detail since they would obscure the invention with unnecessary detail, it being understood that in the development of any actual embodiment, numerous implementation details must be set forth in order to achieve the developer's specific goals, such as compliance with system-related and business-related constraints, changing from one implementation to another, and it being recognized that such development effort might be complex and time consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art.

An optical remote sensing image ship detection and segmentation method based on deep learning comprises the following steps:

reading in image data, and preprocessing the image according to a transfer learning method

Migration learning, which mainly refers to training a convolutional neural network on a large data set (such as an ImageNet data set), wherein after a certain feature extraction capability is achieved, a mode of randomly initializing network parameters is not adopted when other image training tasks are performed, the model parameters obtained through training are used as parameter initial values of a network extraction feature layer, and then model fine tuning is performed; the method adopts the HRNETV2-W40 model obtained by training the ImageNet data set, and the model performs mean value reduction processing on the data in the training process; therefore, the same mean value reduction preprocessing should be performed on the images when the trained HRNETV2-W40 model is transferred to the ship detection and segmentation task.

Step two: constructing a multi-resolution parallel convolution backbone network HRFPN extraction image feature map

The overall network comprises four phases: down-sampling the resolution of the image by a 3 × 3 convolution with two steps of 2 to 1/4 for the original as input to the first stage, which is also the resolution size of the feature map at stage 1; stages 2, 3 and 4 contain resolution profiles of 2, 3 and 4, respectively.

Specifically, like the ResNet-50 structure, phase 1 contains 4 residual units, each unit consisting of a bottle neck module of channel 64, and then scales the number of channels of the feature map to C by 1 3 × 3 convolution.

In particular, stages 2, 3, 4 consist of 1, 4 and 3 repeated modular multi-resolution blocks, respectively. The multi-resolution block consists of a multi-resolution group convolution and a multi-resolution convolution.

Specifically, the structure diagram of the multi-resolution group convolution is shown in fig. 3, where each branch of the multi-resolution group convolution contains 4 residual units, and each unit contains 2 3 × 3 convolutions for each resolution.

Specifically, the structure diagram of the multiresolution convolution is shown in fig. 4, the input and the output are feature maps with different resolutions respectively, in order to ensure that the resolution size of the feature maps is consistent with the number of channels when the feature maps are fused, the high resolution feature map is fused to the low resolution feature map through i 3 × 3 convolutions with the step length of 2, the low resolution feature map is fused to the high resolution feature map through j bilinear upsampling operations, and the range of the values of i and j is [1, 3] according to the size of the feature maps.

Specifically, the feature map channels of 4 resolutions are C, 2C, 4C, and 8C, respectively, and C is set to 40.

Specifically, the network forms the feature maps C2, C3, C4, and C5 by connecting multiple resolution (from high resolution to low resolution) parallel convolutions and performing repeated information exchange between the parallel convolutions.

Specifically, like the FPN feature pyramid network, as shown in fig. 2, the final feature map { P2, P3, P4, P5} is formed by fusing { C2, C3, C4, C4}, and the specific calculation formula is as follows:

in particular, Conv_1×1And Conv_3×3Respectively represent a 1 × 1 convolutional layer and a 3 × 3 convolutional layer; upsamplie represents bilinear upsampling, followed by 1 × 1 convolution operation; down sample represents a 3 × 3 convolutional layer with a step size of 2;

a signature addition operation is shown.

Specifically, P6 was generated by convolution of P5 with 1 convolution of 3 × 3 with step size 2.

Specifically, the output channels of { P2, P3, P4, P5, P6} are all 256.

Step three: generation of ship candidate region based on RPN

Specifically, anchors of size {32 } are respectively set for { P2, P3, P4, P5, P6}²、64²、128²、256²、512²The aspect ratio of anchors on each layer of feature map is set to be 1:1, 1: 2 and 2: 1, and the feature pyramid has 15 anchors setting values in total.

Specifically, the training positive and negative samples are assigned according to the IoU overlap rate of the anchor and the corresponding label box. When IoU is greater than 0.7, the anchor is a positive sample; when IoU is less than 0.3, the anchor is negative, and the total number of positive and negative samples in an image is not more than 2000.

Step four, building a multitask cascade network to obtain a ship positioning frame and a mask

Specifically, the roaallign adjustment is applied to all anchors to fix the feature vectors, the feature vector size of the classification and regression branches is 7 × 7, and the feature vector size of the segmentation branches is 14 × 14.

Specifically, as shown in fig. 6, by using the idea of multitask cascade detection, the information flow is improved by combining cascade and multitask processing at each stage, and the accuracy is further improved by using the spatial context.

Specifically, the multitask cascade network is characterized in that: a, alternately using target positioning frame regression prediction; b, feeding back the mask characteristics of the previous stage to the mask branches of the current stage, introducing a direct path to strengthen information flow between the mask branches, and c, adding additional semantic segmentation branches, and fusing the additional semantic segmentation branches with box and mask branches to explore more context information.

Specifically, as shown in fig. 6, the whole network has 3 detection heads, and CLS, BOX and MASK represent classification, bounding BOX prediction and MASK prediction branches, respectively. The IoU threshold values of the 3 detection heads are 0.5, 0.6 and 0.7 respectively, and the prediction of each stage is input into the next stage to obtain a high-quality prediction result.

Specifically, the predicted features of the current bounding box are obtained by the bounding box after prediction regression of the previous stage through roilign.

Specifically, connections are made between mask predicted branches of adjacent stages, providing information flow for the mask branches. The mask calculation process of two adjacent stages consists of 4 3 × 3 convolutional layers and 1 deconvolution layer.

Specifically, semantic segmentation is introduced into the multitask cascade network, because the semantic segmentation needs to perform fine pixel-level classification on the whole image so as to obtain strong spatial position information, firstly, feature images of 5 levels of the FPN are scaled to the same size for multi-scale feature fusion, then, features are extracted through 4 convolution layers of 3 × 3, and semantic features of fixed size are obtained through 1 × 1 convolution.

Specifically, for a single image, the multitask loss function in the training process is defined as follows:

wherein L is_cls、L_boxAnd L_maskAnd respectively representing classification loss, positioning frame regression loss and mask prediction loss of the t stage, wherein the values of t are 1, 2 and 3. L is_segRepresenting a semantic segmentation penalty.

Specifically, the classification loss is defined as follows:

wherein, i represents the index of the anchor,

the label value of the ith anchor is represented, pi represents the predicted value of the ith anchor, and p is predicted when the ship is predicted_i1, non-ship p_i＝0。

Specifically, for regression loss, t is defined_i＝{t_x，t_y，t_w，t_hThe values of the parameters of the rectangular frame are predicted for the ship,

t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，

t_w＝log(w/w_a)，t_h＝log(h/h_a)，

wherein, x, y, w and h respectively represent coordinate values of the central point of the rectangular frame, and width and height; variable x, x_aAnd x^*The sub-tables correspond to the coordinate values (y, w, h are the same) of the central point x of the prediction frame, the anchor frame and the label frame; the regression loss function is defined as follows:

and representing the output of each pixel in the ith mask after sigmoid.

Specifically, for semantically splitting branches, each anchor output is setThe semantic segmentation graph is s, and the label semantic segmentation graph is s^*Then the semantic segmentation loss function is defined as: _seg ^* ^*L＝-[s×log(s)+(1-s)log(1-s)]。

step five, obtaining a refined detection result by utilizing an NMS method

Non-maxima suppression NMS specifically means: and sorting all the detection frames from high to low according to the scores, reserving the candidate frames with low overlapping degree and high scores among the detection frames, and discarding the candidate frames with high overlapping degree and low scores among the detection frames.

The final test results are shown in fig. 7.

The ship target results of the optical remote sensing images of the invention and the Mask R-CNN (a feature extraction backbone network ResNet-101) in the prior art are respectively evaluated by using two indexes of Accuracy (AP) and recall rate (AP), and the accuracy and the recall rate of the ship target detection and segmentation results of the optical remote sensing images of the invention and the Mask R-CNN (a feature extraction backbone network ResNet-101) in the prior art are respectively calculated by using the following formula.

The accuracy AP is the total detection target correct number/total detection target number.

The recall rate AR is the total number of detected correct targets/total number of actual targets.

Specifically, the data set adopted in the experiment is an Airbus-ship data set of Kaggle optical remote sensing ship detection competition, the data set comprises 42615 image data with ship positions and mask labels, and the data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1 according to the number of ship examples in the data set.

Specifically, the parameter settings during the training of the two networks are kept consistent, the initial learning rate is set to be 0.001, the total number of training epochs is 24, the learning rate is reduced by 0.1 time at 16th epochs and 22 th epochs, and the whole model is optimized by adopting a random gradient descent method (SGD) with a momentum value of 0.9 and weight attenuation of 0.0001 during the training process.

Specifically, the operating system adopted in the experiment is Ubuntu18.04, a single GTX-2080Ti GPU is adopted for training and testing, and a deep learning architecture adopts Pytroch 1.5.0.

The ship detection accuracy and recall index of the Mask R-CNN of the invention and the prior art are respectively listed in Table 1.

Table 1 summary of simulation test results

Test index	Mask R-CNN	The method of the invention
			AP	70.0％	80.7％
AR	71.6％	82.2％

It can be seen from table 1 that the AP and AR values of the existing Mask R-CNN are 70.0% and 71.6% respectively, and the AP and AR values of the method of the present invention are 80.7% and 82.2% respectively, and the ship target detection results of the simulation experiment of the present invention are better.

The ship segmentation accuracy and recall index of the Mask R-CNN of the invention and the prior art are respectively listed in Table 1.

Table 2 summary of simulation test results

Test index	Mask R-CNN	The method of the invention
			AP	64.1％	78.2％
AR	67.1％	80.8％

It can be seen from table 2 that the AP and AR values of the existing Mask R-CNN are 64.1% and 67.1%, respectively, and the AP and AR values of the method of the present invention are 78.2% and 80.8%, respectively, and the ship segmentation results of the simulation experiment of the present invention are better.

Considering that the size of the ship target under remote sensing changes greatly, the AP and the AR are subdivided into the AP_L，AP_M，AP_S，AR_L，AR_M，AR_SWherein AP_L，AP_M，AP_SFor accuracy of large, medium and small targets, AR_L，AR_M，AR_SRecall rates for large, medium and small targets. Specifically, a large target refers to a target size greater than 96 × 96 pixel values, a medium target refers to a target size between 32 × 32 and 96 × 96 pixel values, and a small target refers to a target size less than 32 × 32 pixel values.

Table 3 shows the ship detection AP of Mask R-CNN of the present invention and the prior art_L，AP_M，AP_S，AR_L，AR_MAnd AR_SAnd (4) indexes.

Table 3 summary of simulation test results

Test index	Mask R-CNN	The method of the invention
			AP_L	94.8％	96.0％
AP_M	95.9％	97.5％
			AP_S	56.7％	71.9％
AR_L	97.4％	97.4％
			AR_M	97.2％	98.5％
AR_S	58.4％	73.8％

From Table 3, it can be seen that AP of the conventional Mask R-CNN_SAnd AR_SValues of 56.7% and 58.3%, respectively, of AP of the process of the invention_SAnd AR_SThe values are 71.9% and 73.8% respectively, and compared with Mask R-CNN, the invention has better detection effect on small targets.

Table 4 shows the present invention and the prior artShip segmentation AP of technical Mask R-CNN_L，AP_M，AP_S，AR_L，AR_MAnd AR_SAnd (4) indexes.

Table 4 summary of simulation test results

Test index	Mask R-CNN	The method of the invention
			AP_L	85.0％	91.1％
AP_M	84.3％	94.0％
			AP_S	48.6％	70.4％
AR_L	88.5％	94.6％
			AR_M	87.0％	95.6％
AR_S	51.7％	73.1％

From Table 4, it can be seen that AP of the conventional Mask R-CNN_SAnd AR_SValues of 48.6% and 51.7%, respectively, of AP of the process of the invention_SAnd AR_SThe values are 70.4% and 73.1% respectively, and compared with Mask R-CNN, the small target segmentation effect is better.

In conclusion, the multi-resolution parallel convolution module and the multi-task cascade detection module are added on the basis of deep neural network target detection and segmentation, so that the accuracy of optical remote sensing image ship detection and segmentation is effectively improved, and particularly, the method has better detection capability on small targets.

Although the invention has been described and illustrated in some detail, it should be understood that various modifications may be made to the described embodiments or equivalents may be substituted, as will be apparent to those skilled in the art, without departing from the spirit of the invention.

Claims

1. An optical remote sensing image ship detection and segmentation method based on deep learning is characterized in that: which comprises the following steps:

thirdly, generating a ship candidate area based on the RPN;

and step five, obtaining a refined detection result by utilizing an NMS method.

2. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 1, wherein the method comprises the following steps: in the first step, model parameters obtained by training a convolutional neural network on a large data set are used as initial parameter values of a network extraction feature layer, and then model fine tuning is carried out.

3. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 1, wherein the method comprises the following steps: in the first step, the HRNETV2-W40 model obtained by training on the ImageNet data set is subjected to mean value reduction processing in the training process, and the trained HRNETV2-W40 model is transferred to a ship detection and segmentation task and is subjected to the same mean value reduction preprocessing on the image.

4. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 1, wherein the method comprises the following steps: the overall network in the second step comprises four stages: down-sampling the resolution of the image by means of two 3 x 3 convolutions with step size 2 to 1/4 of the original as input to the first stage, which is also the resolution size of the feature map of stage 1, stages 2, 3 and 4 containing 2, 3 and 4 resolution feature maps, respectively, stage 1 containing 4 residual units, each residual unit consisting of a bottleneck module of one channel 64, then scaling the number of channels of the feature map to C by means of 1 3 x 3 convolution, stages 2, 3 and 4 consisting of 1, 4 and 3 repeated modular multi-resolution blocks, respectively, the multi-resolution block consisting of a multi-resolution group convolution and a multi-resolution convolution, each branch of the multi-resolution group convolution containing 4 residual units, for each resolution, 2 3 x 3 convolutions in each unit, the input and output being feature maps of different resolutions, respectively, in order to ensure that the resolution size and the channel number of the feature map are consistent during feature map fusion, i 3 × 3 convolutions with the step length of 2 are performed when the high-resolution feature map is fused to the low-resolution feature map, j bilinear upsampling operations are performed when the low-resolution feature map is fused to the high-resolution feature map, the range of values of i and j is [1, 3], the channel numbers of the feature map with 4 resolutions are respectively C, 2C, 4C and 8C, and C is set to be 40 according to the size of the feature map.

5. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 4, wherein: in the second step, the network forms feature maps C2, C3, C4 and C5 by connecting multi-resolution parallel convolutions and performing repeated information exchange between the parallel convolutions, and fuses { C2, C3, C4 and C4} to form a final feature map { P2, P3, P4 and P5}, and the specific calculation formula is as follows:

6. The optical remote sensing map based on deep learning of claim 1The image ship detection and segmentation method is characterized by comprising the following steps: in the third step, the anchors are respectively set to be 32 in size for P2, P3, P4, P5 and P6²、64²、128²、256²、512²Setting the aspect ratio of a ship candidate area on each layer of feature map as { 1:1, 1: 2 and 2: 1}, setting 15 ship candidate areas in the feature pyramid, distributing training positive and negative samples according to the overlapping rate of the ship candidate area and IoU of a corresponding label frame, and when IoU is greater than 0.7, the ship candidate area is a positive sample; when IoU is less than 0.3, the ship candidate area is a negative sample, and the total number of the positive and negative samples in one image is not more than 2000.

7. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 1, wherein the method comprises the following steps: and fourthly, constructing a multi-task cascade network to obtain a ship positioning frame and a MASK, adjusting all ship candidate areas into fixed feature vectors by adopting RoIAlign, wherein the feature vector size of a classification and regression branch is 7 multiplied by 7, the feature vector size of a segmentation branch is 14 multiplied by 14, improving information flow by combining cascade and multi-task processing at each stage by using the idea of multi-task cascade detection, further improving accuracy by utilizing a space context, wherein the whole network is provided with 3 detection heads, CLS, BOX and MASK respectively represent classification, boundary frame prediction and MASK prediction branches, IoU threshold values of the 3 detection heads are respectively 0.5, 0.6 and 0.7, the prediction of each stage is input into the next stage to obtain a high-quality prediction result, and the prediction characteristic of the current boundary frame is obtained by the regression boundary frame of the previous stage through RoIAlign.

8. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 7, wherein: in the fourth step, the mask prediction branches of the adjacent stages are connected to provide information flow of the mask branch, the mask calculation process of the two adjacent stages is composed of 4 3 × 3 convolution layers and 1 deconvolution layer, firstly, feature graphs of 5 levels of the FPN are scaled to the same size to perform multi-scale feature fusion, then, features are extracted through the 4 3 × 3 convolution layers, and semantic features of fixed size are obtained through 1 × 1 convolution.

9. The optical remote sensing image ship detection and segmentation method based on deep learning of claim 8, wherein: in the fourth step, for a single image, the multitask loss function in the training process is defined as follows:

wherein, i represents the index of the anchor,

t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，

t_w＝log(w/w_a)，t_h＝log(h/h_a)，

10. the optical remote sensing image ship detection and segmentation method based on deep learning of claim 1, wherein the method comprises the following steps: and step five, sorting all the detection frames from high to low according to the scores, reserving the candidate frames with low overlapping degree and high scores among the detection frames, and discarding the candidate frames with high overlapping degree and low scores among the detection frames.