CN115496998A

CN115496998A - Remote sensing image wharf target detection method

Info

Publication number: CN115496998A
Application number: CN202210722252.XA
Authority: CN
Inventors: 郭海涛; 卢俊; 龚志辉; 阎晓东; 张衡; 林雨准; 刘相云; 高慧
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-12-20

Abstract

The invention relates to a wharf target detection method based on remote sensing images, and belongs to the technical field of remote sensing image processing. According to the invention, based on a YOLOv4 horizontal frame detection algorithm, a PSA (pressure swing adsorption) attention module is added in a feature fusion network, the PSA attention module is used for obtaining the receptive fields of different scales, and the attention weights of channels are extracted to obtain the attention weights of the channels of different scales, so that the fusion of context information of different scales is realized, and the detection precision of the network is further improved; meanwhile, the dock target is calibrated by adopting the multi-dimensional corner coordinate detection frame, so that the real boundary of the dock target can be effectively represented, the problem of random dock direction is solved, the positioning precision of the target is further improved, and the accurate detection of the dock target is realized.

Description

Remote sensing image wharf target detection method

Technical Field

The invention relates to a wharf target detection method based on remote sensing images, and belongs to the technical field of remote sensing image processing.

Background

With the rapid development of remote sensing technology, the detection of ocean offshore targets by using remote sensing images gradually becomes a hot spot of current research. The wharf serves as a typical ocean offshore target, and monitoring and extraction of the wharf provide important basis for sea battlefield environment construction and ocean economic development. However, the wharf target in the remote sensing image has the characteristics of arbitrary direction, various sizes and the like, and is influenced by the surrounding ships, artificial ground objects and other environments, so that the accurate positioning of the wharf target is challenging. The traditional wharf target detection method comprises an edge detection method, an object-oriented wharf identification method and a port detection method based on characteristics, wherein the methods mainly utilize information of a water side line of a coastal zone or shape characteristics such as the length and the width of a wharf and realize the detection of the wharf through false alarm removal and target confirmation processes. However, these methods are greatly influenced by subjective factors, and it is difficult to accurately extract a wharf target in the presence of interference factors such as clouds, ships, sea waves and the like.

In recent years, deep learning, particularly Convolutional Neural Networks (CNNs), has been a great success in the field of computer vision, and target detection methods based on deep learning have been widely focused and have achieved great success. The detection method of the two stages of R-FCN and FasterR-CNN mainly comprises two stages of region suggestion and region classification. The two-stage detection Network firstly generates a series of candidate boxes by using a selective search algorithm or a regional suggestion Network (RPN), and then performs target classification and bounding box regression; unlike the two-stage detection network, the single-stage detection network simultaneously performs two tasks, i.e., classification and positioning, and has speed advantages, such as SSD and YOLO. In addition, the YOLOv3 and the YOLOv4 improve the YOLO network by using strategies such as backbone networks with more excellent performance, multi-scale fusion and the like, so that the speed and the precision of the model are effectively improved, and a foundation is laid for realizing remote sensing image target detection by using a deep learning network.

The advanced target detection methods generally adopt a horizontal rectangular frame to describe a detected target, are suitable for detecting natural scene images, but cannot meet the requirement of wharf target detection in remote sensing images. In a remote sensing image, a wharf target generally has a large length-width ratio and a certain directivity, and when a horizontal frame is adopted for detection, redundant information is contained, so that the target cannot be accurately positioned. In addition, the wharf is generally scattered in a complex scene, and noise information contained in a redundant area of a horizontal detection frame interferes with feature extraction, so that the wharf target detection effect is greatly influenced.

Disclosure of Invention

The invention aims to provide a method for detecting a target of a remote sensing image wharf, which aims to solve the problem of poor positioning accuracy in the detection of the remote sensing image wharf by adopting horizontal detection at present.

The invention provides a method for detecting a wharf target by using a remote sensing image, which aims to solve the technical problems and comprises the following steps:

1) Acquiring a remote sensing image to be detected;

2) Inputting the remote sensing image to be detected into the target detection model to output a wharf target detection result;

the target detection model comprises a backbone network, a feature fusion network and a prediction layer;

the main network is used for extracting the features of the input remote sensing image to obtain feature maps with different sizes;

the feature fusion network comprises an SPP module, a PSA attention module and a feature pyramid module, wherein the SPP module is used for performing maximum pooling operation on features output by the backbone network and fusing feature maps with different scales; the PSA attention module is used for segmenting the output result of the SPP module and extracting scale features and channel attention weighting to obtain a feature map with multi-scale features and attention weighting; the characteristic pyramid module is used for repeatedly extracting and fusing characteristics of a plurality of obtained characteristic graphs with multi-scale characteristics and attention weighting to obtain characteristics with accurate target position information and high-level semantic information;

the prediction layer positions a target to be detected by using the characteristics of establishing a multi-dimensional corner coordinate detection frame, obtaining position information with accurate target and high-level semantic information, wherein the multi-dimensional corner coordinates comprise coordinates of four corners of the detection frame.

According to the invention, a YOLOv4 horizontal frame detection algorithm is taken as a basis, a PSA attention module is added in a feature fusion network, receptive fields of different scales are obtained by the PSA attention module, attention weights of channels are extracted to obtain the attention weights of the channels of different scales, fusion of context information of different scales is realized, and the detection precision of the network is further improved; meanwhile, the wharf target is calibrated by adopting the multi-dimensional corner coordinate detection frame, so that the real boundary of the wharf target can be effectively represented, the problem of any wharf direction is solved, the positioning precision of the target is further improved, and the accurate detection of the wharf target is realized.

Further, the PSA attention module comprises an SPC module, an SE module and an output module; the SPC module is used for dividing the characteristic diagram output by the SPP module into a plurality of parts on the channel dimension, extracting the characteristics of each part in different scales to obtain characteristic vectors of each part and generating the characteristic diagram of the corresponding channel; the SE module is used for extracting attention vectors of corresponding channels from the feature map of each channel and carrying out feature calibration on the attention vectors of each channel again to obtain attention weights after interaction of corresponding multi-scale channels; and the output module is used for carrying out weighted fusion processing on the feature map of the corresponding channel according to the obtained attention weight to obtain the feature map with the multi-scale features and the attention weight.

Firstly, acquiring receptive fields with different scales by using an SPC module so as to better extract multi-scale information of the image; secondly, extracting attention weights of the channels by using the SE module to obtain the attention weights of the channels with different scales, so that the PSA module can fuse context information with different scales.

Further, the SPC module performs feature extraction of different scales for each part by using a group convolution, where a convolution kernel of each group is related to the size of the group, and a relationship between a convolution kernel size k and the group is:

where G is the size of the packet and k is the size of the convolution kernel.

The invention brings huge parameters for the continuous increase of the size of the convolution kernel, so when the segmented characteristic vectors of each part are subjected to grouping convolution, the size k of the defined convolution kernel is related to the grouping size of each part.

Further, the multi-dimensional corner coordinate loss function adopted by the target detection model during training is as follows:

L＝L _Pre +L _Conf +L _Cls

S ² representing the number of meshes into which the input remote sensing image is divided, B representing the number of prior frames in each mesh, and Pr (object) tableIndicating whether the object is contained in the current prior frame, wherein the value of Pr (object) is 1 when the object is contained and 0 when the object is not contained; x, y, x ^gt 、y ^gt Respectively representing the coordinates of the detection frame and the real frame, c and c ^gt Representing prediction confidence and true confidence, p and p, respectively ^gt Respectively representing the probability and the true probability of the prediction class.

In order to mark a detection frame which is more fit with a target for an object with direction information, the invention uses multi-dimensional corner coordinates to mark the target, so that when a model is trained, a multi-dimensional corner coordinate loss function is adopted to optimize the detection result of the corner coordinate detection frame.

Further, the backbone network adopts a CSPDarknet53 network.

The invention selects CSPDarkNet53 as backbone network, and combines the backbone network DarkNet53 of YOLOv3 with the cross-stage local network, thereby better extracting the characteristic information of the image.

Furthermore, the number of the prediction layers is 3, and each prediction layer comprises 8 pieces of coordinate information, 1 piece of frame confidence coefficient and 1 piece of category confidence coefficient.

Further, the frame confidence coefficient adopts a calculation formula as follows:

wherein C is _ij Is the jth prior frame in the ith grid; pr (object) indicates whether the object is contained in the current prior frame, the value of Pr (object) is 1 when the object is contained, and is 0 when the object is not contained;

the intersection ratio of the prediction bounding box and the real bounding box is between 0 and 1.

Drawings

FIG. 1 is a schematic diagram of a current YOLOv4 network structure;

FIG. 2 is a schematic diagram of the structure of the adopted CSPDarkNet 53;

FIG. 3-a is a schematic representation of the structure of a CBM in the CSPDarkNet53 structure;

FIG. 3-b is a schematic diagram of the residual connection structure in the CSPDarkNet53 structure;

FIG. 4 is a diagram of the structure of the CSPn module in the CSPDarkNet53 structure;

FIG. 5 is a schematic diagram of the structure of SPP module in CSPDarkNet53 structure;

FIG. 6 is a diagram of the construction of the PANET module in the CSPDarkNet53 architecture;

FIG. 7 is a schematic diagram of a modified YOLOv4 network architecture employed in the present invention;

FIG. 8-a is a schematic drawing currently labeled with a horizontal box;

FIG. 8-b is a schematic diagram of a multi-dimensional corner coordinate detection box used in the prediction layer of the improved YOLOv4 adopted in the present invention;

FIG. 9 is a diagram of the PSA module architecture used in the feature fusion network in the modified YOLOv4 employed in the present invention;

FIG. 10 is a schematic diagram of the SPC module structure in the PSA module employed in the present invention;

FIG. 11-a is a schematic view of a jetty dock;

FIG. 11-b is a schematic view of an extended dock;

FIG. 11-c is a schematic view of a quay dock;

FIG. 12-a is a schematic diagram of dock identification results on dataset 1 using the SSD algorithm during the experiment;

FIG. 12-b is a diagram illustrating the dock identification result on the data set 1 using the YOLOv3 algorithm during the experiment;

FIG. 12-c is a diagram illustrating the recognition result of wharf on the data set 1 during the experiment using the YOLOv4 algorithm;

FIG. 13-a is a schematic representation of dock identification results on dataset 2 using the SSD algorithm during the experiment;

FIG. 13-b is a diagram illustrating the recognition result of wharf on the data set 2 by using the YOLOv3 algorithm in the experimental process;

FIG. 13-c is a diagram illustrating the dock identification result on the data set 2 using the YOLOv4 algorithm during the experiment;

FIG. 14-a is a schematic diagram of dock marker truth on dataset 1 during the experiment;

FIG. 14-b is a diagram illustrating the dock identification result on the data set 1 using the YOLOv4-M algorithm during the experiment;

FIG. 14-c is a graphical representation of dock identification results on dataset 1 during an experiment using the algorithm of the present invention;

FIG. 15-a is a schematic diagram of dock marker truth on dataset 2 during the experiment;

FIG. 15-b is a diagram illustrating the dock identification result on the data set 2 using the YOLOv4-M algorithm during the experiment;

FIG. 15-c is a graphical representation of wharf identification results on data set 2 during an experiment using the algorithm of the present invention;

FIG. 16 is a comparison graph of Loss function Loss curves of different backbone networks during an experiment;

fig. 17 is a schematic diagram of a result of identifying and detecting a large-format remote sensing image dock by using the target detection method of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

On the basis of the existing YOLOv4, a PSA attention module is added in a feature fusion network, receptive fields of different scales are obtained by the PSA attention module, attention weights of channels are extracted to obtain the attention weights of the channels of different scales, fusion of context information of different scales is realized, and detection accuracy of the network is further improved; meanwhile, the wharf target is calibrated by adopting the multi-dimensional corner coordinate detection frame, so that the real boundary of the wharf target can be effectively represented, the problem of any wharf direction is solved, the positioning precision of the target is further improved, and the accurate detection of the wharf target is realized. Before the present invention is introduced, the YOLOv4 algorithm is introduced.

YOLOv4 is a single-stage target detection algorithm, and combines the more excellent target detection ideas in recent years, so that the balance between the network detection precision and the speed is well realized. YOLOv4 mainly includes a backbone network, a feature fusion network, and a prediction layer. As shown in fig. 1, an image with a size of 480 × 480 pixels is input, feature extraction is performed through a CSPDarknet53 backbone Network, then image features are further enhanced by using a Space Pyramid pooling module (SPP) and a Path Aggregation Network (PANet), and finally a final detection result is obtained through a prediction layer.

The YOLOv4 selects the CSPDarkNet53 as a backbone Network, and the CSPDarkNet53 combines the backbone Network DarkNet53 of the YOLOv3 with a Cross Stage local Network (CSPNet), so as to better extract the characteristic information of the image. The CSPDarkNet53 structure is shown in FIG. 2. Wherein the CBM module includes the convolutional layer Conv, batch Normalization (BN), and the Mish activation function, as shown in fig. 3-a. The residual concatenation (Res _ unit) structure is similar to that of the ResNet network, as shown in fig. 3-b, and the input information can be directly transmitted to the following layers through the skip concatenation, so that the difficulty of network learning is reduced. CSPn is CSPNet introduced on each residual block of Darknet53, consisting of CBM and n Res _ unit modules, as shown in fig. 4. CSPNet mainly solves the problem of large calculation amount in the aspect of network structure design, wherein the CSP module divides an input characteristic diagram into two parts, and one part obtains a residual error result through residual error convolution; the other part is directly fused with the obtained residual error result in a cross-stage mode, so that the accuracy of the method can be guaranteed while the calculated amount is reduced.

The feature fusion network comprises an SPP (shortest Path first) and PANet feature pyramid structure, wherein an SPP module has the function of enabling the input of the convolutional neural network not to be limited by a fixed size, improving the receptive field of the network and effectively extracting important context information. As shown in fig. 5, the SPP module performs maximum pooling operation on the last feature layer of the CSPDarkNet53 with pooling kernel sizes of 1 × 1, 5 × 5, 9 × 9, and 13 × 13, and then performs fusion (Concat) operation on feature maps of different scales, thereby improving scale invariance of the image. The PANET is a further improvement of a Feature Pyramid Network (FPN), and improves the extraction capability of the Network multi-scale features by repeatedly extracting and fusing image features. As shown in fig. 6, the FPN transmits rich semantic information of a high-level network to a low-level network by means of upsampling, and implements feature fusion between corresponding feature layers by means of transverse connection, and this top-down network structure emphasizes transmission of high-level semantic information. The PANET further optimizes the FPN structure by utilizing a bottom-up Path enhancement Network (PAN), and transmits the position information extracted by the lower layer Network to the higher layer Network through downsampling operation, so that the Network can simultaneously acquire the accurate position information of the target and the semantic information of the higher layer, and the accuracy of target detection is obviously improved.

The prediction layer adopts a horizontal detection frame to realize target positioning, namely, a central coordinate of the detection frame, confidence of the detection frame and target category confidence information in the detection frame are adopted, namely, an image with the size of 480 × 480 × 3 pixels is input, the prediction layer obtains a feature map with the dimensions of 60 × 060 × 13 × (4 +1+ n), 30 × 30 × 3 × (4 +1+ n) and 15 × 15 × 3 × (4 +1+ n), wherein 60, 30 and 15 respectively represent the output size of each layer, 3 × (4 +1+ n) represents YOLOv4 and 3 priori frames are allocated to the feature layer of each scale, and in addition, 4 coordinate information, confidence information and n categories of the detection frame are also included. The YOLOv4 roughly positions the target to be detected by using the prior frame, and then calculates the actual central point (b) of the boundary frame according to the following formula _x ,b _y ) And width and height (b) _w ,b _h )：

In the formula (I), the compound is shown in the specification,

as a Sigmoid function, (t) _x ,t _y ) For detecting the horizontal and vertical coordinates (c) of the central point of the frame relative to the upper left corner of the prior frame _x ,c _y ) (ii) an amount of deviation of (t) _w ,t _h ) Is the detection frame width height (p) relative to the prior frame _w ,p _h ) The scaling ratio of (c). The confidence score calculation formula for the detection box is as follows:

wherein C _ij Is the jth prior frame in the ith grid; pr (object) represents whether the current prior box contains the object, the value of Pr (object) is 1 when the object is contained, and the value of Pr (object) is 0 when the object is not contained;

the size of the intersection ratio between the prediction bounding box and the true bounding box is between 0 and 1. And finally, suppressing and eliminating the redundant detection frame by using the non-maximum value so as to obtain a final detection result.

The YOLOv4 target detection algorithm effectively improves the precision and the speed of the target detection algorithm by using strategies such as a backbone network with more excellent performance, multi-scale feature fusion and the like. However, the YOLOv4 target detection algorithm detects a horizontal target frame with a default inclination angle of 0, and can obtain a good detection result for a target in a natural scene image, but for a wharf target in any direction in a remote sensing image, the YOLOv4 algorithm is difficult to accurately describe the size and angle information of the target.

In order to solve the problems, the invention provides a wharf target detection method based on an improved YOLOv4 algorithm (Im-YOLOv 4 for short) on the basis of the existing YOLOv4 algorithm, and the method utilizes a multi-dimensional corner coordinate detection frame in the prediction process, so that the detection result can describe the wharf target more accurately, and improves the loss function of YOLOv4 to ensure that the loss function is suitable for the detection of the wharf target; meanwhile, im-YOLOv4 introduces a Pyramid segmentation Attention module (PSA) to fully extract multi-scale space information and cross-dimension important features of the feature vectors and improve the accuracy of wharf target detection.

Specifically, an Im-YOLOv4 network adopted by the present invention is shown in fig. 7, and has the same structure as the existing YOLOv4 network, including a backbone network, a feature fusion network, and a prediction layer, wherein the backbone network is the same as the existing backbone network, and a CSPDarkNet53 network is adopted, and the structure thereof is shown in fig. 2, and is not described in detail here. The improvement of the invention is mainly in a feature fusion network and a prediction layer, wherein a PSA module is added in the feature fusion network, and a multi-dimensional corner coordinate detection frame is used in the prediction layer to replace the existing horizontal detection frame.

As shown in fig. 9, the added PSA module of the present invention is a lightweight and efficient attention module for improving network multi-scale feature extraction capability. The PSA attention module comprises an SPC module, an SE module and an output module, and firstly, a Split and connect (SPC) module is used for obtaining receptive fields with different scales so as to better extract multi-scale information of the image; secondly, extracting attention weights of the channels by using an SE module to obtain the attention weights of the channels with different scales, so that a PSA module can fuse context information with different scales; finally, performing Softmax normalization on each group of channel attention vectors, and outputting a feature map with multi-scale feature information.

The SPC module extracts the spatial information of the characteristic vectors of each channel by the network in a manner of segmenting the characteristic diagram, and performs characteristic fusion in a manner of establishing local cross-channel connection, so as to obtain multi-scale information of the image. Firstly, the SPC module divides the input feature map X into S parts in the channel dimension, and the feature vectors of each part after division are respectively used for [ X ] ₀ ,X ₁ ,…,X _S-1 ]This is shown (fig. 10 is a structural diagram of the SPC module at S = 4). Secondly, extracting the space characteristic information of each characteristic vector by utilizing convolution with different sizes, and grouping and convolving each part of characteristic vectors after segmentation to define the size k of the convolution kernel and the grouping size of each part of characteristic vectors as the continuous increase of the size of the convolution kernel can bring huge parametersSmall G is defined as:

each part generates a feature map F through grouping convolution _i And F is combined _i And stacking to obtain the multi-scale feature map F.

F _i ＝Conv(k _i ×k _i ,G _i )(X _i ) (7)

F＝Cat([F ₀ ,F ₁ ,...,F _S-1 ]) (8)

Wherein i =0,1,2, …, S-1.

The horizontal detection frame adopted by the existing prediction layer usually marks the detection target by using a four-dimensional vector (x, y, w, h), as shown in fig. 8-a, where x, y are coordinates of a central point of the detection frame, and w, h are width and height of the detection frame. The labeling method describes the information of the detection frame by using the least parameters, but when the target with a large length-width ratio, especially the angle of the target, presents arbitrariness, the labeling method cannot provide accurate boundary information of the target. And the wharf target in the remote sensing image belongs to a target with directionality, and the direction information and the boundary information of the wharf cannot be shown by the horizontal frame marking method, so that a detection frame contains a large amount of redundant information. In order to mark a detection frame more closely fitting a target for an object with direction information, the invention uses multi-dimensional corner coordinates (x) ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ ) To calibrate the target. As shown in FIG. 8-b, where x ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ The coordinates of the four corner points of the detection frame are respectively represented, and the target with the direction information can be well framed by the marking mode, so that the detection frame is more attached to the target.

Aiming at the fact that the target is calibrated by using multi-dimensional corner point coordinates, in order to adapt to the optimization, when the Im-YOLOv4 model is trained, the loss function is correspondingly optimized, and the main principle isTo construct a multidimensional corner coordinate loss function to optimize the detection result of the corner coordinate detection frame, the network total loss function L is regressed by the detection frame to obtain a loss L _Pre Confidence loss L _Conf And a classification loss L _Cls The method comprises three parts, which are respectively expressed by a Smooth L1 Loss function, a Binary Cross Entropy Loss function (Binary Cross Entropy Loss) and a Cross Entropy Loss function (Cross Entropy Loss), namely:

L＝L _Pre +L _Conf +L _Cls (9)

wherein, the regression loss L of the detection frame _Pre Confidence loss L _Conf And a classification loss L _Cls The calculation formulas are respectively as follows:

in the formula, S ² The representation target detection network divides an original image into S multiplied by S grids, B represents the number of prior frames in each grid, the prior frames refer to rectangular frames with different sizes which are predefined at each position of a feature mapping image and comprise different aspect ratios and are used for matching with rectangular frames of real objects. Pr (object) indicates whether an object is contained in the current prior box, the value of Pr (object) is 1 when an object is contained, and is 0 when no object is contained; x, y, x ^gt 、y ^gt Respectively representing the coordinates of the detection frame and the real frame, c and c ^gt Representing prediction confidence and true confidence, p and p, respectively ^gt Respectively representing the probability and the true probability of the prediction class. Confidence loss L _Conf And a classification loss L _Cls The existing loss function is used.

Experimental verification

In order to further prove the effect of the invention, the invention is firstly subjected to experimental simulation. 2 groups of experiments are respectively carried out, the first group is based on a horizontal detection box, and the performance of each algorithm in a wharf target detection task is researched by carrying out experiments on the current mainstream SSD, YOLOv3 and YOLOv4 single-stage target detection algorithms. And a second group of experiments are developed aiming at a target detection algorithm of the multi-dimensional corner coordinate detection frame, and are improved on the basis of a horizontal detection frame YOLOv4 algorithm so as to realize wharf target detection with direction information in the remote sensing image.

Each network experiment is performed in the same environment, and each training parameter is kept consistent. In order to quantitatively evaluate the performance of the model, an Average Accuracy (AP) is selected as an evaluation index of the wharf detection result, wherein the Average accuracy AP is used for measuring the overall detection accuracy of the network and is defined as an integral value of a curve of the accuracy (P) varying with the Recall rate (Recall, R) on the R from 0 to 1, as shown in the following formula:

in the formula, TP is the number of correctly detected wharfs, FP is the number of wrongly detected wharfs, and FN is the number of missed detected wharfs.

1) Selecting experimental data

The wharf belongs to a small target on a remote sensing image, characteristics such as gray scale and texture are simple, and in order to better construct a wharf data set for wharf detection, image characteristics of the wharf must be fully analyzed. The wharfs are various in types and mainly divided into a jetty type, a bank expansion type, a bank following type and the like, and it is difficult to effectively extract all the wharfs by one method. In which jetty wharves extend out of the coast at right or obtuse angles, as shown in fig. 11-a, the forward line of this type of wharves presents a large angle with the natural shoreline, which is common in seaports with large cargo volumes. 11 (b) the loading and unloading platform at the front edge of the bank expanding type wharf is connected with a rear shoreline through a bridge approach (or a guide embankment is added), and has obvious structural characteristics, as shown in figure 11-b; the structure characteristics of the quay-alongside wharfs are greatly different from those of jetty wharfs and approach bridges, as shown in fig. 11-c, the front line of the wharf is generally parallel to a natural shoreline, no obvious boundary exists between the quay and the land, the surface texture is similar to that of strip-shaped ground objects such as roads and the like, and the quay-alongside wharfs are difficult to distinguish from other reinforced shorelines, so that the extraction problem of the quay-alongside wharfs is not considered in the research process.

In the experiment, two types of wharf data sets are constructed based on the disclosed remote sensing data set and are respectively marked as a data set 1 and a data set 2, and the basic information of each data set is shown in table 1. The data set 1 selects images from a military port area of Google Earth, a wharf target detection data set is constructed through artificial marking, and 812 Zhang Xunlian images are obtained through data enhancement and extension training data. The data set 2 is taken from a DOTA data set [9, 1080 images containing civil docks are obtained through operations of image screening, clipping and the like, and the data set is randomly divided according to the proportion of 1:1. The target detection algorithm based on the horizontal box adopts the minimum external rectangle marked by the samples in the data set to describe the target.

TABLE 1

2) Horizontal test frame experiment

In order to explore the detection capability of the mainstream horizontal frame target detection algorithm on the wharf target, the experimental example selects the SSD, the YOLOv3 and the YOLOv4 single-stage horizontal frame detection algorithm respectively based on the data set 1 and the data set 2 to perform experiments. The dock in data set 1 is of military type and fig. 12-a, 12-b and 12-c are the results of the detection of each algorithm in data set 1. The SSD algorithm cannot overcome the interference caused by the ship target around the dock, and thus the false detection and missing detection phenomena occur, as shown in fig. 12-a. Due to the length and width of the military terminal, the YOLOv3 detection frame may have incomplete target detection, as shown in fig. 12-b. As shown in fig. 12-c, the YOLOv4 algorithm can accurately detect the wharf target, but the detection frame also includes the docked ship target, which brings interference to the detection result.

The wharf in the data set 2 is mostly of a civil type and has the characteristics of long and thin structure, small size, disordered distribution and the like. FIGS. 13-a, 13-b, and 3-c are the results of the detection in dataset 2 for each algorithm. As shown in fig. 13-a, the SSD algorithm has a poor detection effect in the data set 2, and when the dock is small in size and interference factors such as vegetation exist around the dock, a missing detection phenomenon is easily caused; in addition, buildings around the wharf can also interfere with the detection of wharf targets, causing false detection. Fig. 13-b shows the detection result of the YOLOv3 algorithm on the wharf in the data 2, which is obviously improved compared with the detection result of the SSD algorithm, but still has a certain missing detection. As shown in fig. 13-c, the YOLOv4 algorithm can detect the wharf target more accurately, but when the wharf is densely distributed and has a long and thin structure, the detection frames of different wharf targets are overlapped, and the wharf target cannot be accurately located.

The test set precision evaluation results are shown in table 2. The SSD algorithm uses VGG16 as a feature extraction network, AP values in the two data sets are 49.08% and 52.89% respectively, and the AP value is the lowest in each comparison network, so that the detection task of the SSD algorithm on the wharf target is explained. The Yolov3 algorithm utilizes Darknet53 as a feature extraction network, the AP values in the two data sets are respectively 75.39% and 82.61%, and the detection precision is obviously superior to that of the SSD algorithm. The YOLOv4 algorithm is obtained by further improving the YOLOv3 algorithm, CSPDarknet53 is selected as a feature extraction network, AP values on two wharf target detection data sets respectively reach 81.84% and 84.78%, the AP values are respectively improved by 6.45% and 2.17% compared with the YOLOv3 algorithm, the YOLOv4 horizontal frame detection algorithm is proved to have the optimal performance in a wharf target detection task, and a foundation is laid for the subsequent realization of wharf target detection in any direction.

TABLE 2

3) Multidimensional focus coordinate detection frame experiment

The previous experiment verifies that the YOLOv4 horizontal frame detection algorithm has better performance in a wharf target detection task, and the detection of a wharf target with directional information in a remote sensing image is realized by introducing a multi-dimensional corner coordinate detection frame on the basis of the YOLOv4 algorithm, and the detection is recorded as a YOLOv4 multi-dimensional corner coordinate detection algorithm (YOLOv 4-M). The experiment further improves the YOLOv4-M algorithm aiming at the characteristics of the wharf target detection task, so that the experiment is more suitable for the detection of the wharf target of the remote sensing image. Data set 1 contains a number of military dock targets. Fig. 14-a is a visual result of an image marked wharf, fig. 14-b is a detection result of YOLOv4-M, and the YOLOv4-M makes up for the deficiency of a horizontal detection frame by using a multi-dimensional corner coordinate detection frame, but the target features of a ship parked on a military wharf are similar to the wharf and are easy to generate shadows in a wharf area, the wharf features are covered, the detection result of an algorithm is greatly interfered, and the YOLOv4-M detection result has the phenomena of missing detection and inaccurate positioning. FIG. 14-c shows the detection result of the algorithm of the invention, which is significantly improved compared with the detection result of YOLOv4-M, effectively overcomes the interference of factors such as ships and warships, and realizes the accurate extraction of wharf targets in military port images.

Then, a simulation experiment is performed on the YOLOv4 multi-dimensional corner coordinate detection algorithm and the algorithm of the invention in the data set 2. Fig. 15-a is a visual result of the image marked wharf, and fig. 15-b is a YOLOv4-M detection result, from which it can be seen that the YOLOv4-M multi-dimensional corner coordinate detection frame can accurately reflect the size information of the wharf target, and the wharf positioning is more accurate, and better meets the actual requirements than the horizontal frame. But the size difference of the civil wharfs is large, and the YOLOv4-M causes missing detection on the wharf target with a long and thin structure and a large length-width ratio; in addition, the YOLOv4-M falsely detects interference factors such as yachts and vegetation existing around the wharf as the wharf, and influences the detection result of the wharf. The detection result of the algorithm in the data set 2 is shown in fig. 15-c, and the algorithm introduces a PSA module on the basis of YOLOv4-M, so that the extraction capability of YOLOv4-M multi-scale features is enhanced, the missing detection rate of a network on wharf targets with slender structures is reduced, and accurate wharf detection results can be obtained in civil port areas with complex backgrounds.

The results of the two data set test set accuracy evaluations are shown in table 3. Meanwhile, in order to research the effectiveness of the PSA Attention Module in a wharf target detection task, a SE Attention Module and a CBAM Attention Module (CBAM) are introduced on the basis of YOLOv4-M respectively to carry out a comparison experiment, and parameters are kept consistent in the experiment process. The AP values of the YOLOv4-M in the two data sets are 59.19% and 73.04%, respectively, and the AP values of the two data sets are respectively improved by 7.58% and 5.78% after the YOLOv4-M is introduced into the PSA attention module (the algorithm of the invention), which shows that the PSA module can effectively improve the detection precision of the wharf target. In addition, after the YOLOv4-M is introduced into the SE module, the AP values of the two data sets are respectively improved by 6.97% and 1.43% compared with the YOLOv4-M, and are respectively reduced by 0.61% and 4.35% compared with the algorithm of the invention, which shows that the SE module is also beneficial to the detection of wharf targets, but the improvement capability is more effective than that of the PSA module. The AP value of the data set 1 is improved by 4.28% compared with the AP value of the YOLOv4-M after the YOLOv4-M is introduced into the CBAM module, and the AP value of the data set 2 is reduced by 0.15% compared with the AP value of the YOLOv4-M algorithm, which shows that the CBAM module can improve the detection capability of the algorithm on the category of military docks but is not favorable for the detection of civil docks. The experimental results show that the detection precision of the wharf target is effectively improved by introducing the PSA attention module in the algorithm; meanwhile, the PSA attention module is more suitable for wharf target detection tasks than SE and CBAM attention modules by extracting multi-scale features of images.

TABLE 3

4) Backbone network ablation experiment

In addition, in order to explore the influence of the feature extraction backbone network on the detection performance of the target detection algorithm, the algorithm disclosed by the invention is utilized to perform comparison experiments on the MobileNet v2, mobileNet v3 and CSPDarknet53 backbone networks on the data set 2. The variation curve of Loss function Loss of different backbone networks in the data set 2 along with the iteration times is shown in fig. 16, the Loss value of each curve is kept at a higher level in the initial stage of model training, the Loss value is continuously and slowly oscillated and decreased along with the increase of the iteration times, and the Loss value of each algorithm tends to be stable when the iteration times reach 250. The Loss values of backbone networks using MobileNet v2 and MobileNet v3 are finally stabilized at 15.02 and 16.16, and the Loss value of the backbone network using CSPDarknet53 is finally stabilized at 11.44 in the algorithm of the invention, which is the lowest in Loss compared with the Loss values of other networks.

The evaluation results of the models using different backbone networks on the data set 2 are shown in table 4, and the detection speed of each model is evaluated using the number of Frames Per Second (FPS). The FPS is the number of pictures which can be detected by the model within one second, and the larger the FPS value is, the faster the detection speed of the model is. As can be seen from Table 4, the YOLOv4-M algorithm using CSPDarknet53 as backbone network has an AP value of 73.04% and a detection rate FPS of 29.18 frames/s on data set 2. When a MobileNet v2 backbone network is adopted, the AP value on the data set 2 is 71.86%, and the detection rate FPS is 31.78 frames/s; when a MobileNet v3 backbone network is adopted, the AP value is 69.10%, and the detection rate FPS is 32.35 frames/s; the detection rate is 28.49 frames/s when CSPDarknet53 is used as the backbone network, the detection rate is slightly reduced compared with the detection rate when Mobilnetv2 and Mobilnetv3 are used as characteristics to extract the backbone network, but the AP value on the two data sets reaches 78.80 percent, and the optimal detection rate is achieved. In addition, compared with YOLOv4-M which also uses CSPDarknet53 as a backbone network, the FPS of the invention is only reduced by 0.69 frames/s, and the fact that PSA is used as a light-weight attention module is proved, and effective balance of algorithm detection precision and speed is realized.

TABLE 4

5) Remote sensing image wharf target detection application example

In the example, the U.S. Nofork harbor region is selected for verifying the wharf detection performance of a large-scale Google Earth image, the pixel size of the image frame of the region is 5 559 multiplied by 6 pixels, the spatial resolution is 1m, and the image frame has 17 wharfs in total.

The wharf detection result obtained by applying the identification method of the invention is shown in fig. 17, 16 wharf targets are correctly detected, 3 wharf targets are false detected (indicated by white line frames), 1 wharf target is missed (indicated by black line frames), the accuracy is 84.21%, and the recall rate is 94.12%. The false detection target is a port sea-land boundary or an artificial building with similar characteristics to the wharf on land, and the undetected wharf structure is too thin and long, so that the undetected wharf structure is difficult to detect by an algorithm, and the undetected detection is generated.

In conclusion, the invention designs an Im-YOLOv4 algorithm capable of realizing wharf target detection in any direction based on a YOLOv4 horizontal frame detection algorithm, the algorithm adopts a multi-dimensional angular point coordinate detection frame to solve the problem of any direction of the wharf, and a PSA attention mechanism is introduced to enhance the extraction capability of a network on the wharf target. Two remote sensing image wharf target detection data sets are selected in an experiment, and the feasibility of the current mainstream horizontal frame detection algorithm in a wharf target detection task is verified. In addition, the influence of an attention mechanism and a feature extraction backbone network on the wharf detection performance is analyzed through an ablation experiment, and the result shows that the method can accurately extract the wharf target of the remote sensing image, and has practicability on the large-amplitude image.

Claims

1. A method for detecting a wharf target by using a remote sensing image is characterized by comprising the following steps:

1) Acquiring a remote sensing image to be detected;

2) Inputting a remote sensing image to be detected into the target detection model to output a wharf target detection result;

2. The method for detecting the wharf target by remote sensing image according to claim 1, wherein the PSA attention module comprises an SPC module, an SE module and an output module; the SPC module is used for dividing the characteristic diagram output by the SPP module into a plurality of parts on the channel dimension, extracting the characteristics of each part in different scales to obtain characteristic vectors of each part and generating the characteristic diagram of the corresponding channel; the SE module is used for extracting attention vectors of corresponding channels from the feature map of each channel and carrying out feature calibration on the attention vectors of each channel again to obtain attention weights after interaction of corresponding multi-scale channels; and the output module is used for carrying out weighted fusion processing on the feature map of the corresponding channel according to the obtained attention weight to obtain the feature map with the multi-scale features and the attention weight.

3. The method for detecting the wharf target by remote sensing of the images as claimed in claim 2, wherein the SPC module performs feature extraction of different scales on each part by using a group convolution, the convolution kernel of each group is related to the size of the group, and the relationship between the size k of the convolution kernel and the group is as follows:

where G is the size of the packet and k is the size of the convolution kernel.

4. The method for detecting the wharf target by the remote sensing image according to claim 1, wherein the target detection model adopts a multidimensional corner coordinate loss function during training:

L＝L _Pre +L _Conf +L _Cls

S ² representing the number of grids into which the input remote sensing image is divided, B representing the number of prior frames in each grid, pr (object) representing whether the current prior frame contains an object, wherein the value of Pr (object) is 1 when the object is contained and 0 when the object is not contained; x, y, x ^gt 、y ^gt Respectively representing the coordinates of the detection frame and the real frame, c and c ^gt Representing prediction confidence and true confidence, p and p, respectively ^gt Respectively representing the probability and the true probability of the prediction class.

5. The method for detecting the wharf target by remote sensing image according to claim 1, wherein the trunk network is a CSPDarknet53 network.

6. The method for detecting the wharf target by remote sensing of the claim 4, wherein the number of the prediction layers is 3, and each prediction layer comprises 8 coordinate information, 1 frame confidence and 1 category confidence.

7. The method for detecting the wharf target by the remote sensing image according to claim 6, wherein the confidence of the frame adopts a calculation formula as follows: