CN115527095A

CN115527095A - Multi-scale target detection method based on combined recursive feature pyramid

Info

Publication number: CN115527095A
Application number: CN202211339440.0A
Authority: CN
Inventors: 韩冰; 陈玮铭; 高新波; 杨铮; 黄晓悦
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-10-29
Filing date: 2022-10-29
Publication date: 2022-12-27

Abstract

The invention discloses a multi-scale target detection method based on a combined recursive feature pyramid. The method mainly solves the problem that in the prior art, the multi-scale target detection precision is low in a complex scene. The implementation scheme is as follows: 1) Reading data of a target detection database, and preprocessing image data; 2) Extracting the characteristics of the image by using a ResNet convolution neural network as a backbone network; 3) Constructing a characteristic pyramid according to the extracted image characteristics; 4) Constructing a joint feedback processor formed by connecting a channel attention module and a space attention module in series; 5) Processing the pyramid characteristics of each layer by using a joint feedback processor to complete characteristic fusion; 6) Repeating the step 3) to the step 5) twice to obtain multi-scale features; 7) And inputting the multi-scale features into the existing detection head to complete multi-scale detection. The invention obviously improves the accuracy of multi-scale target detection in a complex scene, and can be used for intelligent traffic, intelligent security and remote sensing image processing.

Description

Multi-scale target detection method based on combined recursive feature pyramid

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a multi-scale target detection method based on cyclic characteristics, which can be used in the fields of traffic, security, medical treatment and the like.

Background

The target detection is one of basic tasks in the field of computer vision, is widely applied to the fields of traffic, security, medical treatment and the like, and has extremely high application value. The task of target detection comprises two items of positioning the position of a target in an image and predicting the type of the target in the image. Among them, due to the difference between the size of the target itself and the distance from the camera, the scale of the target appearing in the image usually has a large difference, thereby causing the detection performance to be degraded.

In recent years, the problem of multi-scale object detection has received a lot of attention. The existing algorithm adopts a method of constructing a feature pyramid, namely, a specific layer in a backbone network is independently output, and the feature pyramid is constructed in a mode of down-sampling and feature fusion to obtain features with high resolution and rich semantic information. In addition, some scholars improve detection by introducing a cyclic mechanism for the feature pyramid and a switchable hole convolution for the backbone network.

The traditional feature pyramid structure has large semantic difference between each layer, a direct top-down sampling feature fusion mode cannot well transmit high-layer semantic information to a low layer, and the highest layer only has information loss but does not have the feature of a higher layer for fusion, so that the multi-scale information extraction capability is insufficient. For this purpose, some variants of methods based on a feature pyramid structure are proposed in the prior art.

Due to the fact that the strategy of constructing feature maps with different spatial resolutions by the feature pyramid in a layered mode can remarkably improve the detection performance of the model on the targets with different scales, the target detection algorithm based on the feature pyramid and the variants thereof is the mainstream method for multi-scale target detection. Ghiasi et al used an automatic search algorithm to search out a feature pyramid structure using the feature map to be fused as a search space. However, such structures searched using automatic search algorithms tend to have higher data set dependencies, often manifesting as better performance on certain data sets but mediocre performance on other data sets. Qiao et al first introduced a cyclic mechanism into the target detection task, proposed a cyclic feature pyramid structure, and designed switchable hole convolutions for the backbone network. However, they neglect the inherent semantic difference between the pyramid layers, so that the performance of the method cannot reach the best, and the switchable hole convolution inference speed is full and slow, and the video memory occupation is high. Guo et al consider that the highest layer of the feature pyramid only has information loss, design a residual feature enhancement module to complement the features of the highest layer of the feature pyramid, and also design an adaptive spatial fusion module for fusing each layer of the feature pyramid, and the fused features are used for predicting the target category and the regression target position, thereby significantly improving the multi-scale information extraction capability of the detector. However, the method simply fuses the features of each layer and then performs prediction and regression, ignores the inherent semantic difference between layers, and thus cannot achieve the best performance. Liu et al consider that the information propagation path in the conventional feature pyramid is too long, so that the connection path in the feature pyramid is optimized so that the bottom-layer features beneficial to target positioning can flow to the higher layer more quickly, and the multi-scale target detection capability of the detector is improved. Although this approach optimizes the information propagation paths in the feature pyramid, they still ignore the inherent semantic differences between layers and thus do not achieve optimal performance.

Disclosure of Invention

The invention aims to provide a multi-scale target detection method based on a combined recursive feature pyramid aiming at the defects of the prior art and considering the highest layer information loss and the inherent semantic difference between layers.

In order to achieve the purpose, the implementation steps of the technical scheme of the invention comprise the following steps:

(1) Reading data of a target detection database, sequentially adjusting, turning and normalizing images of training data, sequentially adjusting and normalizing images of test data, setting normalized mean values and standard deviations of RGB (red, green and blue) three channels, and finally obtaining tensor data corresponding to the images;

(2) Inputting the preprocessed image tensor data obtained in the step (1) into a ResNet convolutional neural network comprising 5 serially connected convolutional blocks as a main network to obtain image characteristics which are respectively extracted by the 5 convolutional blocks and are respectively marked as C1, C2, C3, C4 and C5;

(3) According to the image features extracted by the ResNet convolution neural network, a feature pyramid is constructed:

3a) Image features C2, C3, C4 and C5 extracted from the ResNet convolutional neural network are respectively subjected to 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, so that the channel number of the C2 feature is still kept to be 256, the channel number of the C3 feature is reduced to 256 from 512, the channel number of the C4 feature is reduced to 256 from 1024, the channel number of the C5 feature is reduced to 256 from 2048, and finally 4 layers of main dimensionality reduction features C2', C3', C4 'and C5' are obtained;

3b) Performing top-down feature fusion operation on each layer of the trunk dimensionality reduction features obtained in the step 3 a) to form a feature pyramid structure consisting of P2, P3, P4 and P5 pyramid features;

(4) Constructing a joint feedback processor formed by connecting a channel attention module and a space attention module in series;

(5) Processing the pyramid characteristics of each layer obtained in the step (3) by using a joint feedback processor to complete characteristic fusion:

5a) Inputting the 4-layer pyramid characteristics of P2, P3, P4 and P5 into the channel attention module to obtain a channel attention characteristic M _C ；

5b) The channel attention characteristics M obtained in 5 a) _C Inputting a spatial attention module to obtain a spatial attention feature M _S ；

5c) Feature M of spatial attention _S The method is divided into 4 characteristic graphs,and down-sampling the 4 feature maps to output features C of each convolution block of the backbone network _i The sizes are the same;

5d) Respectively processing the up-sampled feature maps by 4 convolution layers with the kernel size of 1 multiplied by 1 and the step length of 1, respectively increasing the channel number to 256, 512, 1024 and 2048, and obtaining a feature map M to be fused with a backbone network _i Then, each feature map M is divided into _i Output characteristic C of each convolution block of backbone network _i Completing feature fusion by corresponding addition;

(6) And (5) repeating the steps (3) to (5) twice to obtain final multi-scale features P2', P3', P4 'and P5', inputting the final multi-scale features into the existing detection head network, and outputting predicted target position parameters (x, y, w, h) and confidence degrees c of corresponding classes of targets, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, so that the detection of the multi-scale targets is completed.

Compared with the prior art, the invention has the following advantages:

firstly, on the basis of the cyclic feature pyramid, a combined feedback processor is introduced to uniformly process the feedback features of the feature pyramid, so that information flow supplement can be performed on the topmost features of the feature pyramid, semantic differences among layers can be reduced, the multi-scale information extraction capability of a detector is improved, and the network detection effect is improved;

secondly, the method does not need special convolution operations such as switchable hole convolution and the like to increase the receptive field, and compared with other circulation methods, the method provided by the invention has the advantage that the reasoning speed is obviously improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a joint recursive feature pyramid in accordance with the present invention;

FIG. 3 is a schematic diagram of a joint feedback processor of the present invention;

FIG. 4 is a diagram of simulation results of detection of a ship target in an optical remote sensing image using the present invention.

Detailed Description

Embodiments and effects of the present invention are further described below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this embodiment are as follows:

step 1, reading data of a target detection database, and preprocessing image data.

The target detection database data comprises data in a training stage and data in a testing stage, and the image data in the two stages are respectively preprocessed as follows:

1.1 Data preprocessing in the training phase:

firstly, the size of an input image is scaled to 800 multiplied by 800, then the brightness, the contrast, the saturation and the hue of the image are randomly adjusted according to the probability of 0.5, and then the image is randomly turned over according to the probability of 0.5;

normalizing the image by adopting a mean standard deviation normalization method, wherein the normalized mean values of three RGB channels are set to be [123.675,116.28 and 103.53], the standard deviations of the three channels are set to be [58.395,57.12 and 57.375], and finally tensor data corresponding to the image in the stage are obtained;

1.2 Data preprocessing in the testing phase:

scaling the size of the input image to 800 × 800;

normalizing the image by adopting a method of normalizing standard deviation of mean values, wherein the normalized mean values of three channels of RGB are respectively set as [123.675,116.28 and 103.53],

setting the standard deviation of three channels as 58.395,57.12 and 57.375, and finally obtaining the corresponding tensor data of the image at the stage.

And 2, extracting the features of the image by using a ResNet convolutional neural network as a backbone network.

The ResNet convolutional neural network has 5 convolutional blocks connected in series, each convolutional block comprises a plurality of convolutional groups, and each convolutional group comprises a convolutional layer, a batch normalization layer and a ReLu activation function. The main network used by the invention relates to three versions of ResNet-50, resNet-101 and ResNet-152, the image tensor preprocessed in step 1 is input into a ResNet convolution neural network to extract image features, and the image features extracted by 5 convolution blocks are respectively marked as C1, C2, C3, C4 and C5. The backbone network structure and the respective extracted image features are shown in table 1.

Table 1: resNet convolution neural network structure and extracted image characteristics

And 3, constructing a characteristic pyramid according to the image characteristics extracted by the ResNet convolutional neural network.

Referring to fig. 2, the specific implementation of this step is as follows:

3.1 Image features C2, C3, C4 and C5 extracted from the ResNet convolutional neural network respectively pass through 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, so that the number of channels of the C2 feature is still kept to be 256, the number of channels of the C3 feature is reduced from 512 to 256, the number of channels of the C4 feature is reduced from 1024 to 256, the number of channels of the C5 feature is reduced from 2048 to 256, and finally 4 layers of main trunk dimension reduction features C2', C3', C4 'and C5' are obtained;

3.2 Carrying out top-down feature fusion operation on the dimensionality reduction features of the layers of trunks obtained in the step 3.1):

3.2.1 Marking the dimensionality reduction feature of the highest-level trunk as a highest-level pyramid feature P5, performing 2 times of upsampling operation on the P5, and directly adding the upsampled feature with the dimensionality reduction feature of the next-higher-level trunk to obtain a next-higher-level pyramid feature P4;

3.2.2 Performing 2 times of upsampling operation on the second-highest pyramid feature P4, and directly adding the upsampled feature with the dimension reduction feature of the second-bottom trunk to obtain a second-bottom pyramid feature P3;

3.2.3 Performing 2 times of upsampling operation on the secondary bottom pyramid feature P3, and directly adding the upsampled operation and the bottom trunk dimensionality reduction feature to obtain a bottom pyramid feature P2;

3.3 The pyramid features of P2, P3, P4 and P5 are arranged from bottom to top to form a feature pyramid structure.

And 4, constructing a joint feedback processor.

4.1 Select a channel attention module sequentially including upsampling, feature splicing, a global average pooling layer, a full connection layer and a Sigmoid function, for extracting channel attention features, wherein an expression of the Sigmoid function is as follows:

4.2 Selecting a space attention module sequentially comprising an average pooling layer, a maximum pooling layer, a convolution layer and a Sigmoid function for extracting space attention characteristics;

4.3 The channel attention module and the spatial attention module are connected in series to form a joint feedback processor.

And 5, processing the pyramid characteristics of each layer obtained in the step 3 by using a combined feedback processor to complete characteristic fusion.

Referring to fig. 3, the specific implementation of this step is as follows:

5.1 Inputting 4 layers of pyramid features P2, P3, P4 and P5 into the channel attention module to obtain a channel attention feature M _C ：

5.1.1 Respectively upsampling the pyramid features P2, P3, P4 and P5 to obtain corresponding features X2, X3, X4 and X5 which are upsampled, wherein the corresponding features are 200 multiplied by 200 in size, and the number of channels is 256;

5.1.2 ) the pyramid corresponding features X2, X3, X4, X5 after up-sampling are spliced into a channel total feature M _cat1 The size is 200 multiplied by 200, and the number of channels is 1024;

5.1.3 General characteristics M of the channel _cat1 Compressing the global average pooled layer into an average pooled compressed vector V with the length of 1024 _gap ；

5.1.4 Average pooled compressed vector V _gap After a group of full connection layers, batch standardization layers and a ReLu activation function, the channel recompression vector V with the length of 256 is obtained by carrying out compression again _fc1 ；

5.1.5 Channel recompression vector V) _fc1 Releasing the channel number through another full-connection layer to obtain a channel release vector with the length of 1024

5.1.6 Using Sigmoid function to release vector V for channel _fc2 Normalizing to obtain a normalized vector V with the length of 1024 _norm ；

5.1.7 General characteristics M of the channel _cat1 And the normalized vector V _norm Performing dot product to obtain the channel attention feature M _C ：

M _C ＝M _cat1 ·V _norm

Wherein the channel attention feature M _C Has the size of 200 x 200 and the number of channels of 1024.

5.2 ) the channel attention feature M obtained in 5.1) _C Inputting a spatial attention module to obtain a spatial attention feature M _S ：

5.2.1 Channel attention feature M) _C Respectively passing through a maximum pooling layer and an average pooling layer to obtain a maximum pooling characteristic M _max And average pooling characteristics M _avg Wherein the sizes of the maximum pooling feature and the average pooling feature are both 200 × 200, and the number of channels is 1;

5.2.2 Maximum pooling feature M) _max And average pooling characteristics M _avg Splicing into a spatial overall characteristic M _cat2 The size is 200 multiplied by 200, and the number of channels is 2;

5.2.3 General spatial signature M) _cat2 After passing through a convolution layer with 7 x 7 kernel size and 1 step length, a new feature M is obtained _un The size is 200 multiplied by 200, and the number of channels is 1;

5.2.4 Using Sigmoid function on new feature M _un Normalization is carried out to obtain normalized characteristics M _norm The size is 200 multiplied by 200, and the number of channels is 1;

5.2.5 Channel attention feature M) _C And normalized feature M _norm Doing Hadamard product to obtain space attention feature M _S ：

Wherein, the symbol

Representing the Hadamard product, spatial attention feature M _S The size of (1) is 200 multiplied by 200, and the number of channels is 1024;

5.3 Spatial attention feature M) _S Splitting the feature map into 4 feature maps, and respectively down-sampling the 4 feature maps to output features C of each convolution block of the backbone network _i The sizes are the same;

5.4 Respectively inputting the up-sampled feature maps into 4 convolution layers with the kernel size of 1 × 1 and the step length of 1, and respectively increasing the channel number to 256, 512, 1024 and 2048 to obtain a feature map M to be fused with a backbone network _i Then, each feature map M is divided into _i Output characteristic C of each convolution block of backbone network _i And completing feature fusion by corresponding addition.

And 6, completing the detection of the multi-scale target.

6.1 Repeating the step 3 to the step 5 twice to obtain final multi-scale features P2', P3', P4 'and P5';

6.1 Inputting the multi-scale features P2', P3', P4 'and P5' into the existing detection head network, and outputting the predicted target position parameters (x, y, w, h) and the confidence coefficient c of the corresponding category of the target, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, and the detection of the multi-scale target is completed.

The effect of the present invention will be further described with reference to simulation experiments.

1. The experimental conditions are as follows:

the computer processor is Intel (R) Core (TM) i7 CPU @3.5GHz, the running memory is 128G, and the video card is an NVIDIA TITAN X GPU with the video memory of 12 GB.

The operating system was 64-bit Ubuntu 18.04 (LTS), and the deep learning framework used was PyTorch (version 1.8.0).

All network training adopts a back propagation algorithm to calculate residual errors of all layers, and a random gradient descent algorithm with a kinetic energy term and a weight attenuation term is used for updating network parameters, wherein the kinetic energy term is 0.9, and the weight attenuation term is 0.0001.

The experiment is evaluated by using an HRSC2016 optical remote sensing ship detection database, a self-built database HRSC2016-MS and a DIOR large-scale optical remote sensing target detection database, and the evaluation indexes are mAP and AP _S 、AP _M And AP _L . Wherein mAP is the average precision mean value under 50% intersection ratio threshold, AP _S AP for average accuracy of objects with dimensions less than 32X 32 _M AP is average accuracy of 32 × 32 or more and less than 96 × 96 _L The average accuracy of objects with dimensions greater than 96 x 96.

The HRSC2016 database is currently the only open-source optical remote sensing ship detection database, and comprises 1,070 optical remote sensing images, the spatial resolution is 2 meters and 0.4 meter, the image size is different from 300 × 300 to 1500 × 900, most of the image size is larger than 1000 × 1000, and 2,976 ship examples are included.

The self-built database HRSC2016-MS is an optical remote sensing ship detection database which is obtained by expanding and re-labeling on the basis of the HRSC2016 database and comprises 7,655 ship examples of 1,680 optical remote sensing images.

The DIOR database is an optical remote sensing target detection database with a larger scale at present, comprises 23,463 optical remote sensing images, and covers 192,472 target examples in 20 target categories.

2. The experimental contents are as follows:

experiment 1: the ship targets in the HRSC2016 and HRSC2016-MS databases were tested by the method of the present invention and the 13 existing methods under the above experimental conditions, and the test results are shown in table 2.

TABLE 2 results of the tests of the present invention and the prior 13 methods on HRSC2016 and HRSC2016-MS databases

The existing methods in table 2 at 13 are:

SSD: a single-stage multi-bounding box target detection algorithm proposed by Liu et al;

yolof: a single level signature graph based target detection algorithm proposed by Chen et al;

RetinaNet: a single-stage target detection algorithm based on Focal local proposed by Lin et al;

NAS-FPN: a target detection algorithm of a pyramid feature structure, which is searched in a certain search space based on a neural network architecture search algorithm, is proposed by Ghiasi et al;

FCOS: a full convolution single-stage target detection algorithm proposed by Tian et al;

and (3) PANET: a two-stage target detection algorithm based on a path aggregation feature pyramid proposed by Liu et al;

fast R-CNN: a real-time two-stage target detection algorithm based on a regional suggestion network proposed by Ren et al;

mask R-CNN: he et al adds an algorithm for dividing a mask prediction branch into a target instance and detecting a target on the basis of Faster R-CNN;

cascade R-CNN: a two-stage target detection algorithm based on cascade R-CNN proposed by Cai et al;

DetectoRS: a target detection algorithm based on a circular feature pyramid structure proposed by Qiao et al;

libra R-CNN: a target detection algorithm based on balanced cross-over ratio sampling, a balanced feature pyramid and a balanced L1 loss function proposed by Pang et al;

YOLOX: ge et al have fused a high-performance single-stage fast target detection algorithm that many design techniques propose;

HTC: chen et al propose a hybrid task Cascade model for target detection and target instance segmentation tasks based on Mask R-CNN and Cascade R-CNN.

The subjective result of ship detection is carried out on the HRSC2016-MS database by the method, as shown in fig. 4, small, medium and large multi-scale ship targets in the optical remote sensing image can be accurately detected, and corresponding bounding boxes are obtained.

It can be seen from the subjective results shown in fig. 4 and the objective results shown in table 2 that the method of the present invention achieves the best detection effect on both the HRSC2016 and HRSC2016-MS databases, proving the effectiveness of the method of the present invention.

Experiment 2: under the above conditions, the combination of the joint recursive feature pyramid structure proposed by the present invention and 5 existing feature pyramid structures as the neck structure and the baseline method, which is the HTC model with the neck structure and semantic prediction branches removed, was combined and compared on the HRSC2016-MS database, and the results are shown in table 3.

TABLE 3 comparison of the Joint recursive feature pyramid of the present invention with the existing 5 feature pyramid structures in the HRSC2016-MS database

The methods in table 3 are presented below:

baseline: a baseline method, specifically an HTC model with neck structure and semantic prediction branches removed;

baseline + FPN: the traditional characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;

baseline + PAFPN: the path aggregation characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;

baseline + BFP: a method in which a balanced feature pyramid is used as a neck structure and a baseline method is combined;

baseline + BiFPN: a double-flow characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;

baseline + RFP: a cyclic feature pyramid is used as a method for combining a neck structure with a baseline method;

baseline + JRFP: the invention provides a method for combining a combined recursive feature pyramid as a neck structure with a baseline method.

As can be seen from the results shown in table 3, the joint recursive feature pyramid provided by the method of the present invention as a neck structure achieves the best detection effect on the HRSC2016-MS database, and achieves the best detection effect on all three dimensions, namely, the large, medium and small dimensions, thereby further proving the effectiveness of the method of the present invention.

Experiment 3: the target detection is carried out on a large-scale optical remote sensing database DIOR by using the method of the invention and 15 methods in the prior art under the conditions, and the results are shown in Table 4.

TABLE 4 detection results of the present invention method and the existing 15 methods on DIOR database

Method	Mean of average precision
		R-CNN	37.7
RICNN	44.2
		RICAOD	50.9
RIFD-CNN	56.1
		SSD	58.6
Faster R-CNN	63.1
		Mask R-CNN	63.5
CornerNet	64.9
		RetinaNet	65.7
Cascade R-CNN	70.3
		YOLOv3	71.0
PANet	71.1
		DetectoRS	71.8
HTC	72.6
		AFPN	72.6
The method of the invention	76.9

The 7 aforementioned non-mentioned methods in table 4 are described below, respectively:

R-CNN: a target detection algorithm based on regional convolution proposed by Girshick et al;

RICNN: cheng et al propose a high-resolution optical remote sensing image target detection algorithm based on rotation invariant convolution;

RICAOD: li et al propose a remote sensing image target detection algorithm based on a rotation insensitive area suggestion network and a local context feature fusion network;

RIFD-CNN: the remote sensing image target detection algorithm based on rotation invariance and Fisher discriminant convolution is provided by Cheng et al;

CornerNet: an hourglass network-based target detection algorithm proposed by Law et al;

YOLOv3: a third edition of YOLO series single-stage rapid target detection algorithm proposed by Joseph et al;

AFPN: cheng et al propose a remote sensing image target detection algorithm based on a perception feature pyramid structure.

As can be seen from the results shown in Table 4, the method of the invention achieves the best detection effect on the DIOR database of the large-scale optical remote sensing database, and further proves the effectiveness of the method of the invention.

Claims

1. A multi-scale target detection method based on a combined recursive feature pyramid is characterized by comprising the following steps:

5c) Feature M of spatial attention _S Splitting the feature map into 4 feature maps, and respectively down-sampling the 4 feature maps to output features C of each convolution block of the backbone network _i The sizes are the same;

5d) Respectively inputting the up-sampled feature maps into 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, respectively increasing the number of channels to 256, 512, 1024 and 2048 to obtain a feature map M to be fused with a backbone network _i Then, each feature map M is divided into _i Output characteristic C of each convolution block of backbone network _i Completing feature fusion by corresponding addition;

(6) And (5) repeating the steps (3) to (5) twice to obtain final multi-scale features P2', P3', P4 'and P5', inputting the final multi-scale features into the existing detection head network, and outputting predicted target position parameters (x, y, w, h) and confidence degrees c of corresponding classes of targets, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, and the detection of the multi-scale targets is completed.

2. The method according to claim 1, wherein in step (1), the images in the training stage and the testing stage are sequentially adjusted, flipped, and normalized, and the mean and standard deviation of the three RGB channels are set as follows:

1a) Data preprocessing in a training phase:

scaling the size of an input image to 800 × 800, and randomly adjusting the brightness, contrast, saturation, and hue of the image with a probability of 0.5;

then randomly turning over the image with the probability of 0.5, and normalizing the image by adopting a method of normalizing the mean standard deviation;

setting the normalized mean values of three RGB channels as [123.675,116.28 and 103.53], setting the standard deviations of the three channels as [58.395,57.12 and 57.375], and finally obtaining tensor data corresponding to the image at the stage;

1b) Data preprocessing in a testing stage:

the size of an input image is scaled to 800 multiplied by 800, and then the image is normalized by adopting a method of normalizing the mean standard deviation;

setting the normalized mean values of the three RGB channels as [123.675,116.28 and 103.53], setting the standard deviations of the three channels as [58.395,57.12 and 57.375], and finally obtaining the corresponding tensor data of the image at the stage.

3. The method of claim 1, wherein the 5 concatenated convolutional blocks of the ResNet convolutional neural network in step (2) have the same structure, and each convolutional block comprises a plurality of convolutional groups, and each convolutional group comprises a convolutional layer, a batch normalization layer and a ReLu activation function.

4. The method of claim 1, wherein the layer features obtained in step (3 a) are subjected to a top-down feature fusion operation in step (3 b) by:

3b1) Recording the dimensionality reduction feature C5 'of the highest-level trunk as a highest-level pyramid feature P5, performing 2 times of upsampling operation on the P5, and directly adding the dimensionality reduction feature C4' of the next-highest-level trunk to obtain a next-highest-level pyramid feature P4;

3b2) Performing 2 times of up-sampling operation on the secondary high-level pyramid feature P4, and directly adding the up-sampling operation to the secondary bottom-level trunk dimension reduction feature C3' to obtain a secondary bottom-level pyramid feature P3;

3b3) Performing 2 times of up-sampling operation on the secondary bottom pyramid feature P3, and directly adding the secondary bottom pyramid feature P3 and the bottom trunk dimensionality reduction feature C2' to obtain a bottom pyramid feature P2;

3b4) And (3) arranging the pyramid features P2, P3, P4 and P5 from bottom to top to form a feature pyramid structure.

5. The method of claim 1, wherein the channel attention module and the spatial attention module in step (4) are structured as follows:

the channel attention module sequentially comprises the operations of upsampling, feature splicing, a global average pooling layer, a full connection layer and a Sigmoid function, and is used for extracting the channel attention feature;

the spatial attention module sequentially comprises an average pooling layer, a maximum pooling layer, a convolution layer and a Sigmoid function, and is used for extracting spatial attention features.

6. The method of claim 1, wherein the 4-level pyramid features P2, P3, P4 and P5 are input into the channel attention module in step 5 a) to obtain a channel attention feature M _C The implementation is as follows:

5a1) Respectively performing up-sampling on the pyramid features P2, P3, P4 and P5 to obtain corresponding features X2, X3, X4 and X5 after up-sampling, wherein the sizes of the corresponding features are all 200 multiplied by 200, and the number of channels is all 256;

5a2) Splicing the pyramid corresponding features X2, X3, X4 and X5 after up-sampling into a channel total feature M _cat1 The size is 200 multiplied by 200, and the number of channels is 1024;

5a3) General characteristics M of channel _cat1 Compressing the global average pooled layer into an average pooled compressed vector V with the length of 1024 _gap ；

5a4) Compressing the average poolVector V _gap After a group of full connection layers, batch normalization layers and a ReLu activation function, the channel recompression vector V with the length of 256 is obtained by compressing again _fc1 ；

5a5) Recompress the channel by vector V _fc1 The channel number is released through another full-connection layer to obtain a channel release vector with the length of 1024

5a6) Release vector V for channel using Sigmoid function _fc2 Normalizing to obtain a normalized vector V with the length of 1024 _norm ；

5a7) General characteristics M of channel _cat1 And normalized vector V _norm Performing dot product to obtain the channel attention feature M _C ：

M _C ＝M _cat1 ·V _norm

Wherein the channel attention feature M _C Has a size of 200 × 200 and a number of channels of 1024.

7. Method according to claim 1, characterized in that the channel attention feature M in step 5 b) _C Inputting spatial attention module to obtain spatial attention feature M _S The implementation is as follows:

5b1) Attention feature M of channel _C Respectively passing through a maximum pooling layer and an average pooling layer to obtain a maximum pooling characteristic M _max And average pooling characteristics M _avg Wherein the sizes of the maximum pooling feature and the average pooling feature are both 200 × 200, and the number of channels is 1;

5b2) Pooling maximum feature M _max And average pooling characteristics M _avg Splicing into a spatial overall characteristic M _cat2 The size is 200 multiplied by 200, and the number of channels is 2;

5b3) General feature M of space _cat2 After passing through a convolution layer with 7 x 7 kernel size and 1 step length, a new feature M is obtained _un The size is 200 multiplied by 200, and the number of channels is 1;

5b4) Using Sigmoid function to new feature M _un Normalization is carried out to obtain normalized characteristics M _norm The size is 200 multiplied by 200, and the number of channels is 1;

5b5) Attention feature M of channel _C And normalized feature M _norm Doing Hadamard product to obtain spatial attention feature M _S ：

Wherein the symbols

Representing the Hadamard product, spatial attention feature M _S Has the size of 200 x 200 and the number of channels of 1024.