CN115527095A - Multi-scale target detection method based on combined recursive feature pyramid - Google Patents

Multi-scale target detection method based on combined recursive feature pyramid Download PDF

Info

Publication number
CN115527095A
CN115527095A CN202211339440.0A CN202211339440A CN115527095A CN 115527095 A CN115527095 A CN 115527095A CN 202211339440 A CN202211339440 A CN 202211339440A CN 115527095 A CN115527095 A CN 115527095A
Authority
CN
China
Prior art keywords
feature
pyramid
channel
features
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211339440.0A
Other languages
Chinese (zh)
Inventor
韩冰
陈玮铭
高新波
杨铮
黄晓悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202211339440.0A priority Critical patent/CN115527095A/en
Publication of CN115527095A publication Critical patent/CN115527095A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale target detection method based on a combined recursive feature pyramid. The method mainly solves the problem that in the prior art, the multi-scale target detection precision is low in a complex scene. The implementation scheme is as follows: 1) Reading data of a target detection database, and preprocessing image data; 2) Extracting the characteristics of the image by using a ResNet convolution neural network as a backbone network; 3) Constructing a characteristic pyramid according to the extracted image characteristics; 4) Constructing a joint feedback processor formed by connecting a channel attention module and a space attention module in series; 5) Processing the pyramid characteristics of each layer by using a joint feedback processor to complete characteristic fusion; 6) Repeating the step 3) to the step 5) twice to obtain multi-scale features; 7) And inputting the multi-scale features into the existing detection head to complete multi-scale detection. The invention obviously improves the accuracy of multi-scale target detection in a complex scene, and can be used for intelligent traffic, intelligent security and remote sensing image processing.

Description

Multi-scale target detection method based on combined recursive feature pyramid
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a multi-scale target detection method based on cyclic characteristics, which can be used in the fields of traffic, security, medical treatment and the like.
Background
The target detection is one of basic tasks in the field of computer vision, is widely applied to the fields of traffic, security, medical treatment and the like, and has extremely high application value. The task of target detection comprises two items of positioning the position of a target in an image and predicting the type of the target in the image. Among them, due to the difference between the size of the target itself and the distance from the camera, the scale of the target appearing in the image usually has a large difference, thereby causing the detection performance to be degraded.
In recent years, the problem of multi-scale object detection has received a lot of attention. The existing algorithm adopts a method of constructing a feature pyramid, namely, a specific layer in a backbone network is independently output, and the feature pyramid is constructed in a mode of down-sampling and feature fusion to obtain features with high resolution and rich semantic information. In addition, some scholars improve detection by introducing a cyclic mechanism for the feature pyramid and a switchable hole convolution for the backbone network.
The traditional feature pyramid structure has large semantic difference between each layer, a direct top-down sampling feature fusion mode cannot well transmit high-layer semantic information to a low layer, and the highest layer only has information loss but does not have the feature of a higher layer for fusion, so that the multi-scale information extraction capability is insufficient. For this purpose, some variants of methods based on a feature pyramid structure are proposed in the prior art.
Due to the fact that the strategy of constructing feature maps with different spatial resolutions by the feature pyramid in a layered mode can remarkably improve the detection performance of the model on the targets with different scales, the target detection algorithm based on the feature pyramid and the variants thereof is the mainstream method for multi-scale target detection. Ghiasi et al used an automatic search algorithm to search out a feature pyramid structure using the feature map to be fused as a search space. However, such structures searched using automatic search algorithms tend to have higher data set dependencies, often manifesting as better performance on certain data sets but mediocre performance on other data sets. Qiao et al first introduced a cyclic mechanism into the target detection task, proposed a cyclic feature pyramid structure, and designed switchable hole convolutions for the backbone network. However, they neglect the inherent semantic difference between the pyramid layers, so that the performance of the method cannot reach the best, and the switchable hole convolution inference speed is full and slow, and the video memory occupation is high. Guo et al consider that the highest layer of the feature pyramid only has information loss, design a residual feature enhancement module to complement the features of the highest layer of the feature pyramid, and also design an adaptive spatial fusion module for fusing each layer of the feature pyramid, and the fused features are used for predicting the target category and the regression target position, thereby significantly improving the multi-scale information extraction capability of the detector. However, the method simply fuses the features of each layer and then performs prediction and regression, ignores the inherent semantic difference between layers, and thus cannot achieve the best performance. Liu et al consider that the information propagation path in the conventional feature pyramid is too long, so that the connection path in the feature pyramid is optimized so that the bottom-layer features beneficial to target positioning can flow to the higher layer more quickly, and the multi-scale target detection capability of the detector is improved. Although this approach optimizes the information propagation paths in the feature pyramid, they still ignore the inherent semantic differences between layers and thus do not achieve optimal performance.
Disclosure of Invention
The invention aims to provide a multi-scale target detection method based on a combined recursive feature pyramid aiming at the defects of the prior art and considering the highest layer information loss and the inherent semantic difference between layers.
In order to achieve the purpose, the implementation steps of the technical scheme of the invention comprise the following steps:
(1) Reading data of a target detection database, sequentially adjusting, turning and normalizing images of training data, sequentially adjusting and normalizing images of test data, setting normalized mean values and standard deviations of RGB (red, green and blue) three channels, and finally obtaining tensor data corresponding to the images;
(2) Inputting the preprocessed image tensor data obtained in the step (1) into a ResNet convolutional neural network comprising 5 serially connected convolutional blocks as a main network to obtain image characteristics which are respectively extracted by the 5 convolutional blocks and are respectively marked as C1, C2, C3, C4 and C5;
(3) According to the image features extracted by the ResNet convolution neural network, a feature pyramid is constructed:
3a) Image features C2, C3, C4 and C5 extracted from the ResNet convolutional neural network are respectively subjected to 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, so that the channel number of the C2 feature is still kept to be 256, the channel number of the C3 feature is reduced to 256 from 512, the channel number of the C4 feature is reduced to 256 from 1024, the channel number of the C5 feature is reduced to 256 from 2048, and finally 4 layers of main dimensionality reduction features C2', C3', C4 'and C5' are obtained;
3b) Performing top-down feature fusion operation on each layer of the trunk dimensionality reduction features obtained in the step 3 a) to form a feature pyramid structure consisting of P2, P3, P4 and P5 pyramid features;
(4) Constructing a joint feedback processor formed by connecting a channel attention module and a space attention module in series;
(5) Processing the pyramid characteristics of each layer obtained in the step (3) by using a joint feedback processor to complete characteristic fusion:
5a) Inputting the 4-layer pyramid characteristics of P2, P3, P4 and P5 into the channel attention module to obtain a channel attention characteristic M C
5b) The channel attention characteristics M obtained in 5 a) C Inputting a spatial attention module to obtain a spatial attention feature M S
5c) Feature M of spatial attention S The method is divided into 4 characteristic graphs,and down-sampling the 4 feature maps to output features C of each convolution block of the backbone network i The sizes are the same;
5d) Respectively processing the up-sampled feature maps by 4 convolution layers with the kernel size of 1 multiplied by 1 and the step length of 1, respectively increasing the channel number to 256, 512, 1024 and 2048, and obtaining a feature map M to be fused with a backbone network i Then, each feature map M is divided into i Output characteristic C of each convolution block of backbone network i Completing feature fusion by corresponding addition;
(6) And (5) repeating the steps (3) to (5) twice to obtain final multi-scale features P2', P3', P4 'and P5', inputting the final multi-scale features into the existing detection head network, and outputting predicted target position parameters (x, y, w, h) and confidence degrees c of corresponding classes of targets, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, so that the detection of the multi-scale targets is completed.
Compared with the prior art, the invention has the following advantages:
firstly, on the basis of the cyclic feature pyramid, a combined feedback processor is introduced to uniformly process the feedback features of the feature pyramid, so that information flow supplement can be performed on the topmost features of the feature pyramid, semantic differences among layers can be reduced, the multi-scale information extraction capability of a detector is improved, and the network detection effect is improved;
secondly, the method does not need special convolution operations such as switchable hole convolution and the like to increase the receptive field, and compared with other circulation methods, the method provided by the invention has the advantage that the reasoning speed is obviously improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a joint recursive feature pyramid in accordance with the present invention;
FIG. 3 is a schematic diagram of a joint feedback processor of the present invention;
FIG. 4 is a diagram of simulation results of detection of a ship target in an optical remote sensing image using the present invention.
Detailed Description
Embodiments and effects of the present invention are further described below with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of this embodiment are as follows:
step 1, reading data of a target detection database, and preprocessing image data.
The target detection database data comprises data in a training stage and data in a testing stage, and the image data in the two stages are respectively preprocessed as follows:
1.1 Data preprocessing in the training phase:
firstly, the size of an input image is scaled to 800 multiplied by 800, then the brightness, the contrast, the saturation and the hue of the image are randomly adjusted according to the probability of 0.5, and then the image is randomly turned over according to the probability of 0.5;
normalizing the image by adopting a mean standard deviation normalization method, wherein the normalized mean values of three RGB channels are set to be [123.675,116.28 and 103.53], the standard deviations of the three channels are set to be [58.395,57.12 and 57.375], and finally tensor data corresponding to the image in the stage are obtained;
1.2 Data preprocessing in the testing phase:
scaling the size of the input image to 800 × 800;
normalizing the image by adopting a method of normalizing standard deviation of mean values, wherein the normalized mean values of three channels of RGB are respectively set as [123.675,116.28 and 103.53],
setting the standard deviation of three channels as 58.395,57.12 and 57.375, and finally obtaining the corresponding tensor data of the image at the stage.
And 2, extracting the features of the image by using a ResNet convolutional neural network as a backbone network.
The ResNet convolutional neural network has 5 convolutional blocks connected in series, each convolutional block comprises a plurality of convolutional groups, and each convolutional group comprises a convolutional layer, a batch normalization layer and a ReLu activation function. The main network used by the invention relates to three versions of ResNet-50, resNet-101 and ResNet-152, the image tensor preprocessed in step 1 is input into a ResNet convolution neural network to extract image features, and the image features extracted by 5 convolution blocks are respectively marked as C1, C2, C3, C4 and C5. The backbone network structure and the respective extracted image features are shown in table 1.
Table 1: resNet convolution neural network structure and extracted image characteristics
Figure BDA0003915939590000041
And 3, constructing a characteristic pyramid according to the image characteristics extracted by the ResNet convolutional neural network.
Referring to fig. 2, the specific implementation of this step is as follows:
3.1 Image features C2, C3, C4 and C5 extracted from the ResNet convolutional neural network respectively pass through 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, so that the number of channels of the C2 feature is still kept to be 256, the number of channels of the C3 feature is reduced from 512 to 256, the number of channels of the C4 feature is reduced from 1024 to 256, the number of channels of the C5 feature is reduced from 2048 to 256, and finally 4 layers of main trunk dimension reduction features C2', C3', C4 'and C5' are obtained;
3.2 Carrying out top-down feature fusion operation on the dimensionality reduction features of the layers of trunks obtained in the step 3.1):
3.2.1 Marking the dimensionality reduction feature of the highest-level trunk as a highest-level pyramid feature P5, performing 2 times of upsampling operation on the P5, and directly adding the upsampled feature with the dimensionality reduction feature of the next-higher-level trunk to obtain a next-higher-level pyramid feature P4;
3.2.2 Performing 2 times of upsampling operation on the second-highest pyramid feature P4, and directly adding the upsampled feature with the dimension reduction feature of the second-bottom trunk to obtain a second-bottom pyramid feature P3;
3.2.3 Performing 2 times of upsampling operation on the secondary bottom pyramid feature P3, and directly adding the upsampled operation and the bottom trunk dimensionality reduction feature to obtain a bottom pyramid feature P2;
3.3 The pyramid features of P2, P3, P4 and P5 are arranged from bottom to top to form a feature pyramid structure.
And 4, constructing a joint feedback processor.
4.1 Select a channel attention module sequentially including upsampling, feature splicing, a global average pooling layer, a full connection layer and a Sigmoid function, for extracting channel attention features, wherein an expression of the Sigmoid function is as follows:
Figure BDA0003915939590000051
4.2 Selecting a space attention module sequentially comprising an average pooling layer, a maximum pooling layer, a convolution layer and a Sigmoid function for extracting space attention characteristics;
4.3 The channel attention module and the spatial attention module are connected in series to form a joint feedback processor.
And 5, processing the pyramid characteristics of each layer obtained in the step 3 by using a combined feedback processor to complete characteristic fusion.
Referring to fig. 3, the specific implementation of this step is as follows:
5.1 Inputting 4 layers of pyramid features P2, P3, P4 and P5 into the channel attention module to obtain a channel attention feature M C
5.1.1 Respectively upsampling the pyramid features P2, P3, P4 and P5 to obtain corresponding features X2, X3, X4 and X5 which are upsampled, wherein the corresponding features are 200 multiplied by 200 in size, and the number of channels is 256;
5.1.2 ) the pyramid corresponding features X2, X3, X4, X5 after up-sampling are spliced into a channel total feature M cat1 The size is 200 multiplied by 200, and the number of channels is 1024;
5.1.3 General characteristics M of the channel cat1 Compressing the global average pooled layer into an average pooled compressed vector V with the length of 1024 gap
5.1.4 Average pooled compressed vector V gap After a group of full connection layers, batch standardization layers and a ReLu activation function, the channel recompression vector V with the length of 256 is obtained by carrying out compression again fc1
5.1.5 Channel recompression vector V) fc1 Releasing the channel number through another full-connection layer to obtain a channel release vector with the length of 1024
Figure BDA0003915939590000061
5.1.6 Using Sigmoid function to release vector V for channel fc2 Normalizing to obtain a normalized vector V with the length of 1024 norm
5.1.7 General characteristics M of the channel cat1 And the normalized vector V norm Performing dot product to obtain the channel attention feature M C
M C =M cat1 ·V norm
Wherein the channel attention feature M C Has the size of 200 x 200 and the number of channels of 1024.
5.2 ) the channel attention feature M obtained in 5.1) C Inputting a spatial attention module to obtain a spatial attention feature M S
5.2.1 Channel attention feature M) C Respectively passing through a maximum pooling layer and an average pooling layer to obtain a maximum pooling characteristic M max And average pooling characteristics M avg Wherein the sizes of the maximum pooling feature and the average pooling feature are both 200 × 200, and the number of channels is 1;
5.2.2 Maximum pooling feature M) max And average pooling characteristics M avg Splicing into a spatial overall characteristic M cat2 The size is 200 multiplied by 200, and the number of channels is 2;
5.2.3 General spatial signature M) cat2 After passing through a convolution layer with 7 x 7 kernel size and 1 step length, a new feature M is obtained un The size is 200 multiplied by 200, and the number of channels is 1;
5.2.4 Using Sigmoid function on new feature M un Normalization is carried out to obtain normalized characteristics M norm The size is 200 multiplied by 200, and the number of channels is 1;
5.2.5 Channel attention feature M) C And normalized feature M norm Doing Hadamard product to obtain space attention feature M S
Figure BDA0003915939590000062
Wherein, the symbol
Figure BDA0003915939590000063
Representing the Hadamard product, spatial attention feature M S The size of (1) is 200 multiplied by 200, and the number of channels is 1024;
5.3 Spatial attention feature M) S Splitting the feature map into 4 feature maps, and respectively down-sampling the 4 feature maps to output features C of each convolution block of the backbone network i The sizes are the same;
5.4 Respectively inputting the up-sampled feature maps into 4 convolution layers with the kernel size of 1 × 1 and the step length of 1, and respectively increasing the channel number to 256, 512, 1024 and 2048 to obtain a feature map M to be fused with a backbone network i Then, each feature map M is divided into i Output characteristic C of each convolution block of backbone network i And completing feature fusion by corresponding addition.
And 6, completing the detection of the multi-scale target.
6.1 Repeating the step 3 to the step 5 twice to obtain final multi-scale features P2', P3', P4 'and P5';
6.1 Inputting the multi-scale features P2', P3', P4 'and P5' into the existing detection head network, and outputting the predicted target position parameters (x, y, w, h) and the confidence coefficient c of the corresponding category of the target, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, and the detection of the multi-scale target is completed.
The effect of the present invention will be further described with reference to simulation experiments.
1. The experimental conditions are as follows:
the computer processor is Intel (R) Core (TM) i7 CPU @3.5GHz, the running memory is 128G, and the video card is an NVIDIA TITAN X GPU with the video memory of 12 GB.
The operating system was 64-bit Ubuntu 18.04 (LTS), and the deep learning framework used was PyTorch (version 1.8.0).
All network training adopts a back propagation algorithm to calculate residual errors of all layers, and a random gradient descent algorithm with a kinetic energy term and a weight attenuation term is used for updating network parameters, wherein the kinetic energy term is 0.9, and the weight attenuation term is 0.0001.
The experiment is evaluated by using an HRSC2016 optical remote sensing ship detection database, a self-built database HRSC2016-MS and a DIOR large-scale optical remote sensing target detection database, and the evaluation indexes are mAP and AP S 、AP M And AP L . Wherein mAP is the average precision mean value under 50% intersection ratio threshold, AP S AP for average accuracy of objects with dimensions less than 32X 32 M AP is average accuracy of 32 × 32 or more and less than 96 × 96 L The average accuracy of objects with dimensions greater than 96 x 96.
The HRSC2016 database is currently the only open-source optical remote sensing ship detection database, and comprises 1,070 optical remote sensing images, the spatial resolution is 2 meters and 0.4 meter, the image size is different from 300 × 300 to 1500 × 900, most of the image size is larger than 1000 × 1000, and 2,976 ship examples are included.
The self-built database HRSC2016-MS is an optical remote sensing ship detection database which is obtained by expanding and re-labeling on the basis of the HRSC2016 database and comprises 7,655 ship examples of 1,680 optical remote sensing images.
The DIOR database is an optical remote sensing target detection database with a larger scale at present, comprises 23,463 optical remote sensing images, and covers 192,472 target examples in 20 target categories.
2. The experimental contents are as follows:
experiment 1: the ship targets in the HRSC2016 and HRSC2016-MS databases were tested by the method of the present invention and the 13 existing methods under the above experimental conditions, and the test results are shown in table 2.
TABLE 2 results of the tests of the present invention and the prior 13 methods on HRSC2016 and HRSC2016-MS databases
Figure BDA0003915939590000071
Figure BDA0003915939590000081
The existing methods in table 2 at 13 are:
SSD: a single-stage multi-bounding box target detection algorithm proposed by Liu et al;
yolof: a single level signature graph based target detection algorithm proposed by Chen et al;
RetinaNet: a single-stage target detection algorithm based on Focal local proposed by Lin et al;
NAS-FPN: a target detection algorithm of a pyramid feature structure, which is searched in a certain search space based on a neural network architecture search algorithm, is proposed by Ghiasi et al;
FCOS: a full convolution single-stage target detection algorithm proposed by Tian et al;
and (3) PANET: a two-stage target detection algorithm based on a path aggregation feature pyramid proposed by Liu et al;
fast R-CNN: a real-time two-stage target detection algorithm based on a regional suggestion network proposed by Ren et al;
mask R-CNN: he et al adds an algorithm for dividing a mask prediction branch into a target instance and detecting a target on the basis of Faster R-CNN;
cascade R-CNN: a two-stage target detection algorithm based on cascade R-CNN proposed by Cai et al;
DetectoRS: a target detection algorithm based on a circular feature pyramid structure proposed by Qiao et al;
libra R-CNN: a target detection algorithm based on balanced cross-over ratio sampling, a balanced feature pyramid and a balanced L1 loss function proposed by Pang et al;
YOLOX: ge et al have fused a high-performance single-stage fast target detection algorithm that many design techniques propose;
HTC: chen et al propose a hybrid task Cascade model for target detection and target instance segmentation tasks based on Mask R-CNN and Cascade R-CNN.
The subjective result of ship detection is carried out on the HRSC2016-MS database by the method, as shown in fig. 4, small, medium and large multi-scale ship targets in the optical remote sensing image can be accurately detected, and corresponding bounding boxes are obtained.
It can be seen from the subjective results shown in fig. 4 and the objective results shown in table 2 that the method of the present invention achieves the best detection effect on both the HRSC2016 and HRSC2016-MS databases, proving the effectiveness of the method of the present invention.
Experiment 2: under the above conditions, the combination of the joint recursive feature pyramid structure proposed by the present invention and 5 existing feature pyramid structures as the neck structure and the baseline method, which is the HTC model with the neck structure and semantic prediction branches removed, was combined and compared on the HRSC2016-MS database, and the results are shown in table 3.
TABLE 3 comparison of the Joint recursive feature pyramid of the present invention with the existing 5 feature pyramid structures in the HRSC2016-MS database
Figure BDA0003915939590000091
The methods in table 3 are presented below:
baseline: a baseline method, specifically an HTC model with neck structure and semantic prediction branches removed;
baseline + FPN: the traditional characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;
baseline + PAFPN: the path aggregation characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;
baseline + BFP: a method in which a balanced feature pyramid is used as a neck structure and a baseline method is combined;
baseline + BiFPN: a double-flow characteristic pyramid is used as a method formed by combining a neck structure and a baseline method;
baseline + RFP: a cyclic feature pyramid is used as a method for combining a neck structure with a baseline method;
baseline + JRFP: the invention provides a method for combining a combined recursive feature pyramid as a neck structure with a baseline method.
As can be seen from the results shown in table 3, the joint recursive feature pyramid provided by the method of the present invention as a neck structure achieves the best detection effect on the HRSC2016-MS database, and achieves the best detection effect on all three dimensions, namely, the large, medium and small dimensions, thereby further proving the effectiveness of the method of the present invention.
Experiment 3: the target detection is carried out on a large-scale optical remote sensing database DIOR by using the method of the invention and 15 methods in the prior art under the conditions, and the results are shown in Table 4.
TABLE 4 detection results of the present invention method and the existing 15 methods on DIOR database
Method Mean of average precision
R-CNN 37.7
RICNN 44.2
RICAOD 50.9
RIFD-CNN 56.1
SSD 58.6
Faster R-CNN 63.1
Mask R-CNN 63.5
CornerNet 64.9
RetinaNet 65.7
Cascade R-CNN 70.3
YOLOv3 71.0
PANet 71.1
DetectoRS 71.8
HTC 72.6
AFPN 72.6
The method of the invention 76.9
The 7 aforementioned non-mentioned methods in table 4 are described below, respectively:
R-CNN: a target detection algorithm based on regional convolution proposed by Girshick et al;
RICNN: cheng et al propose a high-resolution optical remote sensing image target detection algorithm based on rotation invariant convolution;
RICAOD: li et al propose a remote sensing image target detection algorithm based on a rotation insensitive area suggestion network and a local context feature fusion network;
RIFD-CNN: the remote sensing image target detection algorithm based on rotation invariance and Fisher discriminant convolution is provided by Cheng et al;
CornerNet: an hourglass network-based target detection algorithm proposed by Law et al;
YOLOv3: a third edition of YOLO series single-stage rapid target detection algorithm proposed by Joseph et al;
AFPN: cheng et al propose a remote sensing image target detection algorithm based on a perception feature pyramid structure.
As can be seen from the results shown in Table 4, the method of the invention achieves the best detection effect on the DIOR database of the large-scale optical remote sensing database, and further proves the effectiveness of the method of the invention.

Claims (7)

1. A multi-scale target detection method based on a combined recursive feature pyramid is characterized by comprising the following steps:
(1) Reading data of a target detection database, sequentially adjusting, turning and normalizing images of training data, sequentially adjusting and normalizing images of test data, setting normalized mean values and standard deviations of RGB (red, green and blue) three channels, and finally obtaining tensor data corresponding to the images;
(2) Inputting the preprocessed image tensor data obtained in the step (1) into a ResNet convolutional neural network comprising 5 serially connected convolutional blocks as a main network to obtain image characteristics which are respectively extracted by the 5 convolutional blocks and are respectively marked as C1, C2, C3, C4 and C5;
(3) According to the image features extracted by the ResNet convolution neural network, a feature pyramid is constructed:
3a) Image features C2, C3, C4 and C5 extracted from the ResNet convolutional neural network are respectively subjected to 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, so that the channel number of the C2 feature is still kept to be 256, the channel number of the C3 feature is reduced to 256 from 512, the channel number of the C4 feature is reduced to 256 from 1024, the channel number of the C5 feature is reduced to 256 from 2048, and finally 4 layers of main dimensionality reduction features C2', C3', C4 'and C5' are obtained;
3b) Performing top-down feature fusion operation on each layer of the trunk dimensionality reduction features obtained in the step 3 a) to form a feature pyramid structure consisting of P2, P3, P4 and P5 pyramid features;
(4) Constructing a joint feedback processor formed by connecting a channel attention module and a space attention module in series;
(5) Processing the pyramid characteristics of each layer obtained in the step (3) by using a joint feedback processor to complete characteristic fusion:
5a) Inputting the 4-layer pyramid characteristics of P2, P3, P4 and P5 into the channel attention module to obtain a channel attention characteristic M C
5b) The channel attention characteristics M obtained in 5 a) C Inputting a spatial attention module to obtain a spatial attention feature M S
5c) Feature M of spatial attention S Splitting the feature map into 4 feature maps, and respectively down-sampling the 4 feature maps to output features C of each convolution block of the backbone network i The sizes are the same;
5d) Respectively inputting the up-sampled feature maps into 4 convolutional layers with the kernel size of 1 multiplied by 1 and the step length of 1, respectively increasing the number of channels to 256, 512, 1024 and 2048 to obtain a feature map M to be fused with a backbone network i Then, each feature map M is divided into i Output characteristic C of each convolution block of backbone network i Completing feature fusion by corresponding addition;
(6) And (5) repeating the steps (3) to (5) twice to obtain final multi-scale features P2', P3', P4 'and P5', inputting the final multi-scale features into the existing detection head network, and outputting predicted target position parameters (x, y, w, h) and confidence degrees c of corresponding classes of targets, wherein (x, y) is the coordinate of the upper left corner of the target boundary box in the image, w is the width of the target boundary box, and h is the height of the target boundary box, and the detection of the multi-scale targets is completed.
2. The method according to claim 1, wherein in step (1), the images in the training stage and the testing stage are sequentially adjusted, flipped, and normalized, and the mean and standard deviation of the three RGB channels are set as follows:
1a) Data preprocessing in a training phase:
scaling the size of an input image to 800 × 800, and randomly adjusting the brightness, contrast, saturation, and hue of the image with a probability of 0.5;
then randomly turning over the image with the probability of 0.5, and normalizing the image by adopting a method of normalizing the mean standard deviation;
setting the normalized mean values of three RGB channels as [123.675,116.28 and 103.53], setting the standard deviations of the three channels as [58.395,57.12 and 57.375], and finally obtaining tensor data corresponding to the image at the stage;
1b) Data preprocessing in a testing stage:
the size of an input image is scaled to 800 multiplied by 800, and then the image is normalized by adopting a method of normalizing the mean standard deviation;
setting the normalized mean values of the three RGB channels as [123.675,116.28 and 103.53], setting the standard deviations of the three channels as [58.395,57.12 and 57.375], and finally obtaining the corresponding tensor data of the image at the stage.
3. The method of claim 1, wherein the 5 concatenated convolutional blocks of the ResNet convolutional neural network in step (2) have the same structure, and each convolutional block comprises a plurality of convolutional groups, and each convolutional group comprises a convolutional layer, a batch normalization layer and a ReLu activation function.
4. The method of claim 1, wherein the layer features obtained in step (3 a) are subjected to a top-down feature fusion operation in step (3 b) by:
3b1) Recording the dimensionality reduction feature C5 'of the highest-level trunk as a highest-level pyramid feature P5, performing 2 times of upsampling operation on the P5, and directly adding the dimensionality reduction feature C4' of the next-highest-level trunk to obtain a next-highest-level pyramid feature P4;
3b2) Performing 2 times of up-sampling operation on the secondary high-level pyramid feature P4, and directly adding the up-sampling operation to the secondary bottom-level trunk dimension reduction feature C3' to obtain a secondary bottom-level pyramid feature P3;
3b3) Performing 2 times of up-sampling operation on the secondary bottom pyramid feature P3, and directly adding the secondary bottom pyramid feature P3 and the bottom trunk dimensionality reduction feature C2' to obtain a bottom pyramid feature P2;
3b4) And (3) arranging the pyramid features P2, P3, P4 and P5 from bottom to top to form a feature pyramid structure.
5. The method of claim 1, wherein the channel attention module and the spatial attention module in step (4) are structured as follows:
the channel attention module sequentially comprises the operations of upsampling, feature splicing, a global average pooling layer, a full connection layer and a Sigmoid function, and is used for extracting the channel attention feature;
the spatial attention module sequentially comprises an average pooling layer, a maximum pooling layer, a convolution layer and a Sigmoid function, and is used for extracting spatial attention features.
6. The method of claim 1, wherein the 4-level pyramid features P2, P3, P4 and P5 are input into the channel attention module in step 5 a) to obtain a channel attention feature M C The implementation is as follows:
5a1) Respectively performing up-sampling on the pyramid features P2, P3, P4 and P5 to obtain corresponding features X2, X3, X4 and X5 after up-sampling, wherein the sizes of the corresponding features are all 200 multiplied by 200, and the number of channels is all 256;
5a2) Splicing the pyramid corresponding features X2, X3, X4 and X5 after up-sampling into a channel total feature M cat1 The size is 200 multiplied by 200, and the number of channels is 1024;
5a3) General characteristics M of channel cat1 Compressing the global average pooled layer into an average pooled compressed vector V with the length of 1024 gap
5a4) Compressing the average poolVector V gap After a group of full connection layers, batch normalization layers and a ReLu activation function, the channel recompression vector V with the length of 256 is obtained by compressing again fc1
5a5) Recompress the channel by vector V fc1 The channel number is released through another full-connection layer to obtain a channel release vector with the length of 1024
Figure FDA0003915939580000031
5a6) Release vector V for channel using Sigmoid function fc2 Normalizing to obtain a normalized vector V with the length of 1024 norm
5a7) General characteristics M of channel cat1 And normalized vector V norm Performing dot product to obtain the channel attention feature M C
M C =M cat1 ·V norm
Wherein the channel attention feature M C Has a size of 200 × 200 and a number of channels of 1024.
7. Method according to claim 1, characterized in that the channel attention feature M in step 5 b) C Inputting spatial attention module to obtain spatial attention feature M S The implementation is as follows:
5b1) Attention feature M of channel C Respectively passing through a maximum pooling layer and an average pooling layer to obtain a maximum pooling characteristic M max And average pooling characteristics M avg Wherein the sizes of the maximum pooling feature and the average pooling feature are both 200 × 200, and the number of channels is 1;
5b2) Pooling maximum feature M max And average pooling characteristics M avg Splicing into a spatial overall characteristic M cat2 The size is 200 multiplied by 200, and the number of channels is 2;
5b3) General feature M of space cat2 After passing through a convolution layer with 7 x 7 kernel size and 1 step length, a new feature M is obtained un The size is 200 multiplied by 200, and the number of channels is 1;
5b4) Using Sigmoid function to new feature M un Normalization is carried out to obtain normalized characteristics M norm The size is 200 multiplied by 200, and the number of channels is 1;
5b5) Attention feature M of channel C And normalized feature M norm Doing Hadamard product to obtain spatial attention feature M S
Figure FDA0003915939580000041
Wherein the symbols
Figure FDA0003915939580000042
Representing the Hadamard product, spatial attention feature M S Has the size of 200 x 200 and the number of channels of 1024.
CN202211339440.0A 2022-10-29 2022-10-29 Multi-scale target detection method based on combined recursive feature pyramid Pending CN115527095A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211339440.0A CN115527095A (en) 2022-10-29 2022-10-29 Multi-scale target detection method based on combined recursive feature pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211339440.0A CN115527095A (en) 2022-10-29 2022-10-29 Multi-scale target detection method based on combined recursive feature pyramid

Publications (1)

Publication Number Publication Date
CN115527095A true CN115527095A (en) 2022-12-27

Family

ID=84704563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211339440.0A Pending CN115527095A (en) 2022-10-29 2022-10-29 Multi-scale target detection method based on combined recursive feature pyramid

Country Status (1)

Country Link
CN (1) CN115527095A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797357A (en) * 2023-02-10 2023-03-14 智洋创新科技股份有限公司 Transmission channel hidden danger detection method based on improved YOLOv7
CN116311361A (en) * 2023-03-02 2023-06-23 北京化工大学 Dangerous source indoor staff positioning method based on pixel-level labeling
CN117423062A (en) * 2023-11-13 2024-01-19 南通大学 Building site safety helmet detection method based on improved YOLOv5
CN117523437A (en) * 2023-10-30 2024-02-06 河南送变电建设有限公司 Real-time risk identification method for substation near-electricity operation site
CN117784620A (en) * 2024-02-27 2024-03-29 山东九曲圣基新型建材有限公司 Intelligent parameter adjusting system and method for tailing dry-discharging dehydrator
CN117876891A (en) * 2023-02-21 2024-04-12 云景技术有限公司 Adaptive aerial photographing target detection method based on multi-scale deep learning
CN118015469A (en) * 2024-03-12 2024-05-10 重庆科技大学 Urban and rural junction illegal building detection method and system
CN118015469B (en) * 2024-03-12 2024-09-10 重庆科技大学 Urban and rural junction illegal building detection method and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797357A (en) * 2023-02-10 2023-03-14 智洋创新科技股份有限公司 Transmission channel hidden danger detection method based on improved YOLOv7
CN115797357B (en) * 2023-02-10 2023-05-16 智洋创新科技股份有限公司 Power transmission channel hidden danger detection method based on improved YOLOv7
CN117876891A (en) * 2023-02-21 2024-04-12 云景技术有限公司 Adaptive aerial photographing target detection method based on multi-scale deep learning
CN116311361A (en) * 2023-03-02 2023-06-23 北京化工大学 Dangerous source indoor staff positioning method based on pixel-level labeling
CN116311361B (en) * 2023-03-02 2023-09-15 北京化工大学 Dangerous source indoor staff positioning method based on pixel-level labeling
CN117523437A (en) * 2023-10-30 2024-02-06 河南送变电建设有限公司 Real-time risk identification method for substation near-electricity operation site
CN117423062A (en) * 2023-11-13 2024-01-19 南通大学 Building site safety helmet detection method based on improved YOLOv5
CN117423062B (en) * 2023-11-13 2024-07-19 南通大学 Construction site safety helmet detection method based on improved YOLOv5
CN117784620A (en) * 2024-02-27 2024-03-29 山东九曲圣基新型建材有限公司 Intelligent parameter adjusting system and method for tailing dry-discharging dehydrator
CN117784620B (en) * 2024-02-27 2024-05-10 山东九曲圣基新型建材有限公司 Intelligent parameter adjusting system and method for tailing dry-discharging dehydrator
CN118015469A (en) * 2024-03-12 2024-05-10 重庆科技大学 Urban and rural junction illegal building detection method and system
CN118015469B (en) * 2024-03-12 2024-09-10 重庆科技大学 Urban and rural junction illegal building detection method and system

Similar Documents

Publication Publication Date Title
CN115527095A (en) Multi-scale target detection method based on combined recursive feature pyramid
US20220067335A1 (en) Method for dim and small object detection based on discriminant feature of video satellite data
Hochuli et al. Handwritten digit segmentation: Is it still necessary?
CN111460927A (en) Method for extracting structured information of house property certificate image
CN114841244B (en) Target detection method based on robust sampling and mixed attention pyramid
Zhan et al. Semi-supervised classification of hyperspectral data based on generative adversarial networks and neighborhood majority voting
CN112580480B (en) Hyperspectral remote sensing image classification method and device
Wang et al. A Convolutional Neural Network‐Based Classification and Decision‐Making Model for Visible Defect Identification of High‐Speed Train Images
CN106228166A (en) The recognition methods of character picture
CN114419413A (en) Method for constructing sensing field self-adaptive transformer substation insulator defect detection neural network
Ali et al. A three-way clustering approach using image enhancement operations
Wang et al. CDFF: a fast and highly accurate method for recognizing traffic signs
Pan et al. Hybrid dilated faster RCNN for object detection
Cao et al. Attentional mechanisms and improved residual networks for diabetic retinopathy severity classification
CN117576009A (en) Improved YOLOv5 s-based high-precision solar panel defect detection method
CN117237599A (en) Image target detection method and device
CN116012686A (en) Improved YOLOv6 target detection method introducing dynamic position loss
Fu et al. Pedestrian detection by feature selected self-similarity features
CN112052881B (en) Hyperspectral image classification model device based on multi-scale near-end feature splicing
Antony et al. Traffic sign recognition using CNN and Res-Net
CN114494827A (en) Small target detection method for detecting aerial picture
Xue et al. EL-YOLO: An efficient and lightweight low-altitude aerial objects detector for onboard applications
Zhang et al. Semantics reused context feature pyramid network for object detection in remote sensing images
Wang et al. EFSSD: An Enhanced Fusion SSD with Feature Fusion and Visual Object Association Method
Wei et al. EDCNet: A Lightweight Object Detection Method Based on Encoding Feature Sharing for Drug Driving Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination