CN114612709A - Multi-scale target detection method guided by image pyramid characteristics - Google Patents

Multi-scale target detection method guided by image pyramid characteristics Download PDF

Info

Publication number
CN114612709A
CN114612709A CN202210185676.7A CN202210185676A CN114612709A CN 114612709 A CN114612709 A CN 114612709A CN 202210185676 A CN202210185676 A CN 202210185676A CN 114612709 A CN114612709 A CN 114612709A
Authority
CN
China
Prior art keywords
features
image
network
convolution
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210185676.7A
Other languages
Chinese (zh)
Inventor
陈苏婷
马文妍
张艳艳
张闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202210185676.7A priority Critical patent/CN114612709A/en
Publication of CN114612709A publication Critical patent/CN114612709A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-scale target detection method guided by image pyramid characteristics, which comprises the following steps: s1, taking the color image as network input, taking FPN as a frame for target detection, and adopting a sequencing down-sampling method to extract image features; s2, taking the same color image as input, and extracting position information and detail features of each level in the image pyramid by adopting the constructed double-bottleneck convolution network; s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of the high-resolution weak semantic features and the low-resolution strong semantic features; and S4, introducing a Focal local reconstruction loss function to finish target detection. The invention not only can strengthen the spatial position information, but also can avoid losing a large amount of detail information in the down-sampling, thereby increasing the identification degree of the target detection network to small targets and adjacent targets.

Description

Multi-scale target detection method guided by image pyramid characteristics
Technical Field
The invention relates to a multi-scale target detection method, in particular to a multi-scale target detection method guided by image pyramid characteristics.
Background
The target detection task is to accurately predict a category and a coordinate position for each target in a natural image. The method is widely applied to many fields, and has great research value from unmanned driving, intelligent communities to video monitoring. However, objects of different classes may have similar appearances and sizes, and objects of the same class may also differ greatly in appearance and size. Background is complex and various, and mutual shielding phenomenon exists between objects, and these factors cause target detection to be one of the most challenging tasks in the field of computer vision. In the traditional target detection method, the characteristics of the image are obtained by artificially designing a characteristic extraction operator, and the characteristics have low representation capability and poor generalization, so that the further development of the traditional target detection method is limited.
In recent years, the advent of deep learning has greatly promoted the development of computer vision. The target detection algorithm based on the Convolutional Neural Networks (CNNs) extracts target features from an image through convolution operation, so that the network can obtain more favorable and deeper features and can better handle complex conditions such as shielding, deformation and illumination change. Current target detection algorithms based on convolutional neural networks can be divided into two broad categories: one is a two-step target detection algorithm based on a region candidate box and a one-step target detection algorithm based on regression.
However, the target detection algorithm still has many disadvantages, especially in multi-scale target detection. The existing method cannot well detect the same target with different scales, particularly a small target and a neighboring target, and semantic information of the small target and the neighboring target disappears along with the deepening of a network.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a multi-scale target detection method guided by image pyramid features, which can improve the identification degrees of small targets and adjacent targets, and is used for solving the problems that high-level feature maps in multi-scale target detection are easy to confuse the characteristics of the adjacent targets and ignore the small targets.
The technical scheme is as follows: the multi-scale target detection method comprises the following steps:
s1, taking the color image as network input, taking the FPN based on ResNet-101 backbone network as a frame of target detection, and adopting a sequencing down-sampling method to extract image features;
s2, taking the same color image in the step S1 as input, and extracting position information and detail features of each level in the image pyramid by adopting the constructed double-bottleneck convolution network;
s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of the high-resolution weak semantic features and the low-resolution strong semantic features;
and S4, introducing a Focal local reconstruction loss function, training multiple tasks, and completing target detection.
In step S1, the implementation procedure of the ordered downsampling method is as follows:
s11, sliding a sliding window with a set step length on the feature map of the feature sampling layer of the convolutional neural network, sequencing the numerical values in the sliding window in an ascending order, sequentially extracting four values in the sliding window, and generating four new feature maps; the width and height of each new feature map are half of the original feature map, and the output of the ordered down-sampling method is as follows:
Figure BDA0003523282860000021
wherein the content of the first and second substances,
Figure BDA0003523282860000022
representing a feature map of each sampling layer of the convolutional neural network, W, H and D respectively represent the width, height and channel number of the feature map, and l is a hierarchical index of the sampling layer of the convolutional neural network; mj(. h) represents the process of extracting the jth value in the sliding window, and four values are extracted in sequence in each sliding window;
Figure BDA0003523282860000023
a new feature map representing a jth output of the ith down-sampling layer, four new feature maps being generated for each down-sampling layer;
s12, the four new feature maps are juxtaposed and then input into a small convolution network for feature refinement and channel adjustment; final feature map to be output
Figure BDA0003523282860000024
And as the input of the next layer of the backbone network, wherein W ', H ' and D ' respectively represent the width, height and channel number of the final feature map.
In step S2, the process of constructing the dual bottleneck convolutional network is as follows:
s21, defining the inputs of the double bottleneck convolutional network as:
Figure BDA0003523282860000025
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003523282860000026
denotes a height of H*Width W*The image of (a), which is simultaneously an input image of the object detection model; i is the hierarchical index of the image pyramid and the backbone network;
s22, inputting the ith layer of image in the image pyramid into a double-bottleneck sub-convolution network, and extracting the edge characteristics of the image surface layer through a 5 multiplied by 5 convolution kernel and a 3 multiplied by 3 convolution kernel;
s23, inputting the extracted edge features into a residual error network unit with 2 bottleneck structures to extract detail features, connecting the edge features by using a side edge with a 1 multiplied by 1 convolution kernel, and transmitting accurately positioned edge information to the extracted texture detail features;
the bottleneck structure is composed of 21 × 1 convolution kernels used for the dimensionality reduction and the dimensionality increase of a characteristic diagram channel and 23 × 3 convolution kernels used for learning shallow layer characteristics;
s24, obtaining a characteristic diagram with the same level scale as the corresponding backbone network, and outputting the characteristic diagram as a residual error network unit;
s25, taking the images of different scales as input, defining the output of the double bottleneck convolutional network as:
Figure BDA0003523282860000031
Figure BDA0003523282860000032
wherein the content of the first and second substances,
Figure BDA0003523282860000033
representing the extracted features of the ith layer image of the image pyramid;
Figure BDA0003523282860000034
representing the set of features extracted for all levels of the image in the image pyramid.
In step S3, the hierarchical feature fusion module is a feature fusion module based on element-by-element addition; the output defining the element-by-element addition is:
Figure BDA0003523282860000035
wherein the content of the first and second substances,
Figure BDA0003523282860000036
and
Figure BDA0003523282860000037
respectively, two 3 x 3 convolution units, used as parameterized feature maps for the feature map,
Figure BDA0003523282860000038
a 1 × 1 convolution unit used as a linear transformation of the feature map; BN [. C]Batch normalization operations for convolution features; t (-) represents a bilinear interpolation operation of the channel dimension, and is used for adjusting the channel dimensions of two different types of features; h (-) and g (-) are the output characteristic diagram of the double-bottleneck convolution network and the characteristic diagram of the main network respectively, i is the hierarchical index of the image pyramid and the main network; i is0And IiRepresenting the original image and the image of the ith layer in the image pyramid, respectively.
In step S4, the classification loss function is as follows:
Figure BDA0003523282860000039
wherein p and p*Respectively a sample predicted value and a sample true value; alpha is alphatE (0, 1) is a weight factor introduced for class 1, 1-alphatIntroducing a weight factor for class-1; (1-p) γ is a modulation factor;
the positional regression loss term is expressed as:
Figure BDA00035232828600000310
wherein L isreg(t,t*) From smoothed L1Loss expression; t ═ { x, y, w, h } represents bounding box position information of sample prediction, where { x, y } represents the center coordinates of the bounding box, and { w, h } represents the width and height of the bounding box; t is t*A sample label of t;
the target detection loss function is expressed as:
Figure BDA00035232828600000311
wherein the content of the first and second substances,
Figure BDA0003523282860000041
for the number of samples to be classified,
Figure BDA0003523282860000042
is the number of samples regressed; w is the number of batches of training images; k is the index of a single sample in each batch of training samples; λ is the loss balance term.
Compared with the prior art, the invention has the following remarkable effects:
1. extracting shallow layer characteristics of each layer of a multi-scale image pyramid by using a double bottleneck sub-convolution network; introducing high-resolution shallow features rich in detail and position information into the target detection model by using a layered feature fusion module; a new sequencing downsampling method is utilized, and a large amount of detail information lost in the original downsampling process is saved; the reconstructed loss function can relieve the problem of unbalanced classification of the foreground and the background;
2. the invention fully utilizes the shallow high-resolution characteristic of the multi-scale image pyramid and simultaneously avoids losing a large amount of detail information in the down-sampling, thereby increasing the identification degree of a target detection network on a small target and an adjacent target and further improving the model performance.
Drawings
FIG. 1 is a schematic diagram of a double bottleneck characteristic pyramid network (DBFP-Net) structure of the present invention;
FIG. 2 is a schematic diagram of a sequential downsampling method according to the present invention;
fig. 3 is a schematic diagram of an implementation of the ordered downsampling method of the present invention.
FIG. 4 is a diagram of a framework for a two-bottleneck convolutional network of the present invention;
FIG. 5 (a) is a schematic diagram of a feature fusion module based on element-by-element addition in the present invention,
(b) for the schematic diagram of the feature fusion module based on element-by-element multiplication in the present invention,
(c) the invention is a schematic diagram of a feature fusion module based on feature map collocation;
FIG. 6 is a diagram showing the partial detection results of DBFP-Net on MS COCO.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
FIG. 1 is a schematic diagram of a double bottleneck characteristic pyramid network (DBFP-Net). Firstly, FPN (feature Pyramid networks) is used as an object detection framework, ResNet-101 is used as a backbone network of the FPN to extract depth features, color images are input, a new sequencing downsampling feature enhancement method is used for replacing an original downsampling method in the backbone network, all numerical values in a sliding window in downsampling are reserved, and the problem that a large amount of detail information is lost in deep features is solved. Then, a double-bottleneck-sub-convolutional network is constructed to extract image features of each level in the image pyramid, a hierarchical feature fusion module is constructed to fuse the high-resolution and weak semantic features extracted by the double-bottleneck-sub-convolutional network with the low-resolution and strong semantic features of the deep layer of the main network, and the shallow-layer features and more spatial information extracted in the image pyramid can be provided for a target detection network. And finally, introducing a Focal loss reconstruction loss function to relieve the imbalance of foreground and background classification.
The method takes FPN as a target detection frame, takes ResNet-101 as a backbone network to extract depth features, inputs a color image, and replaces an original downsampling method in the backbone network with a new sequencing downsampling feature enhancement method, so that all numerical values in a sliding window in downsampling are reserved, and the problem that a large amount of detail information is lost by deep features is solved.
(1) Ordered down-sampling method
In the continuous down-sampling operation of CNN (convolutional neural network), the maximum pooling method is commonly used, and the scale of the finally generated new feature map is smaller than that of the original feature map, which results in that a large amount of information of the original feature map is lost in the new feature map, which is very disadvantageous for the multi-scale object detection task that needs a large amount of feature information to identify multiple types of objects. Therefore, as shown in fig. 2, the ordered downsampling feature enhancement method proposed by the present invention slides a 2 × 2 sliding window with a step length of 2 on the feature map of the feature sampling layer of the CNN, orders the values in the sliding window in an ascending order, and sequentially extracts four values in the sliding window, thereby retaining all the values in the sliding window. And finally, four new feature maps are generated, the width and the height of the four new feature maps are half of those of the original feature maps, and the new feature maps do not lose any information of the original feature maps. The output of the ordered downsampling method is defined as:
Figure BDA0003523282860000051
wherein the content of the first and second substances,
Figure BDA0003523282860000052
representing a feature map of each sampling layer of a Convolutional Neural Network (CNN), W, H and D respectively representing the width, height and channel number of the feature map, and l is a hierarchical index of the CNN sampling layer; mj(. The) represents the process of extracting the jth value in the sliding window by a sequencing downsampling method, and four values are sequentially extracted in each sliding window;
Figure BDA0003523282860000053
and a new feature map representing the jth output in the ith down-sampling layer, wherein four new feature maps are generated for each down-sampling layer.
A specific implementation of the ordered downsampling method is shown in fig. 3. And (3) juxtaposing the four new feature maps, inputting the four new feature maps into a small convolution network for feature refinement and channel adjustment, wherein the small convolution network consists of a 1 × 1 convolution kernel and a 3 × 3 convolution kernel in order to keep consistent with the output of the original sampling layer of the backbone network. Outputting the final characteristic diagram
Figure BDA0003523282860000054
As inputs to the next layer of the backbone network, W ', H ', and D ' represent the width, height, and number of channels, respectively, of the final feature map.
(2) Double bottleneck convolutional network
And constructing a double-bottleneck sub-convolution network to extract the image characteristics of each level in the image pyramid, and obtaining the characteristic graphs which are rich in spatial information and have different scales. Because the double-bottleneck sub-convolution network is a lightweight network for sharing parameters with images of all levels of the image pyramid, the newly added computational complexity and parameter storage loss can be ignored compared with a main network.
FIG. 4 is a block diagram of a two-bottleneck convolutional network. The input to the double bottleneck convolutional network is a simple multi-scale image pyramid. An image pyramid is a set of images arranged in a pyramid shape and with progressively lower resolution, which are obtained by sequential down-sampling of the same input image. The inputs that define the dual bottleneck convolutional network are:
Figure BDA0003523282860000055
wherein the content of the first and second substances,
Figure BDA0003523282860000061
representing resolution as H*×W*(height is H)*Width W*) The image of (a), which is simultaneously an input image of the backbone network; i is both the image pyramid and the hierarchical index of the backbone network. Firstly, the ith layer of image in the image pyramid is input into a double-bottleneck sub-convolution network, and the edge characteristics of the image surface layer are extracted through a 5 multiplied by 5 convolution kernel and a 3 multiplied by 3 convolution kernel. Then, the extracted edge features are input into a residual network unit with a 2-Bottleneck (bottleeck) structure to extract detail features, and accurately positioned edge information is transmitted to the extracted texture detail features using side connection with a 1 × 1 convolution kernel. The bottleneck structure is composed of 21 × 1 convolution kernels for dimensionality reduction and dimensionality increase of the feature map channel respectively and 23 × 3 convolution kernels for learning shallow features. Finally, the output of the residual network unit is a characteristic diagram with the same scale as the hierarchy of the corresponding main network. Taking images of different scales as input, defining the output of the double-bottleneck sub-convolution network as follows:
Figure BDA0003523282860000062
Figure BDA0003523282860000063
wherein the content of the first and second substances,
Figure BDA0003523282860000064
representing the extracted features of the ith layer image of the image pyramid.
Figure BDA0003523282860000065
Representing the set of features extracted for all levels of the image in the image pyramid.
(3) Layered feature fusion module
A hierarchical feature fusion module is constructed to fuse the high-resolution and weak semantic features extracted by the double-bottleneck sub-convolution network with the low-resolution and strong semantic features of the deep layer of the main network, and the fused features are transmitted to a feature map of a corresponding hierarchy of the FPN, so that shallow features and more spatial information extracted in an image pyramid can be provided for a target detection network. Three fusion methods are designed, as shown in (a), (b) and (c) in fig. 5, and their outputs are collectively defined as:
Figure BDA0003523282860000066
wherein h (-) and g (-) are the output characteristic diagram of the double-bottleneck convolution network and the characteristic diagram of the main network respectively, i is the hierarchical index of the image pyramid and the main network, and OiThe output characteristics of the characteristic fusion module on the ith layer; i is0And IiRespectively representing the original image in the image pyramid and the image of the ith layer.
Figure BDA0003523282860000067
Is a fusion equation of the feature fusion module. In general, if the image pyramid has (n-1) downsampled images in total, then the number of levels of the image pyramid is n.
The feature fusion module based on element-by-element addition is shown in (a) of fig. 5, and is to combine the output features h of the dual-bottleneck convolution networki(Ii) And backbone network characteristics gi(I0) And (4) adding. Due to the double bottleneck sub-convolution net parameter sharing, the channel dimensions of these two different types of features need to be adjusted. The channel dimensions of these two different types of features are adjusted using a channel dimension bilinear interpolation operation T (·). Thus, the output O of the element-by-element addition is defined1iComprises the following steps:
Figure BDA0003523282860000068
wherein the content of the first and second substances,
Figure BDA0003523282860000071
and
Figure BDA0003523282860000072
respectively, two 3 x 3 convolution units, used as parameterized feature maps for the feature map,
Figure BDA0003523282860000073
is a 1 × 1 convolution unit and is used as a linear transformation of the feature map. BN [. C]Batch normalization operations for convolution features.
The feature fusion module based on element-by-element multiplication is shown in (b) of FIG. 5, and is to combine the output features h of the dual-bottleneck convolution networki(Ii) And backbone network characteristics gi(I0) Multiplication. Then, refinement is performed by a 1 × 1 convolution kernel. Finally, the refined features are added again with the corresponding backbone network features through a side connection. Defining the output O of element-by-element multiplication2iComprises the following steps:
Figure BDA0003523282860000074
wherein the content of the first and second substances,
Figure BDA0003523282860000075
convolution units of 1 × 1 are used for feature refinement;
Figure BDA0003523282860000076
a convolution unit of 1 x 1 is used as a linear transformation of the feature map.
Feature fusion module based on feature map collocation is shown in fig. 5 (c), similar to the feature fusion method used in U-Net. The invention also defines a feature map collocation method, the output of which is O3iComprises the following steps:
Figure BDA0003523282860000077
wherein concat (. cndot.) represents the feature map juxtaposition operation.
The above methods demonstrate the flexibility of the hierarchical feature fusion module. Different fusion methods may be selected as part of the model of the invention.
To compare the effectiveness of three different fusion methods, experiments implemented different feature fusion methods in the same target detection model. The comparison results are shown in table 1.
TABLE 1 hierarchical feature fusion method comparison
Figure BDA0003523282860000078
44.7AP is obtained based on a feature fusion method of element-by-element addition; the methods based on element-by-element multiplication and feature map concatenation derive 42.9AP and 43.6AP, respectively. In small target detection, the three feature fusion methods obtain similar detection results. However, in the medium-scale and large-scale target detection, the target detection model configuring the element-by-element addition fusion method is improved by 4.1% and 3.6%, respectively, compared with the target detection model configuring the element-by-element multiplication fusion method. Table 1 shows the feature fusion method using element-by-element addition at APS、APMWith APLThe evaluation criteria obtained better detection results than the other two methods. Therefore, the present invention selects a feature fusion method based on element-by-element addition.
(4) Loss function
To alleviate the extreme imbalance between foreground and background, a Focal loss reconstruction loss function is introduced, the classification loss function being as follows:
Figure BDA0003523282860000079
wherein p and p*Respectively, a sample prediction value and a sample true value. Alpha is alphatE (0, 1) is a weight factor introduced for class 1, 1-alphatA weight factor is introduced into class-1, and the weight of foreground and background classification can be adjusted. (1-p)γIs a modulation factor, which can automatically reduce the weight of simple examples during training and quickly focus the model on difficult examples. That is, the higher the confidence of the sample prediction, the smaller the contribution of the sample to the overall loss. As with Focal loss, γ is empirically set to 2, αtSet to 0.25.
The positional regression loss term is specifically expressed as:
Figure BDA0003523282860000081
wherein L isreg(t,t*) Represented by the smoothed L1 loss; t ═ x, y, w, h denotes Bounding Box (Bounding Box) position information for sample prediction, where x, y denotes the center coordinates of the Bounding Box and w, h denotes the width and height of the Bounding Box; t is t*Sample label of t.
The target detection loss function used by the present invention can be represented by the following equation:
Figure BDA0003523282860000082
wherein the content of the first and second substances,
Figure BDA0003523282860000083
for the number of samples to be classified,
Figure BDA0003523282860000084
is the number of samples regressed. w is the number of batches of training images, and k is the index of a single sample in each batch of training samples; λ is a loss balance term, set empirically to 2.
To further validate the performance of the proposed model, the present invention compares the accuracy of DBFP-Net with existing target detection algorithms on the MS COCO dataset, as shown in table 2.
Table 2 comparison of the COCO data set with the existing models
Figure BDA0003523282860000085
It can be seen that the lowest AP value is the SSD model, only 28.8%. The SSD constructs a feature pyramid structure based on R-FCN, predicting targets at multiple levels, without the need to combine features or scores. This improves detection speed and accuracy, but the recall rate for small targets is still not ideal. The FPN uses a coding-decoding network structure with transverse connection to correlate low-level feature mapping of cross-resolution and semantic levels, provides high-level semantic information for shallow network features, and improves the robustness of a detection model to multi-scale targets. According to the method, the information lost by the backbone network is compensated by introducing the shallow layer characteristics of the multi-scale image pyramid on the basis of the FPN, and the information exceeds the FPN by 8.5% on the AP. RetinaNet and the image pyramid feature-guided multi-scale target detection model are both based on an FPN framework and use Focal loss as a loss function, but DBFP-Net is obviously improved by 5.6% compared with RetinaNet (44.7AP vs.39.1 AP). Compared with the detection model M2Det based on the feature pyramid method with the best performance, the method still exceeds 3.7% (44.7AP vs.41.0 AP). Compared with the recently popular target detection method without Anchor point (Anchor-free), DBFP-Net exceeds CornerNet, FCOS and FSAF by 4.2 percent (44.7AP vs.40.5AP), 2.6 percent (44.7AP vs.42.1AP) and 1.8 percent (44.7AP vs.42.9AP) respectively. In addition, in the evaluation of the criteria APS、APMAnd APLThe performance of the invention is also best compared to the other models in table 2.
FIG. 6 shows the results of the present invention on MS COCO data set. The results show that for multi-scale objects in the dataset, the invention retains more information for the objects and can accurately present their locations and classifications. The DBFP-Net can not only detect extremely small objects, but also can easily detect occluded and dense objects. For example, in the third row, the second picture and the fifth row, the fourth picture, the invention well detects the blocked birds; in the second picture of the second row and the third picture of the fifth row, the invention can accurately detect dense fruits and people. The invention is also highly robust for small objects with high speed motion and pixel blur, such as baseball in the second picture of the first line. In addition, objects with sparse characteristics, such as skateboards and cups, can be accurately detected.
In conclusion, compared with a classical FPN network, the method provided by the invention realizes 8.5% of promotion, and can effectively detect occluded and different-size targets.

Claims (5)

1. A multi-scale target detection method guided by image pyramid features is characterized by comprising the following steps:
s1, taking the color image as network input, taking the FPN based on ResNet-101 backbone network as a frame of target detection, and adopting a sequencing down-sampling method to extract image features;
s2, taking the same color image in the step S1 as input, and extracting position information and detail features of each level in the image pyramid by adopting the constructed double-bottleneck convolution network;
s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of high-resolution weak semantic features and low-resolution strong semantic features;
and S4, introducing a Focal local reconstruction loss function, training multiple tasks, and completing target detection.
2. The method for detecting multi-scale objects guided by image pyramid features according to claim 1, wherein in step S1, the ordered down-sampling method is implemented as follows:
s11, sliding a sliding window with a set step length on the feature map of the feature sampling layer of the convolutional neural network, sequencing the numerical values in the sliding window in an ascending order, sequentially extracting four values in the sliding window, and generating four new feature maps; the width and height of each new feature map are half of the original feature map, and the output of the ordered down-sampling method is as follows:
Figure FDA0003523282850000011
wherein the content of the first and second substances,
Figure FDA0003523282850000012
representing a feature map of each sampling layer of the convolutional neural network, W, H and D respectively represent the width, height and channel number of the feature map, and l is a hierarchical index of the sampling layer of the convolutional neural network; mj(. h) represents the process of extracting the jth value in the sliding window, and four values are extracted in sequence in each sliding window;
Figure FDA0003523282850000013
a new feature map representing a jth output of the ith down-sampling layer, four new feature maps being generated for each down-sampling layer;
s12, the four new feature maps are juxtaposed and then input into a small convolution network for feature refinement and channel adjustment; final feature map to be output
Figure FDA0003523282850000014
And as the input of the next layer of the backbone network, wherein W ', H ' and D ' respectively represent the width, height and channel number of the final feature map.
3. The method for detecting multi-scale objects guided by image pyramid features of claim 1, wherein in step S2, the process of constructing the dual bottleneck sub-convolutional network is as follows:
s21, defining the inputs of the double bottleneck convolutional network as:
Figure FDA0003523282850000015
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003523282850000021
denotes a height of H*Width W*The image of (a), which is simultaneously an input image of the object detection model; i is the hierarchical index of the image pyramid and the backbone network;
s22, inputting the ith layer of image in the image pyramid into a double-bottleneck sub-convolution network, and extracting the edge characteristics of the image surface layer through a 5 multiplied by 5 convolution kernel and a 3 multiplied by 3 convolution kernel;
s23, inputting the extracted edge features into a residual error network unit with 2 bottleneck structures to extract detail features, connecting the edge features by using a side edge with a 1 multiplied by 1 convolution kernel, and transmitting accurately positioned edge information to the extracted texture detail features;
the bottleneck structure is composed of 21 × 1 convolution kernels used for dimensionality reduction and dimensionality increase of a feature map channel respectively and 23 × 3 convolution kernels used for learning shallow features;
s24, obtaining a characteristic diagram with the same level scale as the corresponding backbone network, and outputting the characteristic diagram as a residual error network unit;
s25, taking the images of different scales as input, defining the output of the double bottleneck convolutional network as:
Figure FDA0003523282850000022
Figure FDA0003523282850000023
wherein the content of the first and second substances,
Figure FDA0003523282850000024
representing the extracted features of the ith layer image of the image pyramid;
Figure FDA0003523282850000025
representing the set of features extracted for all levels of the image in the image pyramid.
4. The method for image pyramid feature guided multi-scale object detection according to claim 1, wherein in step S3, the hierarchical feature fusion module employs a feature fusion module based on element-by-element addition; the output defining the element-by-element addition is:
Figure FDA0003523282850000026
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003523282850000027
and
Figure FDA0003523282850000028
respectively, two 3 x 3 convolution units, used as parameterized feature maps for the feature map,
Figure FDA0003523282850000029
a 1 × 1 convolution unit used as a linear transformation of the feature map; BN [. C]Batch normalization operations for convolution features; t (-) represents a bilinear interpolation operation of the channel dimension, and is used for adjusting the channel dimensions of two different types of features; h (-) and g (-) are the output characteristic diagram of the double-bottleneck convolution network and the characteristic diagram of the main network respectively, i is the hierarchical index of the image pyramid and the main network; i is0And IiRepresenting the original image and the image of the ith layer in the image pyramid, respectively.
5. The method for detecting multi-scale objects guided by image pyramid features according to claim 1, wherein in the step S4, the classification loss function is as follows:
Figure FDA0003523282850000031
wherein p and p*Respectively a sample predicted value and a sample true value; alpha is alphatE (0, 1) is a weight factor introduced for class 1, 1-alphatIntroducing a weight factor for class-1; (1-p)γIs the modulation factor;
the positional regression loss term is expressed as:
Figure FDA0003523282850000032
wherein L isreg(t,t*) From smoothed L1Loss expression; t ═ { x, y, w, h } represents bounding box position information of sample prediction, where { x, y } represents the center coordinates of the bounding box, and { w, h } represents the width and height of the bounding box; t is t*A sample label of t;
the target detection loss function is expressed as:
Figure FDA0003523282850000033
wherein the content of the first and second substances,
Figure FDA0003523282850000034
for the number of samples to be classified,
Figure FDA0003523282850000035
is the number of samples regressed; w is the number of batches of training images; k is the index of a single sample in each batch of training samples; λ is the loss balance term.
CN202210185676.7A 2022-02-28 2022-02-28 Multi-scale target detection method guided by image pyramid characteristics Pending CN114612709A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210185676.7A CN114612709A (en) 2022-02-28 2022-02-28 Multi-scale target detection method guided by image pyramid characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210185676.7A CN114612709A (en) 2022-02-28 2022-02-28 Multi-scale target detection method guided by image pyramid characteristics

Publications (1)

Publication Number Publication Date
CN114612709A true CN114612709A (en) 2022-06-10

Family

ID=81860058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210185676.7A Pending CN114612709A (en) 2022-02-28 2022-02-28 Multi-scale target detection method guided by image pyramid characteristics

Country Status (1)

Country Link
CN (1) CN114612709A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403115A (en) * 2023-06-07 2023-07-07 江西啄木蜂科技有限公司 Large-format remote sensing image target detection method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403115A (en) * 2023-06-07 2023-07-07 江西啄木蜂科技有限公司 Large-format remote sensing image target detection method
CN116403115B (en) * 2023-06-07 2023-08-22 江西啄木蜂科技有限公司 Large-format remote sensing image target detection method

Similar Documents

Publication Publication Date Title
Guo et al. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images
CN111126202B (en) Optical remote sensing image target detection method based on void feature pyramid network
CN109949255B (en) Image reconstruction method and device
CN109583340B (en) Video target detection method based on deep learning
CN108765425B (en) Image segmentation method and device, computer equipment and storage medium
CN111291809B (en) Processing device, method and storage medium
CN107679250A (en) A kind of multitask layered image search method based on depth own coding convolutional neural networks
CN108491849A (en) Hyperspectral image classification method based on three-dimensional dense connection convolutional neural networks
CN109325589A (en) Convolutional calculation method and device
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN112395442B (en) Automatic identification and content filtering method for popular pictures on mobile internet
Chen et al. Dr-tanet: Dynamic receptive temporal attention network for street scene change detection
CN112733614B (en) Pest image detection method with similar size enhanced identification
Gao et al. Densely connected multiscale attention network for hyperspectral image classification
CN110222718A (en) The method and device of image procossing
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN113920516A (en) Calligraphy character skeleton matching method and system based on twin neural network
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
Hu et al. RGB-D image multi-target detection method based on 3D DSF R-CNN
Tang et al. Deep saliency quality assessment network with joint metric
CN113191361B (en) Shape recognition method
CN114612709A (en) Multi-scale target detection method guided by image pyramid characteristics
CN115330759B (en) Method and device for calculating distance loss based on Hausdorff distance
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN108765384B (en) Significance detection method for joint manifold sequencing and improved convex hull

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination