CN116051953A

CN116051953A - Small target detection method based on selectable convolution kernel network and weighted bidirectional feature pyramid

Info

Publication number: CN116051953A
Application number: CN202211470248.5A
Authority: CN
Inventors: 万久地; 潘纯洁; 张前进; 罗正岳; 蒋波
Original assignee: Chongqing Branch China Tower Co ltd
Current assignee: Chongqing Branch China Tower Co ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-05-02

Abstract

The invention relates to a small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid, which belongs to the field of deep learning target detection and specifically comprises the following steps of: s1: performing data enhancement on an original image, calculating a predefined anchor frame by adopting an adaptive anchor frame, scaling the image to the same size by adopting adaptive image scaling, and inputting the processed image into a YOLOv5 backbone network of a selectable convolution kernel network based on spatial attention; s2: the input image is subjected to multi-layer feature extraction through the backbone network to obtain different layers of features; s3: performing cross-layer feature fusion on the features of different layers by using BiFPN to obtain a plurality of fusion features; s4: adding a group of small target detection anchor frames on the YOLOv5 detection layer, and carrying out small target detection on a plurality of fusion features; s5: training the improved network model, and inputting the data set into the trained model to detect the small target.

Description

Small target detection method based on selectable convolution kernel network and weighted bidirectional feature pyramid

Technical Field

The invention belongs to the field of deep learning target detection, and relates to a small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid

Background

With the development of deep learning, the application of the deep learning in image recognition is becoming wider, and a target detection algorithm based on the deep learning is becoming a research hotspot in the image field in recent years. The small target can be defined in terms of both relative and absolute dimensions as: in images with an area ratio of less than 0.1 or in large data sets, objects smaller than a relatively fixed pixel are defined as small objects.

The deep learning is superior to the recognition and detection of large and medium targets in precision and accuracy, but because the small targets occupy less pixels in the image, the visual information is less, and the small targets are extremely easy to be influenced by environmental factors, the recognition and detection efficiency and precision of the small targets are far lower than those of the large and medium targets.

The generalization capability of the current small target detection model is weak, and the detection effect is poor.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a small target detection method based on a spatial attention selectable convolution kernel network and a weighted bi-directional feature pyramid, which is used for improving the accuracy of small target detection based on deep learning.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid specifically comprises the following steps:

s1: performing data enhancement on an original image, calculating a predefined anchor frame by adopting an adaptive anchor frame, scaling the image to the same size by adopting adaptive image scaling, and inputting the processed image into a YOLOv5 backbone network of a selectable convolution kernel network based on spatial attention;

s2: the input image is subjected to multi-layer feature extraction through the backbone network to obtain different layers of features;

s3: using a weighted bidirectional feature pyramid network BiFPN to perform cross-layer feature fusion on different layers of features to obtain a plurality of fusion features;

s4: adding a group of small target detection anchor frames on the YOLOv5 detection layer, and carrying out small target detection on a plurality of fusion features;

s5: training the improved network model, and inputting the data set into the trained model to detect the small target.

Further, the data enhancement in step S1 specifically includes: and the original image is subjected to Mosaic data enhancement, four pictures are randomly cut and scaled, then randomly arranged and spliced to form one picture, so that a small sample target is increased while the data set is enriched, and the training speed of a network is improved.

Further, in step S1, the calculating the predefined anchor frame by using the adaptive anchor frame specifically includes: on the basis of the initial anchor frame, the output prediction frame is compared with the real frame, the difference is calculated, then the output prediction frame is reversely updated, and iteration parameters are continuously carried out to obtain the most suitable anchor frame value. The data set is analyzed by adopting k-means clustering and a genetic learning algorithm, and a preset anchor frame suitable for object boundary frame prediction in the data set is obtained.

Further, the YOLOv5 backbone network introducing the optional convolution kernel network based on spatial attention in step S1 is: and integrating a spatial attention mechanism Coordinate Attention into the selectable convolution kernel network SKNet to obtain the selectable convolution kernel network CA-SKNet based on spatial attention, and integrating the CA-SKNet into the C3 convolution module to obtain the improved YOLOv5 backbone network.

Further, in the YOLOv5 backbone network, which introduces a selectable convolution kernel network based on spatial attention:

the spatial attention mechanism aggregates the input features along two spatial directions to obtain a pair of direction perception feature graphs, wherein the sizes of the direction perception feature graphs are C multiplied by H multiplied by 1 and C multiplied by 1 multiplied by W respectively;

wherein C represents the channel number, H is the input image height, W is the input image width, x _c (H, i) represents the value of (H, i) in the coordinate of the feature map C×W×H, x _c (j, W) represents the value of (j, W) in the coordinate of (C×W×H) in the feature map, Z _c ^h Represents the result of averaging pooling in the W direction, Z _c ^w Representing the average pooling result in the H direction;

the two characteristic graphs with the sizes of C multiplied by H multiplied by 1 and C multiplied by 1 multiplied by W are spliced and then convolved and normalized to obtain the characteristic graph with the size of C multiplied by H multiplied by 1

The feature value is scaled between (0, 1) to obtain weight values in two directions through a sigmoid activation function;

f＝δ(F ₁ ([z ^h ,z ^w ]))

g ^h ＝σ(F _h (f ^h ))

g ^w ＝σ(F _w (f ^w ))

wherein δ represents a nonlinear transformation, F represents a convolution operation, z ^h Representing the result of averaging pooling in the W direction, sigma representing the activation function, z ^w Represents the result of averaging pooling in the H direction, F _h Represents the pair f ^h Performing convolution processing to F _w Represents the pair f ^w Performing convolution processing, wherein f represents the result after convolution and nonlinear transformation, and f ^h Representation ofz ^h F, the result after convolution and nonlinear transformation ^w Representing z ^w The result after convolution and nonlinear transformation, g represents the weight obtained by the activation function, g ^h Represents f ^h Through convolution F _h And the weight value g obtained by activating the function ^w Represents f ^w Through convolution F _w And activating the weight obtained by the function;

the original feature map is weighted and calculated to obtain a feature map with attention weights in the width and height directions:

x _c (i, j) represents the value of (i, j) in the original feature map C X W X H,

represents f ^h Weight obtained by convolution and activation function, < ->

Represents f ^w Weights, y, obtained by activating the function _c (i, j) represents the original feature map and the height direction weight and width direction weight are multiplied to obtain a new feature map.

Further, in step S3, the cross-layer feature fusion is performed on the features of different layers by using the weighted bi-directional feature pyramid network BiFPN, which specifically includes:

four layers of features are respectively positioned in the second layer, the fourth layer, the sixth layer and the ninth layer after the backbone network features are extracted;

the ninth layer of features are subjected to a one-time bottom-up sampling process, and are fused with the backbone network to output four layers of features, and a feature map is respectively output at a tenth layer, a fourteenth layer, an eighteenth layer and a twentieth layer;

the twentieth layer of feature map is subjected to a down sampling process from top to bottom, and is output into four layers with the backbone network, and the previous up sampling output is fused with the four layers of features, and one feature is output into the detection layer for detection at the twenty-first layer, the twenty-fourth layer, the twenty-seventh layer and the thirty-first layer respectively.

In step S4, a group of small target detection layers is added to YOLOv5, and after the feature map output by the weighted bidirectional feature pyramid network BiFPN is obtained, the feature map is sent to the detection layers to perform small target detection, so that the accuracy of small target detection is increased.

Further, the small target detection layer is: the method consists of four detection layers, wherein feature maps with different sizes are used for detecting target objects with different sizes, and 160×160, 80×80, 40×40 and 20×20 feature maps output by a weighted bidirectional feature pyramid network (BiFPN) are detected respectively. And outputting corresponding vectors by each detection layer, and finally generating and marking a prediction boundary box and a category of the target in the original image.

Further, the training of the improved network model in step S5 specifically includes the following steps: firstly, inputting an input end into a backbone network after Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive image scaling treatment, and carrying out multi-layer feature extraction on the input image by using a YOLOv5 backbone network which introduces a selectable convolution kernel network based on spatial attention to obtain different layers of features; then, cross-layer feature fusion is carried out on the multi-layer features through a weighted bidirectional feature pyramid network to obtain a plurality of fusion features; and finally, adding a group of small target detection layers to perform target detection on a plurality of fused features, calculating rectangular frame loss through CIOU loss, calculating confidence coefficient loss and classification loss through BCE loss, weighting the three to obtain total loss, and carrying out back propagation to minimize loss so as to update network parameters, iterating to obtain a trained model, inputting a data set into the model, and detecting the small targets.

The invention has the beneficial effects that: the method is improved based on the YOLOv5 model, a selectable convolution kernel network based on spatial attention is integrated into a backbone network, small targets can be better extracted, the loss of the small target features is avoided, the features of different layers are fused by using a weighted bidirectional feature pyramid, small target information in the features is enriched, a small target detection layer is increased, and the detection effect on small targets in images is improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a small target detection method based on a selectable convolution kernel network and a weighted bi-directional feature pyramid according to the present invention;

FIG. 2 is a diagram of the improved YOLOv5 model of the present invention;

FIG. 3 is a diagram of an alternative convolution kernel network (CA-SKNet) architecture based on spatial attention;

fig. 4 is a diagram of the structure of the four detection layers of the head.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

As shown in fig. 1, this embodiment discloses a small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid of spatial attention, which specifically includes the following steps:

the invention firstly provides a small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid of spatial attention, and one embodiment of the invention has the following implementation process: based on the small target data set visclone 2019 containing the respective corresponding preset tags, this data set is captured by various unmanned aerial vehicle cameras according to 4:1 divide training and test sets.

The data enhancement rich data set is made to the data set, the self-adaptive anchor frame calculates the predefined anchor frame and the self-adaptive image scaling process scales the image to the same size 640 x 640, and the processed image is input into the backbone network.

In order to improve the efficiency and the precision of small target detection in an image, the embodiment improves on the basis of a YOLOv5 model, so that the model improves the efficiency and the precision of small target identification. YOLOv5 introduces a selectable convolution kernel network (CA-SKNet) based on spatial attention as shown in fig. 3, so as to obtain a new Backbone network, such as the Backbone part of fig. 2, and inputs the processed image into an improved Backbone network to perform multi-layer feature extraction to obtain different layers of features.

Specifically, in this embodiment, the CA-SKNet is an improvement on a selectable convolution kernel network (SKNet), and the spatial attention mechanism Coordinate Attention is incorporated to solve the defect that SKNet only considers channel information and ignores spatial information, so as to obtain an improved selectable convolution kernel network (CA-SKNet) based on spatial attention, and different convolution kernel sizes, namely different receptive fields, are selected through the spatial attention mechanism, so that characteristics of different sizes can be obtained, and further new weighting characteristics are obtained; then merging the selectable convolution kernel network based on the spatial attention into a C3 module, and aggregating the input features along two spatial directions in the selectable convolution kernel network based on the spatial attention to obtain a pair of direction perception feature graphs with the sizes of C multiplied by H multiplied by 1 and C multiplied by 1 multiplied by W respectively;

wherein C represents the number of channels, H is the input image height, W is the input image width, xc (H, i) represents the value of (H, i) in the feature map C x W x H, xc (j, W) represents the value of (j, W) in the feature map C x W x H,

represents the average pooling result in the W direction, < >>

Representing the results of the average pooling in the H direction.

The two characteristic graphs with the sizes of C multiplied by H multiplied by 1 and C multiplied by 1 multiplied by W are spliced and then subjected to convolution and normalization steps to obtain the characteristic graph with the size of C multiplied by H multiplied by 1 and C multiplied by 1 multiplied by W

Is characterized by after sigmoid activation functionScaling the characteristic value to a value between (0 and 1) to obtain weights in two directions;

f＝δ(F ₁ ([z ^h ,z ^w ]))

g ^h ＝σ(F _h (f ^h ))

g ^w ＝σ(F _w (f ^w ))

wherein delta represents nonlinear transformation, F represents convolution operation, sigma represents activation function, F represents the result after convolution and nonlinear transformation, and g represents weight obtained by the activation function.

The original feature map is weighted and calculated to obtain the feature map with attention weights in the width and height directions.

The structure of the alternative convolution kernel network based on spatial attention (CA-SKNet) of this embodiment is shown in FIG. 3. The improved backbone network is utilized to obtain a plurality of different layers of features, the deep features contain rich semantic information, but the resolution is very low, the perception capability of details is poor, the resolution of the shallow features is high, more detail information is contained, and useless noise information is also high. Therefore, the weighted bidirectional feature pyramid is used for feature fusion of different layers of features, such as the Neck part of FIG. 2, so as to obtain a plurality of fusion features.

Specifically, a weighted bidirectional feature pyramid network (BiFPN) is used, four layers of features are shared after backbone network feature extraction, the features are respectively located in a second layer, a fourth layer, a sixth layer and a ninth layer, the features of the ninth layer are fused with the features of the fourth layer output by the backbone network through a bottom-up sampling process, a feature map is respectively output at a tenth layer, a fourteenth layer, an eighteenth layer and a twentieth layer, the feature map of the twentieth layer is subjected to a top-down sampling process, and the features of the fourth layer output by the backbone network and the features of the fourth layer output by the previous up sampling process are fused, and the features of the fourth layer, the twenty-seventh layer and the thirty-fourth layer are respectively output to be sent to a detection layer for detection.

Then, in this embodiment, a group of small target detection anchor frames is added to the improved YOLOv5 detection layer, as shown in fig. 4, a 160×160 feature map output by a weighted bidirectional feature pyramid network (BiFPN) is obtained, and then sent to the detection layer to perform small target detection, so as to increase the accuracy of small target detection.

And finally, training the improved network to obtain a trained model, and inputting the data set into the model to detect the small target.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A small target detection method based on a selectable convolution kernel network and a weighted bidirectional feature pyramid is characterized by comprising the following steps of: the method specifically comprises the following steps:

2. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: the data enhancement in step S1 specifically includes: and performing Mosaic data enhancement on the original image, randomly cutting and scaling the four pictures, and randomly arranging and splicing the four pictures to form one picture.

3. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: in step S1, the calculating the predefined anchor frame by using the adaptive anchor frame specifically includes: comparing the output predicted frame with the real frame on the basis of the initial anchor frame, calculating the difference, then reversely updating, and continuously iterating parameters to obtain the most suitable anchor frame value; and (3) analyzing the data set by adopting a k-means clustering and genetic learning algorithm to obtain a preset anchor frame suitable for object boundary frame prediction in the data set.

4. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: the YOLOv5 backbone network that introduces the optional convolution kernel network based on spatial attention in step S1 is: and integrating a spatial attention mechanism Coordinate Attention into the selectable convolution kernel network SKNet to obtain the selectable convolution kernel network CA-SKNet based on spatial attention, and integrating the CA-SKNet into the C3 convolution module to obtain the improved YOLOv5 backbone network.

5. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: in the YOLOv5 backbone network, which introduces a selectable convolution kernel network based on spatial attention:

wherein C represents the channel number, H is the input image height, W is the input image width, x _c (H, i) represents the value of (H, i) in the coordinate of the feature map C×W×H, x _c (j, W) represents a value of (j, W) in the coordinates in the feature map C X W X H,

represents the average pooling result in the W direction, < >>

Representing the average pooling result in the H direction;

The feature value is scaled between (0, 1) to obtain weight values in two directions through a sigmoid activation function; />

f＝δ(F ₁ ([z ^h ,z ^w ]))

g ^h ＝σ(F _h (f ^h ))

g ^w ＝σ(F _w (f ^w ))

Wherein δ represents a nonlinear transformation, F represents a convolution operation, z ^h Representing the result of averaging pooling in the W direction, sigma representing the activation function, z ^w Represents the result of averaging pooling in the H direction, F _h Represents the pair f ^h Performing convolution processing to F _w Represents the pair f ^w Performing convolution processing, wherein f represents the result after convolution and nonlinear transformation, and f ^h Representing z ^h F, the result after convolution and nonlinear transformation ^w Representing z ^w The result after convolution and nonlinear transformation,g represents the weight obtained by activating the function, g ^h Represents f ^h Through convolution F _h And the weight value g obtained by activating the function ^w Represents f ^w Through convolution F _w And activating the weight obtained by the function;

represents f ^h Weight obtained by convolution and activation function>

6. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: in step S3, the cross-layer feature fusion is performed on the features of different layers by using the weighted bi-directional feature pyramid network BiFPN, which specifically includes:

7. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: in step S4, a group of small target detection layers are added in the YOLOv5, and after the feature images output by the weighted bidirectional feature pyramid network BiFPN are obtained, the feature images are sent to the detection layers to carry out small target detection, so that the accuracy of small target detection is improved.

8. The method for small object detection based on a selectable convolutional kernel network and a weighted bi-directional feature pyramid of claim 7, wherein: the small target detection layer is as follows: the method comprises the steps of forming four layers of detection layers, wherein feature graphs with different sizes are used for detecting target objects with different sizes, and respectively detecting 160×160, 80×80, 40×40 and 20×20 feature graphs output by a weighted bidirectional feature pyramid network; and outputting corresponding vectors by each detection layer, and finally generating and marking a prediction boundary box and a category of the target in the original image.

9. The small target detection method based on the selectable convolution kernel network and the weighted bidirectional feature pyramid according to claim 1, wherein the small target detection method comprises the following steps: the training of the improved network model in step S5 specifically includes the following steps: firstly, inputting an input end into a backbone network after Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive image scaling treatment, and carrying out multi-layer feature extraction on the input image by using a YOLOv5 backbone network which introduces a selectable convolution kernel network based on spatial attention to obtain different layers of features; then, cross-layer feature fusion is carried out on the multi-layer features through a weighted bidirectional feature pyramid network to obtain a plurality of fusion features; and finally, adding a group of small target detection layers to perform target detection on a plurality of fused features, calculating rectangular frame loss, BCEloss (binary coded decimal) confidence coefficient loss and classification loss through CIOUloss, weighting the three to obtain total loss, and carrying out back propagation to minimize loss so as to update network parameters, iterating to obtain a trained model, inputting a data set into the model, and detecting the small targets.