CN115272648B

CN115272648B - Multi-level receptive field expanding method and system for small target detection

Info

Publication number: CN115272648B
Application number: CN202211209625.XA
Authority: CN
Inventors: 阙越; 甘梦晗; 刘志伟; 张月园; 熊汉卿
Original assignee: East China Jiaotong University
Current assignee: Hefei Minglong Electronic Technology Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2022-12-20
Anticipated expiration: 2042-09-30
Also published as: CN115272648A

Abstract

The invention provides a multi-level receptive field expanding method and a multi-level receptive field expanding system for small target detection, wherein Swin transform is introduced as a backbone network of a model, and the characteristics of a small target are extracted by utilizing the hierarchy, the locality and the translational invariance of the Swin transform; according to different output characteristics of each stage of the backbone network, a multi-stage receptive field expansion network is designed, and the output characteristics of the backbone network are further processed, so that the problem of small target information loss is avoided; in addition, the receptive field amplifying module can effectively expand the receptive field. According to task requirements, the structure of each layer of receptive field amplification module is flexibly adjusted to match the receptive fields required by targets with different scales and obtain rich context information; on the other hand, the proposed joint loss of GIOU loss and BIOU loss is used to enhance the localization performance of the target; the comparison test proves that the invention has good performance in the aspect of small target detection.

Description

Multi-level receptive field expanding method and system for small target detection

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-level receptive field expanding method and a multi-level receptive field expanding system for small target detection.

Background

Object detection is an important research direction in the field of computer vision, and is also the basis of other high-level vision tasks. Although the target detection algorithm using the deep learning method has been developed rapidly, the detection of small targets is a difficult point in the target detection. For the field of automatic driving, small targets influencing traffic can be accurately and quickly detected, and the safety of travel of a driver can be ensured; for the industrial automation field, the small defects on the material can be accurately positioned and identified, so that the industrial production efficiency can be ensured; for the field of satellite remote sensing, small target detection can help to solve the problems of illegal fishing boats, illegal cargo transfer and the like. Therefore, a multi-level receptive field amplification network for small target detection is developed, and the method has wide application value and academic research value.

In the field of object detection, the most powerful COCO data set uses absolute size definitions, which specify objects of 32 × 32 pixels or less as small objects, and this standard is widely used. On COCO datasets, small target detection accuracy is generally less than normal target detection accuracy, and therefore small target detection is more challenging than normal targets. Specifically, the task of small target detection mainly faces four challenges: firstly, the characteristics of small objects are difficult to extract, and distinguishing characteristic information is difficult to extract from the small objects with low resolution due to the lack of visual information; secondly, due to downsampling, the features of small objects may converge into a point and even disappear on a deep feature layer; thirdly, the receptive fields are not matched, the large receptive field is suitable for large object detection, and the small receptive field is beneficial for small object detection; finally, small objects need higher positioning accuracy, the small target detection is greatly influenced by the deviation of the boundary frame, the small objects are difficult to accurately position, and the condition of missing detection may occur.

Therefore, it is necessary to provide a multi-level receptive field expanding method and system for small target detection to solve the above technical problems.

Disclosure of Invention

Therefore, embodiments of the present invention provide a multi-level receptive field expansion method and system for small target detection to solve the above technical problems.

The invention provides a multi-level receptive field expanding method for small target detection, wherein the method comprises the following steps:

firstly, preprocessing applicable to small target detection is carried out on an input image in a COCO data set;

step two, introducing a Swin Transformer as a backbone network, and performing feature extraction on the input image by using a hierarchical structure of the Swin Transformer to obtain multiple layers of features, wherein each layer of features corresponds to a feature layer;

thirdly, constructing a multi-level reception field feature fusion network, matching the required reception fields of all feature layers in the Swin transform through a reception field feature amplification module in the multi-level reception field feature fusion network and supplementing shallow prediction features, wherein after matching, each feature layer is correspondingly provided with a plurality of reception field feature amplification modules;

step four, taking the linear combination of the GIOU loss and the BIOU loss as the regression loss of the bounding box, and enhancing the target positioning effect according to the regression loss function of the corresponding bounding box;

and step five, distributing the targets with different scales in the input image on the feature layers with different receptive fields, and positioning and identifying the small targets by utilizing the shallow prediction feature layer in the detection model to obtain the positioning and identifying results of the small targets.

The invention provides a multi-level receptive field expansion method for small target detection, which introduces Swin transform as a backbone network of a model and utilizes the hierarchy, locality and translational invariance to extract the characteristics of a small target; according to the difference of output characteristics of each stage of the backbone network, a multi-stage receptive field expansion network is designed, and the output characteristics of the backbone network are further processed, so that the problem of small target information loss is avoided; in addition, the receptive field amplifying module can effectively expand the receptive field. According to task requirements, the structure of each layer of receptive field amplification module is flexibly adjusted to match receptive fields required by targets with different scales, and rich context information is obtained; on the other hand, the proposed combined loss of GIOU loss and BIOU loss is used to enhance the localization performance of the target; the comparison test proves that the invention has good performance in the aspect of small target detection.

The multi-level receptive field expanding method for small target detection is characterized in that in the step one, the preprocessing comprises the following steps:

designing a data enhancement strategy, wherein the data enhancement strategy is as follows: scaling an image size of an input image, using multi-scale training to enhance sample scale diversity;

and randomly and horizontally turning the input image in the COCO data set for data augmentation so as to enhance the generalization capability of the model.

The multi-level receptive field expanding method for small target detection is characterized in that the Swin transducer corresponds to a four-layer structure, and four extraction features with different scales and different depths are correspondingly extracted

In which

Meridian/channel

Convolution adjusts the channel number to obtain the characteristics

Wherein, in the process,

；

output characteristics of four different scales are output by multi-level receptive field characteristic fusion network

In which

；

The receptive field feature amplification modules on four feature layers in the multi-level receptive field feature fusion network are represented as

Wherein, in the process,

；

the corresponding relationship is as follows:

wherein the content of the first and second substances,

respectively representing the layer 2 output characteristic, the layer 3 output characteristic, the layer 4 output characteristic and the layer 5 output characteristic,

respectively representing layer 2 features, layer 3 features, layer 4 features and layer 5 features,

respectively show the receptive field characteristic amplifying modules on the 2 nd characteristic layer, the 3 rd characteristic layer, the 4 th characteristic layer and the 5 th characteristic layer,

representing the number of the receptive field feature amplification modules in a single feature layer,

meaning upsampling using a two-fold adjacent sample interpolation.

The multi-level receptive field expanding method for small target detection is characterized in that the receptive field characteristic amplifying module comprises a plurality of basesA basic unit, which is the layer 4 characteristic of Swin Transformer of the backbone network in the layer 4 characteristic

Base unit 1 of the receptive field feature amplification module

Obtaining a first base unit output characteristic

Then passes through the 2 nd basic unit

Obtaining a second base unit output characteristic

Finally via the 3 rd basic unit

Merging layer 4 features of the backbone network by residual connection

Third base unit output characteristics for layer 4 characteristics

；

The corresponding expression is:

wherein the third basic unit outputs characteristics

The output characteristic of the first receptive field characteristic amplification module of the 4 th characteristic layer is obtained.

The multi-level receptive field expanding method for small target detection is characterized in that the first basic unit outputs characteristics

Is expressed as:

wherein the content of the first and second substances,

represent

The result of the convolution is,

representing a convolution kernel of

The convolution of the holes of (a) with (b),

which represents the spreading ratio of the convolution of the hole,

which represents the normalization of the batch,

it is shown that the activation function is,

to representInvolving batch normalization and activation functions

The result of the convolution is,

the representation comprising batch normalization and activation functions

And (4) convolution of holes.

The multi-level receptive field expanding method for small target detection is characterized in that the bounding box regression loss function is expressed as:

wherein the content of the first and second substances,

a bounding box regression loss function is represented,

represents the GIOU loss function,

represents the BIOU loss function,

a prediction bounding box is represented that is,

a reference frame is shown in the drawing,

the position of the bounding box is indicated,

，

the coordinates representing the center point of the bounding box,

respectively representing the width and height of the bounding box,

represents the minimum bounding box area of the prediction bounding box and the annotation box,

indicating a loss of Smooth L1,

indicating a contact ratio calculation.

In the fifth step, in the identification task, a Focal loss function is used for solving the problem of imbalance of positive and negative samples, and the corresponding Focal loss function is expressed as:

wherein the content of the first and second substances,

the local function is represented by the following formula,

a prediction score is represented by a number of points,

the presence of a real label is indicated,

representing the number of balanced positive and negative samples,

representThe factor is adjusted.

In the step five, the total loss function corresponding to the positioning and recognition of the detection model is expressed as:

wherein, the first and the second end of the pipe are connected with each other,

representing the total loss function of the detection model for positioning and identifying,

both represent hyper-parameters.

The invention also provides a multi-level receptive field expanding system for small target detection, wherein the system comprises:

a pre-processing module to:

preprocessing suitable for small target detection is carried out on an input image in the COCO data set;

a feature extraction module to:

introducing a Swin transform as a backbone network, and performing feature extraction on the input image by using a hierarchical structure of the Swin transform to obtain multiple layers of features, wherein each layer of features corresponds to a feature layer;

a network construction module to:

constructing a multi-level receptive field feature fusion network, matching the required receptive fields of all feature layers in the Swin transducer and supplementing shallow prediction features through receptive field feature amplification modules in the multi-level receptive field feature fusion network, wherein after matching, each feature layer corresponds to a plurality of receptive field feature amplification modules;

a loss determination module to:

taking the linear combination of the GIOU loss and the BIOU loss as the regression loss of the bounding box, and enhancing the target positioning effect according to the regression loss function of the corresponding bounding box;

a result output module to:

and distributing the targets with different scales in the input image on the feature layers with different receptive fields, and positioning and identifying the small targets by utilizing the shallow prediction feature layer in the detection model to obtain the positioning and identifying results of the small targets.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of embodiments of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a multi-level receptive field expansion method for small target detection proposed by the present invention;

fig. 2 is a schematic structural diagram of a multi-level receptive field expanding system for small target detection according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a multi-level receptive field expanding method for small target detection, wherein the method includes the following steps:

s101, preprocessing applicable to small object detection is carried out on the input image in the COCO data set.

In particular, most images in the COCO dataset are of life origin, with a complex background. There were 80 classes in the dataset, each image containing on average 3.5 classes and 7.7 instances. In the COCO dataset, targets with an area less than 32 × 32 were defined as small targets, the percentage of small objects being 41%.

In step S101, the preprocessing includes the steps of:

s1011, designing a data enhancement strategy, wherein the data enhancement strategy is as follows: the image size of the input image is scaled, using multi-scale training to enhance sample scale diversity.

In this step, the scaled image size is (480, 1333).

And S1012, performing data amplification on the input image in the COCO data set by adopting random horizontal inversion to enhance the generalization capability of the model.

S102, introducing a Swin Transformer as a backbone network, and performing feature extraction on the input image by using a hierarchical structure of the Swin Transformer to obtain multiple layers of features, wherein each layer of features corresponds to a feature layer.

In step S102, the Swin Transformer has a four-layer structure, and the multi-layer receptive field feature fusion network is used for outputting four output features with different scales

Wherein

The receptive field feature amplification modules on four feature layers in the multilevel receptive field feature fusion network are represented as

Wherein

。

Firstly, inputting a preprocessed image, and correspondingly extracting four extraction features with different scales and different depths

Wherein, in the process,

meridian/channel

Convolution adjusts the channel number to obtain the characteristics

Wherein

；

Then, layer 5 features

ThroughnA 5 th layer reception field characteristic amplification module obtains 5 th layer output characteristics

Is expressed by the formula

；

Next, layer 5 output features

Then, the feature is obtained by up-sampling by a two-fold adjacent sampling interpolation method

Through the 4 th layernThe characteristics obtained by the 4 th layer reception field characteristic amplification module are fused to obtain the 4 th layer output characteristics

Is expressed by the formula

. In the same way, will

Upsampling and layer 3 features

Fusing to obtain layer 3 output characteristics

Then, the same operation is used to obtain the output characteristics of the 2 nd layer

。

Specifically, the correspondence is as follows:

wherein the content of the first and second substances,

respectively represent the receptive fieldness on the 2 nd, 3 rd, 4 th and 5 th feature layersA sign-up and amplification module for amplifying the light,

indicating upsampling using a double neighbor interpolation method.

S103, constructing a multi-level reception field feature fusion network, matching the required reception fields of all feature layers in the Swin transform through a reception field feature amplification module in the multi-level reception field feature fusion network and supplementing shallow prediction features, wherein after matching, each feature layer is provided with a plurality of reception field feature amplification modules correspondingly.

In the invention, the receptive field characteristic amplification module comprises a plurality of basic units, and taking the 4 th characteristic layer as an example, in the 4 th characteristic layer, the 4 th layer characteristic of Swin transducer as a backbone network

1 st basic unit of receptor field characteristic amplification module

Obtaining a first base unit output characteristic

Then passes through the 2 nd basic unit

Obtaining a second base unit output characteristic

Finally, via the 3 rd basic unit

Merging layer 4 features of the backbone network by residual connection

Third base unit output characteristics for layer 4 characteristics

；

The corresponding expression is:

wherein the third basic unit outputs characteristics

The output characteristic of the first receptive field characteristic amplifying module of the 4 th characteristic layer.

As a supplementary note, the other receptive field feature amplification modules on the 4 th feature layer all perform the same operation as described above with the output of the previous receptive field feature amplification module as an input feature. The receptive field feature amplification module on the 4 th feature layer has 3 basic units, the receptive field amplification modules on the 5 th, 3 rd and 2 nd features have 4, 3 and 1 basic units respectively, and the operation is the same as that of the 4 th feature layer.

Further, the first basic unit outputs a characteristic

Is expressed as:

represent

The result of the convolution is,

representing a convolution kernel of

The convolution of the hole of (a) with (b),

which represents the spreading ratio of the convolution of the hole,

which represents the normalization of the batch,

it is shown that the activation function is,

the representation comprising batch normalization and activation functions

The result of the convolution is,

the representation comprising batch normalization and activation functions

And (4) performing hole convolution.

Additionally, the spreading ratio of each base unit is carefully designed. In order to avoid the problem of detailed information loss caused by the checkerboard effect caused by the discontinuity of the used data, different expansion rates are set for the basic units on different feature layers so as to fully utilize the information and match the required receptive fields of targets with different scales. For the 5 th feature layer, the expansion rates of the basic units are respectively set to be 1, 3, 9 and 9; settings 1, 3, 9 for the 4 th feature layer; the 3 rd feature layer is set to be 1, 2 and 3; the 2 nd feature layer is set to 1.

And S104, taking the linear combination of the GIOU loss and the BIOU loss as the regression loss of the boundary frame, and enhancing the target positioning effect according to the corresponding regression loss function of the boundary frame.

In step S104, the bounding box regression loss function is expressed as:

wherein the content of the first and second substances,

a bounding box regression loss function is represented,

represents the GIOU loss function,

represents the BIOU loss function,

a prediction bounding box is represented that is,

a reference frame is shown in the drawing,

the position of the bounding box is indicated,

，

the coordinates representing the center point of the bounding box,

respectively representing the width and height of the bounding box,

indicating a loss of Smooth L1,

representing the contact ratio calculation.

And S105, distributing the targets with different scales in the input image on the feature layers with different receptive fields, and positioning and identifying the small targets by utilizing the shallow prediction feature layer in the detection model to obtain the positioning identification result of the small targets.

In step S105, in the identification task, the positive and negative sample imbalance problem is solved using the following Focal loss function:

wherein the content of the first and second substances,

the local function is represented by the following formula,

the prediction score is represented by a number of prediction points,

a real label is represented by a tag that is true,

representing the number of balanced positive and negative samples,

indicating the adjustment factor.

In addition, the total loss function corresponding to the positioning and identification performed by the detection model is represented as:

both represent hyper-parameters.

Specifically, after model training is completed, a test set sample is input, and output average precision AP (IOU threshold value 0.50-0.95), AP50 (IOU threshold value 0.50), AP75 (IOU threshold value 0.75) and APS (IOU threshold value 0.50-0.95 and an object smaller than 32 × 32 pixels) are obtained and used for evaluating model performance.

The invention provides a multi-level receptive field expansion method for small target detection, which introduces Swin transform as a backbone network of a model and utilizes the hierarchy, locality and translational invariance to extract the characteristics of a small target; according to different output characteristics of each stage of the backbone network, a multi-stage receptive field expansion network is designed, and the output characteristics of the backbone network are further processed, so that the problem of small target information loss is avoided; in addition, the receptive field amplifying module can effectively expand the receptive field. According to task requirements, the structure of each layer of receptive field amplification module is flexibly adjusted to match the receptive fields required by targets with different scales and obtain rich context information; on the other hand, the proposed combined loss of GIOU loss and BIOU loss is used to enhance the localization performance of the target; the comparison test proves that the invention has good performance in the aspect of small target detection.

Referring to fig. 2, the present invention further provides a multi-level receptive field expanding system for small target detection, wherein the system comprises:

a pre-processing module to:

a feature extraction module to:

a network construction module to:

a loss determination module to:

a result output module to:

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A multi-level receptive field expansion method for small target detection, the method comprising the steps of:

step four, taking the linear combination of the GIOU loss and the BIOU loss as the regression loss of the bounding box, and strengthening the target positioning effect according to the regression loss function of the corresponding bounding box;

step five, distributing the targets with different scales in the input image on feature layers with different receptive fields, and positioning and identifying the small targets by utilizing a shallow prediction feature layer in the detection model to obtain the positioning identification result of the small targets;

in the first step, the pretreatment comprises the following steps:

randomly turning horizontally an input image in the COCO data set to perform data augmentation so as to enhance the generalization capability of the model;

the Swin transducer is correspondingly provided with a four-layer structure, and four extraction features with different scales and different depths are correspondingly extracted

Wherein, in the step (A),

meridian/channel

Convolution adjusts the channel number to obtain the characteristics

Wherein, in the step (A),

；

multi-level reception field feature fusion network for outputting output features of four different scales

Wherein, in the step (A),

；

multilevel receptive field featuresThe receptive field characteristic amplification module on the four characteristic layers in the converged network is represented as

Wherein, in the process,

；

the corresponding relationship is as follows:

wherein the content of the first and second substances,

respectively showing the amplification of the characteristics of the receptive field on the 2 nd, 3 rd, 4 th and 5 th characteristic layersThe module is provided with a plurality of modules,

represents the number of the receptive field feature amplification modules in a single feature layer,

representing upsampling using a double neighboring sample interpolation;

the receptive field characteristic amplification module comprises a plurality of basic units, and in the 4 th characteristic layer, the 4 th layer characteristic of Swin transducer as a backbone network

Base unit 1 of the receptive field feature amplification module

Obtaining a first base unit output characteristic

Then passes through the 2 nd basic unit

Obtaining a second base unit output characteristic

Finally, via the 3 rd basic unit

Merging layer 4 features of the backbone network by residual connection

Third base unit output characteristics for layer 4 characteristics

；

The corresponding expression is:

wherein the third basic unit outputs characteristics

The output characteristic of the first receptive field characteristic amplification module of the 4 th characteristic layer is obtained;

first basic unit output characteristics

Is expressed as:

wherein the content of the first and second substances,

represent

The result of the convolution is,

representing a convolution kernel of

The convolution of the holes of (a) with (b),

which represents the spreading ratio of the convolution of the hole,

which represents the normalization of the batch,

it is shown that the activation function is,

the representation comprising batch normalization and activation functions

The result of the convolution is,

representing a representation containing batch normalization and activation functions

Performing hole convolution;

the bounding box regression loss function is expressed as:

a bounding box regression loss function is represented,

represents the GIOU loss function,

represents the BIOU loss function,

a prediction bounding box is represented that is,

a reference frame is shown in the drawing,

the position of the bounding box is indicated,

，

the coordinates representing the center point of the bounding box,

respectively representing the width and height of the bounding box,

indicating a loss of Smooth L1,

representing a contact ratio calculation;

in the fifth step, in the identification task, a Focal local function is used for solving the problem of imbalance of positive and negative samples, and the corresponding Focal local function is expressed as follows:

the local function is represented by the following formula,

a prediction score is represented by a number of points,

the presence of a real label is indicated,

representing the number of balanced positive and negative samples,

represents a regulatory factor;

in the fifth step, the total loss function corresponding to the positioning and identification performed by the detection model is represented as:

wherein the content of the first and second substances,

both represent hyper-parameters.

2. A multi-level receptive field expansion system for small target detection, characterized in that the system applies a multi-level receptive field expansion method for small target detection as claimed in claim 1 above, the system comprising:

a pre-processing module to:

a feature extraction module to:

a network construction module to:

a loss determination module to:

the linear combination of the GIOU loss and the BIOU loss is used as the regression loss of the boundary frame, and the target positioning effect is enhanced according to the corresponding regression loss function of the boundary frame;

a result output module to: