CN115035563A

CN115035563A - Method, device and equipment for detecting small target by introducing attention mechanism

Info

Publication number: CN115035563A
Application number: CN202210486933.0A
Authority: CN
Inventors: 赵小川; 史津竹; 樊迪; 刘华鹏; 王子彻; 陈路豪; 李陈
Original assignee: China North Computer Application Technology Research Institute
Current assignee: China North Computer Application Technology Research Institute
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-09-09

Abstract

The invention discloses a small target detection method, a device and equipment for introducing an attention mechanism, wherein the method comprises the following steps: acquiring an input image; performing feature extraction on the input image through a preset primary visual perception cortex imitation model to obtain a first feature map; the preset primary visual perception cortex simulation model comprises a VOneBlock layer, a Conv layer and a feature fusion layer; performing target detection on the first characteristic diagram through a preset target detection model to obtain five target characteristic diagrams with different sizes; the target detection model is an improved model based on a Yolov5 model, and a CA attention module is included in the target detection model; and carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result. The method enhances the utilization efficiency of the context information of the feature map, improves the efficiency of extracting the interested features, effectively improves the feature extraction capability of small targets, and improves the anti-interference capability of the model.

Description

Method, device and equipment for detecting small target by introducing attention mechanism

Technical Field

The invention relates to the technical field of target detection, in particular to a small target detection method, a small target detection device and small target detection equipment introducing an attention mechanism.

Background

With the development of computer technology, a large number of target detection models, such as fast-RCNN, SSD, cenertet, Yolov 1-Yolov 5, etc., have appeared. The existing target detection model has poor detection accuracy when the target is interfered by environmental noise and aims at small targets, so that the target detection has high omission ratio and even detection failure, and the existing target detection model cannot be applied to application scenes with a large number of small targets and noise interference.

Disclosure of Invention

It is an object of the present invention to provide a new technical solution for small target detection.

According to a first aspect of the present invention, there is provided a small object detection method introducing an attention mechanism, the method comprising:

acquiring an input image;

performing feature extraction on the input image through a preset primary visual perception cortex imitation model to obtain a first feature map; the preset primary visual perception cortex simulation model comprises a VOneBlock layer, a Conv layer and a feature fusion layer;

performing target detection on the first characteristic diagram through a preset target detection model to obtain five target characteristic diagrams with different sizes; the target detection model is an improved model based on a Yolov5 model, and a CA attention module is included in the target detection model;

and carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result.

Optionally, the preset target detection model includes a backbone network and a head network, the CA attention modules are both disposed in the backbone network and the head network, and the target detection is performed on the first feature map through the preset target detection model to obtain five target feature maps of different sizes, including:

performing multiple size compression and feature extraction on the first feature map through the backbone network to obtain a plurality of backbone feature maps with different sizes;

and inputting the first feature map and the plurality of backbone feature maps with different sizes into the head network to obtain the five target feature maps with different sizes.

Optionally, the header network includes four FPN modules and four feature aggregation modules connected in series, where each of the FPN modules includes a Conv layer, an upsamplle layer, a Concat layer, and a C3 layer connected in sequence, and the CA attention module includes a CA attention layer disposed behind the C3 layer of the FPN module and a CA attention layer disposed behind each feature aggregation module, and the inputting the first feature map and the plurality of different-sized backbone feature maps into the header network to obtain the five different-sized target feature maps includes:

the four FPN modules and the CA attention layer disposed behind the C3 layer of each FPN module generate a first target feature map from the plurality of differently sized diaphyseal feature maps and the first feature map;

inputting the first target feature map, the first backbone feature map and a feature map output by a Conv layer of a first FPN module into a first feature aggregation module, wherein the first feature aggregation module performs size compression and channel aggregation on a plurality of input feature maps, and a CA attention layer arranged behind the first feature aggregation module performs interested feature extraction on the feature map output by the first feature aggregation module to obtain a second target feature map;

inputting the second target feature map, the second backbone feature map and a feature map output by a Conv layer of a second FPN module into a second feature aggregation module, wherein the second feature aggregation module performs size compression and channel aggregation on the input multiple feature maps, and a CA attention layer arranged behind the second feature aggregation module performs interested feature extraction on the feature map output by the second feature aggregation module to obtain a third target feature map;

inputting the third target feature map, the third backbone feature map and a feature map output by a Conv layer of a third FPN module into a third feature aggregation module, wherein the third feature aggregation module performs size compression and channel aggregation on the input multiple feature maps, and a CA attention layer arranged behind the third feature aggregation module performs interested feature extraction on the feature map output by the third feature aggregation module to obtain a fourth target feature map;

and inputting the fourth target feature map, the fourth backbone feature map and a feature map output by a Conv layer of a fourth FPN module into a fourth feature aggregation module, wherein the fourth feature aggregation module performs size compression and channel aggregation on the input multiple feature maps, and a CA attention layer arranged behind the fourth feature aggregation module performs interested feature extraction on the feature map output by the fourth feature aggregation module to obtain a fifth target feature map.

Optionally, the four FPN modules and the CA attention layer disposed behind the C3 layer of each FPN module generating a first target feature map from the plurality of different sized feature maps and the first feature map, comprising:

inputting the fourth backbone feature map and the third backbone feature map into a fourth FPN module to obtain an output feature map of the fourth FPN module;

extracting interesting features of a feature map output by the fourth FPN module through a CA attention layer arranged behind a C3 layer of the fourth FPN module, and inputting the feature map output by the CA attention layer arranged behind the C3 layer of the fourth FPN module and the second backbone feature map into the third FPN module to obtain an output feature map of the third FPN module;

extracting interesting features of a feature map output by a third FPN module through a CA attention layer arranged behind a C3 layer of the third FPN module, and inputting the feature map output by the CA attention layer arranged behind the C3 layer of the third FPN module and the first backbone feature map into a second FPN module to obtain an output feature map of the second FPN module;

and performing interesting feature extraction on the feature map output by the second FPN module through a CA attention layer arranged behind a C3 layer of the second FPN module, inputting the feature map output by the CA attention layer arranged behind a C3 layer of the second FPN module and the first feature map into the first FPN module, and performing interesting feature extraction on the feature map output by the first FPN module through a CA attention layer arranged behind a C3 layer of the first FPN module to obtain a first target feature map.

Optionally, the backbone network includes a first Conv layer, a first C3 layer, a second Conv layer, a second C3 layer, a third Conv layer, a third C3 layer, a fourth Conv layer, a fourth C3 layer, and an SPP layer, the CA attention module includes a CA attention layer disposed behind each C3 layer in the backbone network, and the first feature map is subjected to feature map size compression and feature extraction for multiple times through the backbone network to obtain a plurality of backbone feature maps of different sizes, including:

compressing the size of the feature map of the first feature map through the first Conv layer, and extracting the features of the compressed first feature map through the first C3 layer to obtain a first backbone feature map;

performing interesting feature extraction on the first backbone feature map through a CA attention layer arranged behind a first C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a first C3 layer through a second Conv layer, and performing feature extraction on the feature map output by the second Conv layer through the second C3 layer to obtain a second backbone feature map;

extracting interesting features of the second backbone feature map through a CA attention layer arranged behind a second C3 layer, compressing the size of the feature map output by the CA attention layer arranged behind a second C3 layer through the third Conv layer, and extracting the features of the feature map output by the third Conv layer through the third C3 layer to obtain a third backbone feature map;

and performing interesting feature extraction on the third backbone feature map by a CA attention layer arranged behind a third C3 layer, performing feature map size compression on the feature map output by the CA attention layer arranged behind a third C3 layer by the fourth Conv layer, performing spatial information fusion on the feature map output by the third Conv layer by the SPP layer, and performing feature extraction on the feature map output by the SPP layer by the fourth C3 layer to obtain a fourth backbone feature map.

Optionally, the performing feature extraction on the input image through a preset primary visual perception cortex-imitated model to obtain a first feature map includes:

performing feature extraction and size compression on the input image through the VOneBlock layer to obtain a second feature map;

performing size compression on the input image through the Conv layer to obtain a third feature map; the size of the third characteristic diagram is the same as that of the second characteristic diagram;

and performing characteristic value fusion on the second characteristic diagram and the third characteristic diagram through the characteristic fusion layer to obtain the first characteristic diagram.

According to a second aspect of the present invention, there is provided a small object detection apparatus introducing an attention mechanism, the apparatus comprising:

the image acquisition module is used for acquiring an input image;

the preprocessing module is used for extracting the characteristics of the input image through a preset primary visual perception cortex imitation model to obtain a first characteristic diagram; the preset primary visual perception cortex simulation model comprises a VOneBlock layer, a Conv layer and a feature fusion layer;

the target detection module is used for carrying out target detection on the first feature map through a preset target detection model to obtain five target feature maps with different sizes, the target detection model is an improved model based on a Yolov5 model, and the target detection model comprises a CA attention module; and carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result.

Optionally, the target detection module includes:

the backbone network module is used for carrying out multiple size compression and feature extraction on the first feature graph to obtain a plurality of backbone feature graphs with different sizes;

and the head network module is used for receiving the first feature map and the plurality of backbone feature maps with different sizes to obtain the five target feature maps with different sizes.

Optionally, the target detection module includes:

and the detection head module is used for carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result.

According to a third aspect of the present invention, there is provided an electronic device comprising a processor and a memory, the memory storing a program executable on the processor, the program, when executed by the processor, implementing the object detection method according to the first aspect of the present invention.

According to one embodiment of the disclosure, the method and the device have the advantages that the five target feature maps with different sizes are generated, the target detection is carried out through the five target feature maps with different sizes, the feature extraction capability of the weak and small target is effectively improved, and the detection performance of the weak and small target is improved. Meanwhile, the primary visual perception cortex simulation model is added in the preprocessing process, and the anti-interference capability of the model is improved by simulating the primary visual cortex perception mechanism of the brain. Meanwhile, a CA attention module is added in the target detection model, and the utilization efficiency of the context information of the feature map is enhanced and the efficiency of extracting the interested features is improved through space attention and channel attention.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a small target detection method incorporating an attention mechanism of the present invention.

Fig. 2 is a schematic diagram of the primary visual perception cortex-simulated model of the present invention.

FIG. 3 is a schematic diagram of a small object detection model incorporating an attention mechanism according to the present invention.

Fig. 4 is a block diagram of a small object detecting device incorporating an attention mechanism according to the present invention.

Fig. 5 is a block diagram of the electronic device of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

There is currently no clear definition in the industry of the concept of small targets, which can be defined both in terms of absolute and relative dimensions. For example, when a small target is defined in terms of absolute scale, the target region may be considered to be a small target when the target region area is less than 32 × 32 pixel values. The small target is defined from the perspective of relative scale, and the small target is defined when the length and width of the target size account for 0.1 of the original size.

As shown in FIG. 1, the embodiment of the invention introduces a small object detection method with attention mechanism, which comprises steps S1-S4.

S1: an input image is acquired.

S2: performing feature extraction on the input image through a preset primary visual perception cortex imitation model to obtain a first feature map; the preset primary visual perception cortex simulation model comprises a VOneBlock layer, a Conv layer and a feature fusion layer.

S3: performing target detection on the first characteristic diagram through a preset target detection model to obtain five target characteristic diagrams with different sizes; the target detection model is an improved model based on a Yolov5 model, and the target detection model comprises a CA (coding attention) attention module.

S4: and carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result.

The input image is an image which needs to be subjected to target detection, and may be an image shot by a camera, an image frame extracted from a video, or a partial area image cut from a complete image.

After the input image is acquired, preprocessing is carried out on the input image. And in the preprocessing process, performing feature extraction on the input image through a preset primary visual perception cortex simulation model to obtain a first feature map. In the imitation primary visual perception cortex model, a VOneBlock layer, a Conv (convolution) layer and a feature fusion layer are included. The VOneBlock layer is a neural network layer constructed according to the primary visual cortex of the primate, and a Gabor filter is used as a core component to simulate a human visual perception cortical information processing mechanism to perform bionic visual feature extraction on an input image. After the input image is subjected to feature extraction through the primary visual perception cortex simulation model, a first feature map which is closer to the features of the human brain after visual processing can be obtained. The simulated primary visual perception cortical model reflects a mapping relationship between the input image and the first feature map.

The target detection model in the invention is an improved model based on a Yolov5 model, and a CA attention module is included in target detection. A CA attention module can be seen as a computational unit to enhance the expressive power of features in the network, which can take as input any intermediate feature tensor and output by transformation the features with the same size as the tensor and with enhanced representation.

And after the first characteristic diagram is obtained, carrying out target detection on the first characteristic diagram through a preset target detection model to obtain five target characteristic diagrams with different sizes. And carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result. The target detection result comprises position information and classification information of a target object to be detected in the input image, and the target detection model reflects the mapping relation between the first characteristic diagram and the target detection result.

According to the invention, by generating five target feature maps with different sizes and carrying out target detection through the five target feature maps with different sizes, the feature extraction capability of the weak and small targets is effectively improved, and the detection performance of the weak and small targets is improved. Meanwhile, the primary visual perception cortex simulation model is added in the preprocessing process, and the anti-interference capability of the model is improved by simulating the primary visual cortex perception mechanism of the brain. Meanwhile, a CA attention module is added in the target detection model, and the utilization efficiency of the context information of the feature map is enhanced and the efficiency of extracting the interested features is improved through space attention and channel attention.

In one embodiment of the present invention, the step S2 includes S201 to S203.

S201: and performing feature extraction and size compression on the input image through the VOneBlock layer to obtain a second feature map.

S202: performing size compression on the input image through the Conv layer to obtain a third feature map; the size of the third feature map is the same as the size of the second feature map.

S203: and performing characteristic value fusion on the second characteristic diagram and the third characteristic diagram through the characteristic fusion layer to obtain the first characteristic diagram.

As shown in fig. 2, the imitation primary visual perception cortex model of the present invention comprises a VOneBlock layer, a Conv layer and a feature fusion layer, wherein the VOneBlock layer and the Conv layer are connected in parallel. The input image is used as the input of a VONEBlock layer and a Conv layer, the VONEBlock layer performs feature extraction and size compression on the input image to obtain a second feature map, the Conv layer performs size compression on the input image to obtain a third feature map, and the size of the second feature map is the same as that of the third feature map. And the second feature diagram and the third feature diagram are used as input of a feature fusion layer, the feature fusion layer performs feature value fusion on the second feature diagram and the third feature diagram, and finally the first feature diagram is output.

For example, the size of the input image is 640 × 640, and after feature extraction and size compression are performed on the input image in the von block layer, the size of the second feature map obtained is 320 × 320, and the size of the second feature map is 1/2 of the size of the input image. After the input image is size-compressed by the Conv layer, the size of the obtained third feature map is 320 × 320, the size of the third feature map is also 1/2 of the size of the input image, and the size of the third feature map is the same as that of the second feature map. And finally, the feature fusion layer performs feature value fusion on the second feature map and the third feature map, and the size of the obtained first feature map is also 320 x 320.

In the primary visual perception cortex simulation model, only the VOneBlock layer is connected with one Conv layer in parallel, so that the model is simpler, the primary visual perception cortex simulation model can adjust the size compression ratio of the feature map more flexibly, and the flexibility of the model is improved.

As shown in fig. 3, in an embodiment of the present invention, the preset object detection model includes a backbone network 101 and a head network 102, the CA attention module is disposed in both the backbone network 101 and the head network 102, and step S3 includes S301-S302.

S301: and carrying out multiple size compression and feature extraction on the first feature map through the backbone network to obtain a plurality of backbone feature maps with different sizes.

S302: and inputting the first feature map and the plurality of backbone feature maps with different sizes into the head network to obtain the five target feature maps with different sizes.

The target detection model in the invention is an improved model based on a Yolov5 model in the prior art, the target detection model comprises a backbone network 101 and a head network 102, and CA attention modules are added in both the backbone network 101 and the head network 102. The backbone network is used for carrying out size compression and feature extraction on the first feature map for multiple times to obtain a plurality of backbone feature maps with different sizes. The head network then receives a plurality of backbone feature maps of different sizes obtained through the backbone network and the first feature map, and the head network outputs five target feature maps of different sizes. As shown in fig. 3, the target detection model further includes a Detect layer 50, and the Detect layer 50 performs target classification and coordinate positioning on the five target feature maps with different sizes to output a target detection result.

In an embodiment of the present invention, the backbone network includes a first Conv layer, a first C3 layer, a second Conv layer, a second C3 layer, a third Conv layer, a third C3 layer, a fourth Conv layer, a fourth C3 layer, and an SPP (Spatial Pyramid Pooling) layer, and the CA attention module includes a CA attention layer disposed behind each C3 layer in the backbone network. As shown in fig. 3, the first Conv layer is the Conv layer 1 in fig. 3, the first C3 layer is the C3 layer 2 in fig. 3, and a CA attention layer 3 is further provided after the C3 layer 2. The second Conv layer is the Conv layer 4 in fig. 3, the second C3 layer is the C3 layer 5 in fig. 3, and a CA attention layer 6 is further provided after the C3 layer 5. The third Conv layer is the Conv layer 7 in fig. 3, the third C4 layer is the C3 layer 8 in fig. 3, and a CA attention layer 9 is further provided after the C3 layer 8. The fourth Conv layer is the Conv layer 10 in fig. 3, the fourth C3 layer is the C3 layer 12 in fig. 3, and a CA attention layer 13 is further provided after the C3 layer 12. Step S301 includes S3011-S3014.

S3011: and performing feature map size compression on the first feature map through the first Conv layer, and performing feature extraction on the compressed first feature map through the first C3 layer to obtain a first backbone feature map.

S3012: and performing interesting feature extraction on the first backbone feature map by a CA attention layer arranged behind a first C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a first C3 layer by the second Conv layer, and performing feature extraction on the feature map output by the second Conv layer by the second C3 layer to obtain a second backbone feature map.

S3013: and performing interesting feature extraction on the second backbone feature map by a CA attention layer arranged behind a second C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a second C3 layer by the third Conv layer, and performing feature extraction on the feature map output by the third Conv layer by the third C3 layer to obtain a third backbone feature map.

S3014: and performing interesting feature extraction on the third backbone feature map by a CA attention layer arranged behind a third C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a third C3 layer by a fourth Conv layer, performing spatial information fusion on the feature map output by the third Conv layer by the SPP layer, and performing feature extraction on the feature map output by the SPP layer by the fourth C3 layer to obtain a fourth backbone feature map.

After the input image is obtained, the size of the first feature map obtained by simulating the primary visual perception cortex model is 1/2 of the size of the input image. Inputting the first feature map into a backbone network, acquiring the first feature map by a first Conv layer in the backbone network, performing size compression on the first feature map, inputting the feature map output by the first Conv layer into a first C3 layer for feature extraction, and outputting the first backbone feature map by a first C3 layer, wherein the size of the feature map output by the first Conv layer is 1/4 of the size of an input image. The CA attention layer 3 performs interesting feature extraction on the first backbone feature map, the second Conv layer performs size compression on the feature map output by the CA attention layer 3, the feature map output by the second Conv layer is input to the second C3 layer for feature extraction, and the second C3 layer outputs the second backbone feature map. The CA attention layer 6 performs interesting feature extraction on the second backbone feature map, the third Conv layer performs size compression on the feature map output by the CA attention layer 6, the feature map output by the third Conv layer is input to the third C3 layer for feature extraction, and the third C3 layer outputs the third backbone feature map. The CA attention layer 9 performs interesting feature extraction on the third backbone feature map, the fourth Conv layer performs size compression on the feature map output by the CA attention layer 9, the feature map output by the fourth Conv layer is input to the SPP layer to perform spatial information fusion, the feature map output by the SPP layer is input to the fourth C3 layer to perform feature extraction, and the fourth C3 layer outputs the fourth backbone feature map.

For example, the input image has a size of 640 × 640, the simulated primary visual perception cortex model performs size compression on the input image, the first feature map output is 1/2 of the input image, and the first feature map has a size of 320 × 320. And performing four times of size compression on the first feature map through a backbone network, and performing feature extraction after each size compression to obtain a backbone feature map. After the first size compression and feature extraction process by the first Conv layer and the first C3 layer, a first skeleton feature map of size 160 × 160 was obtained, the size of the first skeleton feature map being 1/4 of the input image. After the second size compression and feature extraction process is performed by the second Conv layer and the second C3 layer, a second skeleton feature map with a size of 80 × 80 is obtained, and the size of the second skeleton feature map is 1/8 of the size of the input image. After the third size compression and feature extraction process is performed through the third Conv layer and the third C3 layer, a third stem feature map with the size of 40 × 40 is obtained, and the size of the third stem feature map is 1/16 of the size of the input image. After the fourth size compression and feature extraction process is performed through the fourth Conv layer, the SPP layer and the fourth C3 layer, a fourth stem feature map with a size of 20 × 20 is obtained, and the size of the fourth stem feature map is 1/32 of the size of the input image.

As shown in fig. 3, in an embodiment of the present invention, the header network includes four FPN modules and four feature aggregation modules connected in series, the FPN modules include a Conv layer, an Upsample (up sampling) layer, a Concat layer, and a C3(CSP bottleeck with 3 definitions) layer connected in sequence, the CA attention module includes a CA attention layer disposed after the C3 layer of the FPN module, and a CA attention layer disposed after each feature aggregation module, and step S302 includes:

s3021: the four FPN modules and the CA attention layer disposed behind the C3 layer of each FPN module generate a first target feature map from the plurality of differently sized diaphyseal feature maps and the first feature map.

S3022: and inputting the first target feature map, the first backbone feature map and a feature map output by a Conv layer of the first FPN module into a first feature aggregation module, performing size compression and channel aggregation on a plurality of input feature maps by the first feature aggregation module, and performing interested feature extraction on the feature map output by the first feature aggregation module by a CA attention layer arranged behind the first feature aggregation module to obtain a second target feature map.

S3023: and inputting the second target feature map, the second backbone feature map and a feature map output by a Conv layer of a second FPN module into a second feature aggregation module, performing size compression and channel aggregation on the input multiple feature maps by the second feature aggregation module, and performing interested feature extraction on the feature map output by the second feature aggregation module by a CA attention layer arranged behind the second feature aggregation module to obtain a third target feature map.

S3024: and inputting the third target feature map, the third backbone feature map and a feature map output by a Conv layer of a third FPN module into a third feature aggregation module, wherein the third feature aggregation module performs size compression and channel aggregation on the input multiple feature maps, and a CA attention layer arranged behind the third feature aggregation module performs interested feature extraction on the feature map output by the third feature aggregation module to obtain a fourth target feature map.

S3025: and inputting the fourth target feature map, the fourth backbone feature map and a feature map output by a Conv layer of a fourth FPN module into a fourth feature aggregation module, wherein the fourth feature aggregation module performs size compression and channel aggregation on the input multiple feature maps, and a CA attention layer arranged behind the fourth feature aggregation module performs interested feature extraction on the feature map output by the fourth feature aggregation module to obtain a fifth target feature map.

The Conv layer, Upesple layer, Concat layer, C3 layer in the FPN module are consistent with the existing Yolov5 model. The Conv layer is a basic convolution unit and performs two-dimensional convolution, regularization and activation operations on an input in sequence. The C3 layer is composed of a plurality of Bottleneck modules, the Bottleneck is a classical residual error structure, the input is subjected to Add operation with an original value after passing through two convolutional layers, and residual error feature transfer is completed without increasing the output depth.

After the first characteristic diagram is input into the backbone network, the backbone network outputs four backbone characteristic diagrams with different sizes, namely a first backbone characteristic diagram, a second backbone characteristic diagram, a third backbone characteristic diagram and a fourth backbone characteristic diagram. The five target feature maps with different sizes are respectively a first target feature map, a second target feature map, a third target feature map, a fourth target feature map and a fifth target feature map. The four feature aggregation modules are respectively a first feature aggregation module, a second feature aggregation module, a third feature aggregation module and a fourth feature aggregation module, and each aggregation module comprises a Conv layer, a Concat layer and a C3 layer. As shown in fig. 3, the first feature aggregation module includes a Conv layer 34, a Concat layer 35, a C3 layer 36, the second feature aggregation module includes a Conv layer 38, a Concat layer 39, a C3 layer 40, the third feature aggregation module includes a Conv layer 42, a Concat layer 43, a C3 layer 44, and the fourth feature aggregation module includes a Conv layer 46, a Concat layer 47, a C3 layer 48. The CA attention module includes a CA attention layer 37 disposed behind a C3 layer 36 in the first feature aggregation module, a CA attention layer 41 disposed behind a C3 layer 40 in the second feature aggregation module, a CA attention layer 45 disposed behind a C3 layer 44 in the third feature aggregation module, and a CA attention layer 49 disposed behind a C3 layer 48 in the fourth feature aggregation module.

The four FPN modules are respectively a first FPN module, a second FPN module, a third FPN module and a fourth FPN module, and the four FPN modules are sequentially connected in series. As shown in fig. 3, the first FPN module includes a Conv layer 29, an update layer 30, a Concat layer 31, and a C3 layer 32, the second FPN module includes a Conv layer 24, an update layer 25, a Concat layer 26, and a C3 layer 27, the third FPN module includes a Conv layer 19, an update layer 20, a Concat layer 21, and a C3 layer 22, and the fourth FPN module includes a Conv layer 14, an update layer 15, a Concat layer 16, and a C3 layer 17. As shown in fig. 3, the CA attention module includes a CA attention layer 33 disposed behind the C3 layer 32 of the first FPN module, a CA attention layer 28 disposed behind the C3 layer 27 of the second FPN module, a CA attention layer 23 disposed behind the C3 layer 22 of the third FPN module, a CA attention layer 18 disposed behind the C3 layer 17 of the fourth FPN module, with a CA attention layer 13 disposed in front of the Conv layer 14 of the fourth FPN module.

The output of the fourth FPN module is processed by the CA attention layer 18 and then used as the input of the third FPN module, the output of the third FPN module is processed by the CA attention layer 23 and then used as the input of the second FPN module, and the output of the second FPN module is processed by the CA attention layer 28 and then used as the input of the first FPN module. Meanwhile, the fourth FPN module receives the third backbone feature map, and the fourth backbone feature map is also used as an input of the fourth FPN module after being processed by the CA attention layer 13, the third FPN module further receives the second backbone feature map as an input, the second FPN module further receives the first backbone feature map as an input, the first FPN module further receives the first feature map as an input, and finally the feature map output by the first FPN module is processed by the CA attention layer 33 and outputs the first target feature map.

After the first target feature map is obtained, the first target feature map, the first backbone feature map and the feature map output by the Conv layer of the first FPN module are input into the first feature aggregation module. The Conv layer in the first feature aggregation module performs size compression on the first target feature map, the Concat layer in the first feature aggregation module performs channel aggregation on the feature map output by the Conv layer in the first feature aggregation module, the first backbone feature map and the feature map output by the Conv layer in the first FPN module, the C3 layer in the first feature aggregation module performs feature extraction on the feature map output by the Concat layer in the first feature aggregation module, the feature maps output by the C3 layer in the first feature aggregation module are input into the CA attention layer 37, and the CA attention layer 37 outputs the second target feature map.

And after the second target feature map is obtained, inputting the second target feature map, the second backbone feature map and the feature map output by the Conv layer of the second FPN module into a second feature aggregation module. The Conv layer in the second feature aggregation module performs size compression on the second target feature map, the Concat layer in the second feature aggregation module performs channel aggregation on the feature map output by the Conv layer in the second feature aggregation module, the second backbone feature map, and the feature map output by the Conv layer in the second FPN module, the C3 layer in the second feature aggregation module performs feature extraction on the feature map output by the Concat layer in the second feature aggregation module, the feature maps output by the C3 layer in the second feature aggregation module are input to the CA attention layer 41, and the CA attention layer 41 outputs the third target feature map.

And after obtaining a third target feature map, inputting the third target feature map, the third backbone feature map and a feature map output by a Conv layer of the third FPN module into a third feature aggregation module. The Conv layer in the third feature aggregation module performs size compression on the third target feature map, the Concat layer in the third feature aggregation module performs channel aggregation on the feature map output by the Conv layer in the third feature aggregation module, the third backbone feature map and the feature map output by the Conv layer in the third FPN module, the C3 layer in the third feature aggregation module performs feature extraction on the feature map output by the Concat layer in the third feature aggregation module, the feature map output by the C3 layer in the third feature aggregation module is input to the CA attention layer 45, and the CA attention layer 45 outputs the fourth target feature map.

And after the fourth target feature map is obtained, inputting the fourth target feature map, the fourth backbone feature map and the feature map output by the Conv layer of the fourth FPN module into a fourth feature aggregation module. The Conv layer in the fourth feature aggregation module performs size compression on the fourth target feature map, the Concat layer in the fourth feature aggregation module performs channel aggregation on the feature map output by the Conv layer in the fourth feature aggregation module, the fourth backbone feature map and the feature map output by the Conv layer in the fourth FPN module, the C3 layer in the fourth feature aggregation module performs feature extraction on the feature map output by the Concat layer in the fourth feature aggregation module, the feature map output by the C3 layer in the fourth feature aggregation module is input to the CA attention layer 49, and the CA attention layer 49 outputs the fifth target feature map.

In one example, the size of the input image is 640 × 640, the size of the first feature map obtained by simulating the primary visual perception cortex model is 320 × 320, and the size of the first target feature map output by the first FPN module is also 320 × 320. And inputting the first characteristic diagram into a backbone network to obtain a first backbone characteristic diagram with the size of 160 × 160, a second backbone characteristic diagram with the size of 80 × 80, a third backbone characteristic diagram with the size of 40 × 40 and a fourth backbone characteristic diagram with the size of 20 × 20. The size of the finally obtained first target feature is the same as that of the first feature, and the size of the first target feature is 320 × 320; the size of the second target feature map is the same as that of the first backbone feature map, and the size of the second target feature map is 160 x 160; the size of the third target feature map is the same as that of the second backbone feature map, and the size of the third target feature map is 80-80; the size of the fourth target feature map is the same as that of the third backbone feature map, and the size of the fourth target feature map is 40 x 40; the size of the fifth target feature map is the same as that of the fourth backbone feature map, and the size of the fifth target feature map is 20 x 20.

In one embodiment of the present invention, step S3021 includes: inputting the fourth backbone feature map and the third backbone feature map into a fourth FPN module to obtain an output feature map of the fourth FPN module;

As shown in fig. 3, the fourth backbone feature map is input to the CA attention layer 13, the CA attention layer 13 performs feature extraction of interest on the fourth backbone feature map, the feature map output by the CA attention layer 13 is input to the Conv layer of the fourth FPN module, and the third backbone feature map is input to the Concat layer of the fourth FPN module. The Conv layer of the fourth FPN module performs channel compression on the feature map output by the CA attention layer 13, and performs size expansion on the feature map output by the Conv layer of the fourth FPN module through the upscale layer of the fourth FPN module, where the size of the feature map output by the upscale layer of the fourth FPN module is the same as the size of the third backbone feature map. The Concat layer of the fourth FPN module performs channel aggregation on the third backbone feature map and the feature map output by the upscale layer of the fourth FPN module, the C3 layer of the fourth FPN module performs feature extraction on the feature map output by the Concat layer of the fourth FPN module, and the feature map output by the C3 layer of the fourth FPN module is the feature map output by the fourth FPN module.

Inputting the feature map output by the fourth FPN module into the CA attention layer 18, wherein the CA attention layer 18 extracts the interesting features of the feature map output by the fourth FPN module, inputs the feature map output by the CA attention layer 18 into the Conv layer of the third FPN module, and inputs the second backbone feature map into the Concat layer of the third FPN module. The Conv layer of the third FPN module performs channel compression on the feature map output by the CA attention layer 18, and performs size expansion on the feature map output by the Conv layer of the third FPN module through the upscale layer of the third FPN module, where the size of the feature map output by the upscale layer of the third FPN module is the same as the size of the second backbone feature map. The Concat layer of the third FPN module performs channel aggregation on the second backbone feature map and the feature map output by the Upsample layer of the third FPN module, the C3 layer of the third FPN module performs feature extraction on the feature map output by the Concat layer of the third FPN module, and the feature map output by the C3 layer of the third FPN module is the feature map output by the third FPN module.

Inputting the feature map output by the third FPN module into the CA attention layer 23, wherein the CA attention layer 23 performs interesting feature extraction on the feature map output by the third FPN module, inputs the feature map output by the CA attention layer 23 into the Conv layer of the second FPN module, and inputs the first backbone feature map into the Concat layer of the second FPN module. The Conv layer of the second FPN module performs channel compression on the feature map output by the CA attention layer 23, and performs size expansion on the feature map output by the Conv layer of the second FPN module through the upscale layer of the second FPN module, where the size of the feature map output by the upscale layer of the second FPN module is the same as the size of the first backbone feature map. The Concat layer of the second FPN module performs channel aggregation on the feature map output by the first backbone feature map and the Upesple layer of the second FPN module, the C3 layer of the second FPN module performs feature extraction on the feature map output by the Concat layer of the second FPN module, and the feature map output by the C3 layer of the second FPN module is the feature map output by the second FPN module.

The feature map output by the second FPN module is input to the CA attention layer 28, the CA attention layer 28 performs feature extraction of interest on the feature map output by the third FPN module, the feature map output by the CA attention layer 28 is input to the Conv layer of the first FPN module, and the first feature map is input to the Concat layer of the first FPN module. The Conv layer of the first FPN module performs channel compression on the feature map output by the CA attention layer 28, and performs size expansion on the feature map output by the Conv layer of the first FPN module through the upscale layer of the first FPN module, where the size of the feature map output by the upscale layer of the first FPN module is the same as the size of the first feature map. The Concat layer of the first FPN module performs channel aggregation on the first feature map and the feature map output by the upscale layer of the first FPN module, the C3 layer of the first FPN module performs feature extraction on the feature map output by the Concat layer of the first FPN module, the feature map output by the C3 layer of the first FPN module is input to the CA attention layer 33, and the CA attention layer 33 outputs the first target feature map.

As shown in fig. 4, an embodiment of the present invention introduces a small target detection apparatus with attention-drawing mechanism, that is, the target detection apparatus 200 shown in fig. 4, where the target detection apparatus 200 is used to implement the small target detection method with attention-drawing mechanism according to any embodiment of the present invention, and the target detection apparatus 200 is used to implement the target detection method according to any embodiment of the present invention, and the target detection apparatus 200 includes:

an image acquisition module 201, configured to acquire an input image;

the preprocessing module 202 is configured to perform feature extraction on the input image through a preset primary visual perception cortex simulation model to obtain a first feature map; the preset primary visual perception cortex simulation model comprises a VOneBlock layer, a Conv layer and a feature fusion layer;

the target detection module 203 is configured to perform target detection on the first feature map through a preset target detection model to obtain five target feature maps with different sizes, where the target detection model is an improved model based on a Yolov5 model, and the target detection model includes a CA attention module; and carrying out target classification and coordinate positioning on the five target characteristic graphs with different sizes to obtain a target detection result.

In one embodiment of the present invention, the object detection module includes:

the backbone network module is used for performing size compression and feature extraction on the first feature map for multiple times to obtain a plurality of backbone feature maps with different sizes;

In an embodiment of the present invention, the target detection module includes a detection head module, configured to perform target classification and coordinate positioning on the five target feature maps with different sizes to obtain a target detection result.

As shown in fig. 5, an electronic device 300 according to an embodiment of the present invention is introduced, and includes a processor 301 and a memory 302, where the memory 302 stores a program executable on the processor 301, and the program implements an object detection method according to any embodiment of the present invention when executed by the processor 301.

An embodiment of the present invention further introduces a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed, implements the object detection method according to any embodiment of the present invention.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be interpreted as a transitory signal per se, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or an electrical signal transmitted through an electrical wire.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method of detecting small objects with attention-directed mechanism, the method comprising:

acquiring an input image;

2. The method of claim 1, wherein the preset target detection model comprises a backbone network and a head network, the CA attention module is disposed in each of the backbone network and the head network, and the target detection on the first feature map by the preset target detection model obtains five target feature maps with different sizes, including:

3. The method of claim 2, wherein the header network comprises four FPN modules and four feature aggregation modules connected in series, the FPN modules comprising a Conv layer, an Upsample layer, a Concat layer, and a C3 layer connected in series, the CA attention module comprising a CA attention layer disposed after the C3 layer of the FPN modules, and a CA attention layer disposed after each feature aggregation module, and wherein inputting the first feature map and the plurality of differently sized stem feature maps into the header network results in the five differently sized target feature maps comprising:

inputting the second target feature map, the second backbone feature map and a feature map output by a Conv layer of a second FPN module into a second feature aggregation module, performing size compression and channel aggregation on the input multiple feature maps by the second feature aggregation module, and performing interested feature extraction on the feature map output by the second feature aggregation module by a CA attention layer arranged behind the second feature aggregation module to obtain a third target feature map;

and inputting the fourth target feature map, the fourth backbone feature map and a feature map output by a Conv layer of a fourth FPN module into a fourth feature aggregation module, performing size compression and channel aggregation on the input multiple feature maps by the fourth feature aggregation module, and performing interested feature extraction on the feature map output by the fourth feature aggregation module by a CA attention layer arranged behind the fourth feature aggregation module to obtain a fifth target feature map.

4. The method of claim 3, wherein the four FPN modules and the CA attention layer disposed behind the C3 layer of each FPN module generate a first target feature map from the plurality of different sized feature maps and the first feature map, comprising:

5. The method of claim 2, wherein the backbone network comprises a first Conv layer, a first C3 layer, a second Conv layer, a second C3 layer, a third Conv layer, a third C3 layer, a fourth Conv layer, a fourth C3 layer, and an SPP layer, wherein the CA attention module comprises a CA attention layer disposed after each C3 layer in the backbone network, and wherein performing multiple feature map size compressions and feature extractions on the first feature map through the backbone network results in multiple different sized backbone feature maps, comprising:

extracting interesting features of the first backbone feature map through a CA attention layer arranged behind a first C3 layer, compressing the size of the feature map output by the CA attention layer arranged behind a first C3 layer through the second Conv layer, and extracting the features of the feature map output by the second Conv layer through the second C3 layer to obtain a second backbone feature map;

performing interesting feature extraction on the second backbone feature map through a CA attention layer arranged behind a second C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a second C3 layer through a third Conv layer, and performing feature extraction on the feature map output by the third Conv layer through the third C3 layer to obtain a third backbone feature map;

and performing interesting feature extraction on the third backbone feature map by a CA attention layer arranged behind a third C3 layer, performing feature map size compression on a feature map output by the CA attention layer arranged behind a third C3 layer by a fourth Conv layer, performing spatial information fusion on the feature map output by the third Conv layer by the SPP layer, and performing feature extraction on the feature map output by the SPP layer by the fourth C3 layer to obtain a fourth backbone feature map.

6. The method of claim 1, wherein the performing feature extraction on the input image through a preset simulated primary visual perception cortex model to obtain a first feature map comprises:

7. A small object detection device incorporating an attention mechanism, the device comprising:

the image acquisition module is used for acquiring an input image;

8. The apparatus of claim 7, wherein the object detection module comprises:

9. The apparatus of claim 8, wherein the target detection module comprises:

10. An electronic device comprising a processor and a memory, the memory storing a program executable on the processor, the program, when executed by the processor, implementing an object detection method as claimed in any one of claims 1-6.