WO2022213395A1

WO2022213395A1 - Light-weighted target detection method and device, and storage medium

Info

Publication number: WO2022213395A1
Application number: PCT/CN2021/086476
Authority: WO
Inventors: 张伟烽; 胡庆茂
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-04-06
Filing date: 2021-04-12
Publication date: 2022-10-13
Also published as: CN115187820A

Abstract

A light-weighted target detection method and device, and a storage medium. The method comprises: acquiring an image of a target to be detected (S1); by means of a first convolutional layer, performing dimensionality reduction on the image to obtain an image which has been subjected to dimensionality reduction (S2); by means of a plurality of second convolutional layers, respectively compressing the image which has been subjected to dimensionality reduction, so as to obtain a plurality of first branches (S3); respectively extracting first feature maps of the plurality of first branches, and splicing the first feature maps of the plurality of first branches to obtain a first spliced feature map (S4); by means of a first pooling layer, performing down-sampling on the first spliced feature map to obtain a sampled feature map (S5); by means of a plurality of block modules, performing feature extraction on the sampled feature map to obtain a feature map of the image (S6); and according to the feature map of the image, performing detection to obtain a detection result of the target to be detected (S7). By means of the method, by using a cross-channel branch policy at a feature extraction stage, feature maps of a plurality of branches are spliced to serve as a basis for subsequent feature extraction, such that a receiving range is expanded, and more low-level functions are preserved, thereby ensuring detection speed and also increasing the accuracy rate.

Description

Lightweight target detection method, device and storage medium

technical field

The invention relates to the technical field of image processing, and in particular, to a lightweight target detection method, device and storage medium.

Background technique

Object detection is the basic visual recognition in computer vision and is widely used in areas such as autonomous driving and safety inspection. With the great success of deep learning in image classification tasks in recent years, object detection networks based on convolutional neural networks (CNNs) have gradually become mainstream. Common CNN-based target detection networks include Faster R-CNN, R-FCN, SSD, YOLO, etc. These target detection networks all rely on complex network structures, and the calculation volume indicator floating point operations (million floating point operations, MFLOPs) Both reach five figures, and run accurately and quickly on server GPUs. Due to the limited computing power and memory on the mobile device side, it cannot carry too many network parameters and computing requirements. Obviously, these target detection networks are not suitable for real-time deployment and application in mobile scenarios. The existing lightweight target detection networks include MobileNet-SSD, MobileNetV2-SSD Lite, Tiny-YOLO, D-YOLO, etc. However, these lightweight object detection networks do not achieve a good balance between accuracy and model complexity.

SUMMARY OF THE INVENTION

In order to solve the deficiencies of the prior art, the present invention provides a lightweight target detection method, device and storage medium, which can improve the accuracy of target detection while ensuring the detection speed.

The specific technical solution proposed by the present invention is: a lightweight target detection method, the target detection method includes:

Obtain the image of the target to be detected;

reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;

The images after the dimensionality reduction are respectively compressed through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;

Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;

Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;

Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;

The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.

Further, before performing feature extraction on the sampled feature map through multiple block modules to obtain the feature map of the image, the target detection method further includes:

compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;

Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;

Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:

The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.

Further, for the first branch/second branch with the smallest depth, the dimensionality-reduced image/sampled feature map is compressed through the second convolution layer, including:

The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.

Further, extract the first feature maps of the multiple first branches/second feature maps of the second branches respectively and perform the first feature maps of the multiple first branches/the second feature maps of the second branches. Splicing to obtain the first splicing feature map/second splicing feature map, including:

The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;

Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;

splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.

Further, feature extraction is performed on the sampled feature map/second stitched feature map through multiple block modules to obtain the feature map of the image, including:

Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;

down-sampling the third-scale feature map to obtain a fourth-scale feature map;

respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;

The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.

Further, the detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained, including:

Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;

Generate a channel attention feature map according to the feature map of the image in the RPN network;

Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;

Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;

The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.

Further, generate a channel attention feature map according to the feature map of the image in the RPN network, including:

The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;

Obtain channel attention weights according to the first sub-feature map;

The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.

The present invention also provides a device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the target detection according to any one of the above method.

The present invention also provides a computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, and when the computer instructions are executed by a processor, implement the target detection method described in any one of the above.

The target detection method proposed by the present invention firstly compresses the dimension-reduced images through multiple second convolution layers in the feature extraction stage to obtain multiple first branches, and then extracts the first feature maps of the multiple first branches respectively. The first feature maps of multiple first branches are spliced to obtain a first spliced feature map, and the feature maps of multiple branches are spliced using a cross-channel branching strategy as the basis for subsequent feature extraction, so that through multiple channel branches The information exchange between them expands the receiving range and retains more low-level functions, which improves the accuracy of target detection while ensuring the detection speed.

Description of drawings

The technical solutions and other beneficial effects of the present invention will be apparent through the detailed description of the specific embodiments of the present invention with reference to the accompanying drawings.

1 is a schematic diagram of a target detection method in an embodiment of the present application;

2 is a schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application;

3 is another schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application;

4 is a schematic diagram of a multi-scale feature fusion module in an embodiment of the present application;

5 is a schematic diagram of a detection network in an embodiment of the present application;

6 is a schematic diagram of a channel self-attention network in an embodiment of the present application;

FIG. 7 is a schematic diagram of a target detection device in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a device in an embodiment of the present application.

Detailed ways

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the specific embodiments set forth herein. Rather, these embodiments are provided to explain the principles of the invention and its practical application, to thereby enable others skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular intended use. Throughout the drawings, the same reference numbers will be used to refer to the same elements.

CNN-based object detection networks are divided into two categories: one-stage and two-stage, according to whether a region proposal network (RPN) is included. The one-stage target detection network can directly regress and predict the target category and frame from the feature map. The network structure is simpler and more efficient, and is often considered to be more suitable for lightweight research, while the two-stage target detection network Better detection performance can be achieved due to the added step of candidate region selection. At present, most of the research work on lightweight object detection networks is based on one stage, for example, MobileNet-SSD, MobileNetV2-SSD Lite, Tiny-YOLO, D-YOLO, Pelee, and there are also two-stage lightweight object detection networks based on , for example, Light-Head R-CNN. However, it is difficult for the existing one-stage based lightweight object detection networks and two-stage based lightweight object detection networks to achieve a good balance between accuracy and model complexity.

Based on the above problems, the present application provides a lightweight target detection method. The target detection method is a two-stage lightweight target detection method, including a feature extraction stage and a detection stage. In the feature extraction stage, a cross-channel branch is adopted. The strategy adds cross-channel branches to the structure of the existing lightweight classification network, and splices the feature maps of multiple branches as the basis for subsequent feature extraction, thereby expanding the receiving range and retaining the information exchange between multiple channel branches. More low-level functions improve the accuracy of target detection while ensuring the detection speed. Specifically, the present application first obtains the image of the target to be detected, reduces the dimension of the image of the target to be detected through the first convolution layer, obtains the dimensionally reduced image, and compresses the dimensionally reduced image through the second convolution layer respectively. form multiple first branches with the same number of channels, respectively extract the first feature maps of the multiple first branches and splicing the first feature maps of the multiple first branches to obtain the first spliced feature map, where multiple The first feature map of the first branch increases sequentially in depth, and then the first stitched feature map is down-sampled through the first pooling layer to obtain the sampled feature map, and the sampled feature map is processed by multiple block modules. Feature extraction, obtain the feature map of the image, and finally perform detection according to the feature map of the image to obtain the detection result of the target to be detected.

The object detection method of the present application is described in detail below by taking the lightweight classification network ShuffleNetV2 as the lightweight classification network of this application. It should be noted that the lightweight classification network ShuffleNetV2 is used as the lightweight classification network of this application. As an example, it is not used to limit the target detection method of the present application, and the lightweight classification network of the present application can also adopt other lightweight classification networks, for example, Tiny-Darknet, MoblieNetV2, PeleeNet, etc.

Referring to FIG. 1 , the lightweight target detection method provided by this embodiment includes the following steps:

S1. Obtain an image of the target to be detected;

S2, reducing the dimension of the image through the first convolution layer to obtain the image after dimension reduction;

S3, compressing the dimensionally reduced images through a plurality of second convolution layers to obtain a plurality of first branches, wherein the plurality of first branches have the same number of channels;

S4. Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, wherein the first feature maps of the multiple first branches are in depth increasing sequentially;

S5, down-sampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;

S6, extracting the features of the sampled feature map through multiple block modules to obtain the feature map of the image;

S7. Perform detection according to the feature map of the image to obtain a detection result of the target to be detected.

The feature extraction network of the target detection method in this embodiment is an improvement based on the lightweight classification network ShuffleNetV2. The specific structure of the feature extraction network is shown in the following table, which includes the first convolutional layer (Convolution ), cross-channel branch feature extraction module, the first pooling layer (MaxPooling), multiple block modules (ShuffleV2block), the first convolution layer (Convolution) and the first pooling layer (MaxPooling) are stem stages, multiple blocks The modules are stage2, stage3, and stage4. Specifically, the feature extraction network includes 16 block modules. Among them, the stage2 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1. The stage3 stage includes a step A block module with a length of 2 and 7 block modules with a step size of 1. The stage4 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1.

Table 1 The structure of the feature extraction network

The target detection method in this embodiment adopts a cross-channel branching strategy, and a cross-channel branch feature extraction module is added in the stem stage of the ShuffleNet V2 network, and the feature maps of multiple branches are spliced as the subsequent stages 2, 3, and 4. The basis of stage feature extraction, thereby expanding the receiving range and retaining more low-level functions through the information exchange between multiple channel branches, which improves the accuracy of target detection while ensuring the detection speed.

Specifically, the image of the target to be detected obtained in step S1 is input into the first convolution layer (Convolution), the size of the convolution kernel of the first convolution layer (Convolution) is 3╳3, and the step size is 2. The first convolution layer (Convolution) reduces the dimensionality of the image to obtain a dimensionally reduced image.

Referring to FIG. 2 , the cross-channel branch feature extraction module in this embodiment includes a plurality of branch modules and a concatenation layer (Concat), and the plurality of branch modules are used to compress the dimensionally reduced image into a plurality of first with the same number of channels. Branch and extract the first feature maps of the multiple first branches, and the first feature maps of the multiple first branches are sequentially increased in depth. Each branch module includes a second convolutional layer (1╳1 Conv) with a convolution kernel size of 1╳1, and multiple branch modules compress the dimension-reduced image through the second convolutional layer (1╳1 Conv). into multiple first branches with the same number of channels. The first branch module only includes the second convolutional layer (1╳1 Conv), starting from the second branch module, each branch module includes a convolutional layer with a convolution kernel size of 3╳3 (3╳3 Conv) And the number of convolutional layers (3╳3 Conv) is increased in turn to extract the first feature maps of multiple first branches increasing in depth in sequence, and finally the first feature maps of multiple first branches are concatenated through the concatenation layer (Concat). The feature maps are spliced to obtain a first spliced feature map. Fig. 2 shows the case where the feature extraction module includes 4 branch modules, the first branches in the 4 branch modules are respectively a1-a4, of course, this is only for illustration purposes, not for limitation, the number of branch modules Can be set according to actual needs.

Preferably, the first branch module in this embodiment adds a second pooling layer (Pool) in front of the second convolutional layer (1╳1 Conv). The dimensionality-reduced images are pooled to increase the receptive field of the first branch module, while retaining the main features and reducing parameters. In addition, on the basis that the first branch module includes the second pooling layer (Pool), in order to retain more information of the original image, the cross-channel branch feature extraction module adds only the second branch module in front of the first branch module. The branch module a0 of the convolutional layer (1╳1 Conv), the branch module performs channel compression on the dimensionally reduced image through the second convolutional layer (1╳1 Conv) and directly outputs it to the concatenation layer (Concat).

Referring to Figure 3, since the model will degenerate as the network depth increases, in order to solve this problem, residual connections are added to the cross-channel branch feature extraction module in this embodiment, that is, the output of the previous branch is used as the next The residual part of the branch is fused with the features of the same depth as the next branch before feature extraction. Figure 3 shows the structure of the cross-channel branch feature extraction module in Figure 2. The structure of the channel branch feature extraction module, specifically, step S4 includes:

S41. Use the output of the previous first branch as the residual part of the next first branch, and fuse the features and residual parts with the same depth as the next first branch and the residual part to obtain the first branch after fusion The cross-branch feature map of ;

S42, performing feature extraction on the cross-branch feature map fused by the first branch through the third convolution layer, and obtaining first feature maps of multiple first branches respectively;

S43, splicing the first feature maps of the plurality of first branches and merging them with the dimension-reduced image to obtain a first splicing feature map.

Take the case where the feature extraction module includes 4 branch modules, and the first branches in the 4 branch modules are respectively a1 to a4 as an example, the output of the previous first branch is used as the residual part of the next first branch and the next A feature and residual part with the same depth as the first branch and the residual part are fused. Specifically, the output of the first branch module and the second branch module are obtained through the second convolutional layer (1╳1 Conv). The feature maps are fused to obtain the fused cross-branch feature map of the second branch module, and then the fused cross-branch feature map of the second branch module is extracted through the third convolution layer to obtain the second The first feature map of the branch module; the output of the second branch module and the feature map obtained by the third branch module after passing through the second convolution layer (1╳1 Conv) and the convolution layer (3╳3 Conv) are analyzed. Fusion, the fused cross-branch feature map of the third branch module is obtained, and then the fused cross-branch feature map of the third branch module is extracted through the third convolution layer to obtain the third branch module. A feature map; fuse the output of the third branch module with the feature map obtained by the fourth branch module after passing through the second convolutional layer (1╳1 Conv) and two convolutional layers (3╳3 Conv), Obtain the fused cross-branch feature map of the fourth branch module, and then perform feature extraction on the fused cross-branch feature map of the fourth branch module through the third convolution layer to obtain the first feature of the fourth branch module In the figure, the output _γi of the four branch modules is expressed as follows:

Among them, α represents the convolution operation on the dimensionally reduced image through the second convolution layer (1╳1 Conv), S represents the convolution operation through the convolution layer (3╳3 Conv), i∈{1,2 ,...,k}, k is the number of branch modules, it should be noted here that α ₁ is the maximum pooling of the dimensionality-reduced image through the second pooling layer (Pool) first, and then the The second convolutional layer (1╳1 Conv) performs convolution operations.

After obtaining the output γ _i of the four branch modules, the outputs of the four branch modules and the output of the branch module a0 are input to the concatenation layer (Concat) for splicing and fused with the dimensionally reduced image to obtain the first concatenated feature map . It should be noted here that the third convolutional layer refers to the convolutional layer (3╳3 Conv) connected to the concatenation layer (Concat) in each branch module, and the above principle is the same for the case where the number of branch modules is greater than 4 , which will not be repeated here.

The cross-channel branch feature extraction module in this embodiment can supplement the original input features by adding residual connections, thereby preventing model degradation.

In step S5, the first stitched feature map is input into the first pooling layer (MaxPooling) for downsampling to obtain the sampled feature map, and the size of the convolution kernel of the first pooling layer (MaxPooling) is 3╳3 , the step size is 2, the first pooling layer (MaxPooling) adopts the maximum pooling method for pooling, and the amount of data calculation can be further reduced by downsampling.

In order to further expand the receiving range of the network and retain more details, in another implementation of this embodiment, a cross-channel branch feature extraction module is also added after the first pooling layer (MaxPooling) of the ShuffleNetV2 network. At this time, The structure of the feature extraction network in another implementation of this embodiment is shown in the following table:

Table 2 Another structure of feature extraction network

Before step S6, the target detection method in another implementation manner of this embodiment further includes:

S600. Compress the sampled feature maps through the second convolution layer (1╳1 Conv) respectively to obtain multiple second branches, wherein the multiple second branches have the same number of channels;

S601. Extract the second feature maps of multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second spliced feature map, and the second feature maps of the multiple second branches are sequentially in depth Increment.

Specifically, the sampled feature map is input to the cross-channel branch feature extraction module again, and the multiple branch modules are used to compress the sampled feature map into multiple second branches with the same number of channels and extract multiple second branches The second feature maps of multiple second branches are sequentially increased in depth. The sampled feature map is compressed into multiple second branches with the same number of channels through the second convolution layer (1╳1 Conv), and then multiple second branches are extracted through the convolution layer (3╳3 Conv) in The second feature maps that increase sequentially in depth, and finally the second feature maps of the plurality of second branches are spliced through a concatenation layer (Concat) to obtain a second concatenated feature map. The acquisition process of the second stitched feature map is the same as the acquisition process of the first stitched feature map, and details are not repeated here.

Similarly, in order to solve the problem of model degradation, a residual connection is also added to the cross-channel branch feature extraction module after the first pooling layer (MaxPooling), that is, the output of the previous branch is used as the residual part of the next branch. The features of the same depth as the next branch are fused and then feature extraction is performed. The specific structure is shown in FIG. 3 . Step S601 includes:

S6011. Use the output of the previous second branch as the residual part of the next second branch, and fuse the features and residual parts with the same depth as the next second branch and the residual part to obtain the second branch after fusion The cross-branch feature map of ;

S6012, performing feature extraction on the cross-branch feature map fused by the second branch through the third convolution layer, and obtaining second feature maps of multiple second branches respectively;

S6013 , splicing the second feature maps of the plurality of second branches and merging them with the dimensionality-reduced image to obtain a second splicing feature map.

The process of obtaining the second stitched feature map after adding residual connection to the cross-channel branch feature extraction module is the same as the process of obtaining the first stitched feature map after adding the residual structure to the cross-channel branch feature extraction module, and will not be repeated here.

In step S6, feature extraction is performed on the sampled feature map through multiple block modules to obtain a feature map of the image. Specifically, feature extraction is performed on the second stitched feature map through multiple block modules to obtain a feature map of the image.

Due to the weak ability of the lightweight network to extract features, a large number of channel features cannot be retained. Preferably, the target detection method of this embodiment also adds a multi-scale feature fusion module on the basis of the ShuffleNet V2 network. The module fuses the features output from the stage3 and stage4 stages of the ShuffleNet V2 network, thereby combining low-resolution information with high-resolution information, which can effectively supplement the global context information between multi-scale feature maps.

Referring to FIG. 4 , the process of applying the multi-scale feature fusion module to the target detection method of the present embodiment is specifically described below. Step S6 includes:

S61, obtaining the first-scale feature map, the second-scale feature map, and the third-scale feature map in turn by passing the sampled feature map through multiple block modules;

S62, down-sampling the third-scale feature map to obtain a fourth-scale feature map;

S63, respectively upsampling the third scale feature map and the fourth scale feature map to obtain the first upsampling feature map and the second upsampling feature map;

S64 , fuse the second scale feature map, the first upsampling feature map, and the second upsampling feature map to obtain a feature map F _mfm of the image.

The first scale feature map in this embodiment is the output of stage 2, specifically, the feature map obtained by passing the second stitched feature map through a block module with a step size of 2 and three block modules with a step size of 1 in turn That is, the first-scale feature map; the second-scale feature map is the output of the stage3 stage. Specifically, the first-scale feature map is obtained by sequentially passing through a block module with a step size of 2 and 7 block modules with a step size of 1. The feature map is the second-scale feature map; the third-scale feature map is the output of the stage4 stage, specifically, the second-scale feature map is sequentially passed through a block module with a step size of 2 and three blocks with a step size of 1. The feature map obtained after the module is the third-scale feature map.

When the detection result is classification, the high-resolution information has a greater impact on the classification than the low-resolution information. Therefore, in order to retain more high-resolution information, in step S62, the third scale feature map is down-sampled The obtained fourth-scale feature maps have more high-resolution information. Preferably, in order to further reduce the amount of data calculation, in this embodiment, depthwise separable convolution (3╳3 DW Conv) is used to downsample the third-scale feature map, and the size of the convolution kernel is 3╳3.

In order to ensure that the data dimensions are consistent, in step S63, it is necessary to upsample the third-scale feature map and the fourth-scale feature map, so that the obtained data dimensions of the first up-sampling feature map and the second up-sampling feature map are the same as the third-scale feature map. The data dimensions of the two-scale feature maps are consistent. Preferably, in this embodiment, the bilinear interpolation method is used to upsample the third-scale feature map and the fourth-scale feature map, and the upsampling passes through a convolutional layer with a convolution kernel size of 1╳1 (1╳1 Conv )to fulfill.

Preferably, in order to further ensure the consistency of data dimensions, in step S63, it is necessary to perform dimension adjustment on the feature map of the second scale to obtain a feature map of increased dimension, and the dimension adjustment is performed through a convolution layer with a convolution kernel size of 1╳1. (1╳1 Conv) is implemented, so as to ensure that the data dimension of the feature map of the upsampling is consistent with the data dimension of the first upsampling feature map and the second upsampling feature map. Correspondingly, in step S64, the fusion of the second scale feature map, the first upsampling feature map, and the second upsampling feature map is specifically to fuse the dimension-raising feature map, the first upsampling feature map, and the second upsampling feature map. The images are fused to obtain the feature map F _mfm of the image.

Since the low-resolution information, that is, the shallow feature information, has a relatively small effect on the classification of the detection result, and the low-resolution information has a large amount of data, which will greatly increase the amount of calculation. Therefore, considering the amount of calculation and the impact on the detection result, In this embodiment, in step S64, only the second scale feature map and the first upsampling feature map are selected for the low resolution information, the second upsampling feature map is selected for the high resolution information, and the second scale feature map, the first upsampling feature map are selected The sampling feature map and the second up-sampling feature map are fused to obtain the feature map of the image, so as to realize the combination of low-resolution information and high-resolution information, effectively supplement the global context information between multi-scale feature maps, and avoid information loss. . It should be noted here that in this embodiment, only three different levels of feature information are selected for fusion, and only the second upsampling feature map is selected for high-resolution information. Influence on the detection results, continue to downsample the fourth-scale feature map to obtain more high-resolution information, and select more levels of feature information for fusion.

Referring to Figure 5, the detection network in the target detection method in this embodiment is an improvement based on the existing lightweight detection network. Specifically, the detection network is performed on the basis of the existing Light-Head R-CNN network. improvement of. Among them, the Light-Head R-CNN network includes RPN, PSROI (position sensitive ROI pooling) layer, and fully connected layer. The detection network in this embodiment adds a channel self-attention network on the basis of the Light-Head R-CNN network. .

Specifically, the RPN includes a fourth convolutional layer (DW Conv), a fifth convolutional layer (1╳1 Conv), and candidate region extraction modules (ROIs) that are cascaded in sequence. It should be noted here that the Light-Head R-CNN network is used as an example and is not used as a limitation, and a channel self-attention network can also be added on the basis of other lightweight detection networks as the detection network in this embodiment.

Specifically, step S7 includes:

S71, passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;

S72. Generate a channel attention feature map according to the feature map of the image in the RPN network;

S73, fuse the channel attention feature map with the feature map of the image to obtain a fused feature map;

S74, obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;

S75. Obtain the detection result of the target to be detected according to the feature map of the target to be detected.

In step S71, the feature map of the image is sequentially passed through the fourth convolution layer (DW Conv) and the fifth convolution layer (1╳1 Conv) to obtain the feature map of the image in the RPN network, and the fifth convolution layer ( The size of the convolution kernel of 1╳1 Conv) is 1╳1. Preferably, in order to further reduce the amount of data calculation, this embodiment adopts depthwise separable convolution to pass the fourth convolution layer (DW Conv) to the characteristics of the image. The graph is convoluted. The feature map of the image in the RPN network is passed through the candidate region extraction module (ROIs) to obtain the candidate frame containing the object to be detected.

In order to solve the problem of weak feature extraction capability of lightweight network and the loss of spatial information around the target detection area, this embodiment adds a channel self-attention network on the basis of the existing Light-Head R-CNN network. The self-attention network optimizes the feature distribution of the feature map input to the PSROI (position sensitive ROI pooling) layer, so that the output feature map pays more attention to the detection-related area and improves the accuracy of the detection result.

Specifically, step S72 includes:

S721, dividing the feature map of the image in the RPN network into a first sub-feature map and a second sub-feature map, wherein the number of channels of the first sub-feature map and the second sub-feature map are equal;

S722, obtaining the channel attention weight according to the first sub-feature map;

S723. Multiply the channel attention weight and the second sub-feature map to obtain the channel attention feature map.

Referring to FIG. 6 , the channel self-attention network in this embodiment includes a first segmentation module and a channel attention weight acquisition module. The first segmentation module is used to segment the feature map F _rpn of the image in the RPN network into a first sub-feature map F ₁ and a second sub-feature map F ₂ , wherein the first sub-feature map F ₁ and the second sub-feature map The number of channels of F ₂ is equal. The segmentation mentioned here is to directly divide the channels. For example, if the number of channels of the feature map of the image in the RPN network is 8, the data corresponding to the 1st to 4th channels are used as the first channel. A sub-feature map F ₁ , and the data corresponding to the 5th to 8th channels is used as the second sub-feature map F ₂ .

Input the _first sub-feature map F1 into the channel attention weight acquisition module, and obtain the channel attention weight K through the channel attention weight acquisition module, wherein the channel attention weight acquisition module includes a second segmentation module, a grouping convolution layer (Group Conv), depthwise separable convolution layer (DW Conv), softmax layer, third pooling layer (Avg pool), sixth convolution layer (1╳1 Conv).

The second segmentation module is used to segment the first sub-feature map F ₁ into a third sub-feature map F ₃ and a fourth sub-feature map F ₄ , wherein the third sub-feature map F ₃ and the fourth sub-feature map F ₄ The number of channels is equal. The segmentation mentioned here is to directly divide the channels. Continue to take the number of channels of the feature map of the image in the RPN network as 8 as an example. After the first segmentation module segmentation, the first sub-feature map F If the number of channels of ₁ is 4, the data corresponding to the 1st to 2nd channels are used as the third sub-feature map F ₃ , and the data corresponding to the 3rd to 4th channels are used as the fourth sub-feature map F ₄ .

The third sub-feature map F ₃ and the fourth sub-feature map F ₄ are respectively input to the group convolution layer (Group Conv) and the depthwise separable convolution layer (DW Conv) for convolution processing, and the group convolution layer (Group Conv) Conv), the output of the depthwise separable convolution layer (DW Conv) are fused and then processed through the softmax layer, the third pooling layer (Avg pool), and the sixth convolution layer (1╳1 Conv) in turn to obtain channel attention. Force weight K, among which, the third pooling layer (Avg pool) adopts the method of mean pooling for pooling, and the sixth convolution layer (1╳1 Conv) is used for dimensional upgrade processing, so that the channel attention weight K The dimension of is consistent with the dimension of the _second sub-feature map F2.

After obtaining the channel attention weight K, multiply the channel attention weight K with the second sub-feature map F ₂ to obtain the channel attention feature map.

In step S73, the channel attention feature map and the image feature map are finally fused through the channel self-attention network to obtain a fused feature map. The channel self-attention network in this embodiment combines the channel separation and the self-attention mechanism. Through the channel separation, the information between each channel can interact with each other, which significantly reduces the complexity of the network structure, thereby greatly reducing the number of parameters. , the background features can be suppressed and the foreground features can be highlighted through the self-attention mechanism. In addition, by combining the channel attention feature map with the feature map of the image, the field of view of each spatial location is expanded and the output features are enriched.

In this embodiment, in steps S74 to S75, the PSROI (position sensitive ROI pooling) layer is used to map the candidate frame to the fused feature map, and the feature map of the target to be detected is extracted from the fused feature map according to the candidate frame. The feature map of the detection target obtains the detection result of the target to be detected through the fully connected layer, wherein the category probability is obtained through the fully connected layer and classified according to the category probability, that is, the detection result is classified, and the position offset information is obtained through the fully connected layer and according to the position Offset information to obtain the location of the target, that is, the detection result is regression.

The target detection method in this embodiment is mainly applied to mobile terminal devices. Before deploying the target detection algorithm to mobile terminal devices, training data needs to be used on the server to perform a network model constructed according to the target detection method of this embodiment. After training, use the evaluation data to evaluate the network model to obtain the network model with the best performance, and finally deploy the network model with the best performance to the mobile terminal through the onnx tool to implement the target detection algorithm. Real data for detection and visualization of detection results.

The target detection method in this embodiment is verified on the public data set PASCAL VOC. The experimental results show that the target detection method in this embodiment only needs 528 MFLOPs to obtain an accuracy of 70.6 mAP. There is a good balance of complexity.

The following is a detailed description of the verification structure of the target detection method in this embodiment on the public dataset PASCAL VOC.

The image is scaled to 320×320 as input, and the network model constructed according to the object detection method of this embodiment is trained on NVIDIA TITAN RTX with 24GB RAM. During the training phase, we employ a stochastic gradient optimizer with a learning rate of 0.0001 and a weight decay of 0.001. All datasets were randomly divided into training set (60%), validation set (20%), test set (20%) so that the data in the training, validation and testing stages all had similar distributions. Here, millions of floating-point operations (MFLOPs) are defined to measure the complexity and efficiency of a lightweight network model, and the performance of the model can be evaluated by mean precision (mAP). On the premise of the same training parameters, different methods are used to detect PASCAL VOC data. The MFLOP and mAP results corresponding to different methods are shown in Table 3.

Table 3 Comparison of the results of different methods on the PASCAL VOC dataset

Compared with most state-of-the-art models based on large-scale object detectors (such as YOLOv2, SSD300, SSD321, R-FCN), the object detection method in this example (our model in Table 2) has a stronger model complexity. Advantage. Therefore, the target detection method in this embodiment is more in line with the requirements of the mobile terminal device.

Comparing the target detection method in this embodiment with the existing lightweight detection algorithm, it can be seen that the target detection method in this embodiment (our model in Table 2) is better than Tiny-YOLO, D-YOLO, MobileNet- The MFLOPs of SSD are much smaller, and the accuracy is higher than Tiny-YOLO, D-YOLO, MobileNet-SSD. Compared with Pelee, the object detection method in this example (our model in Table 2) can produce similar accuracies with half the model complexity. It can be seen that the target detection method in this embodiment can achieve a good balance between the accuracy and the complexity of the model.

Referring to FIG. 7 , the present embodiment also provides a target detection device corresponding to the above target detection method. The target detection device includes an acquisition module 1, a dimensionality reduction module 2, a compression module 3, a splicing module 4, a sampling module 5, and a feature extraction module. 6. Detection module 7.

Specifically, the acquisition module 1 is used to acquire the image of the target to be detected, the dimension reduction module 2 is used to reduce the dimension of the image through the first convolution layer, and obtain the dimension-reduced image, and the compression module 3 is used to reduce the dimension of the image. The images are respectively compressed through multiple second convolution layers to obtain multiple first branches, wherein the multiple first branches have the same number of channels, and the stitching module 4 is used to extract the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, wherein the first feature maps of the multiple first branches are sequentially increased in depth, and the sampling module 5 is used to combine the first splicing feature The image is downsampled through the first pooling layer to obtain the sampled feature map. The feature extraction module 6 is used to extract the sampled feature map through multiple block modules to obtain the feature map of the image. The detection module 7 is used for The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.

The splicing module 4 in this embodiment is also configured to use the output of the previous first branch as the residual part of the next first branch and perform the next first branch and the residual part with the same depth as the residual part. Fusion, to obtain the cross-branch feature map after the fusion of the first branch, and perform feature extraction on the cross-branch feature map after the fusion of the first branch through the third convolution layer to obtain the first feature maps of multiple first branches respectively. , and splicing the first feature maps of the multiple first branches and merging them with the dimension-reduced image to obtain a first splicing feature map.

The compression module 3 is also used to compress the sampled feature maps through multiple second convolution layers to obtain multiple second branches, wherein the multiple second branches have the same number of channels, and the splicing module 4 is also used for Extracting the second feature maps of multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, wherein the second feature maps of the multiple second branches are sequentially in depth. Increment.

The splicing module 4 in this embodiment is also used to use the output of the previous second branch as the residual part of the next second branch and perform the next second branch and the residual part with the same depth as the feature and the residual part. Fusion, obtains the cross-branch feature maps after the fusion of multiple second branches, and performs feature extraction on the cross-branch feature maps after the fusion of multiple second branches through the third convolution layer, and obtains the second branch of the multiple second branches respectively. feature maps, and splicing the second feature maps of the plurality of second branches and merging them with the sampled feature maps to obtain a second splicing feature map.

The feature extraction module 6 is further configured to perform feature extraction on the second stitched feature map through a plurality of block modules to obtain a feature map of the image. Specifically, the feature extraction module 6 is configured to sequentially obtain the first scale feature map, the second scale feature map, the third scale feature map through the sampled feature map/second stitched feature map through multiple block modules, and the third The scale feature map is downsampled to obtain the fourth scale feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled to obtain the first upsampling feature map and the second upsampling feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled. The two-scale feature map, the first upsampling feature map, and the second upsampling feature map are fused to obtain the feature map of the image.

The detection module 7 in this embodiment is specifically used to pass the feature map of the image through the RPN network, obtain the feature map of the image in the RPN network and the candidate frame containing the target to be detected, and generate a channel according to the feature map of the image in the RPN network Attention feature map, and fuse the channel attention feature map with the feature map of the image to obtain the fused feature map, and obtain the feature map of the target to be detected according to the candidate frame and the fused feature map, and according to the target to be detected. The feature map of , obtains the detection result of the target to be detected.

The detection module 7 is further configured to divide the feature map of the image in the RPN network into a first sub-feature map and a second sub-feature map, wherein the number of channels of the first sub-feature map and the second sub-feature map are equal, and The channel attention weight is obtained according to the first sub-feature map, and the channel attention feature map is obtained by multiplying the channel attention weight with the second sub-feature map.

8, this embodiment provides a device including a memory 100, a processor 200, and a network interface 202. The memory 100 stores a computer program, and the processor 200 executes the computer program to implement the target detection method in this embodiment.

The memory 100 may include a high-speed random access memory (Random Access Memory, RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 200 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the target detection method in this embodiment may be completed by an integrated logic circuit of hardware in the processor 200 or an instruction in the form of software. The processor 200 may also be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., and may also be a digital signal processor (DSP), an application specific integrated circuit (ASIC) , off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The memory 100 is used to store a computer program, and after receiving the execution instruction, the processor 200 executes the computer program to implement the target detection method in this embodiment.

This embodiment also provides a computer storage medium, where a computer program is stored in the computer storage medium, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201 to implement the target detection method in this embodiment .

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer storage medium to another computer storage medium, for example, from a website site, computer, server, or data center over a wired (e.g., coaxial) cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center. The computer storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, apparatuses, and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

The above are only specific embodiments of the present application. It should be pointed out that for those skilled in the art, without departing from the principles of the present application, several improvements and modifications can also be made. It should be regarded as the protection scope of this application.

Claims

A lightweight target detection method, wherein the target detection method comprises:

Obtain the image of the target to be detected;

reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;

compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;

Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;

Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;

Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;

The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
The target detection method according to claim 1, wherein, before the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:

compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;

Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;

Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:

The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
The target detection method according to claim 2, wherein, for the first branch/second branch with the smallest depth, compressing the dimension-reduced image/sampled feature map through a second convolution layer, comprising:

The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
The target detection method according to claim 3, wherein the first feature maps of the plurality of first branches/the second feature maps of the second branches are respectively extracted and the first feature maps of the plurality of first branches are extracted respectively. / The second feature map of the second branch is spliced to obtain the first splicing feature map/the second splicing feature map, including:

The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;

Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;

splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.
The target detection method according to claim 1, wherein the feature extraction of the sampled feature map/second stitched feature map through a plurality of block modules to obtain the feature map of the image, comprising:

Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;

down-sampling the third-scale feature map to obtain a fourth-scale feature map;

respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;

The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
The target detection method according to claim 2, wherein the feature extraction of the sampled feature map/second stitched feature map is performed by multiple block modules to obtain the feature map of the image, comprising:

Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;

down-sampling the third-scale feature map to obtain a fourth-scale feature map;

respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;

The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
The target detection method according to claim 5, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:

Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;

Generate a channel attention feature map according to the feature map of the image in the RPN network;

Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;

Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;

The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
The target detection method according to claim 6, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:

Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;

Generate a channel attention feature map according to the feature map of the image in the RPN network;

Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;

Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;

The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
The target detection method according to claim 7, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:

The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;

Obtain channel attention weights according to the first sub-feature map;

The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
The target detection method according to claim 8, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:

The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;

Obtain channel attention weights according to the first sub-feature map;

The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
A device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement a lightweight target detection method, the lightweight Object detection methods include:

Obtain the image of the target to be detected;

reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;

compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;

Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;

Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;

Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;

The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
The device according to claim 11, wherein, before the feature map of the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:

compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;

Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;

Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:

The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
The device according to claim 12, wherein, for the first branch/second branch with the smallest depth, compressing the dimension-reduced image/sampled feature map through a second convolution layer, comprising:

The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
The device according to claim 13, wherein the first feature maps of the plurality of first branches/the second feature maps of the second branches are respectively extracted and the first feature maps/the second feature maps of the plurality of first branches are extracted respectively. The second feature maps of the two branches are spliced to obtain the first splicing feature map/the second splicing feature map, including:

The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;

Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;

splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.
The device according to claim 12, wherein the feature extraction of the sampled feature map/second stitched feature map is performed by multiple block modules to obtain the feature map of the image, comprising:

Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;

down-sampling the third-scale feature map to obtain a fourth-scale feature map;

respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;

The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
The device according to claim 15, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:

Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;

Generate a channel attention feature map according to the feature map of the image in the RPN network;

Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;

Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;

The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
The device according to claim 16, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:

The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;

Obtain channel attention weights according to the first sub-feature map;

The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
A computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, wherein the computer instructions are executed by a processor to perform a lightweight target detection method, and the lightweight target detection method includes:

Obtain the image of the target to be detected;

reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;

compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;

Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;

Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;

Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;

The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
The computer-readable storage medium according to claim 18, wherein, before the feature map of the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:

compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;

Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;

Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:

The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
The computer-readable storage medium according to claim 19, wherein the feature extraction of the sampled feature map/second stitched feature map through a plurality of block modules to obtain the feature map of the image comprises:

Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;

down-sampling the third-scale feature map to obtain a fourth-scale feature map;

respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;

The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.