WO2022213395A1 - Light-weighted target detection method and device, and storage medium - Google Patents

Light-weighted target detection method and device, and storage medium Download PDF

Info

Publication number
WO2022213395A1
WO2022213395A1 PCT/CN2021/086476 CN2021086476W WO2022213395A1 WO 2022213395 A1 WO2022213395 A1 WO 2022213395A1 CN 2021086476 W CN2021086476 W CN 2021086476W WO 2022213395 A1 WO2022213395 A1 WO 2022213395A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
feature
image
branches
maps
Prior art date
Application number
PCT/CN2021/086476
Other languages
French (fr)
Chinese (zh)
Inventor
张伟烽
胡庆茂
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022213395A1 publication Critical patent/WO2022213395A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the invention relates to the technical field of image processing, and in particular, to a lightweight target detection method, device and storage medium.
  • Object detection is the basic visual recognition in computer vision and is widely used in areas such as autonomous driving and safety inspection.
  • object detection networks based on convolutional neural networks (CNNs) have gradually become mainstream.
  • CNN-based target detection networks include Faster R-CNN, R-FCN, SSD, YOLO, etc.
  • MFLOPs million floating point operations
  • the present invention provides a lightweight target detection method, device and storage medium, which can improve the accuracy of target detection while ensuring the detection speed.
  • the specific technical solution proposed by the present invention is: a lightweight target detection method, the target detection method includes:
  • the images after the dimensionality reduction are respectively compressed through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;
  • the detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
  • the target detection method further includes:
  • the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:
  • the second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
  • the dimensionality-reduced image/sampled feature map is compressed through the second convolution layer, including:
  • the dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
  • the output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth.
  • the residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;
  • feature extraction is performed on the sampled feature map/second stitched feature map through multiple block modules to obtain the feature map of the image, including:
  • the feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
  • the detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained, including:
  • the detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
  • the feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;
  • the channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
  • the present invention also provides a device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the target detection according to any one of the above method.
  • the present invention also provides a computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, and when the computer instructions are executed by a processor, implement the target detection method described in any one of the above.
  • the target detection method proposed by the present invention firstly compresses the dimension-reduced images through multiple second convolution layers in the feature extraction stage to obtain multiple first branches, and then extracts the first feature maps of the multiple first branches respectively.
  • the first feature maps of multiple first branches are spliced to obtain a first spliced feature map, and the feature maps of multiple branches are spliced using a cross-channel branching strategy as the basis for subsequent feature extraction, so that through multiple channel branches
  • the information exchange between them expands the receiving range and retains more low-level functions, which improves the accuracy of target detection while ensuring the detection speed.
  • FIG. 1 is a schematic diagram of a target detection method in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application.
  • FIG. 3 is another schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a multi-scale feature fusion module in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a detection network in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a channel self-attention network in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a target detection device in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a device in an embodiment of the present application.
  • CNN-based object detection networks are divided into two categories: one-stage and two-stage, according to whether a region proposal network (RPN) is included.
  • the one-stage target detection network can directly regress and predict the target category and frame from the feature map.
  • the network structure is simpler and more efficient, and is often considered to be more suitable for lightweight research, while the two-stage target detection network Better detection performance can be achieved due to the added step of candidate region selection.
  • most of the research work on lightweight object detection networks is based on one stage, for example, MobileNet-SSD, MobileNetV2-SSD Lite, Tiny-YOLO, D-YOLO, Pelee, and there are also two-stage lightweight object detection networks based on , for example, Light-Head R-CNN.
  • the target detection method is a two-stage lightweight target detection method, including a feature extraction stage and a detection stage.
  • a feature extraction stage In the feature extraction stage, a cross-channel branch is adopted.
  • the strategy adds cross-channel branches to the structure of the existing lightweight classification network, and splices the feature maps of multiple branches as the basis for subsequent feature extraction, thereby expanding the receiving range and retaining the information exchange between multiple channel branches. More low-level functions improve the accuracy of target detection while ensuring the detection speed.
  • the present application first obtains the image of the target to be detected, reduces the dimension of the image of the target to be detected through the first convolution layer, obtains the dimensionally reduced image, and compresses the dimensionally reduced image through the second convolution layer respectively.
  • form multiple first branches with the same number of channels respectively extract the first feature maps of the multiple first branches and splicing the first feature maps of the multiple first branches to obtain the first spliced feature map, where multiple The first feature map of the first branch increases sequentially in depth, and then the first stitched feature map is down-sampled through the first pooling layer to obtain the sampled feature map, and the sampled feature map is processed by multiple block modules.
  • Feature extraction obtain the feature map of the image, and finally perform detection according to the feature map of the image to obtain the detection result of the target to be detected.
  • the object detection method of the present application is described in detail below by taking the lightweight classification network ShuffleNetV2 as the lightweight classification network of this application. It should be noted that the lightweight classification network ShuffleNetV2 is used as the lightweight classification network of this application. As an example, it is not used to limit the target detection method of the present application, and the lightweight classification network of the present application can also adopt other lightweight classification networks, for example, Tiny-Darknet, MoblieNetV2, PeleeNet, etc.
  • the lightweight target detection method provided by this embodiment includes the following steps:
  • the feature extraction network of the target detection method in this embodiment is an improvement based on the lightweight classification network ShuffleNetV2.
  • the specific structure of the feature extraction network is shown in the following table, which includes the first convolutional layer (Convolution ), cross-channel branch feature extraction module, the first pooling layer (MaxPooling), multiple block modules (ShuffleV2block), the first convolution layer (Convolution) and the first pooling layer (MaxPooling) are stem stages, multiple blocks
  • the modules are stage2, stage3, and stage4.
  • the feature extraction network includes 16 block modules.
  • the stage2 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1.
  • the stage3 stage includes a step A block module with a length of 2 and 7 block modules with a step size of 1.
  • the stage4 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1.
  • Table 1 The structure of the feature extraction network
  • the target detection method in this embodiment adopts a cross-channel branching strategy, and a cross-channel branch feature extraction module is added in the stem stage of the ShuffleNet V2 network, and the feature maps of multiple branches are spliced as the subsequent stages 2, 3, and 4.
  • stage feature extraction thereby expanding the receiving range and retaining more low-level functions through the information exchange between multiple channel branches, which improves the accuracy of target detection while ensuring the detection speed.
  • the image of the target to be detected obtained in step S1 is input into the first convolution layer (Convolution), the size of the convolution kernel of the first convolution layer (Convolution) is 3 ⁇ 3, and the step size is 2.
  • the first convolution layer (Convolution) reduces the dimensionality of the image to obtain a dimensionally reduced image.
  • the cross-channel branch feature extraction module in this embodiment includes a plurality of branch modules and a concatenation layer (Concat), and the plurality of branch modules are used to compress the dimensionally reduced image into a plurality of first with the same number of channels.
  • Branch and extract the first feature maps of the multiple first branches, and the first feature maps of the multiple first branches are sequentially increased in depth.
  • Each branch module includes a second convolutional layer (1 ⁇ 1 Conv) with a convolution kernel size of 1 ⁇ 1, and multiple branch modules compress the dimension-reduced image through the second convolutional layer (1 ⁇ 1 Conv). into multiple first branches with the same number of channels.
  • the first branch module only includes the second convolutional layer (1 ⁇ 1 Conv), starting from the second branch module, each branch module includes a convolutional layer with a convolution kernel size of 3 ⁇ 3 (3 ⁇ 3 Conv) And the number of convolutional layers (3 ⁇ 3 Conv) is increased in turn to extract the first feature maps of multiple first branches increasing in depth in sequence, and finally the first feature maps of multiple first branches are concatenated through the concatenation layer (Concat).
  • the feature maps are spliced to obtain a first spliced feature map.
  • Fig. 2 shows the case where the feature extraction module includes 4 branch modules, the first branches in the 4 branch modules are respectively a1-a4, of course, this is only for illustration purposes, not for limitation, the number of branch modules Can be set according to actual needs.
  • the first branch module in this embodiment adds a second pooling layer (Pool) in front of the second convolutional layer (1 ⁇ 1 Conv).
  • the dimensionality-reduced images are pooled to increase the receptive field of the first branch module, while retaining the main features and reducing parameters.
  • the cross-channel branch feature extraction module adds only the second branch module in front of the first branch module.
  • the branch module a0 of the convolutional layer (1 ⁇ 1 Conv) the branch module performs channel compression on the dimensionally reduced image through the second convolutional layer (1 ⁇ 1 Conv) and directly outputs it to the concatenation layer (Concat).
  • step S4 includes:
  • the feature extraction module includes 4 branch modules, and the first branches in the 4 branch modules are respectively a1 to a4 as an example, the output of the previous first branch is used as the residual part of the next first branch and the next A feature and residual part with the same depth as the first branch and the residual part are fused.
  • the output of the first branch module and the second branch module are obtained through the second convolutional layer (1 ⁇ 1 Conv).
  • the feature maps are fused to obtain the fused cross-branch feature map of the second branch module, and then the fused cross-branch feature map of the second branch module is extracted through the third convolution layer to obtain the second
  • the first feature map of the branch module; the output of the second branch module and the feature map obtained by the third branch module after passing through the second convolution layer (1 ⁇ 1 Conv) and the convolution layer (3 ⁇ 3 Conv) are analyzed. Fusion, the fused cross-branch feature map of the third branch module is obtained, and then the fused cross-branch feature map of the third branch module is extracted through the third convolution layer to obtain the third branch module.
  • the output ⁇ i of the four branch modules is expressed as follows:
  • represents the convolution operation on the dimensionally reduced image through the second convolution layer (1 ⁇ 1 Conv)
  • S represents the convolution operation through the convolution layer (3 ⁇ 3 Conv)
  • i ⁇ 1,2 ,...,k ⁇ , k is the number of branch modules
  • ⁇ 1 is the maximum pooling of the dimensionality-reduced image through the second pooling layer (Pool) first, and then the The second convolutional layer (1 ⁇ 1 Conv) performs convolution operations.
  • the outputs of the four branch modules and the output of the branch module a0 are input to the concatenation layer (Concat) for splicing and fused with the dimensionally reduced image to obtain the first concatenated feature map .
  • the third convolutional layer refers to the convolutional layer (3 ⁇ 3 Conv) connected to the concatenation layer (Concat) in each branch module, and the above principle is the same for the case where the number of branch modules is greater than 4 , which will not be repeated here.
  • the cross-channel branch feature extraction module in this embodiment can supplement the original input features by adding residual connections, thereby preventing model degradation.
  • step S5 the first stitched feature map is input into the first pooling layer (MaxPooling) for downsampling to obtain the sampled feature map, and the size of the convolution kernel of the first pooling layer (MaxPooling) is 3 ⁇ 3 , the step size is 2, the first pooling layer (MaxPooling) adopts the maximum pooling method for pooling, and the amount of data calculation can be further reduced by downsampling.
  • a cross-channel branch feature extraction module is also added after the first pooling layer (MaxPooling) of the ShuffleNetV2 network.
  • the structure of the feature extraction network in another implementation of this embodiment is shown in the following table:
  • the target detection method in another implementation manner of this embodiment further includes:
  • the sampled feature map is input to the cross-channel branch feature extraction module again, and the multiple branch modules are used to compress the sampled feature map into multiple second branches with the same number of channels and extract multiple second branches
  • the second feature maps of multiple second branches are sequentially increased in depth.
  • the sampled feature map is compressed into multiple second branches with the same number of channels through the second convolution layer (1 ⁇ 1 Conv), and then multiple second branches are extracted through the convolution layer (3 ⁇ 3 Conv) in
  • the second feature maps that increase sequentially in depth, and finally the second feature maps of the plurality of second branches are spliced through a concatenation layer (Concat) to obtain a second concatenated feature map.
  • Concat concatenation layer
  • Step S601 includes:
  • the process of obtaining the second stitched feature map after adding residual connection to the cross-channel branch feature extraction module is the same as the process of obtaining the first stitched feature map after adding the residual structure to the cross-channel branch feature extraction module, and will not be repeated here.
  • step S6 feature extraction is performed on the sampled feature map through multiple block modules to obtain a feature map of the image. Specifically, feature extraction is performed on the second stitched feature map through multiple block modules to obtain a feature map of the image.
  • the target detection method of this embodiment also adds a multi-scale feature fusion module on the basis of the ShuffleNet V2 network.
  • the module fuses the features output from the stage3 and stage4 stages of the ShuffleNet V2 network, thereby combining low-resolution information with high-resolution information, which can effectively supplement the global context information between multi-scale feature maps.
  • Step S6 includes:
  • the first scale feature map in this embodiment is the output of stage 2, specifically, the feature map obtained by passing the second stitched feature map through a block module with a step size of 2 and three block modules with a step size of 1 in turn That is, the first-scale feature map; the second-scale feature map is the output of the stage3 stage. Specifically, the first-scale feature map is obtained by sequentially passing through a block module with a step size of 2 and 7 block modules with a step size of 1.
  • the feature map is the second-scale feature map;
  • the third-scale feature map is the output of the stage4 stage, specifically, the second-scale feature map is sequentially passed through a block module with a step size of 2 and three blocks with a step size of 1.
  • the feature map obtained after the module is the third-scale feature map.
  • step S62 the third scale feature map is down-sampled
  • the obtained fourth-scale feature maps have more high-resolution information.
  • depthwise separable convolution (3 ⁇ 3 DW Conv) is used to downsample the third-scale feature map, and the size of the convolution kernel is 3 ⁇ 3.
  • step S63 it is necessary to upsample the third-scale feature map and the fourth-scale feature map, so that the obtained data dimensions of the first up-sampling feature map and the second up-sampling feature map are the same as the third-scale feature map.
  • the data dimensions of the two-scale feature maps are consistent.
  • the bilinear interpolation method is used to upsample the third-scale feature map and the fourth-scale feature map, and the upsampling passes through a convolutional layer with a convolution kernel size of 1 ⁇ 1 (1 ⁇ 1 Conv )to fulfill.
  • step S63 it is necessary to perform dimension adjustment on the feature map of the second scale to obtain a feature map of increased dimension, and the dimension adjustment is performed through a convolution layer with a convolution kernel size of 1 ⁇ 1. (1 ⁇ 1 Conv) is implemented, so as to ensure that the data dimension of the feature map of the upsampling is consistent with the data dimension of the first upsampling feature map and the second upsampling feature map.
  • step S64 the fusion of the second scale feature map, the first upsampling feature map, and the second upsampling feature map is specifically to fuse the dimension-raising feature map, the first upsampling feature map, and the second upsampling feature map.
  • the images are fused to obtain the feature map F mfm of the image.
  • step S64 only the second scale feature map and the first upsampling feature map are selected for the low resolution information, the second upsampling feature map is selected for the high resolution information, and the second scale feature map, the first upsampling feature map are selected.
  • the sampling feature map and the second up-sampling feature map are fused to obtain the feature map of the image, so as to realize the combination of low-resolution information and high-resolution information, effectively supplement the global context information between multi-scale feature maps, and avoid information loss.
  • the detection network in the target detection method in this embodiment is an improvement based on the existing lightweight detection network. Specifically, the detection network is performed on the basis of the existing Light-Head R-CNN network. improvement of. Among them, the Light-Head R-CNN network includes RPN, PSROI (position sensitive ROI pooling) layer, and fully connected layer. The detection network in this embodiment adds a channel self-attention network on the basis of the Light-Head R-CNN network. .
  • the RPN includes a fourth convolutional layer (DW Conv), a fifth convolutional layer (1 ⁇ 1 Conv), and candidate region extraction modules (ROIs) that are cascaded in sequence.
  • DW Conv fourth convolutional layer
  • 1 ⁇ 1 Conv fifth convolutional layer
  • ROIs candidate region extraction modules
  • the Light-Head R-CNN network is used as an example and is not used as a limitation, and a channel self-attention network can also be added on the basis of other lightweight detection networks as the detection network in this embodiment.
  • step S7 includes:
  • step S71 the feature map of the image is sequentially passed through the fourth convolution layer (DW Conv) and the fifth convolution layer (1 ⁇ 1 Conv) to obtain the feature map of the image in the RPN network, and the fifth convolution layer (
  • the size of the convolution kernel of 1 ⁇ 1 Conv) is 1 ⁇ 1.
  • this embodiment adopts depthwise separable convolution to pass the fourth convolution layer (DW Conv) to the characteristics of the image.
  • the graph is convoluted.
  • the feature map of the image in the RPN network is passed through the candidate region extraction module (ROIs) to obtain the candidate frame containing the object to be detected.
  • ROIs candidate region extraction module
  • this embodiment adds a channel self-attention network on the basis of the existing Light-Head R-CNN network.
  • the self-attention network optimizes the feature distribution of the feature map input to the PSROI (position sensitive ROI pooling) layer, so that the output feature map pays more attention to the detection-related area and improves the accuracy of the detection result.
  • PSROI position sensitive ROI pooling
  • step S72 includes:
  • the channel self-attention network in this embodiment includes a first segmentation module and a channel attention weight acquisition module.
  • the first segmentation module is used to segment the feature map F rpn of the image in the RPN network into a first sub-feature map F 1 and a second sub-feature map F 2 , wherein the first sub-feature map F 1 and the second sub-feature map
  • the number of channels of F 2 is equal.
  • the segmentation mentioned here is to directly divide the channels. For example, if the number of channels of the feature map of the image in the RPN network is 8, the data corresponding to the 1st to 4th channels are used as the first channel.
  • a sub-feature map F 1 , and the data corresponding to the 5th to 8th channels is used as the second sub-feature map F 2 .
  • the channel attention weight acquisition module includes a second segmentation module, a grouping convolution layer (Group Conv), depthwise separable convolution layer (DW Conv), softmax layer, third pooling layer (Avg pool), sixth convolution layer (1 ⁇ 1 Conv).
  • grouping convolution layer Group Conv
  • DW Conv depthwise separable convolution layer
  • Softmax softmax
  • Avg pool third pooling layer
  • sixth convolution layer (1 ⁇ 1 Conv).
  • the second segmentation module is used to segment the first sub-feature map F 1 into a third sub-feature map F 3 and a fourth sub-feature map F 4 , wherein the third sub-feature map F 3 and the fourth sub-feature map F 4
  • the number of channels is equal.
  • the segmentation mentioned here is to directly divide the channels. Continue to take the number of channels of the feature map of the image in the RPN network as 8 as an example.
  • the first sub-feature map F If the number of channels of 1 is 4, the data corresponding to the 1st to 2nd channels are used as the third sub-feature map F 3 , and the data corresponding to the 3rd to 4th channels are used as the fourth sub-feature map F 4 .
  • the third sub-feature map F 3 and the fourth sub-feature map F 4 are respectively input to the group convolution layer (Group Conv) and the depthwise separable convolution layer (DW Conv) for convolution processing, and the group convolution layer (Group Conv) Conv), the output of the depthwise separable convolution layer (DW Conv) are fused and then processed through the softmax layer, the third pooling layer (Avg pool), and the sixth convolution layer (1 ⁇ 1 Conv) in turn to obtain channel attention.
  • the third pooling layer adopts the method of mean pooling for pooling
  • the sixth convolution layer (1 ⁇ 1 Conv) is used for dimensional upgrade processing, so that the channel attention weight K
  • the dimension of is consistent with the dimension of the second sub-feature map F2.
  • step S73 the channel attention feature map and the image feature map are finally fused through the channel self-attention network to obtain a fused feature map.
  • the channel self-attention network in this embodiment combines the channel separation and the self-attention mechanism. Through the channel separation, the information between each channel can interact with each other, which significantly reduces the complexity of the network structure, thereby greatly reducing the number of parameters. , the background features can be suppressed and the foreground features can be highlighted through the self-attention mechanism. In addition, by combining the channel attention feature map with the feature map of the image, the field of view of each spatial location is expanded and the output features are enriched.
  • the PSROI (position sensitive ROI pooling) layer is used to map the candidate frame to the fused feature map, and the feature map of the target to be detected is extracted from the fused feature map according to the candidate frame.
  • the feature map of the detection target obtains the detection result of the target to be detected through the fully connected layer, wherein the category probability is obtained through the fully connected layer and classified according to the category probability, that is, the detection result is classified, and the position offset information is obtained through the fully connected layer and according to the position Offset information to obtain the location of the target, that is, the detection result is regression.
  • the target detection method in this embodiment is mainly applied to mobile terminal devices.
  • training data needs to be used on the server to perform a network model constructed according to the target detection method of this embodiment. After training, use the evaluation data to evaluate the network model to obtain the network model with the best performance, and finally deploy the network model with the best performance to the mobile terminal through the onnx tool to implement the target detection algorithm. Real data for detection and visualization of detection results.
  • the target detection method in this embodiment is verified on the public data set PASCAL VOC.
  • the experimental results show that the target detection method in this embodiment only needs 528 MFLOPs to obtain an accuracy of 70.6 mAP. There is a good balance of complexity.
  • the image is scaled to 320 ⁇ 320 as input, and the network model constructed according to the object detection method of this embodiment is trained on NVIDIA TITAN RTX with 24GB RAM.
  • a stochastic gradient optimizer with a learning rate of 0.0001 and a weight decay of 0.001. All datasets were randomly divided into training set (60%), validation set (20%), test set (20%) so that the data in the training, validation and testing stages all had similar distributions.
  • millions of floating-point operations (MFLOPs) are defined to measure the complexity and efficiency of a lightweight network model, and the performance of the model can be evaluated by mean precision (mAP).
  • MFLOPs floating-point operations
  • mAP mean precision
  • the object detection method in this example (our model in Table 2) has a stronger model complexity.
  • the target detection method in this embodiment is more in line with the requirements of the mobile terminal device.
  • the target detection method in this embodiment is better than Tiny-YOLO, D-YOLO, MobileNet-
  • the MFLOPs of SSD are much smaller, and the accuracy is higher than Tiny-YOLO, D-YOLO, MobileNet-SSD.
  • the object detection method in this example can produce similar accuracies with half the model complexity. It can be seen that the target detection method in this embodiment can achieve a good balance between the accuracy and the complexity of the model.
  • the present embodiment also provides a target detection device corresponding to the above target detection method.
  • the target detection device includes an acquisition module 1, a dimensionality reduction module 2, a compression module 3, a splicing module 4, a sampling module 5, and a feature extraction module. 6.
  • the acquisition module 1 is used to acquire the image of the target to be detected
  • the dimension reduction module 2 is used to reduce the dimension of the image through the first convolution layer, and obtain the dimension-reduced image
  • the compression module 3 is used to reduce the dimension of the image.
  • the images are respectively compressed through multiple second convolution layers to obtain multiple first branches, wherein the multiple first branches have the same number of channels, and the stitching module 4 is used to extract the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, wherein the first feature maps of the multiple first branches are sequentially increased in depth, and the sampling module 5 is used to combine the first splicing feature
  • the image is downsampled through the first pooling layer to obtain the sampled feature map.
  • the feature extraction module 6 is used to extract the sampled feature map through multiple block modules to obtain the feature map of the image.
  • the detection module 7 is used for The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
  • the splicing module 4 in this embodiment is also configured to use the output of the previous first branch as the residual part of the next first branch and perform the next first branch and the residual part with the same depth as the residual part. Fusion, to obtain the cross-branch feature map after the fusion of the first branch, and perform feature extraction on the cross-branch feature map after the fusion of the first branch through the third convolution layer to obtain the first feature maps of multiple first branches respectively. , and splicing the first feature maps of the multiple first branches and merging them with the dimension-reduced image to obtain a first splicing feature map.
  • the compression module 3 is also used to compress the sampled feature maps through multiple second convolution layers to obtain multiple second branches, wherein the multiple second branches have the same number of channels
  • the splicing module 4 is also used for Extracting the second feature maps of multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, wherein the second feature maps of the multiple second branches are sequentially in depth. Increment.
  • the splicing module 4 in this embodiment is also used to use the output of the previous second branch as the residual part of the next second branch and perform the next second branch and the residual part with the same depth as the feature and the residual part. Fusion, obtains the cross-branch feature maps after the fusion of multiple second branches, and performs feature extraction on the cross-branch feature maps after the fusion of multiple second branches through the third convolution layer, and obtains the second branch of the multiple second branches respectively. feature maps, and splicing the second feature maps of the plurality of second branches and merging them with the sampled feature maps to obtain a second splicing feature map.
  • the feature extraction module 6 is further configured to perform feature extraction on the second stitched feature map through a plurality of block modules to obtain a feature map of the image. Specifically, the feature extraction module 6 is configured to sequentially obtain the first scale feature map, the second scale feature map, the third scale feature map through the sampled feature map/second stitched feature map through multiple block modules, and the third The scale feature map is downsampled to obtain the fourth scale feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled to obtain the first upsampling feature map and the second upsampling feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled.
  • the two-scale feature map, the first upsampling feature map, and the second upsampling feature map are fused to obtain the feature map of the image.
  • the detection module 7 in this embodiment is specifically used to pass the feature map of the image through the RPN network, obtain the feature map of the image in the RPN network and the candidate frame containing the target to be detected, and generate a channel according to the feature map of the image in the RPN network Attention feature map, and fuse the channel attention feature map with the feature map of the image to obtain the fused feature map, and obtain the feature map of the target to be detected according to the candidate frame and the fused feature map, and according to the target to be detected.
  • the feature map of obtains the detection result of the target to be detected.
  • the detection module 7 is further configured to divide the feature map of the image in the RPN network into a first sub-feature map and a second sub-feature map, wherein the number of channels of the first sub-feature map and the second sub-feature map are equal, and The channel attention weight is obtained according to the first sub-feature map, and the channel attention feature map is obtained by multiplying the channel attention weight with the second sub-feature map.
  • this embodiment provides a device including a memory 100, a processor 200, and a network interface 202.
  • the memory 100 stores a computer program
  • the processor 200 executes the computer program to implement the target detection method in this embodiment.
  • the memory 100 may include a high-speed random access memory (Random Access Memory, RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as at least one disk memory.
  • the processor 200 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the target detection method in this embodiment may be completed by an integrated logic circuit of hardware in the processor 200 or an instruction in the form of software.
  • the processor 200 may also be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., and may also be a digital signal processor (DSP), an application specific integrated circuit (ASIC) , off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • the memory 100 is used to store a computer program, and after receiving the execution instruction, the processor 200 executes the computer program to implement the target detection method in this embodiment.
  • This embodiment also provides a computer storage medium, where a computer program is stored in the computer storage medium, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201 to implement the target detection method in this embodiment .
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer storage medium to another computer storage medium, for example, from a website site, computer, server, or data center over a wired (e.g., coaxial) cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.
  • the computer storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.
  • Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, apparatuses, and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A light-weighted target detection method and device, and a storage medium. The method comprises: acquiring an image of a target to be detected (S1); by means of a first convolutional layer, performing dimensionality reduction on the image to obtain an image which has been subjected to dimensionality reduction (S2); by means of a plurality of second convolutional layers, respectively compressing the image which has been subjected to dimensionality reduction, so as to obtain a plurality of first branches (S3); respectively extracting first feature maps of the plurality of first branches, and splicing the first feature maps of the plurality of first branches to obtain a first spliced feature map (S4); by means of a first pooling layer, performing down-sampling on the first spliced feature map to obtain a sampled feature map (S5); by means of a plurality of block modules, performing feature extraction on the sampled feature map to obtain a feature map of the image (S6); and according to the feature map of the image, performing detection to obtain a detection result of the target to be detected (S7). By means of the method, by using a cross-channel branch policy at a feature extraction stage, feature maps of a plurality of branches are spliced to serve as a basis for subsequent feature extraction, such that a receiving range is expanded, and more low-level functions are preserved, thereby ensuring detection speed and also increasing the accuracy rate.

Description

轻量化的目标检测方法、设备、存储介质Lightweight target detection method, device and storage medium 技术领域technical field
本发明涉及图像处理技术领域,尤其涉及一种轻量化的目标检测方法、设备、存储介质。The invention relates to the technical field of image processing, and in particular, to a lightweight target detection method, device and storage medium.
背景技术Background technique
目标检测是计算机视觉中的基本视觉识别,被广泛应用于自动驾驶、安全检查等领域。随着近年来深度学习在图像分类任务中的巨大成功,基于卷积神经网络(CNN)的目标检测网络已逐渐成为主流。常见的基于CNN的目标检测网络包括Faster R-CNN、R-FCN、SSD、YOLO等,这些目标检测网络都依赖于复杂的网络结构,计算量指标浮点运算数(million floating point operations,MFLOPs)均达到五位数,在服务器GPU上能够准确、快速运行。由于移动设备端的计算能力和内存有限,无法承载过多的网络参数和计算需求,显然,这些目标检测网络不适合在移动场景中的实时部署和应用。目前已有的轻量化目标检网络包括MobileNet-SSD、MobileNetV2-SSD Lite、Tiny-YOLO、D-YOLO等。但是,这些轻量化目标检测网络在准确度和模型的复杂度上并没有取得很好的平衡。Object detection is the basic visual recognition in computer vision and is widely used in areas such as autonomous driving and safety inspection. With the great success of deep learning in image classification tasks in recent years, object detection networks based on convolutional neural networks (CNNs) have gradually become mainstream. Common CNN-based target detection networks include Faster R-CNN, R-FCN, SSD, YOLO, etc. These target detection networks all rely on complex network structures, and the calculation volume indicator floating point operations (million floating point operations, MFLOPs) Both reach five figures, and run accurately and quickly on server GPUs. Due to the limited computing power and memory on the mobile device side, it cannot carry too many network parameters and computing requirements. Obviously, these target detection networks are not suitable for real-time deployment and application in mobile scenarios. The existing lightweight target detection networks include MobileNet-SSD, MobileNetV2-SSD Lite, Tiny-YOLO, D-YOLO, etc. However, these lightweight object detection networks do not achieve a good balance between accuracy and model complexity.
发明内容SUMMARY OF THE INVENTION
为了解决现有技术的不足,本发明提供一种轻量化的目标检测方法、设备、存储介质,能够在保证检测速度的同时提升目标检测的准确率。In order to solve the deficiencies of the prior art, the present invention provides a lightweight target detection method, device and storage medium, which can improve the accuracy of target detection while ensuring the detection speed.
本发明提出的具体技术方案为:一种轻量化的目标检测方法,所述目标检测方法包括:The specific technical solution proposed by the present invention is: a lightweight target detection method, the target detection method includes:
获取待检测目标的图像;Obtain the image of the target to be detected;
将所述图像通过第一卷积层进行降维,获得降维后的图像;reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;
将所述降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分 支,所述多个第一分支具有相同的通道数;The images after the dimensionality reduction are respectively compressed through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;
分别提取所述多个第一分支的第一特征图并将所述多个第一分支的第一特征图进行拼接,获得第一拼接特征图,所述多个第一分支的第一特征图在深度上依次递增;Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;
将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图;Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;
将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图;Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;
根据所述图像的特征图进行检测,获得所述待检测目标的检测结果。The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
进一步地,在将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图之前,所述目标检测方法还包括:Further, before performing feature extraction on the sampled feature map through multiple block modules to obtain the feature map of the image, the target detection method further includes:
将所述采样后的特征图分别通过多个第二卷积层进行压缩,获得多个第二分支,所述多个第二分支具有相同的通道数;compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;
分别提取所述多个第二分支的第二特征图并将所述多个第二分支的第二特征图进行拼接,获得第二拼接特征图,所述多个第二分支的第二特征图在深度上依次递增;Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;
相应的,将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:
将所述第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图。The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
进一步地,对于深度最小的第一分支/第二分支,将所述降维后的图像/采样后的特征图通过第二卷积层进行压缩,包括:Further, for the first branch/second branch with the smallest depth, the dimensionality-reduced image/sampled feature map is compressed through the second convolution layer, including:
将所述降维后的图像/采样后的特征图依次通过第二池化层、第二卷积层分别进行池化、压缩。The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
进一步地,分别提取所述多个第一分支的第一特征图/第二分支的第二特征图并将所述多个第一分支的第一特征图/第二分支的第二特征图进行拼接,获得第一拼接特征图/第二拼接特征图,包括:Further, extract the first feature maps of the multiple first branches/second feature maps of the second branches respectively and perform the first feature maps of the multiple first branches/the second feature maps of the second branches. Splicing to obtain the first splicing feature map/second splicing feature map, including:
将上一个第一分支/第二分支的输出作为下一个第一分支/第二分支的残差部分并将下一个第一分支/第二分支与所述残差部分的深度相同的特征和所述残差部分进行融合,获得多个第一分支/第二分支融合后的跨分支特征图;The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;
将多个第一分支/第二分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第一分支的第一特征图/第二分支的第二特征图;Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;
将所述多个第一分支的第一特征图进行拼接并与所述降维后的图像进行融合/将所述多个第二分支的第二特征图进行拼接并与所述采样后的特征图进行融合,获得第一拼接特征图/第二拼接特征图。splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.
进一步地,将所述采样后的特征图/第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:Further, feature extraction is performed on the sampled feature map/second stitched feature map through multiple block modules to obtain the feature map of the image, including:
将所述采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;
对所述第三尺度特征图进行下采样,获得第四尺度特征图;down-sampling the third-scale feature map to obtain a fourth-scale feature map;
分别对所述第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;
将所述第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得所述图像的特征图。The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
进一步地,根据所述图像的特征图进行检测,获得所述待检测目标的检测结果,包括:Further, the detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained, including:
将所述图像的特征图通过RPN网络,获得所述图像在RPN网络中的特征图和包含所述待检测目标的候选框;Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;
根据所述图像在RPN网络中的特征图生成通道注意力特征图;Generate a channel attention feature map according to the feature map of the image in the RPN network;
将所述通道注意力特征图与所述图像的特征图进行融合,获得融合后的特征图;Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;
根据所述候选框和所述融合后的特征图获得所述待检测目标的特征图;Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;
根据所述待检测目标的特征图获得所述待检测目标的检测结果。The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
进一步地,根据所述图像在RPN网络中的特征图生成通道注意力特征图,包括:Further, generate a channel attention feature map according to the feature map of the image in the RPN network, including:
将所述图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,所述第一子特征图和所述第二子特征图的通道数相等;The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;
根据所述第一子特征图获得通道注意力权重;Obtain channel attention weights according to the first sub-feature map;
将所述通道注意力权重与所述第二子特征图相乘获得通道注意力特征图。The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
本发明还提供了一种设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序以实现如上任一项所述的目标检测方法。The present invention also provides a device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the target detection according to any one of the above method.
本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机指令,所述计算机指令被处理器执行时实现如上任一项所述的目标检测方法。The present invention also provides a computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, and when the computer instructions are executed by a processor, implement the target detection method described in any one of the above.
本发明提出的目标检测方法在特征提取阶段先将降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,再分别提取多个第一分支的第一特征图并将多个第一分支的第一特征图进行拼接,获得第一拼接特征图,采用跨通道分支策略将多个分支的特征图进行拼接后作为后续特征提取的基础,从而通过多个通道分支之间的信息交互扩大接收范围并保留更多低级功能,在保证检测速度的同时提升了目标检测的准确率。The target detection method proposed by the present invention firstly compresses the dimension-reduced images through multiple second convolution layers in the feature extraction stage to obtain multiple first branches, and then extracts the first feature maps of the multiple first branches respectively. The first feature maps of multiple first branches are spliced to obtain a first spliced feature map, and the feature maps of multiple branches are spliced using a cross-channel branching strategy as the basis for subsequent feature extraction, so that through multiple channel branches The information exchange between them expands the receiving range and retains more low-level functions, which improves the accuracy of target detection while ensuring the detection speed.
附图说明Description of drawings
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其它有益效果显而易见。The technical solutions and other beneficial effects of the present invention will be apparent through the detailed description of the specific embodiments of the present invention with reference to the accompanying drawings.
图1为本申请实施例中的目标检测方法的示意图;1 is a schematic diagram of a target detection method in an embodiment of the present application;
图2为本申请实施例中的跨通道分支特征提取模块的示意图;2 is a schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application;
图3为本申请实施例中的跨通道分支特征提取模块的另一示意图;3 is another schematic diagram of a cross-channel branch feature extraction module in an embodiment of the present application;
图4为本申请实施例中的多尺度特征融合模块的示意图;4 is a schematic diagram of a multi-scale feature fusion module in an embodiment of the present application;
图5为本申请实施例中的检测网络的示意图;5 is a schematic diagram of a detection network in an embodiment of the present application;
图6为本申请实施例中的通道自注意力网络的示意图;6 is a schematic diagram of a channel self-attention network in an embodiment of the present application;
图7为本申请实施例中的目标检测装置的示意图;FIG. 7 is a schematic diagram of a target detection device in an embodiment of the present application;
图8为本申请实施例中的设备的结构示意图。FIG. 8 is a schematic structural diagram of a device in an embodiment of the present application.
具体实施方式Detailed ways
以下,将参照附图来详细描述本发明的实施例。然而,可以以许多不同的形式来实施本发明,并且本发明不应该被解释为限制于这里阐述的具体实施例。相反,提供这些实施例是为了解释本发明的原理及其实际应用,从而使本领域的其他技术人员能够理解本发明的各种实施例和适合于特定预期应用的各种修改。在附图中,相同的标号将始终被用于表示相同的元件。Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the specific embodiments set forth herein. Rather, these embodiments are provided to explain the principles of the invention and its practical application, to thereby enable others skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular intended use. Throughout the drawings, the same reference numbers will be used to refer to the same elements.
基于CNN的目标检测网络根据是否包含候选区域提取网络(region proposal network,RPN)被分为两类:一阶段和两阶段。一阶段的目标检测网络可以直接从特征图中对目标类别和边框进行回归和预测,网络结构比较简单、效率也更高,往往被认为更适合于轻量化的研究,而两阶段的目标检测网络由于增加了候选区域选择的步骤,可以实现更好的检测性能。目前大部分轻量级目标检测网络的研究工作都是基于一阶段,例如,MobileNet-SSD、MobileNetV2-SSD Lite、Tiny-YOLO、D-YOLO、Pelee,也有基于两阶段的轻量级目标检测网络,例如,Light-Head R-CNN。但是,现有的基于一阶段的轻量级目标检测网络和基于两阶段的轻量级目标检测网络都很难在准确度和模型的复杂度上取得很好的平衡。CNN-based object detection networks are divided into two categories: one-stage and two-stage, according to whether a region proposal network (RPN) is included. The one-stage target detection network can directly regress and predict the target category and frame from the feature map. The network structure is simpler and more efficient, and is often considered to be more suitable for lightweight research, while the two-stage target detection network Better detection performance can be achieved due to the added step of candidate region selection. At present, most of the research work on lightweight object detection networks is based on one stage, for example, MobileNet-SSD, MobileNetV2-SSD Lite, Tiny-YOLO, D-YOLO, Pelee, and there are also two-stage lightweight object detection networks based on , for example, Light-Head R-CNN. However, it is difficult for the existing one-stage based lightweight object detection networks and two-stage based lightweight object detection networks to achieve a good balance between accuracy and model complexity.
基于上述问题,本申请提供了一种轻量化的目标检测方法,该目标检测方法为基于两阶段的轻量级目标检测方法,包括特征提取阶段和检测阶段,在特征提取阶段,采用跨通道分支策略在已有的轻量化分类网络的结构中增加跨通道分支,将多个分支的特征图进行拼接后作为后续特征提取的基础,从而通过多个通道分支之间的信息交互扩大接收范围并保留更多低级功能,在保证检测速度的同时提升了目标检测的准确率。具体地,本申请先获取待检测目标的图像,将待检测目标的图像通过第一卷积层进行降维,获得降维后的图像,将降 维后的图像分别通过第二卷积层压缩成具有相同通道数的多个第一分支,分别提取多个第一分支的第一特征图并将多个第一分支的第一特征图进行拼接,获得第一拼接特征图,其中,多个第一分支的第一特征图在深度上依次递增,然后将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图,将采样后的特征图通过多个block模块进行特征提取,获得图像的特征图,最后根据图像的特征图进行检测,获得待检测目标的检测结果。Based on the above problems, the present application provides a lightweight target detection method. The target detection method is a two-stage lightweight target detection method, including a feature extraction stage and a detection stage. In the feature extraction stage, a cross-channel branch is adopted. The strategy adds cross-channel branches to the structure of the existing lightweight classification network, and splices the feature maps of multiple branches as the basis for subsequent feature extraction, thereby expanding the receiving range and retaining the information exchange between multiple channel branches. More low-level functions improve the accuracy of target detection while ensuring the detection speed. Specifically, the present application first obtains the image of the target to be detected, reduces the dimension of the image of the target to be detected through the first convolution layer, obtains the dimensionally reduced image, and compresses the dimensionally reduced image through the second convolution layer respectively. form multiple first branches with the same number of channels, respectively extract the first feature maps of the multiple first branches and splicing the first feature maps of the multiple first branches to obtain the first spliced feature map, where multiple The first feature map of the first branch increases sequentially in depth, and then the first stitched feature map is down-sampled through the first pooling layer to obtain the sampled feature map, and the sampled feature map is processed by multiple block modules. Feature extraction, obtain the feature map of the image, and finally perform detection according to the feature map of the image to obtain the detection result of the target to be detected.
下面以轻量化分类网络ShuffleNetV2作为本申请的轻量化分类网络为例来对本申请的目标检测方法进行详细的描述,需要说明的是,将轻量化分类网络ShuffleNetV2作为本申请的轻量化分类网络仅仅是作为示例并不用于对本申请的目标检测方法进行限定,本申请的轻量化分类网络也可以采用其他轻量化分类网络,例如,Tiny-Darknet、MoblieNetV2、PeleeNet等。The object detection method of the present application is described in detail below by taking the lightweight classification network ShuffleNetV2 as the lightweight classification network of this application. It should be noted that the lightweight classification network ShuffleNetV2 is used as the lightweight classification network of this application. As an example, it is not used to limit the target detection method of the present application, and the lightweight classification network of the present application can also adopt other lightweight classification networks, for example, Tiny-Darknet, MoblieNetV2, PeleeNet, etc.
参照图1,本实施例提供的轻量化的目标检测方法包括以下步骤:Referring to FIG. 1 , the lightweight target detection method provided by this embodiment includes the following steps:
S1、获取待检测目标的图像;S1. Obtain an image of the target to be detected;
S2、将图像通过第一卷积层进行降维,获得降维后的图像;S2, reducing the dimension of the image through the first convolution layer to obtain the image after dimension reduction;
S3、将降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,其中,多个第一分支具有相同的通道数;S3, compressing the dimensionally reduced images through a plurality of second convolution layers to obtain a plurality of first branches, wherein the plurality of first branches have the same number of channels;
S4、分别提取多个第一分支的第一特征图并将多个第一分支的第一特征图进行拼接,获得第一拼接特征图,其中,多个第一分支的第一特征图在深度上依次递增;S4. Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, wherein the first feature maps of the multiple first branches are in depth increasing sequentially;
S5、将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图;S5, down-sampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;
S6、将采样后的特征图通过多个block模块进行特征提取,获得图像的特征图;S6, extracting the features of the sampled feature map through multiple block modules to obtain the feature map of the image;
S7、根据图像的特征图进行检测,获得待检测目标的检测结果。S7. Perform detection according to the feature map of the image to obtain a detection result of the target to be detected.
本实施例中的目标检测方法的特征提取网络是在轻量化分类网络ShuffleNetV2的基础上进行的改进,特征提取网络的具体结构如下表所示,其包括依次级联的第一卷积层(Convolution)、跨通道分支特征提取模块、第一池化层(MaxPooling)、多个block模块(ShuffleV2block),第一卷积层 (Convolution)和第一池化层(MaxPooling)为stem阶段,多个block模块为stage2、stage3、stage4阶段,具体的,特征提取网络包括16个block模块,其中,stage2阶段包括一个步长为2的block模块和3个步长为1的block模块,stage3阶段包括一个步长为2的block模块和7个步长为1的block模块,stage4阶段包括一个步长为2的block模块和3个步长为1的block模块。The feature extraction network of the target detection method in this embodiment is an improvement based on the lightweight classification network ShuffleNetV2. The specific structure of the feature extraction network is shown in the following table, which includes the first convolutional layer (Convolution ), cross-channel branch feature extraction module, the first pooling layer (MaxPooling), multiple block modules (ShuffleV2block), the first convolution layer (Convolution) and the first pooling layer (MaxPooling) are stem stages, multiple blocks The modules are stage2, stage3, and stage4. Specifically, the feature extraction network includes 16 block modules. Among them, the stage2 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1. The stage3 stage includes a step A block module with a length of 2 and 7 block modules with a step size of 1. The stage4 stage includes a block module with a step size of 2 and 3 block modules with a step size of 1.
表一 特征提取网络的结构Table 1 The structure of the feature extraction network
Figure PCTCN2021086476-appb-000001
Figure PCTCN2021086476-appb-000001
本实施例中的目标检测方法采用跨通道分支策略,在ShuffleNet V2网络的stem阶段增加了跨通道分支特征提取模块,将多个分支的特征图进行拼接后作为后续stage 2、stage 3、stage 4阶段特征提取的基础,从而通过多个通道分支之间的信息交互扩大接收范围并保留更多低级功能,在保证检测速度的同时提升了目标检测的准确率。The target detection method in this embodiment adopts a cross-channel branching strategy, and a cross-channel branch feature extraction module is added in the stem stage of the ShuffleNet V2 network, and the feature maps of multiple branches are spliced as the subsequent stages 2, 3, and 4. The basis of stage feature extraction, thereby expanding the receiving range and retaining more low-level functions through the information exchange between multiple channel branches, which improves the accuracy of target detection while ensuring the detection speed.
具体地,将步骤S1中获取的待检测目标的图像输入至第一卷积层(Convolution)中,第一卷积层(Convolution)的卷积核大小为3╳3,步长为2,通过第一卷积层(Convolution)对图像进行降维,获得降维后的图像。Specifically, the image of the target to be detected obtained in step S1 is input into the first convolution layer (Convolution), the size of the convolution kernel of the first convolution layer (Convolution) is 3╳3, and the step size is 2. The first convolution layer (Convolution) reduces the dimensionality of the image to obtain a dimensionally reduced image.
参照图2,本实施例中的跨通道分支特征提取模块包括多个分支模块和拼接层(Concat),多个分支模块用于将降维后的图像压缩成具有相同通道数的多个第一分支并提取多个第一分支的第一特征图,多个第一分支的第一特征图在深度上依次递增。每一个分支模块包括卷积核大小为1╳1的第二卷积层(1╳1 Conv),多个分支模块分别通过第二卷积层(1╳1 Conv)将降维后的图像压缩成具有相同通道数的多个第一分支。第一个分支模块只包括第二卷积层(1╳1 Conv),从第二个分支模块开始,每一个分支模块包括卷积核大小为3╳3的卷积层(3╳3 Conv)且卷积层(3╳3 Conv)的个数依次增加,以提取多个第一分支在深度上依次递增的第一特征图,最后通过拼接层(Concat)将多个第一分支的第一特征图进行拼接,获得第一拼接特征图。图2示出了特征提取模块包括4个分支模块的情况,4个分支模块中的第一分支分别为a1~a4,当然,这里仅仅是为了作为示例示出,不用作限定,分支模块的数量可以根据实际需要进行设定。Referring to FIG. 2 , the cross-channel branch feature extraction module in this embodiment includes a plurality of branch modules and a concatenation layer (Concat), and the plurality of branch modules are used to compress the dimensionally reduced image into a plurality of first with the same number of channels. Branch and extract the first feature maps of the multiple first branches, and the first feature maps of the multiple first branches are sequentially increased in depth. Each branch module includes a second convolutional layer (1╳1 Conv) with a convolution kernel size of 1╳1, and multiple branch modules compress the dimension-reduced image through the second convolutional layer (1╳1 Conv). into multiple first branches with the same number of channels. The first branch module only includes the second convolutional layer (1╳1 Conv), starting from the second branch module, each branch module includes a convolutional layer with a convolution kernel size of 3╳3 (3╳3 Conv) And the number of convolutional layers (3╳3 Conv) is increased in turn to extract the first feature maps of multiple first branches increasing in depth in sequence, and finally the first feature maps of multiple first branches are concatenated through the concatenation layer (Concat). The feature maps are spliced to obtain a first spliced feature map. Fig. 2 shows the case where the feature extraction module includes 4 branch modules, the first branches in the 4 branch modules are respectively a1-a4, of course, this is only for illustration purposes, not for limitation, the number of branch modules Can be set according to actual needs.
较佳地,本实施例中的第一个分支模块在第二卷积层(1╳1 Conv)的前面增加了一个第二池化层(Pool),通过第二池化层(Pool)先对降维后的图像进行池化处理,以增大第一个分支模块的感受野,保留主要的特征的同时减少参数。此外,在第一个分支模块包括第二池化层(Pool)的基础上,为了保留更多原始图像的信息,跨通道分支特征提取模块在第一个分支模块的前面增加了只包括第二卷积层(1╳1 Conv)的分支模块a0,该分支模块通过第二卷积层(1╳1 Conv)对降维后的图像进行通道压缩后直接输出至拼接层(Concat)。Preferably, the first branch module in this embodiment adds a second pooling layer (Pool) in front of the second convolutional layer (1╳1 Conv). The dimensionality-reduced images are pooled to increase the receptive field of the first branch module, while retaining the main features and reducing parameters. In addition, on the basis that the first branch module includes the second pooling layer (Pool), in order to retain more information of the original image, the cross-channel branch feature extraction module adds only the second branch module in front of the first branch module. The branch module a0 of the convolutional layer (1╳1 Conv), the branch module performs channel compression on the dimensionally reduced image through the second convolutional layer (1╳1 Conv) and directly outputs it to the concatenation layer (Concat).
参照图3,由于随着网络深度的增加,模型会产生退化现象,为了解决这个问题,本实施例中的跨通道分支特征提取模块中增加了残差连接,即将上一分支的输出作为下一分支的残差部分并与下一分支同样深度的特征进行融合后再进行特征提取,图3示出了在图2中的跨通道分支特征提取模块的结构的基础上增加了残差连接的跨通道分支特征提取模块的结构,具体地,步骤S4包括:Referring to Figure 3, since the model will degenerate as the network depth increases, in order to solve this problem, residual connections are added to the cross-channel branch feature extraction module in this embodiment, that is, the output of the previous branch is used as the next The residual part of the branch is fused with the features of the same depth as the next branch before feature extraction. Figure 3 shows the structure of the cross-channel branch feature extraction module in Figure 2. The structure of the channel branch feature extraction module, specifically, step S4 includes:
S41、将上一个第一分支的输出作为下一个第一分支的残差部分并将下一个第一分支与残差部分的深度相同的特征和残差部分进行融合,获得该第一分支融合后的跨分支特征图;S41. Use the output of the previous first branch as the residual part of the next first branch, and fuse the features and residual parts with the same depth as the next first branch and the residual part to obtain the first branch after fusion The cross-branch feature map of ;
S42、将该第一分支融合后的跨分支特征图通过第三卷积层进行特征提取, 分别获得多个第一分支的第一特征图;S42, performing feature extraction on the cross-branch feature map fused by the first branch through the third convolution layer, and obtaining first feature maps of multiple first branches respectively;
S43、将多个第一分支的第一特征图进行拼接并与降维后的图像进行融合,获得第一拼接特征图。S43, splicing the first feature maps of the plurality of first branches and merging them with the dimension-reduced image to obtain a first splicing feature map.
以特征提取模块包括4个分支模块的情况,4个分支模块中的第一分支分别为a1~a4为例,将上一个第一分支的输出作为下一个第一分支的残差部分并将下一个第一分支与残差部分的深度相同的特征和残差部分进行融合具体为,将第一个分支模块的输出与第二个分支模块经过第二卷积层(1╳1 Conv)得到的特征图进行融合,得到第二个分支模块的融合后的跨分支特征图,然后再将第二个分支模块的融合后的跨分支特征图通过第三卷积层进行特征提取,获得第二个分支模块的第一特征图;将第二个分支模块的输出与第三个分支模块经过第二卷积层(1╳1 Conv)、卷积层(3╳3 Conv)后得到的特征图进行融合,得到第三个分支模块的融合后的跨分支特征图,再将第三个分支模块的融合后的跨分支特征图通过第三卷积层进行特征提取,获得第三个分支模块的第一特征图;将第三个分支模块的输出与第四个分支模块经过第二卷积层(1╳1 Conv)、两个卷积层(3╳3 Conv)后得到的特征图进行融合,得到第四个分支模块的融合后的跨分支特征图,再将第四个分支模块的融合后的跨分支特征图通过第三卷积层进行特征提取,获得第四个分支模块的第一特征图,其中,4个分支模块的输出γ i表示如下: Take the case where the feature extraction module includes 4 branch modules, and the first branches in the 4 branch modules are respectively a1 to a4 as an example, the output of the previous first branch is used as the residual part of the next first branch and the next A feature and residual part with the same depth as the first branch and the residual part are fused. Specifically, the output of the first branch module and the second branch module are obtained through the second convolutional layer (1╳1 Conv). The feature maps are fused to obtain the fused cross-branch feature map of the second branch module, and then the fused cross-branch feature map of the second branch module is extracted through the third convolution layer to obtain the second The first feature map of the branch module; the output of the second branch module and the feature map obtained by the third branch module after passing through the second convolution layer (1╳1 Conv) and the convolution layer (3╳3 Conv) are analyzed. Fusion, the fused cross-branch feature map of the third branch module is obtained, and then the fused cross-branch feature map of the third branch module is extracted through the third convolution layer to obtain the third branch module. A feature map; fuse the output of the third branch module with the feature map obtained by the fourth branch module after passing through the second convolutional layer (1╳1 Conv) and two convolutional layers (3╳3 Conv), Obtain the fused cross-branch feature map of the fourth branch module, and then perform feature extraction on the fused cross-branch feature map of the fourth branch module through the third convolution layer to obtain the first feature of the fourth branch module In the figure, the output γi of the four branch modules is expressed as follows:
Figure PCTCN2021086476-appb-000002
Figure PCTCN2021086476-appb-000002
其中,α表示通过第二卷积层(1╳1 Conv)对降维后的图像进行卷积操作,S表示通过卷积层(3╳3 Conv)进行卷积操作,i∈{1,2,......,k},k为分支模块的数目,这里需要说明的是,α 1是先通过第二池化层(Pool)对降维后的图像进行最大池化后再通过第二卷积层(1╳1 Conv)进行卷积操作。 Among them, α represents the convolution operation on the dimensionally reduced image through the second convolution layer (1╳1 Conv), S represents the convolution operation through the convolution layer (3╳3 Conv), i∈{1,2 ,...,k}, k is the number of branch modules, it should be noted here that α 1 is the maximum pooling of the dimensionality-reduced image through the second pooling layer (Pool) first, and then the The second convolutional layer (1╳1 Conv) performs convolution operations.
在获得4个分支模块的输出γ i后,将4个分支模块的输出及分支模块a0的输出输入至拼接层(Concat)进行拼接并与降维后的图像进行融合,获得第一拼接特征图。这里需要说明的是,第三卷积层指的是每个分支模块中与拼接 层(Concat)连接的卷积层(3╳3 Conv),对于分支模块的数量大于4的情况与上述原理相同,这里不再赘述。 After obtaining the output γ i of the four branch modules, the outputs of the four branch modules and the output of the branch module a0 are input to the concatenation layer (Concat) for splicing and fused with the dimensionally reduced image to obtain the first concatenated feature map . It should be noted here that the third convolutional layer refers to the convolutional layer (3╳3 Conv) connected to the concatenation layer (Concat) in each branch module, and the above principle is the same for the case where the number of branch modules is greater than 4 , which will not be repeated here.
本实施例中的跨通道分支特征提取模块通过增加残差连接,可以对原始输入特征进行补充,从而防止模型退化。The cross-channel branch feature extraction module in this embodiment can supplement the original input features by adding residual connections, thereby preventing model degradation.
在步骤S5中,将第一拼接特征图输入至第一池化层(MaxPooling)中进行下采样,获得采样后的特征图,第一池化层(MaxPooling)的卷积核大小为3╳3,步长为2,第一池化层(MaxPooling)采用最大池化法进行池化,通过下采样可以进一步降低数据计算量。In step S5, the first stitched feature map is input into the first pooling layer (MaxPooling) for downsampling to obtain the sampled feature map, and the size of the convolution kernel of the first pooling layer (MaxPooling) is 3╳3 , the step size is 2, the first pooling layer (MaxPooling) adopts the maximum pooling method for pooling, and the amount of data calculation can be further reduced by downsampling.
为了进一步扩大网络的接收范围并保留更多细节,在本实施例的另一实施方式中,在ShuffleNetV2网络的第一池化层(MaxPooling)后面也增加了跨通道分支特征提取模块,此时,本实施例的另一实施方式中的特征提取网络的结构如下表所示:In order to further expand the receiving range of the network and retain more details, in another implementation of this embodiment, a cross-channel branch feature extraction module is also added after the first pooling layer (MaxPooling) of the ShuffleNetV2 network. At this time, The structure of the feature extraction network in another implementation of this embodiment is shown in the following table:
表二 特征提取网络的另一种结构Table 2 Another structure of feature extraction network
Figure PCTCN2021086476-appb-000003
Figure PCTCN2021086476-appb-000003
Figure PCTCN2021086476-appb-000004
Figure PCTCN2021086476-appb-000004
本实施例的另一实施方式中的目标检测方法在步骤S6之前还包括:Before step S6, the target detection method in another implementation manner of this embodiment further includes:
S600、将采样后的特征图分别通过第二卷积层(1╳1 Conv)进行压缩,获得多个第二分支,其中,多个第二分支具有相同的通道数;S600. Compress the sampled feature maps through the second convolution layer (1╳1 Conv) respectively to obtain multiple second branches, wherein the multiple second branches have the same number of channels;
S601、分别提取多个第二分支的第二特征图并将多个第二分支的第二特征图进行拼接,获得第二拼接特征图,多个第二分支的第二特征图在深度上依次递增。S601. Extract the second feature maps of multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second spliced feature map, and the second feature maps of the multiple second branches are sequentially in depth Increment.
具体地,将采样后的特征图再次输入至跨通道分支特征提取模块,多个分支模块用于将采样后的特征图压缩成具有相同通道数的多个第二分支并提取多个第二分支的第二特征图,多个第二分支的第二特征图在深度上依次递增。通过第二卷积层(1╳1 Conv)将采样后的特征图压缩成具有相同通道数的多个第二分支,再通过卷积层(3╳3 Conv)来提取多个第二分支在深度上依次递增的第二特征图,最后通过拼接层(Concat)将多个第二分支的第二特征图进行拼接,获得第二拼接特征图。第二拼接特征图的获取过程与第一拼接特征图的获取过程相同,这里不再赘述。Specifically, the sampled feature map is input to the cross-channel branch feature extraction module again, and the multiple branch modules are used to compress the sampled feature map into multiple second branches with the same number of channels and extract multiple second branches The second feature maps of multiple second branches are sequentially increased in depth. The sampled feature map is compressed into multiple second branches with the same number of channels through the second convolution layer (1╳1 Conv), and then multiple second branches are extracted through the convolution layer (3╳3 Conv) in The second feature maps that increase sequentially in depth, and finally the second feature maps of the plurality of second branches are spliced through a concatenation layer (Concat) to obtain a second concatenated feature map. The acquisition process of the second stitched feature map is the same as the acquisition process of the first stitched feature map, and details are not repeated here.
同样的,为了解决模型退化的问题,在第一池化层(MaxPooling)后面的跨通道分支特征提取模块中也增加了残差连接,即将上一分支的输出作为下一分支的残差部分并与下一分支同样深度的特征进行融合后再进行特征提取,具体结构参见图3所示,则步骤S601包括:Similarly, in order to solve the problem of model degradation, a residual connection is also added to the cross-channel branch feature extraction module after the first pooling layer (MaxPooling), that is, the output of the previous branch is used as the residual part of the next branch. The features of the same depth as the next branch are fused and then feature extraction is performed. The specific structure is shown in FIG. 3 . Step S601 includes:
S6011、将上一个第二分支的输出作为下一个第二分支的残差部分并将下一个第二分支与残差部分的深度相同的特征和残差部分进行融合,获得该第二分支融合后的跨分支特征图;S6011. Use the output of the previous second branch as the residual part of the next second branch, and fuse the features and residual parts with the same depth as the next second branch and the residual part to obtain the second branch after fusion The cross-branch feature map of ;
S6012、将该第二分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第二分支的第二特征图;S6012, performing feature extraction on the cross-branch feature map fused by the second branch through the third convolution layer, and obtaining second feature maps of multiple second branches respectively;
S6013、将多个第二分支的第二特征图进行拼接并与降维后的图像进行融合,获得第二拼接特征图。S6013 , splicing the second feature maps of the plurality of second branches and merging them with the dimensionality-reduced image to obtain a second splicing feature map.
跨通道分支特征提取模块加入残差连接后获取第二拼接特征图的过程与跨通道分支特征提取模块加入残差结构后获取第一拼接特征图的过程相同,这里也不再赘述。The process of obtaining the second stitched feature map after adding residual connection to the cross-channel branch feature extraction module is the same as the process of obtaining the first stitched feature map after adding the residual structure to the cross-channel branch feature extraction module, and will not be repeated here.
在步骤S6中,将采样后的特征图通过多个block模块进行特征提取,获得图像的特征图,具体为,将第二拼接特征图通过多个block模块进行特征提取,获得图像的特征图。In step S6, feature extraction is performed on the sampled feature map through multiple block modules to obtain a feature map of the image. Specifically, feature extraction is performed on the second stitched feature map through multiple block modules to obtain a feature map of the image.
由于轻量级网络提取特征的能力较弱,无法保留大量通道特征,较佳地,本实施例的目标检测方法在ShuffleNet V2网络的基础上还增加了多尺度特征融合模块,通过多尺度特征融合模块对ShuffleNet V2网络的stage3、stage4阶段输出的特征进行融合,从而将低分辨率信息与高分辨率信息相结合,能够有效补充多尺度特征图之间的全局上下文信息。Due to the weak ability of the lightweight network to extract features, a large number of channel features cannot be retained. Preferably, the target detection method of this embodiment also adds a multi-scale feature fusion module on the basis of the ShuffleNet V2 network. The module fuses the features output from the stage3 and stage4 stages of the ShuffleNet V2 network, thereby combining low-resolution information with high-resolution information, which can effectively supplement the global context information between multi-scale feature maps.
参照图4,下面具体描述将多尺度特征融合模块应用到本实施例的目标检测方法中的过程,步骤S6包括:Referring to FIG. 4 , the process of applying the multi-scale feature fusion module to the target detection method of the present embodiment is specifically described below. Step S6 includes:
S61、将采样后的特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;S61, obtaining the first-scale feature map, the second-scale feature map, and the third-scale feature map in turn by passing the sampled feature map through multiple block modules;
S62、对第三尺度特征图进行下采样,获得第四尺度特征图;S62, down-sampling the third-scale feature map to obtain a fourth-scale feature map;
S63、分别对第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;S63, respectively upsampling the third scale feature map and the fourth scale feature map to obtain the first upsampling feature map and the second upsampling feature map;
S64、将第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得图像的特征图F mfmS64 , fuse the second scale feature map, the first upsampling feature map, and the second upsampling feature map to obtain a feature map F mfm of the image.
本实施例中的第一尺度特征图为stage2阶段的输出,具体为,将第二拼接特征图依次通过一个步长为2的block模块和3个步长为1的block模块后获得的特征图即为第一尺度特征图;第二尺度特征图为stage3阶段的输出,具体为,将第一尺度特征图依次通过一个步长为2的block模块和7个步长为1的block模块后获得的特征图即为第二尺度特征图;第三尺度特征图为stage4阶段的输出,具体为,将第二尺度特征图依次通过一个步长为2的block模块和3个步长为1的block模块后获得的特征图即为第三尺度特征图。The first scale feature map in this embodiment is the output of stage 2, specifically, the feature map obtained by passing the second stitched feature map through a block module with a step size of 2 and three block modules with a step size of 1 in turn That is, the first-scale feature map; the second-scale feature map is the output of the stage3 stage. Specifically, the first-scale feature map is obtained by sequentially passing through a block module with a step size of 2 and 7 block modules with a step size of 1. The feature map is the second-scale feature map; the third-scale feature map is the output of the stage4 stage, specifically, the second-scale feature map is sequentially passed through a block module with a step size of 2 and three blocks with a step size of 1. The feature map obtained after the module is the third-scale feature map.
当检测结果为分类时,高分辨率信息相对于低分辨率信息对分类影响更大,因此,为了能够保留更多的高分辨率信息,在步骤S62中,对第三尺度特征图进行下采样获得的第四尺度特征图具有更多的高分辨率信息。较佳地,为了进一步减少数据计算量,本实施例采用深度可分离卷积(3╳3 DW Conv)对第三尺度特征图进行下采样,卷积核大小为3╳3。When the detection result is classification, the high-resolution information has a greater impact on the classification than the low-resolution information. Therefore, in order to retain more high-resolution information, in step S62, the third scale feature map is down-sampled The obtained fourth-scale feature maps have more high-resolution information. Preferably, in order to further reduce the amount of data calculation, in this embodiment, depthwise separable convolution (3╳3 DW Conv) is used to downsample the third-scale feature map, and the size of the convolution kernel is 3╳3.
为了保证数据维度一致,在步骤S63中,需要对第三尺度特征图、第四尺度特征图进行上采样,以使得获得的第一上采样特征图、第二上采样特征图的数据维度与第二尺度特征图的数据维度一致。较佳地,本实施例采用双线性插值的方法对第三尺度特征图、第四尺度特征图进行上采样,上采样通过卷积核大小为1╳1的卷积层(1╳1 Conv)来实现。In order to ensure that the data dimensions are consistent, in step S63, it is necessary to upsample the third-scale feature map and the fourth-scale feature map, so that the obtained data dimensions of the first up-sampling feature map and the second up-sampling feature map are the same as the third-scale feature map. The data dimensions of the two-scale feature maps are consistent. Preferably, in this embodiment, the bilinear interpolation method is used to upsample the third-scale feature map and the fourth-scale feature map, and the upsampling passes through a convolutional layer with a convolution kernel size of 1╳1 (1╳1 Conv )to fulfill.
较佳地,为了进一步保证数据维度一致性,在步骤S63中,需要对第二尺度特征图进行维度调整,获得升维的特征图,维度调整通过卷积核大小为1╳1的卷积层(1╳1 Conv)实现,从而保证升维的特征图的数据维度与第一上采样特征图、第二上采样特征图的数据维度一致。对应的,步骤S64中,将第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合具体为将升维的特征图、第一上采样特征图、第二上采样特征图进行融合,获得图像的特征图F mfmPreferably, in order to further ensure the consistency of data dimensions, in step S63, it is necessary to perform dimension adjustment on the feature map of the second scale to obtain a feature map of increased dimension, and the dimension adjustment is performed through a convolution layer with a convolution kernel size of 1╳1. (1╳1 Conv) is implemented, so as to ensure that the data dimension of the feature map of the upsampling is consistent with the data dimension of the first upsampling feature map and the second upsampling feature map. Correspondingly, in step S64, the fusion of the second scale feature map, the first upsampling feature map, and the second upsampling feature map is specifically to fuse the dimension-raising feature map, the first upsampling feature map, and the second upsampling feature map. The images are fused to obtain the feature map F mfm of the image.
由于低分辨率信息即浅层特征信息对检测结果为分类时的作用比较小,且低分辨率信息数据量较大,会大大增加计算量,因此,综合考虑计算量和对检测结果的影响,本实施例在步骤S64中,低分辨率信息只选择了第二尺度特征图以及第一上采样特征图,高分辨率信息选择第二上采样特征图,将第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合后获得图像的特征图,从而实现低分辨率信息与高分辨率信息的结合,有效补充了多尺度特征图之间的全局上下文信息,避免信息丢失。这里需要说明的是,本实施例中只选择了三个不同层次的特征信息进行了融合,且高分辨率信息只选择了第二上采样特征图,在实际应用过程中,可以根据计算量或者对检测结果的影响,继续对第四尺度特征图进行下采样,获得更多高分辨率信息,并选择更多层次的特征信息进行融合。Since the low-resolution information, that is, the shallow feature information, has a relatively small effect on the classification of the detection result, and the low-resolution information has a large amount of data, which will greatly increase the amount of calculation. Therefore, considering the amount of calculation and the impact on the detection result, In this embodiment, in step S64, only the second scale feature map and the first upsampling feature map are selected for the low resolution information, the second upsampling feature map is selected for the high resolution information, and the second scale feature map, the first upsampling feature map are selected The sampling feature map and the second up-sampling feature map are fused to obtain the feature map of the image, so as to realize the combination of low-resolution information and high-resolution information, effectively supplement the global context information between multi-scale feature maps, and avoid information loss. . It should be noted here that in this embodiment, only three different levels of feature information are selected for fusion, and only the second upsampling feature map is selected for high-resolution information. Influence on the detection results, continue to downsample the fourth-scale feature map to obtain more high-resolution information, and select more levels of feature information for fusion.
参照图5,本实施例中的目标检测方法中的检测网络是在已有的轻量化检测网络的基础上进行的改进,具体为,在已有的Light-Head R-CNN网络的基 础上进行的改进。其中,Light-Head R-CNN网络包括RPN、PSROI(position sensitive ROI pooling)层、全连接层,本实施例中的检测网络在Light-Head R-CNN网络的基础上增加了通道自注意力网络。Referring to Figure 5, the detection network in the target detection method in this embodiment is an improvement based on the existing lightweight detection network. Specifically, the detection network is performed on the basis of the existing Light-Head R-CNN network. improvement of. Among them, the Light-Head R-CNN network includes RPN, PSROI (position sensitive ROI pooling) layer, and fully connected layer. The detection network in this embodiment adds a channel self-attention network on the basis of the Light-Head R-CNN network. .
具体地,RPN包括依次级联的第四卷积层(DW Conv)、第五卷积层(1╳1 Conv)、候选区域提取模块(ROIs)。这里需要说明的是,以Light-Head R-CNN网络作为示例并不用作限定,也可以在其他轻量化检测网络的基础上增加通道自注意力网络作为本实施例中的检测网络。Specifically, the RPN includes a fourth convolutional layer (DW Conv), a fifth convolutional layer (1╳1 Conv), and candidate region extraction modules (ROIs) that are cascaded in sequence. It should be noted here that the Light-Head R-CNN network is used as an example and is not used as a limitation, and a channel self-attention network can also be added on the basis of other lightweight detection networks as the detection network in this embodiment.
具体地,步骤S7包括:Specifically, step S7 includes:
S71、将图像的特征图通过RPN网络,获得图像在RPN网络中的特征图和包含待检测目标的候选框;S71, passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;
S72、根据图像在RPN网络中的特征图生成通道注意力特征图;S72. Generate a channel attention feature map according to the feature map of the image in the RPN network;
S73、将通道注意力特征图与图像的特征图进行融合,获得融合后的特征图;S73, fuse the channel attention feature map with the feature map of the image to obtain a fused feature map;
S74、根据候选框和融合后的特征图获得待检测目标的特征图;S74, obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;
S75、根据待检测目标的特征图获得待检测目标的检测结果。S75. Obtain the detection result of the target to be detected according to the feature map of the target to be detected.
在步骤S71中,将图像的特征图依次通过第四卷积层(DW Conv)、第五卷积层(1╳1 Conv)后获得图像在RPN网络中的特征图,第五卷积层(1╳1 Conv)的卷积核的大小为1╳1,较佳地,为了进一步减少数据计算量,本实施例采用深度可分离卷积通过第四卷积层(DW Conv)对图像的特征图进行卷积。将图像在RPN网络中的特征图通过候选区域提取模块(ROIs)获得包含待检测目标的候选框。In step S71, the feature map of the image is sequentially passed through the fourth convolution layer (DW Conv) and the fifth convolution layer (1╳1 Conv) to obtain the feature map of the image in the RPN network, and the fifth convolution layer ( The size of the convolution kernel of 1╳1 Conv) is 1╳1. Preferably, in order to further reduce the amount of data calculation, this embodiment adopts depthwise separable convolution to pass the fourth convolution layer (DW Conv) to the characteristics of the image. The graph is convoluted. The feature map of the image in the RPN network is passed through the candidate region extraction module (ROIs) to obtain the candidate frame containing the object to be detected.
为了解决轻量化网络特征提取能力较弱的问题以及目标检测区域周围空间信息丢失的问题,本实施例在已有的Light-Head R-CNN网络的基础上增加了通道自注意力网络,通过通道自注意力网络对输入至PSROI(position sensitive ROI pooling)层的特征图的特征分布进行优化,从而使得输出的特征图更加关注与检测相关的区域,提升检测结果的准确性。In order to solve the problem of weak feature extraction capability of lightweight network and the loss of spatial information around the target detection area, this embodiment adds a channel self-attention network on the basis of the existing Light-Head R-CNN network. The self-attention network optimizes the feature distribution of the feature map input to the PSROI (position sensitive ROI pooling) layer, so that the output feature map pays more attention to the detection-related area and improves the accuracy of the detection result.
具体地,步骤S72包括:Specifically, step S72 includes:
S721、将图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,其中,第一子特征图和第二子特征图的通道数相等;S721, dividing the feature map of the image in the RPN network into a first sub-feature map and a second sub-feature map, wherein the number of channels of the first sub-feature map and the second sub-feature map are equal;
S722、根据第一子特征图获得通道注意力权重;S722, obtaining the channel attention weight according to the first sub-feature map;
S723、将通道注意力权重与第二子特征图相乘获得通道注意力特征图。S723. Multiply the channel attention weight and the second sub-feature map to obtain the channel attention feature map.
参照图6,本实施例中的通道自注意力网络包括第一分割模块、通道注意力权重获取模块。第一分割模块用于将图像在RPN网络中的特征图F rpn分割为第一子特征图F 1和第二子特征图F 2,其中,第一子特征图F 1和第二子特征图F 2的通道数相等,这里所说的分割是直接对通道进行均分,例如,图像在RPN网络中的特征图的通道数为8个,则将第1~4个通道对应的数据作为第一子特征图F 1,将第5~8个通道对应的数据作为第二子特征图F 2Referring to FIG. 6 , the channel self-attention network in this embodiment includes a first segmentation module and a channel attention weight acquisition module. The first segmentation module is used to segment the feature map F rpn of the image in the RPN network into a first sub-feature map F 1 and a second sub-feature map F 2 , wherein the first sub-feature map F 1 and the second sub-feature map The number of channels of F 2 is equal. The segmentation mentioned here is to directly divide the channels. For example, if the number of channels of the feature map of the image in the RPN network is 8, the data corresponding to the 1st to 4th channels are used as the first channel. A sub-feature map F 1 , and the data corresponding to the 5th to 8th channels is used as the second sub-feature map F 2 .
将第一子特征图F 1输入至通道注意力权重获取模块中,通过通道注意力权重获取模块获得通道注意力权重K,其中,通道注意力权重获取模块包括第二分割模块、分组卷积层(Group Conv)、深度可分离卷积层(DW Conv)、softmax层、第三池化层(Avg pool)、第六卷积层(1╳1 Conv)。 Input the first sub-feature map F1 into the channel attention weight acquisition module, and obtain the channel attention weight K through the channel attention weight acquisition module, wherein the channel attention weight acquisition module includes a second segmentation module, a grouping convolution layer (Group Conv), depthwise separable convolution layer (DW Conv), softmax layer, third pooling layer (Avg pool), sixth convolution layer (1╳1 Conv).
第二分割模块用于将第一子特征图F 1分割为第三子特征图F 3和第四子特征图F 4,其中,第三子特征图F 3和第四子特征图F 4的通道数相等,这里所说的分割是直接对通道进行均分,继续以图像在RPN网络中的特征图的通道数为8个作为示例,经过第一分割模块分割后,第一子特征图F 1的通道数为4个,则将第1~2个通道对应的数据作为第三子特征图F 3,将第3~4个通道对应的数据作为第四子特征图F 4The second segmentation module is used to segment the first sub-feature map F 1 into a third sub-feature map F 3 and a fourth sub-feature map F 4 , wherein the third sub-feature map F 3 and the fourth sub-feature map F 4 The number of channels is equal. The segmentation mentioned here is to directly divide the channels. Continue to take the number of channels of the feature map of the image in the RPN network as 8 as an example. After the first segmentation module segmentation, the first sub-feature map F If the number of channels of 1 is 4, the data corresponding to the 1st to 2nd channels are used as the third sub-feature map F 3 , and the data corresponding to the 3rd to 4th channels are used as the fourth sub-feature map F 4 .
将第三子特征图F 3、第四子特征图F 4分别输入至分组卷积层(Group Conv)、深度可分离卷积层(DW Conv)进行卷积处理,将分组卷积层(Group Conv)、深度可分离卷积层(DW Conv)的输出进行融合后依次通过softmax层、第三池化层(Avg pool)、第六卷积层(1╳1 Conv)进行处理,获得通道注意力权重K,其中,第三池化层(Avg pool)采用均值池化的方法进行池化,第六卷积层(1╳1 Conv)用于进行升维处理,以使得通道注意力权重K的维度与第二子特征图F 2的维度一致。 The third sub-feature map F 3 and the fourth sub-feature map F 4 are respectively input to the group convolution layer (Group Conv) and the depthwise separable convolution layer (DW Conv) for convolution processing, and the group convolution layer (Group Conv) Conv), the output of the depthwise separable convolution layer (DW Conv) are fused and then processed through the softmax layer, the third pooling layer (Avg pool), and the sixth convolution layer (1╳1 Conv) in turn to obtain channel attention. Force weight K, among which, the third pooling layer (Avg pool) adopts the method of mean pooling for pooling, and the sixth convolution layer (1╳1 Conv) is used for dimensional upgrade processing, so that the channel attention weight K The dimension of is consistent with the dimension of the second sub-feature map F2.
在获得通道注意力权重K后,将通道注意力权重K与第二子特征图F 2相乘,获得通道注意力特征图。 After obtaining the channel attention weight K, multiply the channel attention weight K with the second sub-feature map F 2 to obtain the channel attention feature map.
在步骤S73中,通过通道自注意力网络最后将通道注意力特征图与图像的特征图进行融合,获得融合后的特征图。本实施例中的通道自注意力网络将通道分离与自注意力机制进行结合,通过通道分离使得各个通道之间的信息可以互相交互,显著降低了网络结构的复杂性,从而大大减少参数的数量,通过自注意力机制可以抑制背景特征并突出前景特征。此外,通过将通道注意力特征图与图像的特征图,扩展了每个空间位置的视野,丰富输出功能。In step S73, the channel attention feature map and the image feature map are finally fused through the channel self-attention network to obtain a fused feature map. The channel self-attention network in this embodiment combines the channel separation and the self-attention mechanism. Through the channel separation, the information between each channel can interact with each other, which significantly reduces the complexity of the network structure, thereby greatly reducing the number of parameters. , the background features can be suppressed and the foreground features can be highlighted through the self-attention mechanism. In addition, by combining the channel attention feature map with the feature map of the image, the field of view of each spatial location is expanded and the output features are enriched.
本实施例在步骤S74~S75中通过PSROI(position sensitive ROI pooling)层将候选框映射到融合后的特征图上,根据候选框从融合后的特征图上提取待检测目标的特征图,将待检测目标的特征图通过全连接层获得待检测目标的检测结果,其中,通过全连接层获得类别概率并根据类别概率进行分类即检测结果为分类,通过全连接层获得位置偏移信息并根据位置偏移信息来获得目标所在的位置即检测结果为回归。In this embodiment, in steps S74 to S75, the PSROI (position sensitive ROI pooling) layer is used to map the candidate frame to the fused feature map, and the feature map of the target to be detected is extracted from the fused feature map according to the candidate frame. The feature map of the detection target obtains the detection result of the target to be detected through the fully connected layer, wherein the category probability is obtained through the fully connected layer and classified according to the category probability, that is, the detection result is classified, and the position offset information is obtained through the fully connected layer and according to the position Offset information to obtain the location of the target, that is, the detection result is regression.
本实施例中的目标检测方法主要应用于移动终端设备,在将该目标检测算法部署到移动终端设备之前,需要先利用训练数据在服务器上对根据本实施例的目标检测方法构建的网络模型进行训练,训练完成后再利用评估数据对网络模型进行评估,以获得性能最好的网络模型,最后再将具有最好性能的网络模型通过onnx工具部署至移动终端来实现所述目标检测算法,对真实数据进行检测并对检测结果进行可视化。The target detection method in this embodiment is mainly applied to mobile terminal devices. Before deploying the target detection algorithm to mobile terminal devices, training data needs to be used on the server to perform a network model constructed according to the target detection method of this embodiment. After training, use the evaluation data to evaluate the network model to obtain the network model with the best performance, and finally deploy the network model with the best performance to the mobile terminal through the onnx tool to implement the target detection algorithm. Real data for detection and visualization of detection results.
将本实施例中的目标检测方法在公共数据集PASCAL VOC上进行验证,实验结果表明,本实施例中的目标检测方法仅仅需要528 MFLOPs便可获得70.6mAP的准确度,在准确度和模型的复杂度上取得了很好的平衡。The target detection method in this embodiment is verified on the public data set PASCAL VOC. The experimental results show that the target detection method in this embodiment only needs 528 MFLOPs to obtain an accuracy of 70.6 mAP. There is a good balance of complexity.
下面具体对将本实施例中的目标检测方法在公共数据集PASCAL VOC上的验证结构进行详细的描述。The following is a detailed description of the verification structure of the target detection method in this embodiment on the public dataset PASCAL VOC.
将图像缩放为320×320作为输入,并在24GB RAM的NVIDIA TITAN RTX上训练根据本实施例的目标检测方法构建的网络模型。在训练阶段,我们采用了随机梯度优化器,学习率为0.0001,权重衰减为0.001。将所有数据集随机分为训练集(60%)、验证集(20%)、测试集(20%),以便训练阶段、 验证阶段及测试阶段的数据都具有相似的分布。这里定义百万个浮点运算(MFLOP)用于衡量轻量级网络模型的复杂性和效率,模型的性能可通过平均精度(mAP)进行评估。在训练参数一致的前提下用不同方法对PASCAL VOC数据检测,不同方法对应的MFLOP和mAP结果如表三所示。The image is scaled to 320×320 as input, and the network model constructed according to the object detection method of this embodiment is trained on NVIDIA TITAN RTX with 24GB RAM. During the training phase, we employ a stochastic gradient optimizer with a learning rate of 0.0001 and a weight decay of 0.001. All datasets were randomly divided into training set (60%), validation set (20%), test set (20%) so that the data in the training, validation and testing stages all had similar distributions. Here, millions of floating-point operations (MFLOPs) are defined to measure the complexity and efficiency of a lightweight network model, and the performance of the model can be evaluated by mean precision (mAP). On the premise of the same training parameters, different methods are used to detect PASCAL VOC data. The MFLOP and mAP results corresponding to different methods are shown in Table 3.
表三 不同方法在PASCAL VOC数据集上的结果对比Table 3 Comparison of the results of different methods on the PASCAL VOC dataset
Figure PCTCN2021086476-appb-000005
Figure PCTCN2021086476-appb-000005
与大多数基于大型目标检测器的最新模型(例如YOLOv2、SSD300、SSD321、R-FCN)相比,本实施例中的目标检测方法(表2中our model)在模型复杂性方面具有较强的优势。因此,本实施例中的目标检测方法更符合移动终端设备的需求。Compared with most state-of-the-art models based on large-scale object detectors (such as YOLOv2, SSD300, SSD321, R-FCN), the object detection method in this example (our model in Table 2) has a stronger model complexity. Advantage. Therefore, the target detection method in this embodiment is more in line with the requirements of the mobile terminal device.
将本实施例中的目标检测方法与现有的轻量级检测算法进行比较可以看出,本实施例中的目标检测方法(表2中our model)比Tiny-YOLO、D-YOLO、MobileNet-SSD的MFLOPs小很多,且准确度比Tiny-YOLO、D-YOLO、MobileNet-SSD高。与Pelee相比,本实施例中的目标检测方法(表2中our model)可以产生相似的精度,而模型复杂度只有一半。可见,本实施例中的目标检测方法能够在准确度和模型的复杂度上取得很好的平衡。Comparing the target detection method in this embodiment with the existing lightweight detection algorithm, it can be seen that the target detection method in this embodiment (our model in Table 2) is better than Tiny-YOLO, D-YOLO, MobileNet- The MFLOPs of SSD are much smaller, and the accuracy is higher than Tiny-YOLO, D-YOLO, MobileNet-SSD. Compared with Pelee, the object detection method in this example (our model in Table 2) can produce similar accuracies with half the model complexity. It can be seen that the target detection method in this embodiment can achieve a good balance between the accuracy and the complexity of the model.
参照图7,本实施例还提供了与上述目标检测方法对应的目标检测装置,该目标检测装置包括获取模块1、降维模块2、压缩模块3、拼接模块4、采样模块5、特征提取模块6、检测模块7。Referring to FIG. 7 , the present embodiment also provides a target detection device corresponding to the above target detection method. The target detection device includes an acquisition module 1, a dimensionality reduction module 2, a compression module 3, a splicing module 4, a sampling module 5, and a feature extraction module. 6. Detection module 7.
具体地,获取模块1用于获取待检测目标的图像,降维模块2用于将图像通过第一卷积层进行降维,获得降维后的图像,压缩模块3用于将降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,其中,多个第一分 支具有相同的通道数,拼接模块4用于分别提取多个第一分支的第一特征图并将多个第一分支的第一特征图进行拼接,获得第一拼接特征图,其中,多个第一分支的第一特征图在深度上依次递增,采样模块5用于将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图,特征提取模块6用于将采样后的特征图通过多个block模块进行特征提取,获得图像的特征图,检测模块7用于根据图像的特征图进行检测,获得待检测目标的检测结果。Specifically, the acquisition module 1 is used to acquire the image of the target to be detected, the dimension reduction module 2 is used to reduce the dimension of the image through the first convolution layer, and obtain the dimension-reduced image, and the compression module 3 is used to reduce the dimension of the image. The images are respectively compressed through multiple second convolution layers to obtain multiple first branches, wherein the multiple first branches have the same number of channels, and the stitching module 4 is used to extract the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, wherein the first feature maps of the multiple first branches are sequentially increased in depth, and the sampling module 5 is used to combine the first splicing feature The image is downsampled through the first pooling layer to obtain the sampled feature map. The feature extraction module 6 is used to extract the sampled feature map through multiple block modules to obtain the feature map of the image. The detection module 7 is used for The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
本实施例中的拼接模块4还用于将上一个第一分支的输出作为下一个第一分支的残差部分并将下一个第一分支与残差部分的深度相同的特征和残差部分进行融合,获得该第一分支融合后的跨分支特征图,以及将该第一分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第一分支的第一特征图,以及将多个第一分支的第一特征图进行拼接并与降维后的图像进行融合,获得第一拼接特征图。The splicing module 4 in this embodiment is also configured to use the output of the previous first branch as the residual part of the next first branch and perform the next first branch and the residual part with the same depth as the residual part. Fusion, to obtain the cross-branch feature map after the fusion of the first branch, and perform feature extraction on the cross-branch feature map after the fusion of the first branch through the third convolution layer to obtain the first feature maps of multiple first branches respectively. , and splicing the first feature maps of the multiple first branches and merging them with the dimension-reduced image to obtain a first splicing feature map.
压缩模块3还用于将采样后的特征图分别通过多个第二卷积层进行压缩,获得多个第二分支,其中,多个第二分支具有相同的通道数,拼接模块4还用于分别提取多个第二分支的第二特征图并将多个第二分支的第二特征图进行拼接,获得第二拼接特征图,其中,多个第二分支的第二特征图在深度上依次递增。The compression module 3 is also used to compress the sampled feature maps through multiple second convolution layers to obtain multiple second branches, wherein the multiple second branches have the same number of channels, and the splicing module 4 is also used for Extracting the second feature maps of multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, wherein the second feature maps of the multiple second branches are sequentially in depth. Increment.
本实施例中的拼接模块4还用于将上一个第二分支的输出作为下一个第二分支的残差部分并将下一个第二分支与残差部分的深度相同的特征和残差部分进行融合,获得多个第二分支融合后的跨分支特征图,以及将多个第二分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第二分支的第二特征图,以及将多个第二分支的第二特征图进行拼接并与采样后的特征图进行融合,获得第二拼接特征图。The splicing module 4 in this embodiment is also used to use the output of the previous second branch as the residual part of the next second branch and perform the next second branch and the residual part with the same depth as the feature and the residual part. Fusion, obtains the cross-branch feature maps after the fusion of multiple second branches, and performs feature extraction on the cross-branch feature maps after the fusion of multiple second branches through the third convolution layer, and obtains the second branch of the multiple second branches respectively. feature maps, and splicing the second feature maps of the plurality of second branches and merging them with the sampled feature maps to obtain a second splicing feature map.
特征提取模块6还用于将第二拼接特征图通过多个block模块进行特征提取,获得图像的特征图。具体地,特征提取模块6用于将采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图,以及对第三尺度特征图进行下采样,获得第四尺度特征图,以及分别对第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图,以及将第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得图像的特征图。The feature extraction module 6 is further configured to perform feature extraction on the second stitched feature map through a plurality of block modules to obtain a feature map of the image. Specifically, the feature extraction module 6 is configured to sequentially obtain the first scale feature map, the second scale feature map, the third scale feature map through the sampled feature map/second stitched feature map through multiple block modules, and the third The scale feature map is downsampled to obtain the fourth scale feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled to obtain the first upsampling feature map and the second upsampling feature map, and the third scale feature map and the fourth scale feature map are respectively upsampled. The two-scale feature map, the first upsampling feature map, and the second upsampling feature map are fused to obtain the feature map of the image.
本实施例中的检测模块7具体用于将图像的特征图通过RPN网络,获得图像在RPN网络中的特征图和包含待检测目标的候选框,以及根据图像在RPN网络中的特征图生成通道注意力特征图,以及将通道注意力特征图与图像的特征图进行融合,获得融合后的特征图,以及根据候选框和融合后的特征图获得待检测目标的特征图,以及根据待检测目标的特征图获得所述待检测目标的检测结果。The detection module 7 in this embodiment is specifically used to pass the feature map of the image through the RPN network, obtain the feature map of the image in the RPN network and the candidate frame containing the target to be detected, and generate a channel according to the feature map of the image in the RPN network Attention feature map, and fuse the channel attention feature map with the feature map of the image to obtain the fused feature map, and obtain the feature map of the target to be detected according to the candidate frame and the fused feature map, and according to the target to be detected. The feature map of , obtains the detection result of the target to be detected.
检测模块7还用于将图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,其中,第一子特征图和所述第二子特征图的通道数相等,以及根据第一子特征图获得通道注意力权重,以及将通道注意力权重与第二子特征图相乘获得通道注意力特征图。The detection module 7 is further configured to divide the feature map of the image in the RPN network into a first sub-feature map and a second sub-feature map, wherein the number of channels of the first sub-feature map and the second sub-feature map are equal, and The channel attention weight is obtained according to the first sub-feature map, and the channel attention feature map is obtained by multiplying the channel attention weight with the second sub-feature map.
参照图8,本实施例提供了一种设备,包括存储器100、处理器200、网络接口202,存储器100上存储有计算机程序,处理器200执行计算机程序以实现本实施例中的目标检测方法。8, this embodiment provides a device including a memory 100, a processor 200, and a network interface 202. The memory 100 stores a computer program, and the processor 200 executes the computer program to implement the target detection method in this embodiment.
存储器100可以包括高速随机存取存储器(Random Access Memory,RAM),也可能还包括非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 100 may include a high-speed random access memory (Random Access Memory, RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
处理器200可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本实施例中的目标检测方法的各步骤可以通过处理器200中的硬件的集成逻辑电路或者软件形式的指令完成。处理器200也可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等,还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The processor 200 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the target detection method in this embodiment may be completed by an integrated logic circuit of hardware in the processor 200 or an instruction in the form of software. The processor 200 may also be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., and may also be a digital signal processor (DSP), an application specific integrated circuit (ASIC) , off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
存储器100用于存储计算机程序,处理器200在接收到执行指令后,执行该计算机程序以实现本实施例中的目标检测方法。The memory 100 is used to store a computer program, and after receiving the execution instruction, the processor 200 executes the computer program to implement the target detection method in this embodiment.
本实施例还提供了一种计算机存储介质,计算机存储介质中存储有计算机程序,处理器200用于读取并执行计算机存储介质201中存储的计算机程序,以实现本实施例中的目标检测方法。This embodiment also provides a computer storage medium, where a computer program is stored in the computer storage medium, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201 to implement the target detection method in this embodiment .
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组 合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机存储介质中,或者从一个计算机存储介质向另一个计算机存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer storage medium to another computer storage medium, for example, from a website site, computer, server, or data center over a wired (e.g., coaxial) cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center. The computer storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.
本发明实施例是参照根据本发明实施例的方法、装置、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, apparatuses, and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
以上所述仅是本申请的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above are only specific embodiments of the present application. It should be pointed out that for those skilled in the art, without departing from the principles of the present application, several improvements and modifications can also be made. It should be regarded as the protection scope of this application.

Claims (20)

  1. 一种轻量化的目标检测方法,其中,所述目标检测方法包括:A lightweight target detection method, wherein the target detection method comprises:
    获取待检测目标的图像;Obtain the image of the target to be detected;
    将所述图像通过第一卷积层进行降维,获得降维后的图像;reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;
    将所述降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,所述多个第一分支具有相同的通道数;compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;
    分别提取所述多个第一分支的第一特征图并将所述多个第一分支的第一特征图进行拼接,获得第一拼接特征图,所述多个第一分支的第一特征图在深度上依次递增;Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;
    将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图;Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;
    将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图;Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;
    根据所述图像的特征图进行检测,获得所述待检测目标的检测结果。The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
  2. 根据权利要求1所述的目标检测方法,其中,在将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图之前,所述目标检测方法还包括:The target detection method according to claim 1, wherein, before the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:
    将所述采样后的特征图分别通过多个第二卷积层进行压缩,获得多个第二分支,所述多个第二分支具有相同的通道数;compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;
    分别提取所述多个第二分支的第二特征图并将所述多个第二分支的第二特征图进行拼接,获得第二拼接特征图,所述多个第二分支的第二特征图在深度上依次递增;Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;
    相应的,将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:
    将所述第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图。The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
  3. 根据权利要求2所述的目标检测方法,其中,对于深度最小的第一分支/第二分支,将所述降维后的图像/采样后的特征图通过第二卷积层进行压缩,包括:The target detection method according to claim 2, wherein, for the first branch/second branch with the smallest depth, compressing the dimension-reduced image/sampled feature map through a second convolution layer, comprising:
    将所述降维后的图像/采样后的特征图依次通过第二池化层、第二卷积层分别进行池化、压缩。The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
  4. 根据权利要求3所述的目标检测方法,其中,分别提取所述多个第一分支的第一特征图/第二分支的第二特征图并将所述多个第一分支的第一特征图/第二分支的第二特征图进行拼接,获得第一拼接特征图/第二拼接特征图,包括:The target detection method according to claim 3, wherein the first feature maps of the plurality of first branches/the second feature maps of the second branches are respectively extracted and the first feature maps of the plurality of first branches are extracted respectively. / The second feature map of the second branch is spliced to obtain the first splicing feature map/the second splicing feature map, including:
    将上一个第一分支/第二分支的输出作为下一个第一分支/第二分支的残差部分并将下一个第一分支/第二分支与所述残差部分的深度相同的特征和所述残差部分进行融合,获得多个第一分支/第二分支融合后的跨分支特征图;The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;
    将多个第一分支/第二分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第一分支的第一特征图/第二分支的第二特征图;Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;
    将所述多个第一分支的第一特征图进行拼接并与所述降维后的图像进行融合/将所述多个第二分支的第二特征图进行拼接并与所述采样后的特征图进行融合,获得第一拼接特征图/第二拼接特征图。splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.
  5. 根据权利要求1所述的目标检测方法,其中,将所述采样后的特征图/第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:The target detection method according to claim 1, wherein the feature extraction of the sampled feature map/second stitched feature map through a plurality of block modules to obtain the feature map of the image, comprising:
    将所述采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;
    对所述第三尺度特征图进行下采样,获得第四尺度特征图;down-sampling the third-scale feature map to obtain a fourth-scale feature map;
    分别对所述第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;
    将所述第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得所述图像的特征图。The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
  6. 根据权利要求2所述的目标检测方法,其中,将所述采样后的特征图/第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:The target detection method according to claim 2, wherein the feature extraction of the sampled feature map/second stitched feature map is performed by multiple block modules to obtain the feature map of the image, comprising:
    将所述采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;
    对所述第三尺度特征图进行下采样,获得第四尺度特征图;down-sampling the third-scale feature map to obtain a fourth-scale feature map;
    分别对所述第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;
    将所述第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得所述图像的特征图。The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
  7. 根据权利要求5所述的目标检测方法,其中,根据所述图像的特征图进行检测,获得所述待检测目标的检测结果,包括:The target detection method according to claim 5, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:
    将所述图像的特征图通过RPN网络,获得所述图像在RPN网络中的特征图和包含所述待检测目标的候选框;Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;
    根据所述图像在RPN网络中的特征图生成通道注意力特征图;Generate a channel attention feature map according to the feature map of the image in the RPN network;
    将所述通道注意力特征图与所述图像的特征图进行融合,获得融合后的特征图;Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;
    根据所述候选框和所述融合后的特征图获得所述待检测目标的特征图;Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;
    根据所述待检测目标的特征图获得所述待检测目标的检测结果。The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
  8. 根据权利要求6所述的目标检测方法,其中,根据所述图像的特征图进行检测,获得所述待检测目标的检测结果,包括:The target detection method according to claim 6, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:
    将所述图像的特征图通过RPN网络,获得所述图像在RPN网络中的特征图和包含所述待检测目标的候选框;Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;
    根据所述图像在RPN网络中的特征图生成通道注意力特征图;Generate a channel attention feature map according to the feature map of the image in the RPN network;
    将所述通道注意力特征图与所述图像的特征图进行融合,获得融合后的特征图;Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;
    根据所述候选框和所述融合后的特征图获得所述待检测目标的特征图;Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;
    根据所述待检测目标的特征图获得所述待检测目标的检测结果。The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
  9. 根据权利要求7所述的目标检测方法,其中,根据所述图像在RPN网络中的特征图生成通道注意力特征图,包括:The target detection method according to claim 7, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:
    将所述图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,所述第一子特征图和所述第二子特征图的通道数相等;The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;
    根据所述第一子特征图获得通道注意力权重;Obtain channel attention weights according to the first sub-feature map;
    将所述通道注意力权重与所述第二子特征图相乘获得通道注意力特征图。The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
  10. 根据权利要求8所述的目标检测方法,其中,根据所述图像在RPN网络中的特征图生成通道注意力特征图,包括:The target detection method according to claim 8, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:
    将所述图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,所述第一子特征图和所述第二子特征图的通道数相等;The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;
    根据所述第一子特征图获得通道注意力权重;Obtain channel attention weights according to the first sub-feature map;
    将所述通道注意力权重与所述第二子特征图相乘获得通道注意力特征图。The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
  11. 一种设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序以实现轻量化的目标检测方法,所述轻量化的目标检测方法包括:A device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement a lightweight target detection method, the lightweight Object detection methods include:
    获取待检测目标的图像;Obtain the image of the target to be detected;
    将所述图像通过第一卷积层进行降维,获得降维后的图像;reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;
    将所述降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,所述多个第一分支具有相同的通道数;compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;
    分别提取所述多个第一分支的第一特征图并将所述多个第一分支的第一特征图进行拼接,获得第一拼接特征图,所述多个第一分支的第一特征图在深度上依次递增;Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;
    将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图;Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;
    将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图;Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;
    根据所述图像的特征图进行检测,获得所述待检测目标的检测结果。The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
  12. 根据权利要求11所述的设备,其中,在将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图之前,所述目标检测方法还包括:The device according to claim 11, wherein, before the feature map of the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:
    将所述采样后的特征图分别通过多个第二卷积层进行压缩,获得多个第二分支,所述多个第二分支具有相同的通道数;compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;
    分别提取所述多个第二分支的第二特征图并将所述多个第二分支的第二特征图进行拼接,获得第二拼接特征图,所述多个第二分支的第二特征图在深度上依次递增;Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;
    相应的,将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:
    将所述第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图。The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
  13. 根据权利要求12所述的设备,其中,对于深度最小的第一分支/第二分支,将所述降维后的图像/采样后的特征图通过第二卷积层进行压缩,包括:The device according to claim 12, wherein, for the first branch/second branch with the smallest depth, compressing the dimension-reduced image/sampled feature map through a second convolution layer, comprising:
    将所述降维后的图像/采样后的特征图依次通过第二池化层、第二卷积层分别进行池化、压缩。The dimensionality-reduced image/sampled feature map is sequentially pooled and compressed through the second pooling layer and the second convolutional layer.
  14. 根据权利要求13所述的设备,其中,分别提取所述多个第一分支的第一特征图/第二分支的第二特征图并将所述多个第一分支的第一特征图/第二分支的第二特征图进行拼接,获得第一拼接特征图/第二拼接特征图,包括:The device according to claim 13, wherein the first feature maps of the plurality of first branches/the second feature maps of the second branches are respectively extracted and the first feature maps/the second feature maps of the plurality of first branches are extracted respectively. The second feature maps of the two branches are spliced to obtain the first splicing feature map/the second splicing feature map, including:
    将上一个第一分支/第二分支的输出作为下一个第一分支/第二分支的残差部分并将下一个第一分支/第二分支与所述残差部分的深度相同的特征和所述残差部分进行融合,获得多个第一分支/第二分支融合后的跨分支特征图;The output of the previous first branch/second branch is used as the residual part of the next first branch/second branch and the next first branch/second branch is the same as the feature and the residual part of the depth. The residual part is fused to obtain a cross-branch feature map after the fusion of multiple first branches/second branches;
    将多个第一分支/第二分支融合后的跨分支特征图通过第三卷积层进行特征提取,分别获得多个第一分支的第一特征图/第二分支的第二特征图;Perform feature extraction on the cross-branch feature maps after the fusion of multiple first branches/second branches through the third convolution layer, and obtain the first feature maps of multiple first branches/second feature maps of the second branch respectively;
    将所述多个第一分支的第一特征图进行拼接并与所述降维后的图像进行融合/将所述多个第二分支的第二特征图进行拼接并与所述采样后的特征图进行融合,获得第一拼接特征图/第二拼接特征图。splicing the first feature maps of the plurality of first branches and merging with the dimensionality-reduced image/splicing the second feature maps of the plurality of second branches and merging them with the sampled features The images are fused to obtain the first mosaic feature map/second mosaic feature map.
  15. 根据权利要求12所述的设备,其中,将所述采样后的特征图/第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:The device according to claim 12, wherein the feature extraction of the sampled feature map/second stitched feature map is performed by multiple block modules to obtain the feature map of the image, comprising:
    将所述采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;
    对所述第三尺度特征图进行下采样,获得第四尺度特征图;down-sampling the third-scale feature map to obtain a fourth-scale feature map;
    分别对所述第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;
    将所述第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得所述图像的特征图。The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
  16. 根据权利要求15所述的设备,其中,根据所述图像的特征图进行检测,获得所述待检测目标的检测结果,包括:The device according to claim 15, wherein the detection is performed according to the feature map of the image to obtain the detection result of the target to be detected, comprising:
    将所述图像的特征图通过RPN网络,获得所述图像在RPN网络中的特征图和包含所述待检测目标的候选框;Passing the feature map of the image through the RPN network to obtain the feature map of the image in the RPN network and a candidate frame containing the target to be detected;
    根据所述图像在RPN网络中的特征图生成通道注意力特征图;Generate a channel attention feature map according to the feature map of the image in the RPN network;
    将所述通道注意力特征图与所述图像的特征图进行融合,获得融合后的特征图;Fusing the channel attention feature map with the feature map of the image to obtain a fused feature map;
    根据所述候选框和所述融合后的特征图获得所述待检测目标的特征图;Obtain the feature map of the target to be detected according to the candidate frame and the fused feature map;
    根据所述待检测目标的特征图获得所述待检测目标的检测结果。The detection result of the to-be-detected target is obtained according to the feature map of the to-be-detected target.
  17. 根据权利要求16所述的设备,其中,根据所述图像在RPN网络中的特征图生成通道注意力特征图,包括:The device according to claim 16, wherein generating a channel attention feature map according to the feature map of the image in the RPN network, comprising:
    将所述图像在RPN网络中的特征图分割为第一子特征图和第二子特征图,所述第一子特征图和所述第二子特征图的通道数相等;The feature map of the image in the RPN network is divided into a first sub-feature map and a second sub-feature map, and the number of channels of the first sub-feature map and the second sub-feature map are equal;
    根据所述第一子特征图获得通道注意力权重;Obtain channel attention weights according to the first sub-feature map;
    将所述通道注意力权重与所述第二子特征图相乘获得通道注意力特征图。The channel attention feature map is obtained by multiplying the channel attention weight by the second sub-feature map.
  18. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机指令,其中,所述计算机指令被处理器执行轻量化的目标检测方法,所述轻量化的目标检测方法包括:A computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, wherein the computer instructions are executed by a processor to perform a lightweight target detection method, and the lightweight target detection method includes:
    获取待检测目标的图像;Obtain the image of the target to be detected;
    将所述图像通过第一卷积层进行降维,获得降维后的图像;reducing the dimension of the image through the first convolution layer to obtain a dimension-reduced image;
    将所述降维后的图像分别通过多个第二卷积层进行压缩,获得多个第一分支,所述多个第一分支具有相同的通道数;compressing the dimension-reduced images through a plurality of second convolution layers to obtain a plurality of first branches, and the plurality of first branches have the same number of channels;
    分别提取所述多个第一分支的第一特征图并将所述多个第一分支的第一特征图进行拼接,获得第一拼接特征图,所述多个第一分支的第一特征图在深度上依次递增;Extracting the first feature maps of the multiple first branches respectively and splicing the first feature maps of the multiple first branches to obtain a first splicing feature map, the first feature maps of the multiple first branches increasing in depth;
    将第一拼接特征图通过第一池化层进行下采样,获得采样后的特征图;Downsampling the first stitched feature map through the first pooling layer to obtain a sampled feature map;
    将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图;Perform feature extraction on the sampled feature map through a plurality of block modules to obtain the feature map of the image;
    根据所述图像的特征图进行检测,获得所述待检测目标的检测结果。The detection is performed according to the feature map of the image, and the detection result of the target to be detected is obtained.
  19. 根据权利要求18所述的计算机可读存储介质,其中,在将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图之前,所述目标检测方法还包括:The computer-readable storage medium according to claim 18, wherein, before the feature map of the sampled feature map is extracted by a plurality of block modules to obtain the feature map of the image, the target detection method further comprises:
    将所述采样后的特征图分别通过多个第二卷积层进行压缩,获得多个第二分支,所述多个第二分支具有相同的通道数;compressing the sampled feature maps through a plurality of second convolution layers, respectively, to obtain a plurality of second branches, and the plurality of second branches have the same number of channels;
    分别提取所述多个第二分支的第二特征图并将所述多个第二分支的第二特征图进行拼接,获得第二拼接特征图,所述多个第二分支的第二特征图在深度上依次递增;Extracting the second feature maps of the multiple second branches respectively and splicing the second feature maps of the multiple second branches to obtain a second splicing feature map, the second feature maps of the multiple second branches increasing in depth;
    相应的,将所述采样后的特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:Correspondingly, the sampled feature map is extracted by multiple block modules to obtain the feature map of the image, including:
    将所述第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图。The second stitched feature map is extracted by a plurality of block modules to obtain the feature map of the image.
  20. 根据权利要求19所述的计算机可读存储介质,其中,将所述采样后的特征图/第二拼接特征图通过多个block模块进行特征提取,获得所述图像的特征图,包括:The computer-readable storage medium according to claim 19, wherein the feature extraction of the sampled feature map/second stitched feature map through a plurality of block modules to obtain the feature map of the image comprises:
    将所述采样后的特征图/第二拼接特征图通过多个block模块依次获得第一尺度特征图、第二尺度特征图、第三尺度特征图;Obtain the first scale feature map, the second scale feature map, and the third scale feature map in turn by passing the sampled feature map/second splicing feature map through multiple block modules;
    对所述第三尺度特征图进行下采样,获得第四尺度特征图;down-sampling the third-scale feature map to obtain a fourth-scale feature map;
    分别对所述第三尺度特征图、第四尺度特征图进行上采样,获得第一上采样特征图、第二上采样特征图;respectively up-sampling the third-scale feature map and the fourth-scale feature map to obtain a first up-sampling feature map and a second up-sampling feature map;
    将所述第二尺度特征图、第一上采样特征图、第二上采样特征图进行融合,获得所述图像的特征图。The feature map of the image is obtained by fusing the second scale feature map, the first upsampling feature map, and the second upsampling feature map.
PCT/CN2021/086476 2021-04-06 2021-04-12 Light-weighted target detection method and device, and storage medium WO2022213395A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110367782.2A CN115187820A (en) 2021-04-06 2021-04-06 Light-weight target detection method, device, equipment and storage medium
CN202110367782.2 2021-04-06

Publications (1)

Publication Number Publication Date
WO2022213395A1 true WO2022213395A1 (en) 2022-10-13

Family

ID=83511643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086476 WO2022213395A1 (en) 2021-04-06 2021-04-12 Light-weighted target detection method and device, and storage medium

Country Status (2)

Country Link
CN (1) CN115187820A (en)
WO (1) WO2022213395A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168259A (en) * 2023-04-26 2023-05-26 厦门微图软件科技有限公司 Automatic defect classification algorithm applied to OLED lighting system
CN117095208A (en) * 2023-08-17 2023-11-21 浙江航天润博测控技术有限公司 Lightweight scene classification method for photoelectric pod reconnaissance image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034245A (en) * 2018-07-27 2018-12-18 燕山大学 A kind of object detection method merged using characteristic pattern
US20190303715A1 (en) * 2018-03-29 2019-10-03 Qualcomm Incorporated Combining convolution and deconvolution for object detection
CN110782430A (en) * 2019-09-29 2020-02-11 郑州金惠计算机系统工程有限公司 Small target detection method and device, electronic equipment and storage medium
CN111461211A (en) * 2020-03-31 2020-07-28 中国科学院计算技术研究所 Feature extraction method for lightweight target detection and corresponding detection method
CN112560732A (en) * 2020-12-22 2021-03-26 电子科技大学中山学院 Multi-scale feature extraction network and feature extraction method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303715A1 (en) * 2018-03-29 2019-10-03 Qualcomm Incorporated Combining convolution and deconvolution for object detection
CN109034245A (en) * 2018-07-27 2018-12-18 燕山大学 A kind of object detection method merged using characteristic pattern
CN110782430A (en) * 2019-09-29 2020-02-11 郑州金惠计算机系统工程有限公司 Small target detection method and device, electronic equipment and storage medium
CN111461211A (en) * 2020-03-31 2020-07-28 中国科学院计算技术研究所 Feature extraction method for lightweight target detection and corresponding detection method
CN112560732A (en) * 2020-12-22 2021-03-26 电子科技大学中山学院 Multi-scale feature extraction network and feature extraction method thereof

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168259A (en) * 2023-04-26 2023-05-26 厦门微图软件科技有限公司 Automatic defect classification algorithm applied to OLED lighting system
CN116168259B (en) * 2023-04-26 2023-08-08 厦门微图软件科技有限公司 Automatic defect classification method applied to OLED lighting system
CN117095208A (en) * 2023-08-17 2023-11-21 浙江航天润博测控技术有限公司 Lightweight scene classification method for photoelectric pod reconnaissance image
CN117095208B (en) * 2023-08-17 2024-02-27 浙江航天润博测控技术有限公司 Lightweight scene classification method for photoelectric pod reconnaissance image

Also Published As

Publication number Publication date
CN115187820A (en) 2022-10-14

Similar Documents

Publication Publication Date Title
US20210117791A1 (en) Method and apparatus with neural network performing deconvolution
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
WO2022017025A1 (en) Image processing method and apparatus, storage medium, and electronic device
CN110852383B (en) Target detection method and device based on attention mechanism deep learning network
CN111369440B (en) Model training and image super-resolution processing method, device, terminal and storage medium
CN112801169B (en) Camouflage target detection method, system, device and storage medium based on improved YOLO algorithm
WO2022213395A1 (en) Light-weighted target detection method and device, and storage medium
CN108664981A (en) Specific image extracting method and device
CN110223304B (en) Image segmentation method and device based on multipath aggregation and computer-readable storage medium
US11816881B2 (en) Multiple object detection method and apparatus
CN111353544A (en) Improved Mixed Pooling-Yolov 3-based target detection method
CN110852330A (en) Behavior identification method based on single stage
CN114119975A (en) Language-guided cross-modal instance segmentation method
CN110782430A (en) Small target detection method and device, electronic equipment and storage medium
CN111178217A (en) Method and equipment for detecting face image
CN112164077A (en) Cell example segmentation method based on bottom-up path enhancement
CN113554084A (en) Vehicle re-identification model compression method and system based on pruning and light-weight convolution
Das et al. Contour-aware residual W-Net for nuclei segmentation
CN116090517A (en) Model training method, object detection device, and readable storage medium
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN114821096A (en) Image processing method, neural network training method and related equipment
US11694301B2 (en) Learning model architecture for image data semantic segmentation
CN113313162A (en) Method and system for detecting multi-scale feature fusion target
Kaur et al. Deep transfer learning based multiway feature pyramid network for object detection in images
CN114792370A (en) Whole lung image segmentation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21935619

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21935619

Country of ref document: EP

Kind code of ref document: A1