GB2614954A

GB2614954A - Object detection method based on attention-enhanced bidirectional feature pyramid network (A-BiFPN)

Info

Publication number: GB2614954A
Application number: GB2217717.4A
Authority: GB
Inventors: Zhang Huanlong; Zhang Jianwei; Shi Kunfeng; Du Qifan; Zhang Jie; Zhang Xuncai; Han Dongwei; Tian Yangyang; Guo Zhimin; Wang Fengxian; Qiao Jianwei
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-05-23
Filing date: 2022-11-25
Publication date: 2023-07-26
Also published as: CN114972860A; GB202217717D0

Abstract

Object detection method based on an attention-enhanced bidirectional feature pyramid network (A-BiFPN), comprising: inputting an image to a Visual Geometry Group (VGG) network to obtain four features layers Pin3-6 of different resolutions; inputting the feature layers to a BiFPN, fusing the features at different dimensions through top-bottom and bottom-top path branches thereby obtaining fused features Pout3-6 that containing rich semantic information and detailed information; processing the fused features with a coordinate attention mechanism (grey box) to obtain attention feature maps Y3-6 (not shown); inputting the attention feature maps into a prediction module (predict) for classification and location; and filtering out a redundant prediction boxes through non-maximum suppression (NMS) to obtain a final prediction result. Utilising the above method may improve detection of small objects.

Description

Intellectual Property Office Application No G132217717.4 RTM Date:22 May 2023 The following terms are registered trade marks and should be read as such wherever they occur in this document: Intel NVIDIA Python Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

OBJECT DETECTION METHOD BASED ON ATTENTION-ENHANCED

BIDIRECTIONAL FEATURE PYRAMID NETWORK (A-BiFPN)

TECHNICAL FIELD

[0001] The present disclosure relates to the technical field of object detection, and in particular, to an object detection method based on an attention-enhanced bidirectional feature pyramid network (A-BiFPN).

BACKGROUND

[0002] Object detection is a popular computer vision technique that works to identify, locate, and label specific objects in input images. It has been used in various computer vision applications such as face recognition, self-driving, etc In recent years, due to the development of Convolutional Neural Network (CNN) and hardware computing, the object detection based on deep learning has made significant breakthroughs.

[0003] Although great progress has been made in object detection, small object detection, which is widely used in practical production, remains an unsolved challenge. This is mainly because small objects take up less space and have limited pixels. In addition, after convolution and pooling are conducted multiple times, feature information of the small object in the feature map is seriously lost, resulting in the failure of a detector to accurately detect the small object. Therefore, Liu et al, proposed a typical pyramid structure in a Single Shot Detector (SSD). A typical pyramid structure creatively uses shallow features for smaller object detection and deep features for larger object detection. However, as we all know, shallow features contain rich detailed information, while deep features contain more semantic information. Therefore, SSD algorithm cannot obtain enough details and semantic information of a small object via single feature mapping, making it difficult to effectively detect the small object To address this problem, many researchers have focused on developing multi-scale feature fusion to obtain richer feature representations. In addition to multi-scale feature fusion, the attention mechanism is also helpful to small object detection. The attention mechanism can learn to generate different weights according to the ability of different channels and positions to represent the object, and locally enhance the important channels and positions, which is conducive to the location and recognition of small objects.

SUMMARY

[0004] In view of the problems existing in the prior art, the present disclosure provides an object detection method based on an attention-enhanced bidirectional feature pyramid network (ABiFPN). According to the method, firstly, a BiFPN fuses features at different dimensions such that output features have rich semantic information and detailed information; and then a coordinate attention mechanism enables the network to pay attention to channels and locations related to objects, thereby improving small object detection performance of the object detection algorithm.

[0005] The technical solution of the present disclosure is implemented as follows [0006] The method includes the following steps: Si: inputting an image to a Visual Geometry P °Jr Group (VGG) network to obtain features and / r of 4 layers, 100071 S2. inputting * and Po'' to a BiFPN, and fusing the features at different dimensions through top-bottom and bottom-top path branches so as to obtain features and r6 containing rich semantic information and detailedinformation; 100081 S3: processing and r by a coordinate attention mechanism respectively to obtain attention feature maps Y4, Y4, Y5 and Y6, 100091 S4: putting the attention feature maps Y3, Y4, Y5 and Y6 of four layers output by the coordinate attention mechanism into a prediction module for classification and location; and 100101 S5: filtering out a redundant prediction box through non-maximum suppression (NIVIS) to obtain a final prediction result.

[0011] In step S2, the features of different layers are subject to weighted fusion as follows: [0012] fusing the features of different layers in a fast and normalized manner, where a calculation formula for weighted feature fusion is as follows: E, e4.3v, [0013] [0014] where wi?0 is ensured by using a rectified linear unit (ReLU) after each w-J, E is configured to avoid numerical uncertainty with a value of 0.0001, and Ii represents a value of an ith input feature.

100151 In step S2, a process of fusing the features of different layers by the BiFPN is performed specifically as follows: [0016] For example, a calculation process for fusing t by a top-bottom path branch is as follows: [0017] [0018] where Fup denotes an upsampling process, t; and respectively denote an input feature at a fifth layer and an input feature at a sixth layer in the BiFPN, wi and w) denote weights when are fused, and c is configured to avoid numerical uncertainty with a value of 0.0001.

[0019] For example, a calculation process for fusing by a bottom-top path branch is as follows: [0020] p 100211 Fdo, represents a downsampling process. Finally, and / are fused in the above fusing manner to obtain, and ^ containing rich semantic information and detailed information.

[0022] The processing the fused features by a coordinate attention mechanism in step S3 specifically includes: 100231 S3.1: when an input X has a size of (C xH xW), setting pooling kernels with sizes of (H,1) and (1,W) to encode information of different channels in horizontal and vertical directions; for a cth channel in features, calculating an output of a feature with a height of h after pooling as follows. (h

100241, and [0025] calculating an output of a feature with a width of B after pooling as follows [0026] [0027] S3.2: after performing pooling in horizontal and vertical directions, transforming from C/W/H to CxWx I and cx 1 xH, and transforming CxWx1 to Cx1xH for the purpose of integration; [0028] S3.3: performing connection at a third dimension (H+H=2H) to obtain an attention feature map of C x 1 x2H; [0029] S3.4: giving the attention feature map as an input to a 1 xl convolutional layer, where afterwards, a channel number changes to C/r, and the dimension of the attention feature map changes to Cirx 1 x 2H; [0030] S3.5: decomposing the attention feature map of Cirx 1 x 2H into two independent tensors fhElIcl'H) and r E RCir'w) along a spatial dimension; [0031] S3.6: then restoring channel numbers of the two tensors to C through two 1 xl convolutional layers Fii and F",, and using a sigmoid activation function to obtain weight matrices gf and g as follows: [0032] gh=c(Fh(0) [0033] gw=c(F,(r)), and 100341 S3.7: multiplying the input feature X by the weight matrices to obtain a final output Y of a coordinate attention module as follows.

[0035] YI0' 100361 Compared with the prior art, the present disclosure has the following beneficial effects: the A-BiFPN fuses the features at different dimensions through top-bottom and bottom-top path branches so as to obtain features containing rich semantic information and detailed information. In addition, each feature output branch is processed by coordinate attention such that the network can easily pay attention to the channels and locations related to objects in feature maps, thereby achieving precise classification and location on the objects.

[0037] A Visual Geometry Group (VGG) network or VGGNet is a neural network, in particular, a convolutional neural network, such as a deep convolutional neural network for image recognition. Such a neural network may be trained by the Visual Geometry Group (VGG) at the University of Oxford. This is set out at: https //machinel earning. wtfiterms/vggnet/#:-:text=VGGNet%20is%20e/020deep%20convolutio naP/O2Oneural%2Onetwor10/O20for, the%202014°/O20ImageNet%20Large%20Scale%20VisualgiO2 ORecognition%20Competition.

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] FIG. 1 is a network structure diagram according to the present disclosure; [0039] FIG. 2(a) is a network structure diagram for a coordinate attention model-coordinate attention mechanism; [0040] FIG. 2(b) is a flowchart for a coordinate attention model-coordinate attention mechanism; 100411 FIG. 3 shows comparison of detection results on an NWPU VHR-10 dataset between the present disclosure and an original SSD algorithm, where the detection results of the original SSD algorithm are shown; and [0042] FIG. 4 shows comparison of detection results on an NWPU VHR-10 dataset between the present disclosure and an original SSD algorithm, where the detection results of the improved SSD algorithm are shown.

DETAILED DESCRIPTION OF THE EMBODIMENTS

100431 The technical solutions of the embodiments of the present disclosure are clearly and completely described below with reference to the accompanying drawings. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present disclosure. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts should fall within the protection scope of the present disclosure.

[0044] As shown in FIG 1, an embodiment of the present disclosure provides an object detection method based on an A-BiFPN, where the method includes the following steps: [0045] SI: Input an image to a Visual Geometry Group (VGG) network, the VGG network ni conducts feature extraction on the input image to obtain features and of four layers.

[0046] S2: Input and Y to a BiFPN, and fuse the features at different dimensions through top-bottom and bottom-top path branches so as to obtain features and containing rich semantic information and detailed information.

[0047] The features of different layers are subject to weighted fusion as follows: [0048] Fuse the features of different layers in a fast and normalized manner, where a calculation formula for weighted feature fusion is as follows: [0049] [0050] Where w>0 is ensured by using a rectified linear unit (ReLLT) after each Wi, E is configured to avoid numerical uncertainty with a value of 0.0001, and I; represents a value of an ith input feature.

[0051] A process of fusing the features of different layers by the BiFPN is performed specifically as follows: [0052] For example, a calculation process for fusing t by a top-bottom path branch is as follows: r w [0053] [0054] Where F1r. denotes an upsampling process, and FT respectively denote an input feature at a fifth layer and an input feature at a sixth layer in the BiFPN, mo and w) denote weights when and' are fused, and a is configured to avoid numerical uncertainty with a value of 0.0001.

[0055] For example, a calculation process for fusing by a bottom-top path branch is as follows [0056] [0057] Where Fdowil denotes a downsampling process yea fused in the above fusing manner to obtain information and detailed information.

and finally, k3and P'2. are 14,ht nd 4 containing rich semantic Pr [0058] S3: Process and = by a coordinate attention mechanism respectively to obtain attention feature maps Y3, Y4, Y5 and Yo. For example, the processing the input feature map Pr by the coordinate attention model is specifically as follows: [0059] S31: When Ps has a size of (256x 10x10), set pooling kernels with sizes of 00,1) and (1,10) to encode information of different channels in horizontal and vertical directions; and for a cth channel in features, calculate an output of a feature with a height of h after pooling as follows: [0060] [0061] Calculate an output of a feature with a width of B after pooling as follows: [0062] [0063] S3.2: After performing pooling in horizontal and vertical directions, tranform from 256 x10/10 to 256 x 10x 1 and 256x 1/10. Transform 256/10x 1 to 256 x 1/ 10 for the purpose of integration.

[0064] S3.3: Perform connection at a third dimension (10+10=20) to obtain an attention feature map of 256/1/20.

[0065] S3.4: Give the attention feature map as an input to a 1/1 convolutional layer, where afterwards, a channel number changes to 8, and the dimension of the attention feature map changes to 8/ 1/20.

[0066] S3.5: Decompose the attention feature map of 8x 1 x20 into two independent F h E ROIAH) E,w and Fw R0r) along a spatial dimension.

[0067] S3.6: Then restore channel numbers of the two tensors to 256 through two lx] convolutional layers Fh and F, and use a sigmoid activation function to obtain weight matrices gf and g"' as follows: [0068] gl1=c(Fh(f11)) [0069] gw=a(F".(r)) [0070] S3.7: Multiply the input feature rby the weight matrices to obtain a final output Y3 of a coordinate attention module as follows.

x ') x [00711 Dow [0072] 53.8: Sequentially process an d1 according to steps S3.1-S3.7 to obtain attention feature maps Y4, Y5 and Y6.

[0073] S4: Put the attention feature maps Y3, Ya, Y5 and Ys of four layers output by the coordinate attention mechanism into a prediction module for classification and location; and [0074] S5: Filter out a redundant prediction box through non-maximum suppression (NMS) to obtain a final prediction result.

[0075] As shown in FIGs. 3-4, FIG. 3 shows comparison of detection results on an NWPU VHR-10 dataset between the original SSD object detection algorithm and the object detection method based on the A-BiFPN in the present disclosure. The comparison shows an 7.92% improvement in the detection performance. The example of the present disclosure is implemented on a computer with Intel Platinum 8163CPU(2.50 GHz), 256 GB RAM and NVIDIA TITANRTX using python3.6. The present disclosure uses the NWPU VHR-10 dataset as an experimental material, and average precision MAP as an evaluation indicator. The dataset includes 10 objects of different types, including airplanes, ships, storage tanks, baseball diamonds, tennis courts, basketball courts, ground trackfields, harbors, bridges and vehicles, with a total of 520 training samples and 280 test samples. The training samples are used to train the object detection model, and the test samples are used to evaluate the model detection effect. 100761 The above described are merely preferred embodiments of the present disclosure, and not intended to limit the present disclosure. Any modifications, equivalent replacements and improvements made within the principle of the present disclosure should all fall within the scope of protection of the present disclosure.

Claims

SWHAT IS CLAIMED IS: 1. An object detection method based on an attention-enhanced bidirectional feature pyramid network (A-BiFPN), comprising the following steps: Si: inputting an image to a Visual Geometry Group (VGG) network to obtain features nd of 4 layers; S2: inputting and Pt to a Bi FPN, and fusing the features at different dimensions through top-bottom and bottom-top path branches so as to obtain features a and containing rich semantic information and detailed information; S3: processing irw 5<" and 6 by a coordinate attention mechanism respectively to obtain attention feature maps Y3, Y4, Y5 and Yo, 54: putting the attention feature maps Y3, Yt, Y5 and Y6 of four layers output by the coordinate attention mechanism into a prediction module for classification and location; and S5: filtering out a redundant prediction box through non-maximum suppression (NMS) to obtain a final prediction result.
2. The object detection method based on an A-BiFPN according to claim 1, wherein the fusing in step S2 specifically comprises: fusing the features of different layers in a fast and normalized manner, wherein a calculation formula for weighted feature fusion is as follows: wherein wL0 is ensured by using a rectified linear unit (ReLU) after each wi, E is configured to avoid numerical uncertainty with a value of 0.0001, and L represents a value of an ith input feature
3. The object detection method based on an A-BiFPN according to claim 1 or claim 2, wherein in step S2, a process of fusing features at the third layer through a top-bottom path branch is expressed as follows: -Co wherein F" denotes an upsampling process, and 6 respectively denote an input feature at a fifth layer and an input feature at a sixth layer in the BiFPN, wl and w) denote weights when Pr and are fused, and e is configured to avoid numerical uncertainty with a value of 0.0001; and a process of fusing features at the third layer through a bottom-top path branch is expressed as follows.C (.311Z, wherein Fdown denotes a downsampling process; and finally, are fused in the above fusing manner to obtain and r containing rich semantic information and detailed information
4. The object detection method based on an A-B FPN according to any preceding claim, wherein the processing the fused features by a coordinate attention mechanism in step S3 specifically comprises: S3.1: when an input X has a size of C xilxW, setting pooling kernels with sizes of (H,1) and (1,W) to encode information of different channels in horizontal and vertical directions; and for a cth channel in features, calculating an output of a feature with a height of h after pooling as follows: 0 -(h, and calculating an output of a feature with a width of B after pooling as follows: S3.2: after performing pooling in horizontal and vertical directions, transforming from C >AV xH to Cx\V x 1 and Cx1xH, and transforming C >AV/ 1 to Cx 1 xl1; S3.3: performing connection at a third dimension to obtain an attention feature map of Cx1x2H; S3.4: giving the attention feature map as an input to a t xf convolutional layer, wherein afterwards, a channel number changes to C/r, and the dimension of the attention feature map changes to C/rx 1 x21-1; S3.5: decomposing the attention feature map of C/rx1x2H into two independent tensors fhe Itcmhi) and r e Itcm w) along a spatial dimension; S3.6: then restoring channel numbers of the two tensors to C through two 1 x 1 convolutional layers Fit and Fw, and using a sigmoid activation function to obtain weight matrices sif and gw as follows: gh=c(Fi,(0) gw=iii(Fw(F)), and S3.7: multiplying the input feature X by the weight matrices to obtain a final output Y of a coordinate attention module as follows: Ye(