CN116051957A - Personal protection item detection network based on attention mechanism and multi-scale fusion - Google Patents
Personal protection item detection network based on attention mechanism and multi-scale fusion Download PDFInfo
- Publication number
- CN116051957A CN116051957A CN202310001089.2A CN202310001089A CN116051957A CN 116051957 A CN116051957 A CN 116051957A CN 202310001089 A CN202310001089 A CN 202310001089A CN 116051957 A CN116051957 A CN 116051957A
- Authority
- CN
- China
- Prior art keywords
- feature
- module
- attention
- network
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 99
- 230000004927 fusion Effects 0.000 title claims abstract description 59
- 230000007246 mechanism Effects 0.000 title claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 37
- 238000010586 diagram Methods 0.000 claims abstract description 26
- 230000002708 enhancing effect Effects 0.000 claims abstract description 13
- 238000000034 method Methods 0.000 claims description 45
- 230000008569 process Effects 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 21
- 238000005070 sampling Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 5
- 238000013527 convolutional neural network Methods 0.000 description 29
- 238000012549 training Methods 0.000 description 16
- 230000001681 protective effect Effects 0.000 description 10
- 238000002474 experimental method Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 208000035473 Communicable disease Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007499 fusion processing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000003053 toxin Substances 0.000 description 1
- 231100000765 toxin Toxicity 0.000 description 1
- 108700012359 toxins Proteins 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a personal protection object detection network based on an attention mechanism and multi-scale fusion, which comprises a feature extraction module DAR50 constructed by a DARM module, a target detection module and a target detection module, wherein the feature extraction module DAR50 is used for enhancing target feature information through self-adapting to target morphology; and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and fusing features of different stages by utilizing a pyramid structure so as to improve the detection performance of targets of different scales. The DAR50 feature extraction module effectively reduces background information interference and extracts target feature information more accurately; and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and then fusing the features of different stages by utilizing a pyramid structure, so that the detection performance of targets of different scales is improved.
Description
Technical Field
The present invention relates to personal protection item detection, and in particular to a personal protection item detection network based on attention mechanisms and multiscale fusion.
Background
Currently, in the work of preventing and treating infectious diseases, epidemic prevention workers wear medical personal protection equipment, which is a necessary work. The wearer may be effectively protected from potential infectious diseases or other toxins by wearing medical personal protection equipment such as medical surgical masks, face masks, gloves, and the like. It is therefore necessary to monitor in real time whether personnel wear personal protection equipment in the place where protection is desired.
Compared with the problems that the traditional manual supervision mode cannot perform continuous effective supervision and has high labor cost, the target detection technology based on the neural network can perform continuous effective supervision and has lower cost than the labor cost. However, in the complex scenario shown in fig. 1, the existing target detection network model personal protection equipment performs poorly in detection, so that the implementation supervision by using a machine cannot be practically applied, and in most areas, a manual supervision mode is still adopted. Thus, there is a need for a better performing object detection model in the field of detecting medical personal protective equipment in complex scenarios. The machine adopting the target detection technology takes over whether personnel in the supervision place wear necessary medical personal protection equipment, so that the purposes of reducing cost and real-time performance are achieved.
The target detection task can be divided into two tasks, namely a target classification task and a target positioning task, wherein the target classification task is responsible for judging what kind of object in the image, and the target positioning task is responsible for judging the position information of the object in the image. The target detection technology based on the deep convolution network can be divided into a single-stage target detection algorithm and a two-stage target detection algorithm by the development of recent years. Single-phase networks are advantageous in speed in that they do not require a regional suggestion phase, directly generate object categories and locations, typically representing YOLO series, SSD, etc. The two-stage network has advantages in detection accuracy, the first stage generates candidate regions, and the second stage classifies and corrects the candidate regions, typically representing fast R-CNN [ 5 ], spark R-CNN, etc. Higher accuracy is required for detecting whether personnel are wearing personal protection equipment in the field, so most detection methods still employ a two-stage target detection algorithm.
In a complex scenario of a medical environment, a target detection network model is affected by factors such as background information interference and multi-scale problems in a detection process, so that detection performance of most network models is poor. According to the invention, the AMS R-CNN network is improved and obtained by researching and analyzing the background information interference problem and the multi-scale problem in the detection process of the medical personal protection equipment and referring to the two-stage target detection network Faster R-CNN. In the AMSR-CNN network, the feature extraction network is a DAR50 network formed by a plurality of deformable and attention residual modules (Deformable and Attention Residual Modules, DARM), the DARM module extracts target features by changing the shape of the convolution kernel form self-adaptive detection target, and then the attention module is used for enhancing feature information to obtain more effective feature information of the detection target. And in the feature fusion stage, a CCFPN module is applied, the information fusion of the pixel points and other pixel points is established by using a cross information attention module for the feature graphs of different stages extracted by the feature extraction network, and then the edge information of the features of different stages is fused through a pyramid structure to realize multi-scale fusion. As shown in FIG. 2, a comparison of the proposed method with the optimal method TridentNet detection results in the CPPE-5 paper is shown.
Disclosure of Invention
The invention mainly aims to provide a personal protection object detection network based on an attention mechanism and multi-scale fusion, and aims at the problem of background information interference, a DARM module which extracts features and enhances feature information through a self-adaptive target shape is designed, and a DAR50 feature extraction network is constructed by utilizing the DARM module for extracting the feature information in an image of personal medical protection equipment, so that more effective feature information of a target can be obtained; aiming at the problem that target multi-scale detection is difficult, a CCFPN feature fusion module based on a feature pyramid structure is designed. The module establishes information fusion of the pixel points and other pixel points by using a cross information attention mechanism, adopts a pyramid structure to perform feature fusion of feature graphs at different stages, and improves the detection performance of a network model on targets with different scales.
The technical scheme adopted by the invention is as follows: a personal protective item detection network based on attention mechanisms and multiscale fusion, comprising:
a feature extraction module DAR50 constructed by using a DARM module is used for enhancing the feature information of the target through self-adapting the target form;
and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and fusing features of different stages by utilizing a pyramid structure so as to improve the detection performance of targets of different scales.
Further, the feature extraction module DAR50 includes:
referring to a ResNet50 network structure, an ARM module and a plurality of DARM modules are utilized to construct a DAR50 feature extraction network, so that effective acquisition of target features is realized;
ARM is a residual module for increasing attention operation, and DARM is used for extracting features of the serial deformable convolution and the scSE attention module;
the ARM module is used for enhancing the characteristic information of the original image from the previous two steps, replacing the conventional convolution with a deformable convolution self-adaptive target shape extraction characteristic through the DARM module, and enhancing the target characteristic information by applying the scSE attention module to realize the extraction of the effective target characteristic information in the image;
assume an input imagePost-output by residual block->Wherein W and H represent the width and height of the input image, C and C 1 Representing an image input channel and an output channel; f represents the feature mapping of the original residual module to the image, and the calculation process is as shown in formula (1), wherein conv1 (), conv3 () represent convolution operation using 1x1, 3x3 convolution kernels;
under the condition of ensuring the characteristics of the original residual error module, the DARM module removes the conventional convolution of 3x3 and adds the deformable convolution operation of 3x3 and the scSE attention module;c for the features output by DAR residual modules 2 Representing an output channel; the calculation process of the DARM module is as shown in formula (4):
of the formula (I)The 3x3 deformable convolution operation is represented, the conventional convolution is replaced by the deformable convolution, and the sampling points on the input feature graph are offset and concentrated to a target area by utilizing a convolution kernel in the process of extracting features again by utilizing the deformable convolution, so that the feature information of the target is obtained by self-adapting to the shape of the object;
representing the computation of the scSE attention module, suppressing meaningless parts by the scSE attention module enhancing meaningful parts in the feature;
the scSE module obtains new characteristic information through element level addition after performing calibration sampling through sSE and cSE) running in parallel;
the operation process is as shown in formula (5)>For the calculation of cSE module, +.>Calculation as sSE module;
still further, the CCFPN feature fusion module includes:
based on the feature pyramid structure, a feature fusion network is constructed by adding an attention mechanism;
in the first step, adopting LCC attention mechanism to process the feature graphs of different orders in the feature extraction network respectively;
secondly, generating a feature map with the original size being 2 times by using an adjacent up-sampling method, fusing features with different scales by adopting a method of adding elements, and fusing semantic feature maps with different orders, so that a small target is easier to detect;
taking the feature mapping of four different stages output by the backbone network as input, carrying out feature fusion through a pyramid structure, and carrying out 3x3 convolution on the feature graphs fused at different stages to carry out feature re-extraction;
the final output is subjected to maximum pooling with the step length of 2 to obtain a new characteristic diagram, and the new characteristic diagram is taken as the input of the next stage together with other characteristic diagrams;
LCC is divided into an attention branch and a convolution branch, the attention branch is iterated twice through a crisscross attention mechanism module, the module obtains global information in the vertical direction and the horizontal direction of the image in a characteristic weighting mode, and context dependency relations among pixels are obtained, so that information fusion is established;
the convolution branch is subjected to dimension reduction operation by adopting 1x1 convolution, and is supplemented by the convolution branch, so that the network can obtain more comprehensive and rich characteristic information of targets with different dimensions;
feature map output for each stage of feature extraction networkThe calculation of LCC can be summarized as equation (6), and the calculation of attention branches as equation (7):
feature map representing LCC network, +.>Representing attention branches, ++>A convolution branch representing a convolution operation performed at 1x 1;Representing the computation of Cross Criss attention module.
The invention has the advantages that:
the invention provides an attention mechanism and multiscale fusion-based (AMSR-CNN) target detection model; aiming at the problem of background information interference in the detection process, a DAR50 feature extraction network constructed by using a DARM module is provided, and the module can effectively reduce the background information interference and extract the target feature information more accurately by enhancing the target feature information while adapting to the target form;
aiming at the problem of target multiscale, a CCFPN feature fusion module is provided, wherein information fusion of each pixel point and other pixel points is established in feature diagrams of different stages, and features of different stages are fused by utilizing a pyramid structure, so that the detection performance of targets of different scales is improved.
In addition to the objects, features and advantages described above, the present invention has other objects, features and advantages. The present invention will be described in further detail with reference to the drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a medical protective article in a real scene;
in fig. 2, (a) is an original image, and (b) is a detection result of TridentNet; (c) is the detection result of AMS R-CNN. Purple is rubber glove, red is protective clothing, green is mask, blue is goggles, yellow is protective screen;
fig. 3 illustrates feature extraction of input data, and feature fusion is performed by inputting feature maps extracted at different stages of DAR50 into CCFPN. Inputting the feature map subjected to the feature fusion treatment to an RPN layer for selecting a candidate frame, inputting the feature map subjected to the feature fusion of the candidate frame and the feature to an ROI alignment for post-treatment, and finally realizing classification and positioning prediction;
FIG. 4 is a graph of network model test results for the ResNet of the present invention as a feature extraction network;
FIG. 5 is a diagram of a DAR50 network structure of the present invention, 7x7conv is a convolution operation using a 7x7 convolution kernel, maxpool uses a pooling operation with a step size of 2, and feature information in an original image is obtained as much as possible through a two-step operation. ARM is the residual module of the attention-increasing operation. The DARM performs feature extraction for the serial deformable convolution sum scSE attention module;
in FIG. 6, (a) is an ARM residual module, (b) is a DARM residual module;
FIG. 7 is a diagram of a CCFPN network structure of the present invention, wherein the feature mapping of four different stages of the backbone network output is used as input, feature fusion is performed through a pyramid structure, and then feature re-extraction is performed by performing 3x3 convolution on the feature map fused at different stages, so as to ensure the stability of the extracted features; the final output is subjected to maximum pooling with the step length of 2 to obtain a new characteristic diagram, and the new characteristic diagram is taken as the input of the next stage together with other characteristic diagrams;
FIG. 8 is a block diagram of an LCC network of the present invention, wherein the LCC inputs the input feature map to the attention branch and the convolution branch respectively, wherein the 1x1 convolution operations are all dimension reduction, and the feature map is unified to the same dimension for feature fusion;
FIG. 9 is a training process of the bbox_mAP evaluation index of the network model in the comparative CPPE-5 dataset of the present invention;
in FIG. 10, (a) is an original image in a CPPE-5 dataset; (b) is the result of the network detection of Faster RCNN (ResNet50+FPN). (c) is a TridentNet detection result; (d) the proposed detection result of the network model;
FIG. 11 is a training process of the comparative method and other network model of the present invention on a PASCALVOC 2007 dataset;
FIG. 12 is a training process of the comparative DAR50 and other feature extraction network of the present invention on a CPPE-5 dataset;
FIG. 13 is a training process for comparing different evaluation indexes of different feature fusion modules according to the present invention;
fig. 14 is a training process of the AMS R-CNN of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Description of the invention
In fig. 9:
sparse region convolution neural network target detection method for spark R-CNN
Faster R-CNN: faster regional convolution neural network target detection method
FCOS: first order full convolution target detection
Deformable DETR: variability detection transducer
Double Head detection method
DCN variable convolution network
Empirical Attention: empirically based attention mechanisms
TridentNet: scale-aware trigeminal networks for object detection
Ours method of the invention
In fig. 11:
faster R-CNN: faster regional convolution neural network target detection method
DCN variable convolution network
Empirical Attention: empirically based attention mechanisms
TridentNet: scale-aware trigeminal networks for object detection
Ours method of the invention
In fig. 12:
ResNet50+FPN: residual network and characteristic pyramid network of 50 layers
SE-ResNet50+FPN: 50-layer channel attention residual network and feature pyramid network
CBAM-resnet50+fpn: 50-layer convolution attention network and characteristic pyramid network
BAM-ResNet50+FPN: bottleneck attention network and characteristic pyramid network of 50 layers
Dar50+fpn: 50-layer variability and attention residual network and feature pyramid network
In fig. 13:
FPN: feature pyramid network
PAFPN: bottom-up feature pyramid network
NASFPN: feature pyramid network for neural architecture search
HRFPN high resolution characteristic soldier pyramid network
CCFPN crisscrossed feature pyramid network
In fig. 14:
faster R-CNN: faster regional convolution neural network target detection method
DCN variable convolution network
Empirical Attention: empirically based attention mechanisms
Ours method of the invention
The CPPE-5 data set is published while the target detection paper of the CPPE-5 data set is published, which provides data base support for students in the field of detecting medical personal protection devices. Various object detection network models are listed herein to detect performance results in the dataset. The authors Rishit Dagli et al, taking Faster R-CNN, yolov3 and SSD as baselines, obtained results showed that Faster R-CNN performed optimally, yolov3 times. Experiments were also performed to verify the performance of multiple SOTA network models on the CPPE-5 dataset. The FCOS network model is used for detecting the first-order full convolution target based on the pixel level, features with different sizes are detected on different layers of feature graphs by using the FPN idea, and finally a Center-less is used for suppressing a low-quality prediction boundary box. The Head layer in the R-CNN type network is analyzed through comparison by the Double Head network model, the functional bias of the full-connection Head and the full-convolution Head is verified, and the performance effect is better by adopting a mode corresponding to the function through experimental verification. The transformable DETR model combines the characteristic of DCN self-adaptive extraction characteristics and the idea of a transducer in the DETR network, adds DCN in a backbone network to carry out sparse sampling and directly utilizes a characteristic diagram to train the transducer modeling capability, and solves the problems of detection and training duration of a small target of the DETR network. The authors of Empirical Attention have verified the influencing factors in spatial attention through various experiments and found that combining deformable convolution with key content salience achieves the best balance of self-attention accuracy and efficiency. Aiming at the scale change problem in the detection process, the TridentNet respectively performs the output operation of the feature network by adopting three cavity convolution kernels with different expansion coefficients of shared parameters in the feature extraction network, and selects the optimal result for classification and prediction through three types of results output by the NMS.
Attention mechanism
The attention mechanism in computer vision tasks has been demonstrated by a number of experiments to effectively enhance the targeted feature information and thereby improve network performance. It generally suppresses feature information that is not of interest by acquiring feature information that is most abundant in the information of interest to the target object. The SE attention mechanism [ 18 ] carries out global average pooling on the feature map in the space dimension, learns the channel attention through the full connection layer, and normalizes to obtain the channel attention feature map, so that the feature map is calibrated again to establish the dependency relationship among channels. Unlike SE attention mechanisms, CBAM [ 19 ] divides feature maps into two dimensions of channels and space to extract attention maps, connects the attention maps extracted by the channel attention modules and the space attention modules in series, establishes feature map relationships between different channels and space dependency relationships between different feature maps, and finally multiplies the fused attention maps with input feature maps to perform adaptive feature optimization. The BAM [ 20 ] is also used for extracting the attention map by dividing the feature map into two dimensions of a channel and a space, but the BAM adopts a parallel connection mode to connect the channel attention module and the space attention module, and fusion of the attention map is realized by multiplying the attention maps of the two dimensions and then normalizing the multiplied attention maps. The scSE module [ 21 ] is inspired by an SE attention mechanism, three modes of sSE, cSE and scSE are provided, the sSE mode is to obtain a new attention force diagram by changing the proportional value of the SE module in the full-connection lifting dimension, the cSE mode is multiplied by an input feature diagram after feature recalibration from the space dimension, and the attention force diagrams of the scSE mode are sSE and cSE modes to obtain the new attention force diagram in an adding mode, so that the enhancement of image features is realized.
Feature pyramid structure
In the field of computer vision, the detection performance of a deep convolutional neural network model can be affected by objects with different scales, and solving the problem of multiple scales in detection in an image is a challenging task. The Feature Pyramid Network (FPN) provides fusion of high-order semantic information and low-order semantic information, so that multi-scale feature fusion and receptive field expansion are realized. In this way, the detection performance of the detection network on targets with different scales is improved, but the characteristic information of the highest layer is lost in the up-sampling process, and the characteristic fusion is main. Consequently, the latter scholars have extracted different improvement methods. The authors of PAFPN found that only semantic information was enhanced during the FPN fusion process, and location information was ignored. They propose to newly build a bottom-up path in the FPN to implement enhancement of semantic and positional information in the feature map fusion process. HRFPN proposes building pyramid structures in a cascading fashion, enhancing semantic and positional information by performing repeated multi-scale fusion of feature maps. NAS-FPN provides feature fusion by selecting the optimal Binary operation through Neural Architecture Search (NAS). Therefore, the NAS-FPN does not need to manually design a feature fusion mode, and the feature information is enhanced by the network autonomous selection feature fusion mode. However, the result of such autonomous selection is that more training time is required to achieve a superior result. The SA-FPN realizes the improvement of the performance of human body detection in the image by designing an FPN structure of a hierarchical segmentation block and adding an attention mechanism into the FPN structure.
Method
Network structure
The AMS R-CNN network model is based on the Faster RCNN network structure and is divided into four parts of feature extraction network, feature fusion, target area suggestion network and prediction processing, wherein the structure is shown in figure 3.
The first part is a feature extraction network, and the first part is based on a DAR50 feature extraction network constructed by a DARM module and mainly used for extracting feature information of medical protection object images, fusing the characteristics of a deformable convolution and an attention module, adaptively extracting target features according to the shape of a detection target and enhancing the feature information.
The second part is feature fusion, through CCFPN feature fusion network, the feature graphs from different stages of the feature extraction network are subjected to context information acquisition by using cross information attention, and feature pyramid structures are used for realizing feature fusion of different stages, so that fusion of features of different scales of medical protection articles is realized.
The third part is a regional suggestion network (RPN), which generates an anchor frame through a sliding window, and calculates the anchor frame by using two branches of classification and boundary regression. And penalty attenuation is performed by using a Soft-NMS algorithm by using a Gaussian penalty function to realize candidate frame selection.
The fourth part is the prediction part. The candidate frame, the output characteristic of CCFPN and the information of the original image are input into the ROI alignment together, the ROIALign function is to cancel quantization operation and adopt a bilinear interpolation method, so that the problem of mismatch (mis-alignment) between the candidate region and the original region is solved. The prediction of the category and the position information of the detection target is realized.
Experiment
The invention tests the AMSR-CNN network performance on a CPPE-5 data set, and simultaneously verifies the effectiveness of the method in non-medical personal protection article detection on a PASCAL VOC 2007. Experimental results show that the AMS R-CNN detection network can obtain good detection performance on the CPPE-5 data set, and can effectively detect non-medical personal detection objects on the PASCAL VOC2007 data set. In the following subsections, relevant data sets, implementation details and relevant ablation experiments will be presented.
Data set
CPPE-5 is a disclosed medical protective article test dataset comprising 5 subject categories (work clothes, masks, gloves, faceshields and goggles), all images annotated with a set of bounding boxes and positive labels, about 4.57 annotations per image. The goal of this dataset is to allow study of the subordinate classifications of medical protective articles, as opposed to other popular datasets (e.g., PASCAL VOC, imageNet, microsoft COCO, etc.) that focus on a broad class. The data set contains images of personal medical devices in complex scenes, with multiple objects included in each image. The categories in the entire dataset are as in table 1.
The data set is commonly used for classification and detection tasks, and 9963 images are combined by a training set (5011 sheet) and a test set (4952 sheets), and the data set comprises 20 common object categories in life, and each image has 2.4 targets on average. The data set is a standardized set of excellent data sets and is also a benchmark for measuring the classification and identification capacity of the network model on the images. Therefore, the effectiveness of the proposed AMS R-CNN network model in detecting non-medical personal protective articles can be fully verified.
Implementation details
Training arrangement
The experiment was run on the ubuntu18 system, a Tesla T4 GPU (16G). Training of the network model is expedited using transfer learning. A training model of Resnet50 on the ImageNet dataset was used as the pre-training model. The optimization operation adopts an SGD+momentum optimizer, initial parameters adopt a mmdetection default configuration, the initial learning rate is 0.02, the weight attenuation is 0.0001, and the momentum coefficient is 0.9. The data enhancement of the image uses random inversion in the horizontal, vertical or both axes. The shorter edges of the image are randomly adjusted to 640,672,704,736,768 or 800 pixels during retraining. While ensuring that the longer side cannot exceed 1333 pixels in size. And corresponding padding operation enhancement data is performed.
Evaluation criteria
The invention uses COCO mAP index as the evaluation index for measuring the performance of the model. mAP represents the average Accuracy (AP) of all classes, whose calculation requires Precision (P), recall (R) and Average Precision (AP) values to be involved in the calculation. The calculation process of P is shown in the formula 8, and the calculation result is in the range of 0-1, wherein the calculation process is that the number of correctly predicted positive samples is divided by the total number of the positively predicted positive samples. The R calculation process is shown in the formula 9, and represents that the number of positive samples is correctly predicted to be the total number of the positive samples, and the calculation result is in the range of 0-1. Wherein the method comprises the steps ofIndicating that the actual negative sample is predicted to be negative sample, +.>Indicating that the actual negative sample is predicted to be a positive sample, +.>Indicating that the actual positive sample is predicted to be negative sample, < >>Indicating that the actual negative sample is predicted to be a negative sample.
The AP value represents the calculation process as shown in formula (10) with R as the horizontal axis. P is integrated in a P-R graph with P as the vertical axis to determine the area.
The COCO mAP was calculated using Interplolated Average precision. As in equation (11), each time P is used, among all the thresholds, the P value of the maximum value is multiplied by the variation value of R. The P value and the R change value are multiplied by each other when i pictures are recognized by the system by using the Approximated Average precision calculation method, as shown in the formula (13). Interplolated Average precision can effectively reduce jitter in the P-R curve.
The mAP is calculated as (14), where N represents the class of the object in the dataset,representing the average accuracy of the kth category.
COCO mAP can be further extended to mAP-50 representing IOU>mAP value of 0.5, mAP-75 represents IOU>mAP value of 0.75, mAP-S indicated a size of 32 2 mAP-M represents a size of 32 for mAP values within the pixel area 2 -96 2 mAP-L representing a target size greater than 96 for mAP values of the target between pixel areas 2 mAP value of pixel area.
Results on Datasets
For the CPPE-5 personal medical protection device dataset, the performance of the proposed network model and other network models in the CPPE-5 dataset are verified under the experimental conditions of the same initial parameters. Table 2 shows the performance results of the network model and other detection models on the performance of the CPPE-5 data set, and the evaluation index of the network model is superior to that of the other network models can be obtained from the data in the table.
The mAP index of the network model in the CPPE-5 data set is 0.6 higher than the 54.6 of the performance of the optimal network model TridentNet in the article, and the mAP-50 and mAP-75 indexes are 0.6 and 1.9 percent higher than the second names 89.0 and 58.2 in the table. Meanwhile, from experimental results, the network can be tested to have excellent performance in solving the multi-scale problem, and especially the detection performance of the small-scale target is 4.2 percent ahead of the network model ranked second in the table. Detection performance at the middle and large scales is also superior to other network models.
Fig. 10 is a detection effect diagram of a network model, and it can be seen from the detection diagram that the network model can effectively detect coverage, mask, gaggles, face Shield, gloves in a scene under a real scene. From the detection result diagram, it is verified that the network model can detect small target objects, and can distinguish the background and the target objects, so as to realize classification and positioning of the detected objects.
To verify the detection performance of the AMS R-CNN network model on other items, training and testing was performed on the pasal VOC2007 dataset, and table 3 shows the test results of the model and other models under the same configuration conditions. In the experimental result of the PASCAL VOC2007 data set, the performance of the AMS R-CNN is superior to other network models, and the AMS R-CNN has excellent performance on the detection of common objects in the target detection task. The figure illustrates the training process of the network model in the paspal VOC2007 dataset.
In the CPPE-5 data set, replacing the original VGG network of the Faster RCNN with a Renset50+FPN as a base line network, and carrying out multiple experiments to verify the performance of the network detection personal protection equipment.
After research analysis, a DAR50 feature extraction network module is proposed, and as can be seen from fig. 12 and table 4, the comprehensive performance of the DAR50 is superior to other feature extraction networks under the condition of using the same initial network parameter setting and feature fusion network. And the performance is also superior to other network models compared to the detection method in table 2 except the proposed method. But small target detection is less capable in handling than some detection networks.
Table 5 shows the results of the fusion structure performance of different features, and the font is thickened to be the highest value in the evaluation index.
Therefore, after analysis and research, a CCFPN feature fusion module is provided for improving the multi-scale detection performance. And performing a comparison experiment with other four FPN structure networks. Table 5 is experimental data, from which it can be derived that the proposed CCFPN has better detection performance than other FPN structural network performance under the same conditions. Figure 13 illustrates the training process for target detection at different scales for each network.
Through the previous experimental verification, the combination of the DAR50 and the CCFPN realizes an AMS RCNN network model.
According to the experimental results, the detection performance of the medical personal protection object can be effectively improved relative to the module provided by the network model of the base line. It can be observed from Table 6 that the AMSR-CNN detection network, which combines both DAR50 and CCPF, provides improved detection performance in detecting personal medical protective equipment. Figure 14 illustrates the training process of the AMS-R-CNN network model,
in a real medical detection scene, when a deep convolutional neural network model is used for detecting a medical personal protection object, the network model has poor detection performance due to the background information interference problem generated by the approximation of the characteristics of the object and the surrounding environment and the multi-scale problem of the object in an image.
In order to solve the problems, the invention provides an attention mechanism and multiscale fusion-based (AMSR-CNN) target detection model.
Aiming at the problem of background information interference in the detection process, a DAR50 feature extraction network constructed by using a DARM module is provided, and the module can effectively reduce the background information interference and extract the target feature information more accurately by enhancing the target feature information while adapting to the target form;
aiming at the problem of target multiscale, a CCFPN feature fusion module is provided, wherein information fusion of each pixel point and other pixel points is established in feature diagrams of different stages, and features of different stages are fused by utilizing a pyramid structure, so that the detection performance of targets of different scales is improved.
Experiments were performed on the challenging medical protective article CPPE-5 dataset and the validity of the method was verified in this dataset. Meanwhile, the effectiveness of AMS R-CNN in detecting other objects is also verified on the PASCAL VOC2007 data set.
According to the invention, through the problems of background information interference and multiscale of medical protection articles in the detection process, an AMSR-CNN network model aiming at personal medical protection equipment detection is constructed by the DAR50 module and the CCFPN module. The proposed network model was validated at the CPPE-5 dataset. According to the experimental result, the DAR50 network structure effectively eliminates the interference of background information and acquires the characteristic information of the target. The CCFPN module improves the module and can promote the detection performance to the medical protection article of different scales. The PASCAL VOC2007 data set is verified to be effective in detecting other objects, and experimental results show that the detection network is good in detecting the effectiveness of other objects.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (3)
1. A personal protection item detection network based on attention mechanisms and multiscale fusion, comprising:
a feature extraction module DAR50 constructed by using a DARM module is used for enhancing the feature information of the target through self-adapting the target form;
and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and fusing features of different stages by utilizing a pyramid structure so as to improve the detection performance of targets of different scales.
2. The attention mechanism and multiscale fusion based personal protection item detection network of claim 1, wherein the feature extraction module DAR50 comprises:
referring to a ResNet50 network structure, an ARM module and a plurality of DARM modules are utilized to construct a DAR50 feature extraction network, so that effective acquisition of target features is realized;
ARM is a residual module for increasing attention operation, and DARM is used for extracting features of the serial deformable convolution and the scSE attention module;
the ARM module is used for enhancing the characteristic information of the original image from the previous two steps, replacing the conventional convolution with a deformable convolution self-adaptive target shape extraction characteristic through the DARM module, and enhancing the target characteristic information by applying the scSE attention module to realize the extraction of the effective target characteristic information in the image;
assume an input imagePost-output by residual block->In (1), itW and H represent the width and height of the input image, C and C 1 Representing an image input channel and an output channel; f represents the feature mapping of the original residual module to the image, and the calculation process is as shown in formula (1), wherein conv1 (), conv3 () represent convolution operation using 1x1, 3x3 convolution kernels;
under the condition of ensuring the characteristics of the original residual error module, the DARM module removes the conventional convolution of 3x3 and adds the deformable convolution operation of 3x3 and the scSE attention module;c for the features output by DAR residual modules 2 Representing an output channel; the calculation process of the DARM module is as shown in formula (4):
of the formula (I)The 3x3 deformable convolution operation is represented, the conventional convolution is replaced by the deformable convolution, and the sampling points on the input feature graph are offset and concentrated to a target area by utilizing a convolution kernel in the process of extracting features again by utilizing the deformable convolution, so that the feature information of the target is obtained by self-adapting to the shape of the object;
computation of a representation of a scSE attention module, enhanced by the scSE attention moduleA meaningful fraction of the features, suppressing nonsensical fractions;
the scSE module obtains new characteristic information through element level addition after performing calibration sampling through sSE and cSE) running in parallel;
the operation process is as shown in formula (5)>For the calculation of cSE module, +.>Calculation as sSE module;
3. the attention mechanism and multiscale fusion based personal protection item detection network of claim 1, wherein the CCFPN feature fusion module comprises:
based on the feature pyramid structure, a feature fusion network is constructed by adding an attention mechanism;
in the first step, adopting LCC attention mechanism to process the feature graphs of different orders in the feature extraction network respectively;
secondly, generating a feature map with the original size being 2 times by using an adjacent up-sampling method, fusing features with different scales by adopting a method of adding elements, and fusing semantic feature maps with different orders, so that a small target is easier to detect;
taking the feature mapping of four different stages output by the backbone network as input, carrying out feature fusion through a pyramid structure, and carrying out 3x3 convolution on the feature graphs fused at different stages to carry out feature re-extraction;
the final output is subjected to maximum pooling with the step length of 2 to obtain a new characteristic diagram, and the new characteristic diagram is taken as the input of the next stage together with other characteristic diagrams;
LCC is divided into an attention branch and a convolution branch, the attention branch is iterated twice through a crisscross attention mechanism module, the module obtains global information in the vertical direction and the horizontal direction of the image in a characteristic weighting mode, and context dependency relations among pixels are obtained, so that information fusion is established;
the convolution branch is subjected to dimension reduction operation by adopting 1x1 convolution, and is supplemented by the convolution branch, so that the network can obtain more comprehensive and rich characteristic information of targets with different dimensions;
feature map output for each stage of feature extraction networkThe calculation of LCC can be summarized as equation (6), and the calculation of attention branches as equation (7):
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310001089.2A CN116051957A (en) | 2023-01-03 | 2023-01-03 | Personal protection item detection network based on attention mechanism and multi-scale fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310001089.2A CN116051957A (en) | 2023-01-03 | 2023-01-03 | Personal protection item detection network based on attention mechanism and multi-scale fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116051957A true CN116051957A (en) | 2023-05-02 |
Family
ID=86123237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310001089.2A Withdrawn CN116051957A (en) | 2023-01-03 | 2023-01-03 | Personal protection item detection network based on attention mechanism and multi-scale fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116051957A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116563147A (en) * | 2023-05-04 | 2023-08-08 | 北京联合大学 | Underwater image enhancement system and method |
CN116778227A (en) * | 2023-05-12 | 2023-09-19 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
-
2023
- 2023-01-03 CN CN202310001089.2A patent/CN116051957A/en not_active Withdrawn
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116563147A (en) * | 2023-05-04 | 2023-08-08 | 北京联合大学 | Underwater image enhancement system and method |
CN116563147B (en) * | 2023-05-04 | 2024-03-26 | 北京联合大学 | Underwater image enhancement system and method |
CN116778227A (en) * | 2023-05-12 | 2023-09-19 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
CN116778227B (en) * | 2023-05-12 | 2024-05-10 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111126472B (en) | SSD (solid State disk) -based improved target detection method | |
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN109299274B (en) | Natural scene text detection method based on full convolution neural network | |
CN110287846B (en) | Attention mechanism-based face key point detection method | |
CN116051957A (en) | Personal protection item detection network based on attention mechanism and multi-scale fusion | |
CN103971102B (en) | Static Gesture Recognition Method Based on Finger Contour and Decision Tree | |
CN103745468B (en) | Significant object detecting method based on graph structure and boundary apriority | |
CN110532946B (en) | Method for identifying axle type of green-traffic vehicle based on convolutional neural network | |
CN107506754A (en) | Iris identification method, device and terminal device | |
CN110751195B (en) | Fine-grained image classification method based on improved YOLOv3 | |
CN110879982A (en) | Crowd counting system and method | |
CN101833667A (en) | Pattern recognition classification method expressed based on grouping sparsity | |
CN109800650A (en) | A method of enhancing and identification traffic sign | |
CN114120359A (en) | Method for measuring body size of group-fed pigs based on stacked hourglass network | |
CN118230354A (en) | Sign language recognition method based on improvement YOLOv under complex scene | |
CN111062384B (en) | Vehicle window accurate positioning method based on deep learning | |
CN113077484A (en) | Image instance segmentation method | |
CN116386089B (en) | Human body posture estimation method, device, equipment and storage medium under motion scene | |
CN111160372B (en) | Large target identification method based on high-speed convolutional neural network | |
CN116562341A (en) | Improved YOLOv5n model for traffic signal lamp detection | |
CN113902044B (en) | Image target extraction method based on lightweight YOLOV3 | |
CN114694042A (en) | Disguised person target detection method based on improved Scaled-YOLOv4 | |
Lou et al. | Medical personal protective equipment detection based on attention mechanism and multi-scale fusion | |
Sun et al. | Intelligent Site Detection Based on Improved YOLO Algorithm | |
CN113469224A (en) | Rice classification method based on fusion of convolutional neural network and feature description operator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230502 |