CN116051957A

CN116051957A - Personal protection item detection network based on attention mechanism and multi-scale fusion

Info

Publication number: CN116051957A
Application number: CN202310001089.2A
Authority: CN
Inventors: 娄建楼; 李向宇; 梁丰; 陈科余; 裴天鹏; 谭咏麟
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2023-05-02

Abstract

The invention discloses a personal protection object detection network based on an attention mechanism and multi-scale fusion, which comprises a feature extraction module DAR50 constructed by a DARM module, a target detection module and a target detection module, wherein the feature extraction module DAR50 is used for enhancing target feature information through self-adapting to target morphology; and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and fusing features of different stages by utilizing a pyramid structure so as to improve the detection performance of targets of different scales. The DAR50 feature extraction module effectively reduces background information interference and extracts target feature information more accurately; and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and then fusing the features of different stages by utilizing a pyramid structure, so that the detection performance of targets of different scales is improved.

Description

Personal protection item detection network based on attention mechanism and multi-scale fusion

Technical Field

The present invention relates to personal protection item detection, and in particular to a personal protection item detection network based on attention mechanisms and multiscale fusion.

Background

Currently, in the work of preventing and treating infectious diseases, epidemic prevention workers wear medical personal protection equipment, which is a necessary work. The wearer may be effectively protected from potential infectious diseases or other toxins by wearing medical personal protection equipment such as medical surgical masks, face masks, gloves, and the like. It is therefore necessary to monitor in real time whether personnel wear personal protection equipment in the place where protection is desired.

Compared with the problems that the traditional manual supervision mode cannot perform continuous effective supervision and has high labor cost, the target detection technology based on the neural network can perform continuous effective supervision and has lower cost than the labor cost. However, in the complex scenario shown in fig. 1, the existing target detection network model personal protection equipment performs poorly in detection, so that the implementation supervision by using a machine cannot be practically applied, and in most areas, a manual supervision mode is still adopted. Thus, there is a need for a better performing object detection model in the field of detecting medical personal protective equipment in complex scenarios. The machine adopting the target detection technology takes over whether personnel in the supervision place wear necessary medical personal protection equipment, so that the purposes of reducing cost and real-time performance are achieved.

The target detection task can be divided into two tasks, namely a target classification task and a target positioning task, wherein the target classification task is responsible for judging what kind of object in the image, and the target positioning task is responsible for judging the position information of the object in the image. The target detection technology based on the deep convolution network can be divided into a single-stage target detection algorithm and a two-stage target detection algorithm by the development of recent years. Single-phase networks are advantageous in speed in that they do not require a regional suggestion phase, directly generate object categories and locations, typically representing YOLO series, SSD, etc. The two-stage network has advantages in detection accuracy, the first stage generates candidate regions, and the second stage classifies and corrects the candidate regions, typically representing fast R-CNN [ 5 ], spark R-CNN, etc. Higher accuracy is required for detecting whether personnel are wearing personal protection equipment in the field, so most detection methods still employ a two-stage target detection algorithm.

In a complex scenario of a medical environment, a target detection network model is affected by factors such as background information interference and multi-scale problems in a detection process, so that detection performance of most network models is poor. According to the invention, the AMS R-CNN network is improved and obtained by researching and analyzing the background information interference problem and the multi-scale problem in the detection process of the medical personal protection equipment and referring to the two-stage target detection network Faster R-CNN. In the AMSR-CNN network, the feature extraction network is a DAR50 network formed by a plurality of deformable and attention residual modules (Deformable and Attention Residual Modules, DARM), the DARM module extracts target features by changing the shape of the convolution kernel form self-adaptive detection target, and then the attention module is used for enhancing feature information to obtain more effective feature information of the detection target. And in the feature fusion stage, a CCFPN module is applied, the information fusion of the pixel points and other pixel points is established by using a cross information attention module for the feature graphs of different stages extracted by the feature extraction network, and then the edge information of the features of different stages is fused through a pyramid structure to realize multi-scale fusion. As shown in FIG. 2, a comparison of the proposed method with the optimal method TridentNet detection results in the CPPE-5 paper is shown.

Disclosure of Invention

The invention mainly aims to provide a personal protection object detection network based on an attention mechanism and multi-scale fusion, and aims at the problem of background information interference, a DARM module which extracts features and enhances feature information through a self-adaptive target shape is designed, and a DAR50 feature extraction network is constructed by utilizing the DARM module for extracting the feature information in an image of personal medical protection equipment, so that more effective feature information of a target can be obtained; aiming at the problem that target multi-scale detection is difficult, a CCFPN feature fusion module based on a feature pyramid structure is designed. The module establishes information fusion of the pixel points and other pixel points by using a cross information attention mechanism, adopts a pyramid structure to perform feature fusion of feature graphs at different stages, and improves the detection performance of a network model on targets with different scales.

The technical scheme adopted by the invention is as follows: a personal protective item detection network based on attention mechanisms and multiscale fusion, comprising:

a feature extraction module DAR50 constructed by using a DARM module is used for enhancing the feature information of the target through self-adapting the target form;

and the CCFPN feature fusion module is used for establishing information fusion of each pixel point and other pixel points in the feature diagrams of different stages, and fusing features of different stages by utilizing a pyramid structure so as to improve the detection performance of targets of different scales.

Further, the feature extraction module DAR50 includes:

referring to a ResNet50 network structure, an ARM module and a plurality of DARM modules are utilized to construct a DAR50 feature extraction network, so that effective acquisition of target features is realized;

ARM is a residual module for increasing attention operation, and DARM is used for extracting features of the serial deformable convolution and the scSE attention module;

the ARM module is used for enhancing the characteristic information of the original image from the previous two steps, replacing the conventional convolution with a deformable convolution self-adaptive target shape extraction characteristic through the DARM module, and enhancing the target characteristic information by applying the scSE attention module to realize the extraction of the effective target characteristic information in the image;

assume an input image

Post-output by residual block->

Wherein W and H represent the width and height of the input image, C and C ₁ Representing an image input channel and an output channel; f represents the feature mapping of the original residual module to the image, and the calculation process is as shown in formula (1), wherein conv1 (), conv3 () represent convolution operation using 1x1, 3x3 convolution kernels;

under the condition of ensuring the characteristics of the original residual error module, the DARM module removes the conventional convolution of 3x3 and adds the deformable convolution operation of 3x3 and the scSE attention module;

c for the features output by DAR residual modules ₂ Representing an output channel; the calculation process of the DARM module is as shown in formula (4):

of the formula (I)

The 3x3 deformable convolution operation is represented, the conventional convolution is replaced by the deformable convolution, and the sampling points on the input feature graph are offset and concentrated to a target area by utilizing a convolution kernel in the process of extracting features again by utilizing the deformable convolution, so that the feature information of the target is obtained by self-adapting to the shape of the object;

representing the computation of the scSE attention module, suppressing meaningless parts by the scSE attention module enhancing meaningful parts in the feature;

the scSE module obtains new characteristic information through element level addition after performing calibration sampling through sSE and cSE) running in parallel;

the operation process is as shown in formula (5)>

For the calculation of cSE module, +.>

Calculation as sSE module;

。

still further, the CCFPN feature fusion module includes:

based on the feature pyramid structure, a feature fusion network is constructed by adding an attention mechanism;

in the first step, adopting LCC attention mechanism to process the feature graphs of different orders in the feature extraction network respectively;

secondly, generating a feature map with the original size being 2 times by using an adjacent up-sampling method, fusing features with different scales by adopting a method of adding elements, and fusing semantic feature maps with different orders, so that a small target is easier to detect;

taking the feature mapping of four different stages output by the backbone network as input, carrying out feature fusion through a pyramid structure, and carrying out 3x3 convolution on the feature graphs fused at different stages to carry out feature re-extraction;

the final output is subjected to maximum pooling with the step length of 2 to obtain a new characteristic diagram, and the new characteristic diagram is taken as the input of the next stage together with other characteristic diagrams;

LCC is divided into an attention branch and a convolution branch, the attention branch is iterated twice through a crisscross attention mechanism module, the module obtains global information in the vertical direction and the horizontal direction of the image in a characteristic weighting mode, and context dependency relations among pixels are obtained, so that information fusion is established;

the convolution branch is subjected to dimension reduction operation by adopting 1x1 convolution, and is supplemented by the convolution branch, so that the network can obtain more comprehensive and rich characteristic information of targets with different dimensions;

feature map output for each stage of feature extraction network

The calculation of LCC can be summarized as equation (6), and the calculation of attention branches as equation (7):

feature map representing LCC network, +.>

Representing attention branches, ++>

A convolution branch representing a convolution operation performed at 1x 1;

Representing the computation of Cross Criss attention module.

The invention has the advantages that:

the invention provides an attention mechanism and multiscale fusion-based (AMSR-CNN) target detection model; aiming at the problem of background information interference in the detection process, a DAR50 feature extraction network constructed by using a DARM module is provided, and the module can effectively reduce the background information interference and extract the target feature information more accurately by enhancing the target feature information while adapting to the target form;

aiming at the problem of target multiscale, a CCFPN feature fusion module is provided, wherein information fusion of each pixel point and other pixel points is established in feature diagrams of different stages, and features of different stages are fused by utilizing a pyramid structure, so that the detection performance of targets of different scales is improved.

In addition to the objects, features and advantages described above, the present invention has other objects, features and advantages. The present invention will be described in further detail with reference to the drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a medical protective article in a real scene;

in fig. 2, (a) is an original image, and (b) is a detection result of TridentNet; (c) is the detection result of AMS R-CNN. Purple is rubber glove, red is protective clothing, green is mask, blue is goggles, yellow is protective screen;

fig. 3 illustrates feature extraction of input data, and feature fusion is performed by inputting feature maps extracted at different stages of DAR50 into CCFPN. Inputting the feature map subjected to the feature fusion treatment to an RPN layer for selecting a candidate frame, inputting the feature map subjected to the feature fusion of the candidate frame and the feature to an ROI alignment for post-treatment, and finally realizing classification and positioning prediction;

FIG. 4 is a graph of network model test results for the ResNet of the present invention as a feature extraction network;

FIG. 5 is a diagram of a DAR50 network structure of the present invention, 7x7conv is a convolution operation using a 7x7 convolution kernel, maxpool uses a pooling operation with a step size of 2, and feature information in an original image is obtained as much as possible through a two-step operation. ARM is the residual module of the attention-increasing operation. The DARM performs feature extraction for the serial deformable convolution sum scSE attention module;

in FIG. 6, (a) is an ARM residual module, (b) is a DARM residual module;

FIG. 7 is a diagram of a CCFPN network structure of the present invention, wherein the feature mapping of four different stages of the backbone network output is used as input, feature fusion is performed through a pyramid structure, and then feature re-extraction is performed by performing 3x3 convolution on the feature map fused at different stages, so as to ensure the stability of the extracted features; the final output is subjected to maximum pooling with the step length of 2 to obtain a new characteristic diagram, and the new characteristic diagram is taken as the input of the next stage together with other characteristic diagrams;

FIG. 8 is a block diagram of an LCC network of the present invention, wherein the LCC inputs the input feature map to the attention branch and the convolution branch respectively, wherein the 1x1 convolution operations are all dimension reduction, and the feature map is unified to the same dimension for feature fusion;

FIG. 9 is a training process of the bbox_mAP evaluation index of the network model in the comparative CPPE-5 dataset of the present invention;

in FIG. 10, (a) is an original image in a CPPE-5 dataset; (b) is the result of the network detection of Faster RCNN (ResNet50+FPN). (c) is a TridentNet detection result; (d) the proposed detection result of the network model;

FIG. 11 is a training process of the comparative method and other network model of the present invention on a PASCALVOC 2007 dataset;

FIG. 12 is a training process of the comparative DAR50 and other feature extraction network of the present invention on a CPPE-5 dataset;

FIG. 13 is a training process for comparing different evaluation indexes of different feature fusion modules according to the present invention;

fig. 14 is a training process of the AMS R-CNN of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Description of the invention

In fig. 9:

sparse region convolution neural network target detection method for spark R-CNN

Faster R-CNN: faster regional convolution neural network target detection method

FCOS: first order full convolution target detection

Deformable DETR: variability detection transducer

Double Head detection method

DCN variable convolution network

Empirical Attention: empirically based attention mechanisms

TridentNet: scale-aware trigeminal networks for object detection

Ours method of the invention

In fig. 11:

DCN variable convolution network

Empirical Attention: empirically based attention mechanisms

TridentNet: scale-aware trigeminal networks for object detection

Ours method of the invention

In fig. 12:

ResNet50+FPN: residual network and characteristic pyramid network of 50 layers

SE-ResNet50+FPN: 50-layer channel attention residual network and feature pyramid network

CBAM-resnet50+fpn: 50-layer convolution attention network and characteristic pyramid network

BAM-ResNet50+FPN: bottleneck attention network and characteristic pyramid network of 50 layers

Dar50+fpn: 50-layer variability and attention residual network and feature pyramid network

In fig. 13:

FPN: feature pyramid network

PAFPN: bottom-up feature pyramid network

NASFPN: feature pyramid network for neural architecture search

HRFPN high resolution characteristic soldier pyramid network

CCFPN crisscrossed feature pyramid network

In fig. 14:

DCN variable convolution network

Empirical Attention: empirically based attention mechanisms

Ours method of the invention

The CPPE-5 data set is published while the target detection paper of the CPPE-5 data set is published, which provides data base support for students in the field of detecting medical personal protection devices. Various object detection network models are listed herein to detect performance results in the dataset. The authors Rishit Dagli et al, taking Faster R-CNN, yolov3 and SSD as baselines, obtained results showed that Faster R-CNN performed optimally, yolov3 times. Experiments were also performed to verify the performance of multiple SOTA network models on the CPPE-5 dataset. The FCOS network model is used for detecting the first-order full convolution target based on the pixel level, features with different sizes are detected on different layers of feature graphs by using the FPN idea, and finally a Center-less is used for suppressing a low-quality prediction boundary box. The Head layer in the R-CNN type network is analyzed through comparison by the Double Head network model, the functional bias of the full-connection Head and the full-convolution Head is verified, and the performance effect is better by adopting a mode corresponding to the function through experimental verification. The transformable DETR model combines the characteristic of DCN self-adaptive extraction characteristics and the idea of a transducer in the DETR network, adds DCN in a backbone network to carry out sparse sampling and directly utilizes a characteristic diagram to train the transducer modeling capability, and solves the problems of detection and training duration of a small target of the DETR network. The authors of Empirical Attention have verified the influencing factors in spatial attention through various experiments and found that combining deformable convolution with key content salience achieves the best balance of self-attention accuracy and efficiency. Aiming at the scale change problem in the detection process, the TridentNet respectively performs the output operation of the feature network by adopting three cavity convolution kernels with different expansion coefficients of shared parameters in the feature extraction network, and selects the optimal result for classification and prediction through three types of results output by the NMS.

Attention mechanism

The attention mechanism in computer vision tasks has been demonstrated by a number of experiments to effectively enhance the targeted feature information and thereby improve network performance. It generally suppresses feature information that is not of interest by acquiring feature information that is most abundant in the information of interest to the target object. The SE attention mechanism [ 18 ] carries out global average pooling on the feature map in the space dimension, learns the channel attention through the full connection layer, and normalizes to obtain the channel attention feature map, so that the feature map is calibrated again to establish the dependency relationship among channels. Unlike SE attention mechanisms, CBAM [ 19 ] divides feature maps into two dimensions of channels and space to extract attention maps, connects the attention maps extracted by the channel attention modules and the space attention modules in series, establishes feature map relationships between different channels and space dependency relationships between different feature maps, and finally multiplies the fused attention maps with input feature maps to perform adaptive feature optimization. The BAM [ 20 ] is also used for extracting the attention map by dividing the feature map into two dimensions of a channel and a space, but the BAM adopts a parallel connection mode to connect the channel attention module and the space attention module, and fusion of the attention map is realized by multiplying the attention maps of the two dimensions and then normalizing the multiplied attention maps. The scSE module [ 21 ] is inspired by an SE attention mechanism, three modes of sSE, cSE and scSE are provided, the sSE mode is to obtain a new attention force diagram by changing the proportional value of the SE module in the full-connection lifting dimension, the cSE mode is multiplied by an input feature diagram after feature recalibration from the space dimension, and the attention force diagrams of the scSE mode are sSE and cSE modes to obtain the new attention force diagram in an adding mode, so that the enhancement of image features is realized.

Feature pyramid structure

In the field of computer vision, the detection performance of a deep convolutional neural network model can be affected by objects with different scales, and solving the problem of multiple scales in detection in an image is a challenging task. The Feature Pyramid Network (FPN) provides fusion of high-order semantic information and low-order semantic information, so that multi-scale feature fusion and receptive field expansion are realized. In this way, the detection performance of the detection network on targets with different scales is improved, but the characteristic information of the highest layer is lost in the up-sampling process, and the characteristic fusion is main. Consequently, the latter scholars have extracted different improvement methods. The authors of PAFPN found that only semantic information was enhanced during the FPN fusion process, and location information was ignored. They propose to newly build a bottom-up path in the FPN to implement enhancement of semantic and positional information in the feature map fusion process. HRFPN proposes building pyramid structures in a cascading fashion, enhancing semantic and positional information by performing repeated multi-scale fusion of feature maps. NAS-FPN provides feature fusion by selecting the optimal Binary operation through Neural Architecture Search (NAS). Therefore, the NAS-FPN does not need to manually design a feature fusion mode, and the feature information is enhanced by the network autonomous selection feature fusion mode. However, the result of such autonomous selection is that more training time is required to achieve a superior result. The SA-FPN realizes the improvement of the performance of human body detection in the image by designing an FPN structure of a hierarchical segmentation block and adding an attention mechanism into the FPN structure.

Method

Network structure

The AMS R-CNN network model is based on the Faster RCNN network structure and is divided into four parts of feature extraction network, feature fusion, target area suggestion network and prediction processing, wherein the structure is shown in figure 3.

The first part is a feature extraction network, and the first part is based on a DAR50 feature extraction network constructed by a DARM module and mainly used for extracting feature information of medical protection object images, fusing the characteristics of a deformable convolution and an attention module, adaptively extracting target features according to the shape of a detection target and enhancing the feature information.

The second part is feature fusion, through CCFPN feature fusion network, the feature graphs from different stages of the feature extraction network are subjected to context information acquisition by using cross information attention, and feature pyramid structures are used for realizing feature fusion of different stages, so that fusion of features of different scales of medical protection articles is realized.

The third part is a regional suggestion network (RPN), which generates an anchor frame through a sliding window, and calculates the anchor frame by using two branches of classification and boundary regression. And penalty attenuation is performed by using a Soft-NMS algorithm by using a Gaussian penalty function to realize candidate frame selection.

The fourth part is the prediction part. The candidate frame, the output characteristic of CCFPN and the information of the original image are input into the ROI alignment together, the ROIALign function is to cancel quantization operation and adopt a bilinear interpolation method, so that the problem of mismatch (mis-alignment) between the candidate region and the original region is solved. The prediction of the category and the position information of the detection target is realized.

Experiment

The invention tests the AMSR-CNN network performance on a CPPE-5 data set, and simultaneously verifies the effectiveness of the method in non-medical personal protection article detection on a PASCAL VOC 2007. Experimental results show that the AMS R-CNN detection network can obtain good detection performance on the CPPE-5 data set, and can effectively detect non-medical personal detection objects on the PASCAL VOC2007 data set. In the following subsections, relevant data sets, implementation details and relevant ablation experiments will be presented.

Data set

CPPE-5 is a disclosed medical protective article test dataset comprising 5 subject categories (work clothes, masks, gloves, faceshields and goggles), all images annotated with a set of bounding boxes and positive labels, about 4.57 annotations per image. The goal of this dataset is to allow study of the subordinate classifications of medical protective articles, as opposed to other popular datasets (e.g., PASCAL VOC, imageNet, microsoft COCO, etc.) that focus on a broad class. The data set contains images of personal medical devices in complex scenes, with multiple objects included in each image. The categories in the entire dataset are as in table 1.

The data set is commonly used for classification and detection tasks, and 9963 images are combined by a training set (5011 sheet) and a test set (4952 sheets), and the data set comprises 20 common object categories in life, and each image has 2.4 targets on average. The data set is a standardized set of excellent data sets and is also a benchmark for measuring the classification and identification capacity of the network model on the images. Therefore, the effectiveness of the proposed AMS R-CNN network model in detecting non-medical personal protective articles can be fully verified.

Implementation details

Training arrangement

The experiment was run on the ubuntu18 system, a Tesla T4 GPU (16G). Training of the network model is expedited using transfer learning. A training model of Resnet50 on the ImageNet dataset was used as the pre-training model. The optimization operation adopts an SGD+momentum optimizer, initial parameters adopt a mmdetection default configuration, the initial learning rate is 0.02, the weight attenuation is 0.0001, and the momentum coefficient is 0.9. The data enhancement of the image uses random inversion in the horizontal, vertical or both axes. The shorter edges of the image are randomly adjusted to 640,672,704,736,768 or 800 pixels during retraining. While ensuring that the longer side cannot exceed 1333 pixels in size. And corresponding padding operation enhancement data is performed.

Evaluation criteria

The invention uses COCO mAP index as the evaluation index for measuring the performance of the model. mAP represents the average Accuracy (AP) of all classes, whose calculation requires Precision (P), recall (R) and Average Precision (AP) values to be involved in the calculation. The calculation process of P is shown in the formula 8, and the calculation result is in the range of 0-1, wherein the calculation process is that the number of correctly predicted positive samples is divided by the total number of the positively predicted positive samples. The R calculation process is shown in the formula 9, and represents that the number of positive samples is correctly predicted to be the total number of the positive samples, and the calculation result is in the range of 0-1. Wherein the method comprises the steps of

Indicating that the actual negative sample is predicted to be negative sample, +.>

Indicating that the actual negative sample is predicted to be a positive sample, +.>

Indicating that the actual positive sample is predicted to be negative sample, < >>

Indicating that the actual negative sample is predicted to be a negative sample.

The AP value represents the calculation process as shown in formula (10) with R as the horizontal axis. P is integrated in a P-R graph with P as the vertical axis to determine the area.

The COCO mAP was calculated using Interplolated Average precision. As in equation (11), each time P is used, among all the thresholds, the P value of the maximum value is multiplied by the variation value of R. The P value and the R change value are multiplied by each other when i pictures are recognized by the system by using the Approximated Average precision calculation method, as shown in the formula (13). Interplolated Average precision can effectively reduce jitter in the P-R curve.

The mAP is calculated as (14), where N represents the class of the object in the dataset,

representing the average accuracy of the kth category.

COCO mAP can be further extended to mAP-50 representing IOU>mAP value of 0.5, mAP-75 represents IOU>mAP value of 0.75, mAP-S indicated a size of 32 ² mAP-M represents a size of 32 for mAP values within the pixel area ² -96 ² mAP-L representing a target size greater than 96 for mAP values of the target between pixel areas ² mAP value of pixel area.

Results on Datasets

For the CPPE-5 personal medical protection device dataset, the performance of the proposed network model and other network models in the CPPE-5 dataset are verified under the experimental conditions of the same initial parameters. Table 2 shows the performance results of the network model and other detection models on the performance of the CPPE-5 data set, and the evaluation index of the network model is superior to that of the other network models can be obtained from the data in the table.

The mAP index of the network model in the CPPE-5 data set is 0.6 higher than the 54.6 of the performance of the optimal network model TridentNet in the article, and the mAP-50 and mAP-75 indexes are 0.6 and 1.9 percent higher than the second names 89.0 and 58.2 in the table. Meanwhile, from experimental results, the network can be tested to have excellent performance in solving the multi-scale problem, and especially the detection performance of the small-scale target is 4.2 percent ahead of the network model ranked second in the table. Detection performance at the middle and large scales is also superior to other network models.

Fig. 10 is a detection effect diagram of a network model, and it can be seen from the detection diagram that the network model can effectively detect coverage, mask, gaggles, face Shield, gloves in a scene under a real scene. From the detection result diagram, it is verified that the network model can detect small target objects, and can distinguish the background and the target objects, so as to realize classification and positioning of the detected objects.

To verify the detection performance of the AMS R-CNN network model on other items, training and testing was performed on the pasal VOC2007 dataset, and table 3 shows the test results of the model and other models under the same configuration conditions. In the experimental result of the PASCAL VOC2007 data set, the performance of the AMS R-CNN is superior to other network models, and the AMS R-CNN has excellent performance on the detection of common objects in the target detection task. The figure illustrates the training process of the network model in the paspal VOC2007 dataset.

In the CPPE-5 data set, replacing the original VGG network of the Faster RCNN with a Renset50+FPN as a base line network, and carrying out multiple experiments to verify the performance of the network detection personal protection equipment.

After research analysis, a DAR50 feature extraction network module is proposed, and as can be seen from fig. 12 and table 4, the comprehensive performance of the DAR50 is superior to other feature extraction networks under the condition of using the same initial network parameter setting and feature fusion network. And the performance is also superior to other network models compared to the detection method in table 2 except the proposed method. But small target detection is less capable in handling than some detection networks.

Table 5 shows the results of the fusion structure performance of different features, and the font is thickened to be the highest value in the evaluation index.

Therefore, after analysis and research, a CCFPN feature fusion module is provided for improving the multi-scale detection performance. And performing a comparison experiment with other four FPN structure networks. Table 5 is experimental data, from which it can be derived that the proposed CCFPN has better detection performance than other FPN structural network performance under the same conditions. Figure 13 illustrates the training process for target detection at different scales for each network.

Through the previous experimental verification, the combination of the DAR50 and the CCFPN realizes an AMS RCNN network model.

According to the experimental results, the detection performance of the medical personal protection object can be effectively improved relative to the module provided by the network model of the base line. It can be observed from Table 6 that the AMSR-CNN detection network, which combines both DAR50 and CCPF, provides improved detection performance in detecting personal medical protective equipment. Figure 14 illustrates the training process of the AMS-R-CNN network model,

in a real medical detection scene, when a deep convolutional neural network model is used for detecting a medical personal protection object, the network model has poor detection performance due to the background information interference problem generated by the approximation of the characteristics of the object and the surrounding environment and the multi-scale problem of the object in an image.

In order to solve the problems, the invention provides an attention mechanism and multiscale fusion-based (AMSR-CNN) target detection model.

Aiming at the problem of background information interference in the detection process, a DAR50 feature extraction network constructed by using a DARM module is provided, and the module can effectively reduce the background information interference and extract the target feature information more accurately by enhancing the target feature information while adapting to the target form;

Experiments were performed on the challenging medical protective article CPPE-5 dataset and the validity of the method was verified in this dataset. Meanwhile, the effectiveness of AMS R-CNN in detecting other objects is also verified on the PASCAL VOC2007 data set.

According to the invention, through the problems of background information interference and multiscale of medical protection articles in the detection process, an AMSR-CNN network model aiming at personal medical protection equipment detection is constructed by the DAR50 module and the CCFPN module. The proposed network model was validated at the CPPE-5 dataset. According to the experimental result, the DAR50 network structure effectively eliminates the interference of background information and acquires the characteristic information of the target. The CCFPN module improves the module and can promote the detection performance to the medical protection article of different scales. The PASCAL VOC2007 data set is verified to be effective in detecting other objects, and experimental results show that the detection network is good in detecting the effectiveness of other objects.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A personal protection item detection network based on attention mechanisms and multiscale fusion, comprising:

2. The attention mechanism and multiscale fusion based personal protection item detection network of claim 1, wherein the feature extraction module DAR50 comprises:

assume an input image

Post-output by residual block->

In (1), itW and H represent the width and height of the input image, C and C ₁ Representing an image input channel and an output channel; f represents the feature mapping of the original residual module to the image, and the calculation process is as shown in formula (1), wherein conv1 (), conv3 () represent convolution operation using 1x1, 3x3 convolution kernels;

of the formula (I)

computation of a representation of a scSE attention module, enhanced by the scSE attention moduleA meaningful fraction of the features, suppressing nonsensical fractions;

the operation process is as shown in formula (5)>

For the calculation of cSE module, +.>

Calculation as sSE module;

。 />

3. the attention mechanism and multiscale fusion based personal protection item detection network of claim 1, wherein the CCFPN feature fusion module comprises:

feature map output for each stage of feature extraction network

feature map representing LCC network, +.>

Representing attention branches, ++>

A convolution branch representing a convolution operation performed at 1x 1;

Representing the computation of Cross Criss attention module. />