CN114897887A - X-ray security inspection image contraband detection method based on improved YOLOv5s - Google Patents

X-ray security inspection image contraband detection method based on improved YOLOv5s Download PDF

Info

Publication number
CN114897887A
CN114897887A CN202210705367.8A CN202210705367A CN114897887A CN 114897887 A CN114897887 A CN 114897887A CN 202210705367 A CN202210705367 A CN 202210705367A CN 114897887 A CN114897887 A CN 114897887A
Authority
CN
China
Prior art keywords
convolution
yolov5s
module
detection
improved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210705367.8A
Other languages
Chinese (zh)
Inventor
向娇
李国权
黄正文
林金朝
吴建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210705367.8A priority Critical patent/CN114897887A/en
Publication of CN114897887A publication Critical patent/CN114897887A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10116X-ray image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an X-ray security inspection image contraband detection method based on improved YOLOv5s, and belongs to the field of image detection. The method comprises the following steps: s1: establishing a Rep module designed based on a heavy parameter idea; s2: establishing a heavy parameter-based Yolov5s contraband detection algorithm; s3: the PAN of the neck is improved. Compared with the traditional detection method, the method has higher detection precision, and can meet the actual application requirement of contraband detection in the X-ray security inspection image.

Description

X-ray security inspection image contraband detection method based on improved YOLOv5s
Technical Field
The invention belongs to the field of image detection, and relates to an X-ray security inspection image contraband detection method based on improved YOLOv5 s.
Background
The X-ray luggage security check is an important means for maintaining public traffic safety, but the method for identifying an X-ray image by a security checker through naked eyes is low in efficiency and easy to cause false detection and missing detection, so that a more efficient and more accurate method for automatically detecting prohibited articles is needed.
With the rapid development of deep learning in various fields, corresponding attempts are made in the field of identification of prohibited articles for X-ray security inspection. Currently, the automatic contraband identification based on deep learning can be divided into three aspects of automatic classification of contraband, automatic detection of contraband and automatic division of contraband. At first, Convolutional Neural Networks (CNN) were applied to automatic classification of X-ray security contraband by means of transfer learning. Later, limited security data sets were augmented with generative countermeasure network techniques to improve the accuracy of the identification of security contraband. Kim et al performs automatic detection of contraband by designing a U-Net based O-Net structure. Miao et al propose a Class-balanced Hierarchical refinement (CHR) model to solve the problem of Class imbalance between positive and negative samples during automatic detection of contraband. Xu et al achieves automatic segmentation of security contraband by introducing a mechanism of attention in CNN. Although the technology of automatically identifying contraband articles based on deep learning has been studied, since the X-ray image is different from the natural light image, and the object features are not easy to learn under the perspective property due to the randomness of placing articles, the detection speed of the contraband articles cannot meet the requirement of practical application, and the detection accuracy still needs to be further improved.
In recent years, One-Stage target detection algorithm has attracted wide attention due to its simple structure and superior performance, wherein yolo (young Only Look once) is a set of a series of end-to-end target detection algorithms, and has the characteristic and advantage of high detection speed. The recently sourced YOLOv5 algorithm of the Ultralytics team gives consideration to real-time performance and accuracy to the maximum extent, and has great application potential in real-time contraband detection.
YOLOv5s is the smallest network in YOLOv5 series, and the invention provides an improved method for identifying contraband in X-ray security inspection images by taking YOLOv5s as a basic model. The real-time requirement of the automatic detection of the forbidden articles is met, and meanwhile, the detection precision is improved. Firstly, a heavy parameter module (replay Block) is designed and introduced into a backbone network of YOLOv5s, a parallel 1 × 1 convolution branch is constructed at a 3 × 3 convolution position to assist the backbone network to extract richer features in a training phase, and the 1 × 1 branch is merged into the 3 × 3 branch in an inference phase, so that the detection precision is improved while the inference speed is not influenced. Secondly, two compression-Excitation modules (SE blocks) are inserted into a Path Aggregation Network (PAN) at the neck of YOLOv5s, so that the detection effect of the algorithm on forbidden articles is improved on the premise of not influencing the inference speed.
Disclosure of Invention
In view of the above, the present invention provides a method for detecting contraband in an X-ray security image based on improved YOLOv5 s.
In order to achieve the purpose, the invention provides the following technical scheme:
an X-ray security inspection image contraband detection method based on improved YOLOv5s, comprising the following steps:
s1: establishing a Rep module designed based on a heavy parameter idea;
s2: establishing a heavy parameter-based Yolov5s contraband detection algorithm;
s3: the PAN of the neck is improved.
Optionally, the S1 specifically includes:
setting constructed Rep module parameters as shown in formula (1), namely adding two parallel convolution branches; the information flow generated by the Rep module is represented as y ═ f (x) + g (x), where f (x), g (x) are convolution branches implemented by 3 × 3 and 1 × 1 kernels, respectively;
Rep(3×3)=3×3-BN+1×1-BN (1)
for each 3 × 3 convolution, constructing parallel 1 × 1 convolution branches in a training stage, and respectively performing normalization operation and adding; in the inference stage, 1 × 1 branches are fused into 3 × 3 branches to obtain a 3 × 3 convolution branch, and another parallel branch structure is subtracted, so that the performance of a convolution network is improved, and the network detection efficiency is not influenced;
on the basis of the structure of the Rep module, converting the multi-branch module into a single branch based on the idea of ReptVGG; the conversion of the model is carried out after the training is finished, and comprises the following two steps:
firstly, fusing a convolution layer and a BN layer in each branch; directly substituting the convolution result into the bn formula, as shown by the left arrow in FIG. 3, the output is expressed as formula (2):
M (2) =bn(W (3) *M (1)(3)(3)(3)(3) )+bn(W (1) *M (1)(1)(1)(1)(1) ) (2)
wherein the content of the first and second substances,
Figure BDA0003705095060000022
and
Figure BDA0003705095060000023
denotes convolution kernels representing 3X 3 and 1X 1 convolution layers, respectively, C 1 ,C 2 Representing the number of input and output channels; mu.s (3)(3)(3)(3) Respectively represents the cumulative mean, standard deviation, scaling factor and deviation term of the BN layer after 3 multiplied by 3 convolution (1)(1)(1)(1) Corresponding to the accumulated mean, standard deviation, scaling factor and deviation term of the 1 × 1 convolved BN layer; input and output are respectively expressed as
Figure BDA0003705095060000021
Represents a convolution operation;
substituting the parameters into the formula (2) to obtain a result as shown in the formula (3); wherein bn is a batch normalization function of inference phase, i ∈ [ [ solution ] ]1,C 2 ];
Figure BDA0003705095060000031
Simplifying the formula (3) to obtain a convolution layer with a deviation term; the convolution kernel and the bias term obtained after { W, b, μ, σ, γ, β } transformation are expressed in { W ', b' }, and there are:
Figure BDA0003705095060000032
for any i e [1, C ∈ ] 2 ]With bn (W x M, μ, σ, γ, β) :,i,:,: =(W'*M) :,i,:,: b′ i (ii) a Obtaining a 3 × 3 convolution kernel, a 1 × 1 convolution kernel and two deviation terms after the fusion is completed;
fusing the 3 multiplied by 3 convolution and the 1 multiplied by 1 convolution, and adding the two deviation terms to obtain a fused deviation term; filling a 1 × 1 convolution kernel with 0 to form a 3 × 3 convolution kernel, and adding the 3 × 3 convolution kernel to the original 3 × 3 convolution kernel to obtain a fused convolution kernel; is provided with
Figure BDA0003705095060000033
For two convolution kernels, the addition result is expressed as formula (5) according to the additive principle of convolution; after the convolution kernel is fused, the function before fusion is realized;
Figure BDA0003705095060000034
optionally, the S2 specifically includes: introducing the Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; adjusting a PAN structure, and inserting an SE module between an upper detection layer and a lower detection layer in the PAN to obtain an upgraded PAN network;
the Focus module performs slicing operation on the picture to enable an input channel of the picture to be expanded by 4 times, namely the operated picture is changed into 12 channels from an original RGB three channel; obtaining a double-sampling feature map without information loss through convolution operation; the Conv module encapsulates the convolutional layer, the BN layer and the SiLU activation function; the structure and the function of the C3 module are basically the same as those of the BottleneckCSP, but the floating-point operand is lower, and the running speed is higher; the SPP module is used for splicing maximum pooling results of different sizes to realize the fusion of local features and global features; UpSample is an upper sampling layer, and the image is amplified to 2 times by an internal interpolation method; and detecting three Conv [1,1] in the head to obtain a characteristic diagram of final output.
Optionally, the S3 specifically includes:
the SE module comprises a compression part and an excitation part; the first step is a compression stage, the characteristic diagram of input WxHxC is compressed to 1 x 1 xC through a global average pooling, and the compressed characteristic diagram has a global receptive field; the second step is an excitation stage, which consists of two fully-connected layers: the first fully-connected layer has C x r neurons, and the second fully-connected layer has C neurons, where r is a scaling parameter, which is adjusted to reduce the number of channels and thereby reduce the amount of computation.
Optionally, setting an evaluation index after S4;
the detection performance evaluation of the detector needs to consider the Precision and the Recall rate simultaneously; the average precision average mAP when IoU is 0.5, the macro accuracy MP, the macro recall MR and the macro F1 are used for evaluating the performance of the network model in target detection; the definition of the accuracy rate is formula (6), and the definition of the recall rate is formula (7); wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively;
Figure BDA0003705095060000041
Figure BDA0003705095060000042
the average precision AP is obtained by combining the accuracy and the recall rate and is used for evaluating the precision of the model for detecting a single category; the mAP measurement model detects the precision of all classes and is obtained by solving the average value of all classes of APs, and the definition of the mAP measurement model is shown as a formula (8); the F1 score is a weighted average of the accuracy and the recall ratio, and is defined as formula (9), wherein the larger the value is, the better the effect is;
Figure BDA0003705095060000043
Figure BDA0003705095060000044
macro accuracy, macro recall, and macro F1 are obtained by averaging all category accuracy, recall, and F1 scores, respectively.
The invention has the beneficial effects that: the invention provides an improved method for identifying contraband in an X-ray security inspection image by taking YOLOv5s as a basic model. The real-time requirement of the automatic detection of the forbidden articles is met, and meanwhile, the detection precision is improved. Firstly, a heavy parameter module Rep Block is designed and introduced into a YOLOv5s backbone network, a parallel 1 × 1 convolution branch is constructed at a 3 × 3 convolution position to assist the backbone network to extract richer features in a training phase, and the 1 × 1 branch is merged into the 3 × 3 branch in an inference phase, so that the detection precision is improved while the inference speed is not influenced. Secondly, two SE blocks are inserted into the PAN at the neck part of YOLOv5s, so that the detection effect of the algorithm on the forbidden articles is improved on the premise of not influencing the reasoning speed.
Compared with the traditional detection method, the method has higher detection precision, and can meet the actual application requirement of contraband detection in the X-ray security inspection image.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of a YOLOv5s network architecture;
FIG. 2 is a diagram of a modified YOLOv5s network model architecture;
FIG. 3 is a Rep Block structure and its structure reparameterization process;
FIG. 4 is a block diagram of a fusion convolution kernel obtained by adding a normal 3 × 3 convolution kernel to a filled convolution kernel;
FIG. 5 is a graph of the use of SE modules in a convolutional layer (a) ordinary convolution (b) convolution after insertion of the SE modules;
fig. 6 is a confusion matrix corresponding to various types of contraband under different algorithms.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
X-ray security image dataset
X-rays show their powerful capabilities in security inspection tasks, however, there are still fewer X-ray contraband image datasets available for study. GDXray contains 19407 pictures, but only a few (600) contain three types of contraband: guns, darts and razor blades, and all images are grayscale images, the background is simple, and the difference from a complex real scene is large. OPIXray contains 8885X-ray contraband images, with different levels and proportions of overlap, but only one type of contraband (differently shaped knives). The SIXray is composed of 8929 forbidden article images with multiple categories, the image background is complex, dangerous articles are randomly stacked with shielding, and the actual situation is better met, so that the SIXray is selected as an experimental data set.
YOLOv5s algorithm
The YOLOv5s algorithm consists of four parts, input, backbone network, neck and detection head, as shown in fig. 1. The input end adopts a Mosaic data enhancement method, a self-adaptive calculation boundary box and a zoom image, so that the diversity of input data is enriched. The main network part uses Focus and CSP modules, wherein the CSP structure is beneficial to improving the network characteristic learning ability. The neck is structured with a Feature Pyramid (FPN) plus PAN, the FPN enhancing semantic propagation by upsampling, the PAN enhancing Feature localization using downsampling. The detector header section uses Generalized Intersection over Unit (GIoU) loss as a function of the loss of the bounding box, and selects the bounding box using Non Maximum Suppression (NMS).
According to the invention, the main network and neck PAN structure of the YOLOv5s algorithm are respectively improved to generate a new network and improve the detection performance.
3. Heavy parameter-based YOLOv5s contraband detection algorithm
According to the method, the detection precision of the algorithm on the security inspection prohibited articles is improved by improving the YOLOv5s network structure, the feature extraction capability of the main network is improved by designing the Rep module, and the reasoning time is not influenced; two SE modules are introduced into the neck PAN, so that the network extracts more characteristic information. The improved network structure is shown in fig. 2.
In the contraband detection problem, YOLOv5s was improved in two parts: introducing a Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; and secondly, further adjusting the PAN structure, and inserting the SE module between the upper detection layer and the lower detection layer in the PAN to obtain an upgraded PAN network. The new algorithm can not only enrich the characteristics of the backbone network, improve the model performance and improve the detection effect; and the neck PAN enhancement information can be refined, the influence on the reasoning time can be ignored, and the improved algorithm structure is shown in figure 3. The Focus module performs slicing operation on the picture, so that an input channel of the picture is expanded by 4 times, namely the operated picture is changed into 12 channels from an original RGB three channel; and further performing convolution operation to obtain a double-sampling feature map without information loss. The Conv module encapsulates the convolutional layer, the BN layer, and the sul activation functions. The structure and function of the C3 module are basically the same as those of the BottleneckCSP, but the floating-point operands are lower, and the operation speed is higher. The SPP module is used for splicing maximum pooling results of different sizes, and fusion of local features and global features is achieved. UpSample is an upsampling layer, and the image is enlarged to 2 times by an internal interpolation method. And detecting three Conv [1,1] in the head to obtain a characteristic diagram of final output.
3.1 Rep Module based on design of heavy parameter thought
In order to extract more abundant network features and improve network detection performance, researchers have designed many novel multi-branch structures. The novel component can improve the precision, but the problem brought by the multi-branch structure is that the component is difficult to apply and self-define, the video memory consumption is increased, and the reasoning process is unfavorable. Therefore, the invention designs the Rep module by using the re-parameterization idea to improve the model precision and reduces the influence on the reasoning speed by decoupling the training stage and the testing stage. Let the constructed Rep module parameters be as shown in equation (1), i.e. two parallel convolution branches are added. The information stream generated by the Rep module is denoted as y ═ f (x) + g (x), where f (x), g (x) are the convolution branches implemented by a 3 × 3 kernel and a 1 × 1 kernel, respectively.
Rep(3×3)=3×3-BN+1×1-BN (1)
As shown in fig. 3, for each 3 × 3 convolution, parallel 1 × 1 convolution branches are constructed in the training phase and each subjected to a normalization operation and then added. In the inference stage, the 1 × 1 branch is fused into the 3 × 3 branch to obtain a 3 × 3 convolution branch, and the other parallel branch structure is subtracted, so that the performance of the convolution network can be improved without influencing the network detection efficiency.
On the basis of the structure of the Rep module, the multi-branch module can be converted into a single branch based on the idea of RepVGG. The conversion of the model (i.e. multi-branch fusion) is performed after the training is completed, and comprises the following two steps:
(1) first, the convolution layer and the BN layer in each branch are fused. Directly substituting the convolution result into the bn equation, as shown by the left arrow in fig. 3, the output can be expressed as equation (2):
M (2) =bn(W (3) *M (1)(3)(3)(3)(3) )+bn(W (1) *M (1)(1)(1)(1)(1) ) (2)
wherein the content of the first and second substances,
Figure BDA0003705095060000071
and
Figure BDA0003705095060000072
denotes convolution kernels representing 3X 3 and 1X 1 convolution layers, respectively, C 1 ,C 2 Representing the number of input and output channels. Mu.s (3)(3)(3)(3) Respectively represents the cumulative mean, standard deviation, scaling factor and deviation term of the BN layer after 3 multiplied by 3 convolution (1)(1)(1)(1) Corresponding to the cumulative mean, standard deviation, scaling factor and deviation term of the 1 x 1 convolved BN layer. Input and output are respectively expressed as
Figure BDA0003705095060000073
Denotes the convolution operation.
Substituting the parameters into equation (2) yields the result as equation (3). Where bn is the batch normalization function of the inference phase, i ∈ [1, C 2 ]。
Figure BDA0003705095060000074
The formula (3) is further simplified to obtain a convolution layer with a bias term. The convolution kernel and the bias term obtained after { W, b, μ, σ, γ, β } transformation are expressed in { W ', b' }, and there are:
Figure BDA0003705095060000075
thus, it is possible to verify for any i ∈ [1, C ] 2 ]With bn (W x M, μ, σ, γ, β) :,i,:,: =(W'*M) :,i,:,: b′ i . Therefore, a 3 × 3 convolution kernel, a 1 × 1 convolution kernel, and two bias terms can be obtained after the fusion is completed.
(2) The 3 × 3 convolution and the 1 × 1 convolution are fused, i.e., the right arrow step in fig. 3. Adding the two deviation terms to obtain a fusion deviation term; the 1 × 1 convolution kernel is padded with 0's to form a 3 × 3 convolution kernel, which is then added to the original 3 × 3 convolution kernel to obtain a fused convolution kernel, as shown in fig. 4. Is provided with
Figure BDA0003705095060000076
For two convolution kernels, the addition result can be expressed as equation (5) according to the additive principle of convolution. Thus, the convolution kernel fusion can be followed by the same function as before the fusion.
Figure BDA0003705095060000077
3.2 improvement of neck PAN
The invention improves the model performance by loading the SE module in the PAN at the neck of YOLOv5 s. The SE module screens out the attention of the channels by modeling the correlation among the characteristic channels, and enhances the accuracy by strengthening important characteristics.
As shown in FIG. 5, the SE module mainly comprises two parts of compression (Squeeze) and Excitation (Excitation). The first step is a compression stage, which compresses the input W × H × C feature map to 1 × 1 × C by a global average pooling, and the compressed feature map has a global receptive field. The second step is an excitation stage, which consists of two fully-connected layers: the first fully-connected layer has C x r neurons, and the second fully-connected layer has C neurons, where r is a scaling parameter that is adjusted to reduce the number of channels and thus reduce the computational complexity.
4 Experimental and results analysis
4.1 data set
The algorithm proposed by the present invention was experimented on a common dataset SIXray that collected 8929 annotated images. Compared with other data sets, the SIXray has more categories and relatively larger data volume. The data set was randomly divided into two parts, with 20% of the images (1781) being the test set and the remainder (7148) being the training set, in a ratio of approximately 1: 4. The invention eliminates the detection of scissors in the experiment because the number of the samples is too small and the data amount between the classes is unbalanced. The detailed distribution of the various categories in the dataset is shown in table 1. Furthermore, many images in a dataset contain multiple objects.
Table 1 distribution of each category in the SIXray dataset.
Figure BDA0003705095060000081
Many images contain multiple contraband items, so the total number of items is much higher than the number of images.
4.2 evaluation index
The detection performance evaluation of the detector requires consideration of both accuracy (Precision) and Recall (Recall). The target detection uses, for example, mean Average Precision (mep) when IoU is 0.5, Macro Precision (MP), Macro Recall (MR), and Macro F1(Macro-F1, MF1) to evaluate the performance of the network model. The definition of accuracy is formula (6) and the definition of recall is formula (7). Wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.
Figure BDA0003705095060000091
Figure BDA0003705095060000092
The Average Precision (AP) is obtained by combining the accuracy and recall and is used to evaluate the Precision of the model to detect individual classes. The mAP measurement model detects the accuracy of all classes, and is obtained by averaging all classes of APs, which is defined as formula (8). The F1 score is a weighted average of accuracy and recall, defined as equation (9), with larger values indicating better results.
Figure BDA0003705095060000093
Figure BDA0003705095060000094
Similar to the macro accuracy, macro recall, and macro F1 are obtained by averaging the accuracy, recall, and F1 scores of all categories, respectively. In addition, confusion matrices may also be used to assist in the analysis of results.
4.3 analysis of the results of the experiment
The experimental hardware was configured as a core (TM) i9-10920X processor, a GeForce RTX 3090 graphics card, and the software was configured as torch1.8.0. During training, the network parameters are optimized using the SGD with momentum, and the value of the number of iterations (epochs) is set to 200. The input image size is 640 × 640, and the batch size (batch size) is 64. And controlling the experiments of the various models under the condition that other parameters are consistent.
The invention shows the experimental results of four models in total: an original YOLOv5s algorithm, a backbone network algorithm of modified YOLOv5s using a Rep module (hereinafter referred to as Rep-YOLOv5), a modified algorithm of inserting an SE module in a neck PAN (hereinafter referred to as SE-YOLOv5), and two modified superimposed algorithms (hereinafter referred to as RepSE-YOLOv5 s).
Fig. 6 lists confusion matrices for different algorithms, where diagonal values represent True Positive Rates (TPR) and the sum of off-diagonal values in each column represents one class of False Negative Rates (FNR). As can be seen from the figure, the diagonal response of the confusion matrix of the RepsE-YOLOv5s algorithm is higher than the average value of other matrixes, and the better overall distribution of the confusion matrix shows that the algorithm can identify forbidden articles more accurately.
In contrast, the detection effect of the algorithm on the wrench type articles is improved to the maximum. Overall, however, the true positive rate is the lowest for such objects as knives. The reason may be that the specific features of such objects as the data set knife are not unique, for example, the knife category includes wide kitchen knives, slender straight knives, and small tool knives.
Table 2 shows the overall performance comparison of the four algorithms on the SIXray dataset. Firstly, comparing an original YOLOv5s algorithm with a Rep-YOLOv5s algorithm, data in a table show that the macro accuracy (mAP) of the Rep-YOLOv5s algorithm is improved by 1.1% compared with that of an original network, the other three evaluation indexes are improved, the detection precision of each category is also improved, and the Rep module enhances the feature extraction capability of a backbone network and is obviously helpful for improving the detection performance of the algorithm. Then, by comparing the original YOLOv5s algorithm with the SE-YOLOv5s algorithm, it can be seen that the improvement of the SE module on the network enhanced feature extraction part also improves the detection performance of the whole algorithm, especially the macro recall ratio (MR) is improved by 2.5%.
Finally, the combined two-part modified RepSE-YOLOv5s algorithm was 2.6%, 1.0%, 1.4% higher in mAP index than the original YOLOv5s, Rep-YOLOv5s, SE-YOLOv5s, respectively. There are also significant advantages in macro accuracy (mAP), Macro Recall (MR) and macro F1(MF1) over the original network and the other two networks. In addition, the table also shows that the detection precision of each category is improved, particularly for wrench articles, the detection precision reaches 91.3% from 86.4%, and is improved by 4.9%. The evaluation shows that the RepSE-YOLOv5s algorithm can more accurately detect all types of contraband in the X-ray security inspection image and has the potential of being further applied to actual scenes.
TABLE 2 Performance comparison of detection algorithms
Figure BDA0003705095060000101
In addition, the present invention compares the number of parameters (one hundred thousand, M), the model size (megabyte, MB) and the time (milliseconds, ms) required to detect a single image for the original YOLOv5s algorithm, the Rep-YOLOv5s algorithm, the SE-YOLOv5s algorithm, and the RepsE-YOLOv5s algorithm, as shown in Table 3. The time required to detect a single image is the result of testing the data in the test set on the GPU, including data pre-processing, model reasoning, post-processing and non-maximum suppression (NMS). The average pre-treatment time was 0.1ms and the average NMS time was 0.8 ms per graph.
As can be seen from Table 3, the number of parameters of the RepSE-YOLOv5s algorithm increased by only 0.28%, the size of the model increased by 2.82% (0.4MB), and the detection time was almost unchanged.
TABLE 3 comparison of different algorithms with time
Figure BDA0003705095060000102
The improved algorithm improves the detection effect of some objects with complex backgrounds and difficult identification.
Aiming at the problem that the detection precision of the X-ray security inspection image is not high enough at present, the YOLOv5s algorithm with a small model and a high detection speed is applied to the contraband detection of the X-ray security inspection image, and a RepSE-YOLOv5s detection algorithm is provided, so that the influence on the detection speed can be ignored while the detection precision is improved. Firstly, a Rep module is designed by utilizing a heavy parameter idea to enrich the characteristics of a backbone network of a YOLOv5s algorithm, and then two SE modules are inserted into a PAN at the neck part of the YOLOv5s algorithm, so that the detection effect of the algorithm on forbidden articles is improved. Finally, experiments are carried out on the SIXray data set, four different algorithm models are contrastively analyzed for the contraband detection performance, the results show that the average precision mean, the macro accuracy, the macro recall rate and the macro F1 of the new algorithm are respectively improved by 2.6%, 2.0%, 4.0% and 3.0% compared with the original algorithm, meanwhile, the detection speed is kept to be 2.6 milliseconds per image, and the contraband detection accuracy is improved while almost no extra detection time is added.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (5)

1. The method for detecting contraband in X-ray security inspection images based on improved YOLOv5s is characterized by comprising the following steps: the method comprises the following steps:
s1: establishing a Rep module designed based on a heavy parameter idea;
s2: establishing a heavy parameter-based Yolov5s contraband detection algorithm;
s3: the PAN of the neck is improved.
2. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 1, wherein: the S1 specifically includes:
setting constructed Rep module parameters as shown in formula (1), namely adding two parallel convolution branches; the information flow generated by the Rep module is represented as y ═ f (x) + g (x), where f (x), g (x) are convolution branches implemented by 3 × 3 and 1 × 1 kernels, respectively;
Rep(3×3)=3×3-BN+1×1-BN (1)
for each 3 × 3 convolution, constructing parallel 1 × 1 convolution branches in a training stage, and respectively performing normalization operation and adding; in the inference stage, 1 × 1 branches are fused into 3 × 3 branches to obtain a 3 × 3 convolution branch, and another parallel branch structure is subtracted, so that the performance of a convolution network is improved, and the network detection efficiency is not influenced;
on the basis of the structure of the Rep module, converting the multi-branch module into a single branch based on the idea of ReptVGG; the conversion of the model is carried out after the training is finished, and comprises the following two steps:
firstly, fusing a convolution layer and a BN layer in each branch; directly substituting the convolution result into the bn formula, as shown by the left arrow in fig. 3, the output is expressed as formula (2):
M (2) =bn(W (3) *M (1)(3)(3)(3)(3) )+bn(W (1) *M (1)(1)(1)(1)(1) ) (2)
wherein the content of the first and second substances,
Figure FDA0003705095050000011
and
Figure FDA0003705095050000012
denotes convolution kernels representing 3X 3 and 1X 1 convolution layers, respectively, C 1 ,C 2 Representing the number of input and output channels; mu.s (3)(3)(3)(3) Respectively represents the cumulative mean, standard deviation, scaling factor and deviation term of the BN layer after 3 multiplied by 3 convolution (1)(1)(1)(1) Corresponding to the accumulated mean, standard deviation, scaling factor and deviation term of the 1 × 1 convolved BN layer; input and output are respectively expressed as
Figure FDA0003705095050000013
Generation byPerforming table convolution operation;
substituting the parameters into formula (2) to obtain a result as formula (3); where bn is the batch normalization function of the inference phase, i ∈ [1, C 2 ];
Figure FDA0003705095050000014
Simplifying the formula (3) to obtain a convolution layer with a deviation term; the convolution kernel and the bias term obtained after { W, b, μ, σ, γ, β } transformation are expressed in { W ', b' }, and there are:
Figure FDA0003705095050000021
for any i e [1, C ∈ ] 2 ]With bn (W x M, μ, σ, γ, β) :,i,:,: =(W'*M) :,i,:,: b i '; obtaining a 3 × 3 convolution kernel, a 1 × 1 convolution kernel and two deviation terms after the fusion is completed;
fusing the 3 multiplied by 3 convolution and the 1 multiplied by 1 convolution, and adding the two deviation terms to obtain a fused deviation term; filling a 1 × 1 convolution kernel with 0 to form a 3 × 3 convolution kernel, and adding the 3 × 3 convolution kernel to the original 3 × 3 convolution kernel to obtain a fused convolution kernel; is provided with
Figure FDA0003705095050000022
For two convolution kernels, the addition result is expressed as formula (5) according to the additive principle of convolution; after the convolution kernel is fused, the function before fusion is realized;
Figure FDA0003705095050000023
3. the method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 2, wherein: the S2 specifically includes: introducing the Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; adjusting a PAN structure, and inserting an SE module between an upper detection layer and a lower detection layer in the PAN to obtain an upgraded PAN network;
the Focus module performs slicing operation on the picture to enable an input channel of the picture to be expanded by 4 times, namely the operated picture is changed into 12 channels from an original RGB three channel; obtaining a double-sampling feature map without information loss through convolution operation; the Conv module encapsulates the convolutional layer, the BN layer and the SiLU activation function; the structure and the function of the C3 module are basically the same as those of the BottleneckCSP, but the floating-point operand is lower, and the running speed is higher; the SPP module is used for splicing maximum pooling results of different sizes to realize the fusion of local features and global features; UpSample is an upper sampling layer, and the image is amplified to 2 times by an internal interpolation method; and detecting three Conv [1,1] in the head to obtain a characteristic diagram of final output.
4. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 3, wherein: the S3 specifically includes:
the SE module comprises a compression part and an excitation part; the first step is a compression stage, the characteristic diagram of input WxHxC is compressed to 1 x 1 xC through a global average pooling, and the compressed characteristic diagram has a global receptive field; the second step is an excitation stage, which consists of two fully-connected layers: the first fully-connected layer has C x r neurons, and the second fully-connected layer has C neurons, where r is a scaling parameter, which is adjusted to reduce the number of channels and thereby reduce the amount of computation.
5. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 4, wherein: setting an evaluation index after the step S4;
the detection performance evaluation of the detector needs to consider the Precision and the Recall rate simultaneously; evaluating the performance of the network model by using an average precision mean value mAP, a macro accuracy rate MP, a macro recall rate MR and a macro F1 when IoU is equal to 0.5 in target detection; the definition of the accuracy rate is formula (6), and the definition of the recall rate is formula (7); wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively;
Figure FDA0003705095050000031
Figure FDA0003705095050000032
the average precision AP is obtained by combining the accuracy and the recall rate and is used for evaluating the precision of the model for detecting a single category; the mAP measurement model detects the precision of all classes and is obtained by solving the average value of all classes of APs, and the definition of the mAP measurement model is shown as a formula (8); the F1 score is a weighted average of the accuracy and the recall ratio, and is defined as formula (9), wherein the larger the value is, the better the effect is;
Figure FDA0003705095050000033
Figure FDA0003705095050000034
macro accuracy, macro recall, and macro F1 are obtained by averaging all category accuracy, recall, and F1 scores, respectively.
CN202210705367.8A 2022-06-21 2022-06-21 X-ray security inspection image contraband detection method based on improved YOLOv5s Pending CN114897887A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210705367.8A CN114897887A (en) 2022-06-21 2022-06-21 X-ray security inspection image contraband detection method based on improved YOLOv5s

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210705367.8A CN114897887A (en) 2022-06-21 2022-06-21 X-ray security inspection image contraband detection method based on improved YOLOv5s

Publications (1)

Publication Number Publication Date
CN114897887A true CN114897887A (en) 2022-08-12

Family

ID=82727370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210705367.8A Pending CN114897887A (en) 2022-06-21 2022-06-21 X-ray security inspection image contraband detection method based on improved YOLOv5s

Country Status (1)

Country Link
CN (1) CN114897887A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274192A (en) * 2023-09-20 2023-12-22 重庆市荣冠科技有限公司 Pipeline magnetic flux leakage defect detection method based on improved YOLOv5

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274192A (en) * 2023-09-20 2023-12-22 重庆市荣冠科技有限公司 Pipeline magnetic flux leakage defect detection method based on improved YOLOv5

Similar Documents

Publication Publication Date Title
Sengupta et al. Sfsnet: Learning shape, reflectance and illuminance of facesin the wild'
Liao et al. Deep facial spatiotemporal network for engagement prediction in online learning
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
Li et al. Linestofacephoto: Face photo generation from lines with conditional self-attention generative adversarial networks
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN111325111A (en) Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision
CN107463920A (en) A kind of face identification method for eliminating partial occlusion thing and influenceing
CN111563418A (en) Asymmetric multi-mode fusion significance detection method based on attention mechanism
Zhu et al. Efficient action detection in untrimmed videos via multi-task learning
CN113762138B (en) Identification method, device, computer equipment and storage medium for fake face pictures
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN110032925A (en) A kind of images of gestures segmentation and recognition methods based on improvement capsule network and algorithm
Tao et al. Learning discriminative feature representation with pixel-level supervision for forest smoke recognition
CN110222718A (en) The method and device of image procossing
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN113963170A (en) RGBD image saliency detection method based on interactive feature fusion
CN111914617B (en) Face attribute editing method based on balanced stack type generation type countermeasure network
CN114511710A (en) Image target detection method based on convolutional neural network
Hu et al. Gabor-CNN for object detection based on small samples
CN114897887A (en) X-ray security inspection image contraband detection method based on improved YOLOv5s
Chen et al. Video‐based action recognition using spurious‐3D residual attention networks
US20240177525A1 (en) Multi-view human action recognition method based on hypergraph learning
CN114360073A (en) Image identification method and related device
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN113609944A (en) Silent in-vivo detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination