CN114897887A

CN114897887A - X-ray security inspection image contraband detection method based on improved YOLOv5s

Info

Publication number: CN114897887A
Application number: CN202210705367.8A
Authority: CN
Inventors: 向娇; 李国权; 黄正文; 林金朝; 吴建
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-08-12
Anticipated expiration: 2042-06-21
Also published as: CN114897887B

Abstract

The invention relates to an X-ray security inspection image contraband detection method based on improved YOLOv5s, and belongs to the field of image detection. The method comprises the following steps: s1: establishing a Rep module designed based on a heavy parameter idea; s2: establishing a heavy parameter-based Yolov5s contraband detection algorithm; s3: the PAN of the neck is improved. Compared with the traditional detection method, the method has higher detection precision, and can meet the actual application requirement of contraband detection in the X-ray security inspection image.

Description

X-ray security inspection image contraband detection method based on improved YOLOv5s

Technical Field

The invention belongs to the field of image detection, and relates to an X-ray security inspection image contraband detection method based on improved YOLOv5 s.

Background

The X-ray luggage security check is an important means for maintaining public traffic safety, but the method for identifying an X-ray image by a security checker through naked eyes is low in efficiency and easy to cause false detection and missing detection, so that a more efficient and more accurate method for automatically detecting prohibited articles is needed.

With the rapid development of deep learning in various fields, corresponding attempts are made in the field of identification of prohibited articles for X-ray security inspection. Currently, the automatic contraband identification based on deep learning can be divided into three aspects of automatic classification of contraband, automatic detection of contraband and automatic division of contraband. At first, Convolutional Neural Networks (CNN) were applied to automatic classification of X-ray security contraband by means of transfer learning. Later, limited security data sets were augmented with generative countermeasure network techniques to improve the accuracy of the identification of security contraband. Kim et al performs automatic detection of contraband by designing a U-Net based O-Net structure. Miao et al propose a Class-balanced Hierarchical refinement (CHR) model to solve the problem of Class imbalance between positive and negative samples during automatic detection of contraband. Xu et al achieves automatic segmentation of security contraband by introducing a mechanism of attention in CNN. Although the technology of automatically identifying contraband articles based on deep learning has been studied, since the X-ray image is different from the natural light image, and the object features are not easy to learn under the perspective property due to the randomness of placing articles, the detection speed of the contraband articles cannot meet the requirement of practical application, and the detection accuracy still needs to be further improved.

In recent years, One-Stage target detection algorithm has attracted wide attention due to its simple structure and superior performance, wherein yolo (young Only Look once) is a set of a series of end-to-end target detection algorithms, and has the characteristic and advantage of high detection speed. The recently sourced YOLOv5 algorithm of the Ultralytics team gives consideration to real-time performance and accuracy to the maximum extent, and has great application potential in real-time contraband detection.

YOLOv5s is the smallest network in YOLOv5 series, and the invention provides an improved method for identifying contraband in X-ray security inspection images by taking YOLOv5s as a basic model. The real-time requirement of the automatic detection of the forbidden articles is met, and meanwhile, the detection precision is improved. Firstly, a heavy parameter module (replay Block) is designed and introduced into a backbone network of YOLOv5s, a parallel 1 × 1 convolution branch is constructed at a 3 × 3 convolution position to assist the backbone network to extract richer features in a training phase, and the 1 × 1 branch is merged into the 3 × 3 branch in an inference phase, so that the detection precision is improved while the inference speed is not influenced. Secondly, two compression-Excitation modules (SE blocks) are inserted into a Path Aggregation Network (PAN) at the neck of YOLOv5s, so that the detection effect of the algorithm on forbidden articles is improved on the premise of not influencing the inference speed.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting contraband in an X-ray security image based on improved YOLOv5 s.

In order to achieve the purpose, the invention provides the following technical scheme:

an X-ray security inspection image contraband detection method based on improved YOLOv5s, comprising the following steps:

s1: establishing a Rep module designed based on a heavy parameter idea;

s2: establishing a heavy parameter-based Yolov5s contraband detection algorithm;

s3: the PAN of the neck is improved.

Optionally, the S1 specifically includes:

setting constructed Rep module parameters as shown in formula (1), namely adding two parallel convolution branches; the information flow generated by the Rep module is represented as y ═ f (x) + g (x), where f (x), g (x) are convolution branches implemented by 3 × 3 and 1 × 1 kernels, respectively;

Rep(3×3)＝3×3-BN+1×1-BN (1)

for each 3 × 3 convolution, constructing parallel 1 × 1 convolution branches in a training stage, and respectively performing normalization operation and adding; in the inference stage, 1 × 1 branches are fused into 3 × 3 branches to obtain a 3 × 3 convolution branch, and another parallel branch structure is subtracted, so that the performance of a convolution network is improved, and the network detection efficiency is not influenced;

on the basis of the structure of the Rep module, converting the multi-branch module into a single branch based on the idea of ReptVGG; the conversion of the model is carried out after the training is finished, and comprises the following two steps:

firstly, fusing a convolution layer and a BN layer in each branch; directly substituting the convolution result into the bn formula, as shown by the left arrow in FIG. 3, the output is expressed as formula (2):

M ⁽²⁾ ＝bn(W ⁽³⁾ *M ⁽¹⁾ ,μ ⁽³⁾ ,σ ⁽³⁾ ,γ ⁽³⁾ ,β ⁽³⁾ )+bn(W ⁽¹⁾ *M ⁽¹⁾ ,μ ⁽¹⁾ ,σ ⁽¹⁾ ,γ ⁽¹⁾ ,β ⁽¹⁾ ) (2)

wherein,

and

denotes convolution kernels representing 3X 3 and 1X 1 convolution layers, respectively, C ₁ ，C ₂ Representing the number of input and output channels; mu.s ⁽³⁾ ,σ ⁽³⁾ ,γ ⁽³⁾ ,β ⁽³⁾ Respectively represents the cumulative mean, standard deviation, scaling factor and deviation term of the BN layer after 3 multiplied by 3 convolution ⁽¹⁾ ,σ ⁽¹⁾ ,γ ⁽¹⁾ ,β ⁽¹⁾ Corresponding to the accumulated mean, standard deviation, scaling factor and deviation term of the 1 × 1 convolved BN layer; input and output are respectively expressed as

Represents a convolution operation;

substituting the parameters into the formula (2) to obtain a result as shown in the formula (3); wherein bn is a batch normalization function of inference phase, i ∈ [ [ solution ] ]1,C ₂ ]；

Simplifying the formula (3) to obtain a convolution layer with a deviation term; the convolution kernel and the bias term obtained after { W, b, μ, σ, γ, β } transformation are expressed in { W ', b' }, and there are:

for any i e [1, C ∈ ] ₂ ]With bn (W x M, μ, σ, γ, β) _:,i,:,: ＝(W'*M) _:,i,:,: b′ _i (ii) a Obtaining a 3 × 3 convolution kernel, a 1 × 1 convolution kernel and two deviation terms after the fusion is completed;

fusing the 3 multiplied by 3 convolution and the 1 multiplied by 1 convolution, and adding the two deviation terms to obtain a fused deviation term; filling a 1 × 1 convolution kernel with 0 to form a 3 × 3 convolution kernel, and adding the 3 × 3 convolution kernel to the original 3 × 3 convolution kernel to obtain a fused convolution kernel; is provided with

For two convolution kernels, the addition result is expressed as formula (5) according to the additive principle of convolution; after the convolution kernel is fused, the function before fusion is realized;

optionally, the S2 specifically includes: introducing the Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; adjusting a PAN structure, and inserting an SE module between an upper detection layer and a lower detection layer in the PAN to obtain an upgraded PAN network;

the Focus module performs slicing operation on the picture to enable an input channel of the picture to be expanded by 4 times, namely the operated picture is changed into 12 channels from an original RGB three channel; obtaining a double-sampling feature map without information loss through convolution operation; the Conv module encapsulates the convolutional layer, the BN layer and the SiLU activation function; the structure and the function of the C3 module are basically the same as those of the BottleneckCSP, but the floating-point operand is lower, and the running speed is higher; the SPP module is used for splicing maximum pooling results of different sizes to realize the fusion of local features and global features; UpSample is an upper sampling layer, and the image is amplified to 2 times by an internal interpolation method; and detecting three Conv [1,1] in the head to obtain a characteristic diagram of final output.

Optionally, the S3 specifically includes:

the SE module comprises a compression part and an excitation part; the first step is a compression stage, the characteristic diagram of input WxHxC is compressed to 1 x 1 xC through a global average pooling, and the compressed characteristic diagram has a global receptive field; the second step is an excitation stage, which consists of two fully-connected layers: the first fully-connected layer has C x r neurons, and the second fully-connected layer has C neurons, where r is a scaling parameter, which is adjusted to reduce the number of channels and thereby reduce the amount of computation.

Optionally, setting an evaluation index after S4;

the detection performance evaluation of the detector needs to consider the Precision and the Recall rate simultaneously; the average precision average mAP when IoU is 0.5, the macro accuracy MP, the macro recall MR and the macro F1 are used for evaluating the performance of the network model in target detection; the definition of the accuracy rate is formula (6), and the definition of the recall rate is formula (7); wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively;

the average precision AP is obtained by combining the accuracy and the recall rate and is used for evaluating the precision of the model for detecting a single category; the mAP measurement model detects the precision of all classes and is obtained by solving the average value of all classes of APs, and the definition of the mAP measurement model is shown as a formula (8); the F1 score is a weighted average of the accuracy and the recall ratio, and is defined as formula (9), wherein the larger the value is, the better the effect is;

macro accuracy, macro recall, and macro F1 are obtained by averaging all category accuracy, recall, and F1 scores, respectively.

The invention has the beneficial effects that: the invention provides an improved method for identifying contraband in an X-ray security inspection image by taking YOLOv5s as a basic model. The real-time requirement of the automatic detection of the forbidden articles is met, and meanwhile, the detection precision is improved. Firstly, a heavy parameter module Rep Block is designed and introduced into a YOLOv5s backbone network, a parallel 1 × 1 convolution branch is constructed at a 3 × 3 convolution position to assist the backbone network to extract richer features in a training phase, and the 1 × 1 branch is merged into the 3 × 3 branch in an inference phase, so that the detection precision is improved while the inference speed is not influenced. Secondly, two SE blocks are inserted into the PAN at the neck part of YOLOv5s, so that the detection effect of the algorithm on the forbidden articles is improved on the premise of not influencing the reasoning speed.

Compared with the traditional detection method, the method has higher detection precision, and can meet the actual application requirement of contraband detection in the X-ray security inspection image.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of a YOLOv5s network architecture;

FIG. 2 is a diagram of a modified YOLOv5s network model architecture;

FIG. 3 is a Rep Block structure and its structure reparameterization process;

FIG. 4 is a block diagram of a fusion convolution kernel obtained by adding a normal 3 × 3 convolution kernel to a filled convolution kernel;

FIG. 5 is a graph of the use of SE modules in a convolutional layer (a) ordinary convolution (b) convolution after insertion of the SE modules;

fig. 6 is a confusion matrix corresponding to various types of contraband under different algorithms.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

X-ray security image dataset

X-rays show their powerful capabilities in security inspection tasks, however, there are still fewer X-ray contraband image datasets available for study. GDXray contains 19407 pictures, but only a few (600) contain three types of contraband: guns, darts and razor blades, and all images are grayscale images, the background is simple, and the difference from a complex real scene is large. OPIXray contains 8885X-ray contraband images, with different levels and proportions of overlap, but only one type of contraband (differently shaped knives). The SIXray is composed of 8929 forbidden article images with multiple categories, the image background is complex, dangerous articles are randomly stacked with shielding, and the actual situation is better met, so that the SIXray is selected as an experimental data set.

YOLOv5s algorithm

The YOLOv5s algorithm consists of four parts, input, backbone network, neck and detection head, as shown in fig. 1. The input end adopts a Mosaic data enhancement method, a self-adaptive calculation boundary box and a zoom image, so that the diversity of input data is enriched. The main network part uses Focus and CSP modules, wherein the CSP structure is beneficial to improving the network characteristic learning ability. The neck is structured with a Feature Pyramid (FPN) plus PAN, the FPN enhancing semantic propagation by upsampling, the PAN enhancing Feature localization using downsampling. The detector header section uses Generalized Intersection over Unit (GIoU) loss as a function of the loss of the bounding box, and selects the bounding box using Non Maximum Suppression (NMS).

According to the invention, the main network and neck PAN structure of the YOLOv5s algorithm are respectively improved to generate a new network and improve the detection performance.

3. Heavy parameter-based YOLOv5s contraband detection algorithm

According to the method, the detection precision of the algorithm on the security inspection prohibited articles is improved by improving the YOLOv5s network structure, the feature extraction capability of the main network is improved by designing the Rep module, and the reasoning time is not influenced; two SE modules are introduced into the neck PAN, so that the network extracts more characteristic information. The improved network structure is shown in fig. 2.

In the contraband detection problem, YOLOv5s was improved in two parts: introducing a Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; and secondly, further adjusting the PAN structure, and inserting the SE module between the upper detection layer and the lower detection layer in the PAN to obtain an upgraded PAN network. The new algorithm can not only enrich the characteristics of the backbone network, improve the model performance and improve the detection effect; and the neck PAN enhancement information can be refined, the influence on the reasoning time can be ignored, and the improved algorithm structure is shown in figure 3. The Focus module performs slicing operation on the picture, so that an input channel of the picture is expanded by 4 times, namely the operated picture is changed into 12 channels from an original RGB three channel; and further performing convolution operation to obtain a double-sampling feature map without information loss. The Conv module encapsulates the convolutional layer, the BN layer, and the sul activation functions. The structure and function of the C3 module are basically the same as those of the BottleneckCSP, but the floating-point operands are lower, and the operation speed is higher. The SPP module is used for splicing maximum pooling results of different sizes, and fusion of local features and global features is achieved. UpSample is an upsampling layer, and the image is enlarged to 2 times by an internal interpolation method. And detecting three Conv [1,1] in the head to obtain a characteristic diagram of final output.

3.1 Rep Module based on design of heavy parameter thought

In order to extract more abundant network features and improve network detection performance, researchers have designed many novel multi-branch structures. The novel component can improve the precision, but the problem brought by the multi-branch structure is that the component is difficult to apply and self-define, the video memory consumption is increased, and the reasoning process is unfavorable. Therefore, the invention designs the Rep module by using the re-parameterization idea to improve the model precision and reduces the influence on the reasoning speed by decoupling the training stage and the testing stage. Let the constructed Rep module parameters be as shown in equation (1), i.e. two parallel convolution branches are added. The information stream generated by the Rep module is denoted as y ═ f (x) + g (x), where f (x), g (x) are the convolution branches implemented by a 3 × 3 kernel and a 1 × 1 kernel, respectively.

Rep(3×3)＝3×3-BN+1×1-BN (1)

As shown in fig. 3, for each 3 × 3 convolution, parallel 1 × 1 convolution branches are constructed in the training phase and each subjected to a normalization operation and then added. In the inference stage, the 1 × 1 branch is fused into the 3 × 3 branch to obtain a 3 × 3 convolution branch, and the other parallel branch structure is subtracted, so that the performance of the convolution network can be improved without influencing the network detection efficiency.

On the basis of the structure of the Rep module, the multi-branch module can be converted into a single branch based on the idea of RepVGG. The conversion of the model (i.e. multi-branch fusion) is performed after the training is completed, and comprises the following two steps:

(1) first, the convolution layer and the BN layer in each branch are fused. Directly substituting the convolution result into the bn equation, as shown by the left arrow in fig. 3, the output can be expressed as equation (2):

wherein,

and

denotes convolution kernels representing 3X 3 and 1X 1 convolution layers, respectively, C ₁ ，C ₂ Representing the number of input and output channels. Mu.s ⁽³⁾ ,σ ⁽³⁾ ,γ ⁽³⁾ ,β ⁽³⁾ Respectively represents the cumulative mean, standard deviation, scaling factor and deviation term of the BN layer after 3 multiplied by 3 convolution ⁽¹⁾ ,σ ⁽¹⁾ ,γ ⁽¹⁾ ,β ⁽¹⁾ Corresponding to the cumulative mean, standard deviation, scaling factor and deviation term of the 1 x 1 convolved BN layer. Input and output are respectively expressed as

Denotes the convolution operation.

Substituting the parameters into equation (2) yields the result as equation (3). Where bn is the batch normalization function of the inference phase, i ∈ [1, C ₂ ]。

The formula (3) is further simplified to obtain a convolution layer with a bias term. The convolution kernel and the bias term obtained after { W, b, μ, σ, γ, β } transformation are expressed in { W ', b' }, and there are:

thus, it is possible to verify for any i ∈ [1, C ] ₂ ]With bn (W x M, μ, σ, γ, β) _:,i,:,: ＝(W'*M) _:,i,:,: b′ _i . Therefore, a 3 × 3 convolution kernel, a 1 × 1 convolution kernel, and two bias terms can be obtained after the fusion is completed.

(2) The 3 × 3 convolution and the 1 × 1 convolution are fused, i.e., the right arrow step in fig. 3. Adding the two deviation terms to obtain a fusion deviation term; the 1 × 1 convolution kernel is padded with 0's to form a 3 × 3 convolution kernel, which is then added to the original 3 × 3 convolution kernel to obtain a fused convolution kernel, as shown in fig. 4. Is provided with

For two convolution kernels, the addition result can be expressed as equation (5) according to the additive principle of convolution. Thus, the convolution kernel fusion can be followed by the same function as before the fusion.

3.2 improvement of neck PAN

The invention improves the model performance by loading the SE module in the PAN at the neck of YOLOv5 s. The SE module screens out the attention of the channels by modeling the correlation among the characteristic channels, and enhances the accuracy by strengthening important characteristics.

As shown in FIG. 5, the SE module mainly comprises two parts of compression (Squeeze) and Excitation (Excitation). The first step is a compression stage, which compresses the input W × H × C feature map to 1 × 1 × C by a global average pooling, and the compressed feature map has a global receptive field. The second step is an excitation stage, which consists of two fully-connected layers: the first fully-connected layer has C x r neurons, and the second fully-connected layer has C neurons, where r is a scaling parameter that is adjusted to reduce the number of channels and thus reduce the computational complexity.

4 Experimental and results analysis

4.1 data set

The algorithm proposed by the present invention was experimented on a common dataset SIXray that collected 8929 annotated images. Compared with other data sets, the SIXray has more categories and relatively larger data volume. The data set was randomly divided into two parts, with 20% of the images (1781) being the test set and the remainder (7148) being the training set, in a ratio of approximately 1: 4. The invention eliminates the detection of scissors in the experiment because the number of the samples is too small and the data amount between the classes is unbalanced. The detailed distribution of the various categories in the dataset is shown in table 1. Furthermore, many images in a dataset contain multiple objects.

Table 1 distribution of each category in the SIXray dataset.

Many images contain multiple contraband items, so the total number of items is much higher than the number of images.

4.2 evaluation index

The detection performance evaluation of the detector requires consideration of both accuracy (Precision) and Recall (Recall). The target detection uses, for example, mean Average Precision (mep) when IoU is 0.5, Macro Precision (MP), Macro Recall (MR), and Macro F1(Macro-F1, MF1) to evaluate the performance of the network model. The definition of accuracy is formula (6) and the definition of recall is formula (7). Wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.

The Average Precision (AP) is obtained by combining the accuracy and recall and is used to evaluate the Precision of the model to detect individual classes. The mAP measurement model detects the accuracy of all classes, and is obtained by averaging all classes of APs, which is defined as formula (8). The F1 score is a weighted average of accuracy and recall, defined as equation (9), with larger values indicating better results.

Similar to the macro accuracy, macro recall, and macro F1 are obtained by averaging the accuracy, recall, and F1 scores of all categories, respectively. In addition, confusion matrices may also be used to assist in the analysis of results.

4.3 analysis of the results of the experiment

The experimental hardware was configured as a core (TM) i9-10920X processor, a GeForce RTX 3090 graphics card, and the software was configured as torch1.8.0. During training, the network parameters are optimized using the SGD with momentum, and the value of the number of iterations (epochs) is set to 200. The input image size is 640 × 640, and the batch size (batch size) is 64. And controlling the experiments of the various models under the condition that other parameters are consistent.

The invention shows the experimental results of four models in total: an original YOLOv5s algorithm, a backbone network algorithm of modified YOLOv5s using a Rep module (hereinafter referred to as Rep-YOLOv5), a modified algorithm of inserting an SE module in a neck PAN (hereinafter referred to as SE-YOLOv5), and two modified superimposed algorithms (hereinafter referred to as RepSE-YOLOv5 s).

Fig. 6 lists confusion matrices for different algorithms, where diagonal values represent True Positive Rates (TPR) and the sum of off-diagonal values in each column represents one class of False Negative Rates (FNR). As can be seen from the figure, the diagonal response of the confusion matrix of the RepsE-YOLOv5s algorithm is higher than the average value of other matrixes, and the better overall distribution of the confusion matrix shows that the algorithm can identify forbidden articles more accurately.

In contrast, the detection effect of the algorithm on the wrench type articles is improved to the maximum. Overall, however, the true positive rate is the lowest for such objects as knives. The reason may be that the specific features of such objects as the data set knife are not unique, for example, the knife category includes wide kitchen knives, slender straight knives, and small tool knives.

Table 2 shows the overall performance comparison of the four algorithms on the SIXray dataset. Firstly, comparing an original YOLOv5s algorithm with a Rep-YOLOv5s algorithm, data in a table show that the macro accuracy (mAP) of the Rep-YOLOv5s algorithm is improved by 1.1% compared with that of an original network, the other three evaluation indexes are improved, the detection precision of each category is also improved, and the Rep module enhances the feature extraction capability of a backbone network and is obviously helpful for improving the detection performance of the algorithm. Then, by comparing the original YOLOv5s algorithm with the SE-YOLOv5s algorithm, it can be seen that the improvement of the SE module on the network enhanced feature extraction part also improves the detection performance of the whole algorithm, especially the macro recall ratio (MR) is improved by 2.5%.

Finally, the combined two-part modified RepSE-YOLOv5s algorithm was 2.6%, 1.0%, 1.4% higher in mAP index than the original YOLOv5s, Rep-YOLOv5s, SE-YOLOv5s, respectively. There are also significant advantages in macro accuracy (mAP), Macro Recall (MR) and macro F1(MF1) over the original network and the other two networks. In addition, the table also shows that the detection precision of each category is improved, particularly for wrench articles, the detection precision reaches 91.3% from 86.4%, and is improved by 4.9%. The evaluation shows that the RepSE-YOLOv5s algorithm can more accurately detect all types of contraband in the X-ray security inspection image and has the potential of being further applied to actual scenes.

TABLE 2 Performance comparison of detection algorithms

In addition, the present invention compares the number of parameters (one hundred thousand, M), the model size (megabyte, MB) and the time (milliseconds, ms) required to detect a single image for the original YOLOv5s algorithm, the Rep-YOLOv5s algorithm, the SE-YOLOv5s algorithm, and the RepsE-YOLOv5s algorithm, as shown in Table 3. The time required to detect a single image is the result of testing the data in the test set on the GPU, including data pre-processing, model reasoning, post-processing and non-maximum suppression (NMS). The average pre-treatment time was 0.1ms and the average NMS time was 0.8 ms per graph.

As can be seen from Table 3, the number of parameters of the RepSE-YOLOv5s algorithm increased by only 0.28%, the size of the model increased by 2.82% (0.4MB), and the detection time was almost unchanged.

TABLE 3 comparison of different algorithms with time

The improved algorithm improves the detection effect of some objects with complex backgrounds and difficult identification.

Aiming at the problem that the detection precision of the X-ray security inspection image is not high enough at present, the YOLOv5s algorithm with a small model and a high detection speed is applied to the contraband detection of the X-ray security inspection image, and a RepSE-YOLOv5s detection algorithm is provided, so that the influence on the detection speed can be ignored while the detection precision is improved. Firstly, a Rep module is designed by utilizing a heavy parameter idea to enrich the characteristics of a backbone network of a YOLOv5s algorithm, and then two SE modules are inserted into a PAN at the neck part of the YOLOv5s algorithm, so that the detection effect of the algorithm on forbidden articles is improved. Finally, experiments are carried out on the SIXray data set, four different algorithm models are contrastively analyzed for the contraband detection performance, the results show that the average precision mean, the macro accuracy, the macro recall rate and the macro F1 of the new algorithm are respectively improved by 2.6%, 2.0%, 4.0% and 3.0% compared with the original algorithm, meanwhile, the detection speed is kept to be 2.6 milliseconds per image, and the contraband detection accuracy is improved while almost no extra detection time is added.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. The method for detecting contraband in X-ray security inspection images based on improved YOLOv5s is characterized by comprising the following steps: the method comprises the following steps:

s1: establishing a Rep module designed based on a heavy parameter idea;

s3: the PAN of the neck is improved.

2. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 1, wherein: the S1 specifically includes:

Rep(3×3)＝3×3-BN+1×1-BN (1)

wherein,

and

Generation byPerforming table convolution operation;

substituting the parameters into formula (2) to obtain a result as formula (3); where bn is the batch normalization function of the inference phase, i ∈ [1, C ₂ ]；

for any i e [1, C ∈ ] ₂ ]With bn (W x M, μ, σ, γ, β) _:,i,:,: ＝(W'*M) _:,i,:,: b _i '; obtaining a 3 × 3 convolution kernel, a 1 × 1 convolution kernel and two deviation terms after the fusion is completed;

3. the method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 2, wherein: the S2 specifically includes: introducing the Rep structure into a backbone network of a YOLOv5s algorithm to obtain an upgraded backbone network consisting of a series of Rep modules and C3 modules; adjusting a PAN structure, and inserting an SE module between an upper detection layer and a lower detection layer in the PAN to obtain an upgraded PAN network;

4. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 3, wherein: the S3 specifically includes:

5. The method for detecting contraband in X-ray security inspection image based on improved YOLOv5s as claimed in claim 4, wherein: setting an evaluation index after the step S4;

the detection performance evaluation of the detector needs to consider the Precision and the Recall rate simultaneously; evaluating the performance of the network model by using an average precision mean value mAP, a macro accuracy rate MP, a macro recall rate MR and a macro F1 when IoU is equal to 0.5 in target detection; the definition of the accuracy rate is formula (6), and the definition of the recall rate is formula (7); wherein TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively;