CN117011231B

CN117011231B - Strip steel surface defect detection method and system based on improved YOLOv5

Info

Publication number: CN117011231B
Application number: CN202310774663.8A
Authority: CN
Inventors: 张永平; 沈思洁; 徐森; 郭乃瑄; 孟海涛; 陈朝峰; 邵星
Original assignee: Yancheng Institute of Technology; Yancheng Institute of Technology Technology Transfer Center Co Ltd
Current assignee: Yancheng Institute of Technology; Yancheng Institute of Technology Technology Transfer Center Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2024-04-09
Anticipated expiration: 2043-06-27
Also published as: CN117011231A

Abstract

The invention provides a strip steel surface defect detection method and system based on improved YOLOv5, wherein the method comprises the following steps: building an improved anchor-freeYOLOv5 network; acquiring a strip steel surface image; inputting the strip steel surface image into the anchor-freyyov 5 network to obtain a strip steel surface defect detection result; and outputting the detection result of the surface defect of the strip steel. According to the strip steel surface defect detection method and system based on improved YOLOv5, the number of super parameters is reduced, the detection speed is improved, the Loss function is improved aiming at the problems of unbalanced data and low training speed, the EIoU Loss is adopted as a regression Loss function, the scale is unchanged, the evaluation index of frame regression can be directly optimized, the model fitting speed is improved, the degradation of a network caused by a large number of negative samples is prevented, the Focal length Loss is adopted as objective Loss, and the detection precision of samples difficult to classify is improved by increasing the Loss weight.

Description

Strip steel surface defect detection method and system based on improved YOLOv5

Technical Field

The invention relates to the technical field of computer data processing, in particular to a strip steel surface defect detection method and system based on improved YOLOv 5.

Background

At present, the strip steel is an important product in the modern steel industry, the strip steel has large production quantity and high conveying speed, higher requirements are put forward on the accuracy and the rapidity of the detection of the defects on the surface of the strip steel, and the detection method is also required to have better robustness and generalization due to the complex environment and various interference factors of the production site. Traditional methods such as eddy current detection, magnetic leakage detection, infrared detection and the like are difficult to meet the requirements of rapid and accurate detection. The deep learning which is rapidly developed in recent years becomes a research hot spot for detecting the surface defect image of the strip steel. The target detection algorithm based on deep learning replaces the traditional target detection algorithm and becomes the main stream of the target detection algorithm. Existing deep learning object detection algorithms are largely divided into two categories, the detection algorithm is two-stage detection algorithm represented by Faster R-CNN and Mask R-CNN; the other is a one-stage detection algorithm represented by YOLO and SSD. The YOLO series model is balanced between the detection precision and the detection speed, and a new reference model is continuously pushed out, so that development of several generations is currently experienced, a large number of advanced experiences of other models are absorbed, the detection precision is continuously improved, and the good detection speed is maintained. Compared with the prior version, the detection speed and the detection precision of the existing YOLOv5 algorithm are greatly improved, and the method is one of the most excellent detection algorithms at present. However, an anchor-based method is used in a series of target detection algorithms including YOLOv5, and for different data sets, the method has an anchor box that is initially set, and in the network training process, the network outputs a prediction frame based on the initial anchor box, and then compares the prediction frame with a real frame (grouping algorithm), and then reversely updates and iterates network parameters. But this approach has drawbacks. The setting of the anchors requires manual design (aspect ratio, size and number of anchors), and also requires different designs for different data sets, which is quite cumbersome; the match mechanism of the anchor allows the frequency to which the extreme dimensions are matched and lower relative to the frequency to which objects of moderate size are matched. Training networks are less likely to learn these extreme samples while learning. The number of anchors is large, and each anchor needs to perform IOU calculation, so that efficiency is reduced. The Anchor-based algorithm introduces non-maximal suppression (NMS), which improves detection accuracy, but the computational complexity and complexity seriously prevent the improvement of detection speed.

In recent years, although a basic detection framework for target detection is already determined, new ideas such as Anchor-free, transformer still emerge, and optimization of the detection framework is still ongoing. The anchor-free method represented by CornerNet, FCOS attempts to remove the prior frame, reduces the number of super parameters, and has reached detection accuracy approaching that of the anchor-based method. This approach, while still problematic, provides new ideas for target detection technology. An Anchor-free YOLOv5 network for detecting defects on the surface of a tape was therefore proposed. The YOLOv5s network in YOLOv5 is selected, and the network structure is modified in three ways.

Disclosure of Invention

The embodiment of the invention provides a strip steel surface defect detection method based on improved YOLOv5, which comprises the following steps:

building an improved anchor-freeYOLOv5 network;

acquiring a strip steel surface image;

inputting the strip steel surface image into the anchor-freyyov 5 network to obtain a strip steel surface defect detection result;

and outputting the detection result of the surface defect of the strip steel.

Preferably, the Anchor-freeYOLOv5 network includes: the device comprises a feature map module, a feature fusion module, a convolution module and a detector module which are connected in sequence.

Preferably, the working mechanism of the Anchor-freeYOLOv5 network includes: the input network module, the backhaul network module, the Neck network module and the Prediction network module are sequentially connected.

Preferably, the input network module performs Mosaic data enhancement on the input strip steel surface image.

Preferably, the backup network module is provided with a Focus structure and a CSPDarknet structure for extracting feature images of the strip steel surface image after the metal data enhancement;

the CSPDarknet structure includes 5 CSP modules.

Preferably, the Neck network module adopts a BiFPN module+PAN module structure;

the output of the backbond network module is used as the input of the BiFPN module to perform feature fusion to obtain a feature pyramid;

the PAN module firstly copies the lowest layer in the feature pyramid and becomes the bottommost layer of the new feature pyramid;

performing downsampling operation on the bottommost layer of the new feature pyramid;

the penultimate layer of the feature pyramid is subjected to 3x3 convolution with the step length of 2, and is added with the bottom layer subjected to downsampling by a transverse connection; the addition mode adopts concat operation;

finally, a 3x3 convolution is performed to fuse the features of the addition result.

Preferably, the Prediction network module uses EIoUloss when outputting.

Preferably, the Anchor-freeYOLOv5 network uses EIoU as a bounding box regression loss during training.

Preferably, after feature fusion, the Neck network module predicts three-dimensional tensor, objectivity and class prediction of the coding bounding box based on Anchor-free bounding box regression using two convolution layers.

the building module is used for building an improved anchor-freeYOLOv5 network;

the acquisition module is used for acquiring the surface image of the strip steel;

the detection module is used for inputting the strip steel surface image into the anchor-freyolov 5 network to obtain a strip steel surface defect detection result;

and the output module is used for outputting the detection result of the surface defect of the strip steel.

The invention has the following beneficial effects:

(1) In order to alleviate the problems in the above-mentioned Anchor-based method, a novel anchor-free detection scheme is proposed, the number of super-parameters is reduced, and the detection speed is improved.

(2) The method aims at solving the problems of unbalanced data and low training speed, improves the loss function, adopts EIoU loss as a regression loss function, has unchanged scale, and can directly optimize the evaluation index of frame regression to accelerate the model fitting speed.

(3) To prevent degradation of the network by a large number of negative samples, focal Loss (FL) is used as an objective loss. Focal Loss improves the detection accuracy of samples difficult to classify by increasing the Loss weight.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of an Anchor-freeYOLOv5 network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature fusion module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolution module according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a detector module according to an embodiment of the present invention;

FIG. 5 is a diagram of a BiFPN structure in an embodiment of the invention;

fig. 6 is a schematic diagram of a fpn+pan connection mode in an embodiment of the present invention;

fig. 7 is a diagram of PAN architecture in an embodiment of the present invention;

FIG. 8 is a regression diagram of bounding boxes according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a method for detecting defects on a strip steel surface by improving YOLOv5 according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a strip steel surface defect detection method based on improved YOLOv5, which is shown in fig. 1 and comprises the following steps:

building an improved anchor-freeYOLOv5 network;

acquiring a strip steel surface image;

and outputting the detection result of the surface defect of the strip steel.

The Anchor-freeyolv 5 network includes: the device comprises a feature map module, a feature fusion module, a convolution module and a detector module which are connected in sequence.

The working mechanism of the Anchor-freeyolv 5 network comprises: the input network module, the backhaul network module, the Neck network module and the Prediction network module are sequentially connected.

And the input network module performs Mosaic data enhancement on the input strip steel surface image.

The back bone network module is provided with a Focus structure and a CSPDarknet structure which are used for extracting feature images of the strip steel surface image after the Mosaic data enhancement;

the CSPDarknet structure includes 5 CSP modules.

The Neck network module adopts a BiFPN module+PAN module structure;

And the output of the Prediction network module uses EIoUloss.

The Anchor-freeyolv 5 network uses EIoU as a bounding box regression loss during training.

After feature fusion, the Neck network module predicts three-dimensional tensor, objectivity and class prediction of the coding bounding box by using two convolution layers based on Anchor-free bounding box regression.

The working principle and the beneficial effects of the technical scheme are as follows:

although YOLOv5 achieves satisfactory results in object detection, the network divides the input image into a plurality of grid areas of the same size, then predicts the coordinate information of a plurality of bounding boxes, the confidence scores of the classifications and the probabilities of the respective categories of each bounding box for each grid area, then filters out some bounding boxes which are unlikely to be objects according to the set confidence score threshold, and finally processes out the rest of redundant bounding boxes through non-maximal suppression, thus obtaining the final detection result. There are still some problems in the application of the detection of defective images. For example, input of a 640X640 image, YOLOv5 needs to predict thousands of anchor boxes, but the number of defects of a defective image is about one or two. Too many negative samples may result in high false positive rates and low recall rates. To alleviate this problem, an Anchor-free YOLOv5 network for detection of defects on a tape surface is proposed, as shown in fig. 1.

FIG. 1 depicts an overall detection network, F _i And F _i ' represents feature graphs of the path from the lower and from the top to the lower, respectively, and i represents the level of feature mapping. Fig. 2 is a feature fusion module corresponding to the Agg module of fig. 1, fig. 3 is a convolution module corresponding to Conv, n×nconv of fig. 1, and indicates that the convolution kernel has a size of n×n. Fig. 4 is a detector module.

Specifically, the network divides an input image into a plurality of grid areas with the same size, then predicts coordinate information of a plurality of bounding boxes, confidence scores of classification and probabilities of each class of each bounding box for each grid area, then filters out some bounding boxes which are unlikely to be targets according to a set confidence score threshold, and finally processes out the rest redundant bounding boxes after non-maximal inhibition, thus obtaining a final detection result.

The working mechanism of the network is described below, and the network structure comprises four parts of input, back, ck and Prediction.

The input image is subjected to Mosaic data enhancement, four pictures are adopted for Mosaic data enhancement, and the four pictures are spliced in a random scaling, random cutting and random arrangement mode, so that the advantages of greatly enriching detection as a data set, particularly, a plurality of small targets are added in random scaling, and the robustness of a network is better.

The Backbone network part of the backhaul is mainly used for feature extraction and mainly adopts the following structure: focus structure, CSPDarknet structure. The Focus structure is a picture which was recently introduced in YOLOv5 by the original author and is not introduced in YOLOv1-YOLOv4 for directly processing input, and Focus is mainly a slicing operation on the picture. For example, the original 608x608x3 image is input into a Focus structure, and is firstly changed into a feature map of 304x304x12 by slicing operation, and then is changed into a feature map of 304x304x32 by convolution operation of 32 convolution kernels once. CSPDarknet is a structure generated based on YOLOv3 backbone network dark 53 by referring to the experience of CSPNet in 2019, and comprises 5 CSP modules. The size of the convolution kernel before each CSP module is 3x3, stride=2, and thus can function as a downsampling. Since the backbox has 5 CSP modules, the input image is 608x608, so the law of feature map variation is: after the 608- >304- >152- >76- >38- >19 passes through the CSP module for 5 times, a characteristic diagram with the size of 19x19 is obtained. The CSP module is adopted to divide the feature mapping of the base layer into two parts, and then the two parts are combined through the cross-stage hierarchical structure, so that the calculation amount is reduced, and meanwhile, the accuracy can be ensured. The mesh activation function is used in the backhaul of the network, and the leakage_relu function is used in the latter network.

Feature fusion is carried out on the Neck of the Neck network, a BiFPN+PAN structure is adopted, and a CSP structure is also added, so that the capability of network feature fusion is enhanced. The output of the backbone network is simply processed (the number of channels is adjusted) to be used as the input of the BiFPN module for carrying out feature fusion so as to obtain a final feature pyramid with richer semantics and details. The resolution of each layer of the feature pyramid is different, so different feature layers are used to detect objects of different proportions. But all share a detection head. The detection heads are divided into a classification subnet and a regression subnet. The classification subnetwork is used to predict the object for each anchor point of K, where K represents the number of classes of objects in the dataset used for training. The regression subnetwork is used to predict a 4-dimensional class agnostic offset for each anchor point that represents the distance from the anchor point to the left, top, right, and bottom of the prediction box if the confidence of the anchor point exceeds a threshold. Introduction of BiFPN is an improved part of the present application, and the negk part of the original YOLOv5 structure is fpn+pan. BiFPN is known as a bi-directional feature pyramid network. In fusing features with different resolutions, we first resize them to the same resolution and then aggregate them. Since different input elements have different resolutions, their contributions to the output elements are often non-uniform. Thus, we add an additional weight to each input and let the network know the importance of each input feature. The weighting method is as follows:

wherein O represents the result after fusion, I _i Is input, omega _i And omega _j Is a learnable weight and Relu is used for each update to ensure that its value is greater than 0, and epsilon=0.0001 is a small fraction whose primary function is to prevent the value of the denominator from being equal to 0, epsilon + sigma _j ω _j Weight, ω _i ·I _i Representing the weighted features. For ease of understanding, two 5-level fusion features are described in bipin:

here, theInput features of fifth layer on top-down path, +.>Is an intermediate feature of the fifth layer on the top-down path, and +.>Is the output feature of the sixth layer in the bottom-up approach, ω and ω' are learning weights, C represents a depth separable convolution operation, and R represents a resolution-matched up-sampling or down-sampling operation.

As shown in fig. 5-6, a bottom-up feature pyramid is added to the bippn, which contains two PAN structures. The PAN first replicates the lowest level of the feature pyramid, becoming the lowest level of the heart feature pyramid. The bottommost layer of the new feature pyramid is subjected to a downsampling operation, and then the penultimate layer of the feature pyramid is subjected to a convolution of 3x3, wherein the step is 2; and then adds to the downsampled bottom layer by a cross-connect. Finally, a 3x3 convolution is performed to fuse their features. Thus, the combination operation BiFPN conveys strong semantic features, while PAN conveys strong positioning features from bottom to top, two hands are connected, and parameter aggregation is carried out on different detection layers from different trunk layers.

As shown in fig. 7, the concat operation is adopted in the final addition mode of the PAN, the series features are fused, and the two features are directly connected. If the dimensions of the two input features x and y are p and q, the dimension of the output feature z is p+q. A new feature polymerization mode is used, which can generate more proper anchor frame to obtain better performance, and consists of two modules of Agg and Conv, as shown in fig. 2 and 3, the fourth layer and the third layer F can be obtained in the formula (1) (2) ₃ ' and F ₄ 'top-down feature map'.

F ₄ ′＝Conv(F ₄ +U(F ₅ ′)) (1)

F ₃ ′＝Conv(F ₃ +U(F ₄ ′)) (2)

In the formula (1), U represents F ₅ ' 2X upsampling is performed and then convolved with F by a 1X1 convolution kernel ₄ Has the same shape. Then U (F) ₅ ') with F ₄ Adding and applying to Conv model to obtain F ₄ '. The formula (2) is the same as the formula (1). F (F) ₃ And F ₄ Refer to input feature diagram, F ₃ ′、F ₄ ' and F ₅ ' refers to the output feature map after feature fusion. F (F) ₄ ' and F ₃ The' spatial resolution formula is medium and highest for detecting medium and small samples, respectively. The object of the research is that the surface defect of the strip steel belongs to a small target.

The Prediction output part of the network structure is modified to use EIoU loss, which will be explained in detail later.

Anchor-free bounding box regression

As shown in fig. 8, after neck feature fusion through YOLOv5s network, two convolution layers are applied to predict the coding bounding boxThree-dimensional tensor, objectivity, and class prediction. Here, the bounding box regression is performed using the AF scheme instead of the AB scheme, which has three disadvantages. First, the anchor point size in YOLOv5 is a super parameter, which is difficult to set in advance. Second, the AB scheme introduces multiple anchor points, which will increase the number of negative samples. Thus, the imbalance between positive and negative samples is exacerbated. Third, detection performance is highly dependent on the size of the anchors, and improper anchor size may lead to performance degradation. The study used an anchor-free protocol for bounding box regression as shown in the detector of fig. 4. The boundary regression block diagram is shown in fig. 8. In an auto-focus scheme, the anchor point size need not be preset. Prediction boundary box offset related parameter (t _x1 ,t _y1 ,t _x2 ,t _y2 ) For each coordinate (x _c, ，y _c ) Then, the position of the prediction frame (x ₁ ,y ₁ ,x ₂ ,y ₂ ) The calculation method of (2) is as follows:

wherein (x) ₁ ,y ₁ ) And (x) ₂ ,y ₂ ) Representing the top left and bottom right points of the prediction box, respectively. (x) _c ,y _c ) Is a coordinate in the feature map. At (x) _c ,y _c ) The reason why (0.5 ) is added is to make the center point of the detected image symmetrical. It should be noted that (e) ^tx1 ,e ^ty1 )，e ^tx2 ,e ^ty2 ) Is derived from (x) _c +0.5,y _c +0.5), whileNot (t) _x1 ,t _y1 ,t _x2 ,t _y2 ). The purpose of using the index is to narrow the scope of the bounding box regression, thereby improving the accuracy of the regression. Strides _i Is the position conversion ratio. For frame predictions generated by different scales (i=3, 4, 5), stride _i 8, 16, 32, respectively.

3.1EIoU loss function

YOLOv5 used CIoU as a bounding box regression loss during training. EIoU was chosen for the boundary box regression loss in this study, which was more appropriate for the anchor-free scheme. The parameter v in the CIoU function reflects the difference between aspect ratios, rather than the confidence difference for the wide and high, respectively, which has some effect on model fitting. Therefore, the EIoU function is adopted as a loss function, the EIoU function divides the similarity parameter v of the measured aspect ratio into two parts, the losses of the width and the height are calculated respectively, and the convergence speed is faster. The EIoU loss function is:

from this it can be seen that EIoU divides the loss function into three parts, ioU loss, distance loss Ldis, side loss Lasp. It can be seen that EIoU has side length directly as penalty term. Rho in ² (b,b ^gt ) Euclidean distance, ρ, representing the widths of the real and predicted frames ² (ω,ω ^gt ) Euclidean distance, c, representing the heights of the real and predicted frames _ω Representing the shortest width comprising both the real and predicted frames, c _h Representing the shortest height that contains both the real and predicted frames.

The flow of CIoU algorithm is introduced:

if the prediction block is B ^P ＝(x ₁ ,y ₁ ,x ₂ ,y ₂ ) While GT box (group trunk) is B ^g ＝(x ₁ ’,y ₁ ’,x ₂ ’,y ₂ ’)

Boundary frame coordinates B of input prediction frame B ^P And GT box B ^g ：

B ^P ＝(x ₁ ,y ₁ ,x ₂ ,y ₂ )

B ^G ＝(x ₁ ’,y ₁ ’,x ₂ ’,y ₂ ’)

And (3) outputting: l (L) _GIoU Step 1: calculating the area P of B: a is that ^P ＝(x ₂ -x ₁ )x(y ₂ -y ₁ )

Step 2: area g of B is calculated: ag= (x) ₂ ‘-x ₁ ‘)x(y ₂ ’-y ₁ ‘)

Step 3: calculating areas P and B of intersecting boxes between B ^g ，IB：

x ₁ ^IB ＝max(x ₁ ,x ₁ ‘)x ₂ ^IB ＝min(x ₂ ,x ₂ ’)

y ₁ ^IB ＝max(y ₁ ,y ₁ ‘)y ₂ ^IB ＝min(y ₂ ,y ₂ ‘)

if(x ₂ ^IB >x ₁ ^IB and y ₂ ^IB >y ₁ ^IB ):

IB＝(x ₂ ^IB -x ₁ ^IB )×(y ₂ ^IB -y ₁ ^IB )

Else：

IB＝0

End

Step 4: calculating areas P and B of the minimum enclosure containing B ^g ，EB：

x ₁ ^EB ＝min(x ₁ ,x ₁ ‘)x ₂ ^EB ＝max(x ₂ ,x ₂ ‘)

y ₁ ^EB ＝min(y ₁ ,y ₁ ’)y ₂ ^EB ＝max(y ₂ ,y ₂ ‘)

EB＝(x ₂ ^EB -x ₁ ^EB )×(y ₂ ^EB -y ₁ ^EB )

Step 5: calculation IoU:

AU＝A ^p +A ^g -IB

IoU＝IB/AU

step 6: computing GIoU

GIoU＝IoU-(EB-AU)/EB

Step 7: calculating a loss function L _GIoU

L _GIoU ＝1-GIoU

Step 8: AUB is unchanged when GIoU is contained, two frame distance is changed, GIoU loss is unchanged, and DIoU is improved

Calculating DIoU:

step 9: calculate L _DIoU ：

L _DIoU ＝1-IoU+DIoU

Step 10: the DIoU is improved, the center points are coincident, the distance between the center points is unchanged, the DIOU Loss is unchanged, and the aspect ratio is improved to CIOU. The values of c and d are unchanged when the two box center points coincide. The aspect ratio of the lead-in frame is required at this time

Computing CIoU:

CIoU＝DIoU+αv

where α is a weight function and v is used to measure the uniformity of the aspect ratio:

step 11: calculation L _CIoU ：

CIoU＝1-IoU+DIoU+αv

The penalty term of EIoU is to split the influence factor of aspect ratio based on the penalty term of CIoU to calculate the length and width of the target frame and the anchor frame, respectively, and the loss function comprises three parts: the overlap loss, center distance loss, width-height loss, the first two parts continue the method in CIoU, but the width-height loss directly minimizes the difference between the width and height of the target frame and the anchor frame, resulting in faster convergence speed.

1.2Focal loss function

The anchor-free scheme reduces the number of prediction frames compared to original YOLOv5, but the positive and negative samples remain extremely unbalanced. This imbalance makes the cross entropy loss employed by the objective loss (Objectness) of YOLOv5 unsuitable. Focal Loss (FL) may be an important way to alleviate this problem. The method has the important point that the weight is added to the loss corresponding to the sample according to the difficulty of sample resolution, namely, the sample easy to distinguish is added with smaller weight, and the sample difficult to distinguish is added with larger weight, so that the performance is improved. The Focal Loss has the following characteristics: when the sample is very small (whether the sample is difficult to divide or not, whether the division is correct or not), the regulating factor approaches to 1, and the weight of the sample in the loss function is not affected; when the sample is large (the sample is easy to divide, whether the division is correct or not), the regulating factor approaches 0, and the weight of the sample in the loss function is reduced greatly; the focusing parameters may adjust the degree of weight reduction of the easily classified samples, the greater the degree of weight reduction. Now begin to describe how the functional formula of Focal loss is derived:

the cross entropy loss LCE for the two classes can be described by equation (8).

L _CE (p,y)＝-ylog(p)-(1-y)log(1-p) (8)

Where y ε {0,1} represents the GT class, where 1 represents the object and 0 represents the background. p epsilon [0,1 ]]Is the target probability output of the logic function output. In order to extract indistinguishable samples and mitigate the weight of the easily distinguishable samples, at L _CE Adding regulating factor to obtain L _FL ：

L _FL (p,y)＝|y-p| ^γ L _CE (p,y) (9)

In formula (9), γ is an adjustable parameter and is greater than 0. When a sample is correctly classified as such,

|y-p|->0 is p->y，L _FL (p, y) is close to 0, indicating a decrease in loss of the easily classified samples. When the sample is misclassified, |y-p| ->1，L _CE Equivalent to L _FL 。

To mark the target of each prediction box for the training phase. The prediction box centered closest to the true box is marked positive. The frame overlaps with the real frame larger than the threshold value, and objective loss L is not counted _FL . The threshold is set to 0.5. Class prediction loss is also a two-class cross entropy loss, which is the same as original YOLOv 5. Thus, the final loss function of Anchor-free YOLOv5 in the training phase of this study is as follows:

L _AF ＝L _EIoU (box)+L _FL (objectness)+L _CE (class) (10)

theoretically, AF Yolov5 has a higher efficiency than AB Yolov 5. Take an N x N grid as an example. For the AF method, the maximum number of prediction boundary boxes is n×n, since we predict one box for each point. For the AB method, the maximum number of prediction bounding boxes is 3×n×n, since three boxes are predicted per point. In theory, the training time and the test time of the AF method are smaller than those of the AB method.

The improved feature fusion approach may result in a more like defect detection box.

The EIoU regression loss divides the loss term of the aspect ratio into the difference value of the predicted width and height and the minimum external frame width and height, so that the convergence of the prediction frame is accelerated, and the regression accuracy of the prediction frame is improved.

Using FL as the targeting loss improves detection performance. On the one hand, when the number difference between positive and negative samples is large, the network may not be suitable for convergence due to cross entropy loss; on the other hand, the algorithm can effectively reduce the weight of the easily distinguished samples, and more attention is paid to the samples which are difficult to distinguish, so that the samples in the optimization process are more targeted. As shown in fig. 9, the overall algorithm architecture of the present application is shown.

the building module is used for building an improved anchor-freeYOLOv5 network;

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The strip steel surface defect detection method based on the improved YOLOv5 is characterized by comprising the following steps of:

building an improved anchor-freeYOLOv5 network;

acquiring a strip steel surface image;

outputting the detection result of the surface defect of the strip steel;

the working mechanism of the Anchor-freeyolv 5 network comprises: the input network module, the backhaul network module, the Neck network module and the Prediction network module are sequentially connected;

the input network module carries out Mosaic data enhancement on the input strip steel surface image;

the CSPDarknet structure comprises 5 CSP modules;

the Neck network module adopts a BiFPN module+PAN module structure;

finally, fusing the characteristics of the addition result by a 3x3 convolution;

an additional weight is added to the input of each BiFPN module, and the network is enabled to know the importance of each input characteristic, and the weighting method is as follows:

wherein O represents the result after fusion, I _i Is input, omega _i And omega _j Is a learnable weight and Relu is used for each update to ensure that its value is greater than 0, and epsilon=0.0001 is a small fraction whose function is to prevent the value of the denominator from being equal to 0, epsilon + sigma _j ω _j Weight, ω _i ·I _i Representing the weighted features;

two 5-level fusion features are described in bipin:

is the input feature of the fifth layer on the top-down path, < >>Is an intermediate feature of the fifth layer on the top-down path, < >>Is an intermediate feature of the sixth layer on the top-down path, and +.>Output characteristics of fifth layer in bottom-up pathway, +.>Is the output feature of the fourth layer in the bottom-up approach, C represents a depth separable convolution operation, and R (…) represents a resolution-matched up-sampling or down-sampling operation, ω ₁ For inputting features->For intermediate features->Is the learning weight of +.2 is the input feature +.>For intermediate features->Learning weights, ω ₃ For inputting features->For intermediate featuresLearning weights of->Is the input feature of the sixth layer on the path from top to bottom, omega' ₁ Is an input feature +.>For output characteristics->Learning weights, ω' ₂ Is an intermediate feature->For output characteristics->Learning weights, ω' ₃ Is an intermediate featureFor output characteristics->Learning weights, ω' ₄ Is the output feature +.>Is a learning weight of (a).

2. The improved YOLOv 5-based strip surface defect detection method of claim 1, wherein the anchor-freyolov 5 network comprises: the device comprises a feature map module, a feature fusion module, a convolution module and a detector module which are connected in sequence.

3. The method for detecting surface defects of strip steel based on improved YOLOv5 of claim 1, wherein said Prediction network module uses EIoUloss for output.

4. The method for detecting surface defects of strip steel based on improved YOLOv5 of claim 1, wherein said anchor-freyolov 5 network uses EIoU as a bounding box regression loss during training.

5. The method for detecting strip steel surface defects based on improved YOLOv5 of claim 4, wherein after feature fusion of the neg network module, three-dimensional tensor, objectivity and class prediction of the coding bounding box are predicted based on Anchor-free bounding box regression by using two convolution layers.

6. A strip steel surface defect detection system based on improved YOLOv5, comprising:

the building module is used for building an improved anchor-freeYOLOv5 network;

the output module is used for outputting the detection result of the surface defect of the strip steel;

the CSPDarknet structure comprises 5 CSP modules;

the Neck network module adopts a BiFPN module+PAN module structure;

two 5-level fusion features are described in bipin:

is the input feature of the fifth layer on the top-down path, < >>Is an intermediate feature of the fifth layer on the top-down path, < >>Is an intermediate feature of the sixth layer on the top-down path, and +.>Output characteristics of fifth layer in bottom-up pathway, +.>Is the output feature of the fourth layer in the bottom-up approach, C represents a depth separable convolution operation, and R (…) represents a resolution-matched up-sampling or down-sampling operation, ω ₁ For inputting features->For intermediate features->Learning weights, ω ₂ Is an input feature +.>For intermediate features->Learning weights, ω ₃ For inputting features->For intermediate featuresLearning weights of->Is the input feature of the sixth layer on the path from top to bottom, omega' ₁ Is an input feature +.>For output characteristics->Learning weights, ω' ₂ Is an intermediate feature->For output characteristics->Learning weights, ω' ₃ Is an intermediate featureFor output characteristics->Learning weights, ω' ₄ Is the output feature +.>Is a learning weight of (a).