CN115294038A

CN115294038A - Defect detection method based on joint optimization and mixed attention feature fusion

Info

Publication number: CN115294038A
Application number: CN202210884549.6A
Authority: CN
Inventors: 董永峰; 孙松毅; 王振; 齐巧玲; 王利琴
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-11-04

Abstract

The invention relates to a segmentation-classification-based two-segment defect detection method, which aims at the problem of insufficient feature extraction capability of a two-segment defect detection algorithm, provides a mixed attention feature fusion module, and is integrated in a segmentation network of an encoder-decoder structure, so that the model can better utilize global context information and reconstruct a pixel-level segmentation image by utilizing extracted deep features. In addition, the invention provides a multi-sensing field spatial attention module, which utilizes the enhanced sensing field brought by cavity convolution to extract spatial attention weight and effectively enhances the extraction capability of the model to micro features. Aiming at the problem that the two-section type defect detection model is low in training efficiency, the invention provides a joint optimization framework, and the model is trained end to end by using the constructed joint loss function. Experiments show that the improvement provided by the invention can effectively improve the detection precision of the defect task.

Description

Defect detection method based on joint optimization and mixed attention feature fusion

Technical Field

The technical scheme of the invention relates to the field of deep learning, convolutional neural network and defect detection, in particular to a defect detection method based on joint optimization and mixed attention feature fusion.

Background

The defect detection technology is an indispensable important means in the modern industrial production quality control link, the traditional machine vision detection method mainly utilizes different properties of workpiece surface defects to formulate a reasonable imaging scheme, processes manually set characteristics through an image processing algorithm based on machine learning, and further extracts defect information possibly contained in the workpiece surface, and the method is widely researched and applied in the industrial production field.

With the continuous and deep research in the deep learning field, deep Neural network models represented by Convolutional Neural Networks (CNNs) are applied to the defect detection field in a large amount, and have outstanding effects on defect feature extraction and defect classification problems. The defect detection problem can be simply summarized into a two-classification problem for identifying whether the to-be-detected picture contains the defect, and research fields such as defect positioning, defect segmentation and the like can be developed by combining the requirements on information such as the shape, the type, the position and the like of the defect on the basis of the classification problem.

The existing convolutional neural network surface defect detection method based on classification can be briefly summarized into a 'one-segment' method for classifying by using an original graph and a 'two-segment' method combined with segmentation and positioning tasks, and the two methods mainly have the following defects:

(1) The network structure of the one-stage method for classifying by using the original image is often simpler and the network is shallower, and the feature extraction capability of the network model is slightly insufficient when the defect type with a complex shape is faced, so that the classification precision is unsatisfactory; in addition, the defect area in an actual industrial scene is often smaller than that of an acquired image, and the method cannot effectively utilize space and position information and is easy to ignore tiny defects. The attention mechanism and the feature fusion method are added, so that the network can be focused on the defect part while the feature extraction capability of the convolutional neural network is effectively improved, and the feature fusion method based on the attention mechanism is added into the convolutional neural network.

(2) The two-stage classification model usually adopts a non-end-to-end training mode, namely, training the segmentation branches and storing network parameters, selecting model weights with better segmentation results and loading the model weights into the network and training the classification branches, and the training mode enables the two stages of the model to be well trained, but also has the problems of low training time efficiency and high computing resource consumption. The joint optimization method provided by the invention can enable the model to be trained end to end and improve the classification precision of the model.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: the method provides a spatial attention mechanism based on multiple receptive fields, effectively enhances the feature extraction capability of a model to tiny defects, constructs a mixed attention feature fusion module by combining a channel attention mechanism to replace a feature fusion mode of long jump connection under different scales of a traditional self-encoder model, and better utilizes global context information. The method adopts a joint optimization framework to carry out end-to-end training, which obviously improves the accuracy of the defect detection task.

The technical scheme adopted by the invention for solving the technical problem is as follows: the defect detection method based on joint optimization and mixed attention feature fusion is provided, and comprises the following steps:

the first step is as follows: the method comprises the steps of collecting an image of the surface of a workpiece to be detected, preprocessing the collected image, setting real labels used for training, and constructing a network model, wherein the model is divided into a segmentation network and a classification network.

The second step is that: and inputting the preprocessed image into a model for training, setting optimization parameters and iteration times, wherein the output result of the model is a pixel level segmentation graph of defect information and a corresponding defect type.

The third step: and storing the trained model weight, and detecting the surface defects of the workpiece by using the model.

In the first step, the collected image is normalized to an image size of 512 × 512 pixels through downsampling operation, and the original RGB three-channel image is adjusted to a single-channel image, that is, converted to a gray image, and then input into the model after normalization operation. The reality required by the segmentation network training is marked as a pixel level label, namely the pixel of the corresponding defect area is marked as a positive sample, the pixel of the normal area is marked as a negative sample, and in addition, the corresponding class indication label is given according to the type of the surface defect of the workpiece so as to train the classification network.

The segmentation network consists of an encoder-decoder backbone and a mixed attention feature fusion module, wherein the encoder part comprises 4 continuous down-sampling operations with the step length of 2, the feature map extracted by each layer of network is input into the mixed attention feature fusion module, and the feature map is spliced with the feature map with the same resolution at the same level after the decoder structure is reconstructed to participate in the subsequent convolution calculation. Finally, a pixel level segmentation map of the same size as the input image is output by the decoder, indicating the defect location and shape.

The input of the classification network is a 32 × 32 × 256 feature map output by the segmentation network, feature enhancement is performed by a multi-receptor field spatial attention module, and the feature map is converted into a one-dimensional feature vector through convolution operation and pooling operation of a 5 × 5 convolution kernel and full-connected layer classification operation is performed.

The method comprises the specific steps that before each down-sampling, the obtained feature graph is spliced with a low-dimensional feature graph decoded by a network level corresponding to a decoder in a long-jump connection mode through a mixed attention feature fusion module, the complete semantic information of the feature graph is kept, the image is restored to the original resolution, and after Global Average Pooling (GAP) and Global Maximum Pooling (GMP), the feature graph is spliced with one-dimensional feature vectors output by a classification network to guide the classification result of the classification network. After obtaining the characteristic vector with the dimension of 66 multiplied by 1, inputting the characteristic vector into a full connection layer, and obtaining the confidence coefficient of each category after Softmax operation.

The mixed attention feature fusion module is composed of a multi-sensing-field space attention module and a channel attention module, an input feature graph X with dimension H multiplied by W multiplied by C in the multi-sensing-field space attention module is activated through a nonlinear activation function ReLU after convolution operation under different sensing fields, and then is spliced into a H multiplied by W multiplied by 3 feature graph, the number of channels of the feature graph is compressed to 1 through 1 multiplied by 1 convolution, and the feature graph is multiplied by an original input feature graph after being activated through a Sigmoid function, so that an attention weighted feature graph X' is obtained:

wherein concat [ ·; h; a]It is shown that the splicing operation is performed,

shows convolution operations based on different receptive fields based on convolution kernels of 3 × 3 size, and ReLU (·) denotes the ReLU activation function.

The channel attention module respectively extracts global channel attention and local channel attention by adopting global maximum pooling operation and convolution operation of 1 multiplied by 1, converts the extracted parameters into characteristic weights through a Sigmoid function, and multiplies the characteristic weights by an input characteristic diagram:

X′＝X×σ(f ^1×1 (ReLU(f ^1×1 (GMP(x))))+f ^1×1 (ReLU(f ^1×1 (X))))

where GMP (-) is a global max pooling operation.

In the second step, the optimization process of the model is controlled by the joint loss function. Specifically, the joint loss function is composed of related parameters of segmentation loss, classification loss and balance, and is in the form of:

L _total ＝θ(1-λ)L _Seg +δλL _Cls

wherein L is _Seg To divide loss, L _Cls For the classification loss, θ and δ are balance coefficients of the balance segmentation loss and the classification loss, and λ is a weight factor controlled by the iteration number in the form of a ratio of the current iteration number to the total number.

In the training process of the defect detection model, the number of samples containing defects is usually only a small part in the whole data set, and the number of positive pixel points in the segmentation problem is usually far smaller than that of negative pixel points, so that the positive samples are difficult to classify in the model training process. Therefore, the FocalLoss is adopted as the segmentation loss of the model, and the form of the segmentation loss is as follows:

L _Seg ＝-(1-p _t ) ^γ log(p _t )

in addition, as the problems of unobvious difference between classes and poor classification effect exist in part of multi-classification defect detection problems, the Large Margin Cross-entry Loss is adopted as the classification Loss, and the main effect of the method is to force the model to learn the characteristics of larger inter-class distance and smaller intra-class distance. The form is as follows:

wherein y is a label corresponding to the defect true category, f is a classifier, f _y Classification score for true class, f _c And e is a classification score corresponding to the class c, and epsilon is a regularization coefficient for restricting the dispersion degree of the classification scores of the non-target classes.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic structural diagram of a multi-field spatial attention module according to the present invention;

FIG. 2 is a schematic diagram of a hybrid attention feature fusion module according to the present invention.

FIG. 3 is a schematic diagram of the overall structure of the defect inspection model according to the present invention.

FIG. 4 is a diagram illustrating the output result of the method applied to the MAGNETIC-TILE data set for defect detection.

Detailed Description

The spatial attention module shown in fig. 1 is based on the SAM module improvement in the CBAM model, and is different from the CBAM model in which convolution kernels of different sizes are directly used for operation, the expansion of the spatial attention module receptive field proposed in the present invention is realized by hole convolution, and based on the convolution kernel of 3 × 3, different expansion rates (partition rates) are set to obtain different receptive fields. The use of the hole convolution can not only better expand the receptive field, fully utilize multi-scale context information and enhance the capture capability of the micro defects, but also better control the parameter quantity, namely, no additional parameter is required to be introduced. Meanwhile, the features extracted by the standard convolution kernel with the partition Rate equal to 1 participate in fusion, so that the problems of local information loss of the void convolution and lack of correlation of extracted information are effectively avoided.

Considering that the positive pixel points containing defects in the defect detection problem usually only occupy a small part of the whole image and the good effect is difficult to obtain by extracting the attention weight through global average pooling, the channel attention module of the invention extracts the global channel attention by using Global Maximum Pooling (GMP), the feature map of input dimension H multiplied by W multiplied by C is converted into a one-dimensional feature vector of 1 multiplied by C after GMP operation, and the dimension is changed into a dimension after convolution operation with convolution kernel size of 1 multiplied by 1

(r is set to be 2 in the invention), after the ReLU function is activated, the original dimension is restored to be 1 × 1 × C through 1 × 1 convolution operation, and the operation mainly aims to accelerate the operation speed and save the training time. In addition, the input feature map only obtains the attention of a local channel through the dimension scaling operation of 1 × 1 convolution, and the two are added and then are subjected to SigmoAnd converting the id function into the channel attention weight of the input feature after activation, and multiplying the channel attention weight by the input feature to obtain a weighted feature map.

The hybrid attention feature fusion module shown in fig. 2 combines both spatial and channel attention to construct a way for encoder-decoder network shallow feature and deep feature fusion based on hybrid attention feature fusion. The encoder input feature X and the feature graph Y of the decoder lower layer subjected to bilinear interpolation up-sampling operation are combined through element addition, the combined feature graph is input into a channel attention module to obtain the attention weight in the channel direction, the feature graphs generated after the weight is respectively multiplied by X, Y are used as the input of two multi-receptive-field space attention modules, the result subjected to space attention weighting is subjected to splicing operation to generate an output feature graph of the whole mixed attention feature fusion module, and the output feature graph continuously participates in the calculation of a higher-layer network. The two attention arrangement modes refer to a mode of first channel domain, then space domain and serial arrangement in a CBAM module, and the feature extraction capability of the model is improved to a great extent.

Fig. 3 is a schematic diagram of an overall structure of the defect detection model of this patent, a main skeleton of the model is an encoder-decoder structure, an input image is firstly subjected to 4 times of downsampling in a maximum pooling manner to obtain a high-dimensional feature map with dimensions of 64 × 64 × 256, and an additional multi-sensing-field spatial attention module is used to generate a spatial attention weight of the high-dimensional feature to be multiplied by input features of classification branches, so as to achieve an effect of enhancing classification network features. Each layer of feature graph is input into the mixed attention feature fusion module before down-sampling and fused with the feature graph after up-sampling of the deep layer feature of the lower layer. H multiplied by W multiplied by 1 segmented images output by the segmented network are spliced with the output of the classification network into 66 multiplied by 1 characteristic vectors after pooling operation, and probability scores of all classes are generated after the characteristic vectors are input to a full connection layer.

Example 1

In the embodiment, a defect detection method based on mixed attention feature fusion and joint optimization is adopted to perform a defect detection task on an image to be detected. Here, the present embodiment performs defect detection on the knowledge-TILE data set.

The MAGNETIC-TILE dataset contains 5 total defects (porosity, breakage, cracks, wear, non-uniformity) on the surface of the automotive MAGNETIC TILE and 1344 total defect-free samples, of which 392 samples contain pixel-level annotations matching the shape of the defect.

Firstly, preprocessing an original image in a data set, and dividing the data into a training set and a test set. The read image is changed into a single-channel gray image by an IMREAD _ GRAYSCALE method in an opencv-python library, and is converted into a tensor and standardized. Meanwhile, reading in labels corresponding to the training set samples, and setting edge mislabeled pixels with pixel values between 0 and 1 in the segmented labels as 1 according to a threshold value of 0.5.

Second, the preprocessed image is input into the network model, a 512 × 512 single-channel grayscale image is initially input, the blocksize is set to 8, and the data is first input into the segmentation network. The segmentation network consists of an encoder stage and a decoder stage. The encoder stage comprises 5 convolution modules and 4 downsampling operations, wherein each convolution module comprises two convolution operations with convolution kernel size of 3 x3 and doubled output channel number, two normalization (BatchNorm 2 d) operations and two ReLU operations, and the normalization operations are used for normalizing data according to the mean value and variance of input data, so that the data have more statistical significance, and the data can not be subjected to network stability due to overlarge before the ReLU operation. The ReLU operation effectively solves the problems of slow backward propagation and large calculated amount of sigmoid and other activation functions, effectively saves the calculation time and can avoid the problems of gradient elimination, overfitting and the like.

The decoder stage is formed by combining four times of up-sampling, feature fusion and convolution operations, wherein the up-sampling operation comprises an UpSample operation with a step length of 2, a 3 x3 convolution operation and a normalization and ReLU combination operation, the UpSample operation restores the size of an image to be twice of the input size, the convolution operation compresses the number of image channels to be one half of the original number, and the image channels and the down-sampled image in the encoder stage are input to the hybrid attention feature fusion module in a long-jump connection mode for fusion.

The input to the hybrid attention feature fusion module includes two parts: and restoring the characteristic diagram X after the upsampling at the decoder stage and the characteristic diagram Y of the corresponding layer at the encoder stage. The two input feature maps are firstly added along the channel direction, the accumulated feature maps respectively calculate local attention (calculated by global maximum pooling operation) and global attention (calculated by convolution operation) through a channel attention module, the two kinds of attention weights are added, activated through a Sigmoid function and multiplied by the input accumulated features to obtain total channel attention weights, the weights are respectively multiplied by X, Y, the obtained weighted features are respectively subjected to space attention weighting through a multi-sensing field space attention module and then spliced along the channel direction to generate final mixed attention fusion features, and the formula is expressed as follows:

z＝concat[(f _s (X×f _c (X+Y))；f _s (Y×f _c (X+Y))]

where Z represents the output of the hybrid attention feature fusion module, concat [; a]Showing splicing operations in the direction of the channel, f _s (. C) denotes a multi-receptor spatial attention Module, f _c (. Cndot.) represents the channel attention module, whose expressions are:

wherein GMP (-) is a global max pooling operation for converting the feature map into one-dimensional feature vectors of channel directions; f. of ^1×1 (. Cndot.) represents convolution operation with convolution kernel size of 1 × 1, mainly used for compressing the number of channels, σ (-) represents Sigmoid activation function, puts attention weight between 0-1,

the convolution kernel which represents the size of 3 x3 can effectively extract image features under different receptive fields based on the hole convolution with different expansion rates, and the standard 3 x3 convolution operation is adopted when d =1, so that the method can effectively avoid the problem of tiny feature omission possibly caused by the hole convolution while improving the model receptive field.

The deep features of the segmentation network are subjected to four times of up-sampling reconstruction operation in a decoder stage, and then a pixel level segmentation graph of an input image is output, wherein the size of the segmentation graph is the same as that of the input image, and the segmentation graph can indicate the shape and the position of a defect in the defect image. The segmentation graph is then flattened through a view function, and is spliced with the feature vectors output by the classification network after being respectively subjected to global average pooling and global maximum pooling, so that guidance information is provided for defect classification.

A32 multiplied by 256 high-dimensional feature map generated by four times of down sampling in an encoder stage in a segmentation network is used as the input of a classification network, and the network comprises a multi-receptive-field spatial attention module which is mainly used for enhancing deep features of a model and improving the capturing capability of the model on tiny defects. The deep feature map is converted into a 32 x 1 feature vector through convolution operation with a convolution kernel size of 5 x 5 for three times, and the feature vector is spliced with the output of the segmentation network after being respectively subjected to global average pooling and global maximum pooling. The 66 × 1 feature vectors obtained by splicing are input into a full connection layer with the output of 6, and are converted into probability scores of 6 classes of corresponding categories after Softmax operation, and the sum of the probability scores is 1. The acceptance threshold is set to 0.5, i.e. when the probability of the most scored category of the 6 categories exceeds 0.5, the category of the classification result is considered acceptable.

In the third step, the output of the segmentation network and the classification network respectively calculates the loss with the corresponding real label, wherein the real label of the segmentation network is a black-and-white binary image of a defect image, the label of a defect pixel is 1, the label of a non-defect pixel is 0, and secondary classification is carried out on each pixel point; the label of the classification network is the defect category number plus the unique identifier of the defect-free category, and the loss value is calculated according to the loss function in the following form:

the segmentation loss value is an average result obtained after loss is calculated for pixel points contained in each image, and gradient is calculated through a back propagation algorithm after the loss value is obtained through calculation so as to optimize model parameters.

According to the method, a model training environment is built based on Python3.8 and Pytroch 1.7.1, the sample ratio of a training set and a test set is divided into 4: 1, the model is optimized by adopting an Adadelta algorithm, the learning rate is set to be 0.1, the hardware environment used for model training is an Ubuntu18.04 operating system, an Intel6140 CPU, and the GPU is Nvidia RTX3090. The super parameters theta and delta in the combined loss function are set to be 10 and 0.1, the value of gamma in the segmentation loss is set to be 2, the super parameter epsilon in the classification loss is set to be 0.3, and the total number of training iteration rounds is 300.

In order to evaluate the effectiveness of the proposed method for improving the performance of the defect detection task, the method selects part of outstanding algorithms and compares the algorithms with the model performance in the method under the same experimental conditions, each experiment is carried out three times under the same conditions for testing the stability of the experimental results of the model, and the standard deviation is calculated, as shown in table 1, the method disclosed herein obtains the best results under all indexes.

TABLE 1 comparison of experimental results for different algorithms

The evaluation indexes in the table are Precision (Precision), recall (Recall), F1-Measure value and Accuracy (Accuracy) respectively, and the form is as follows:

wherein c represents the number of categories, and each index is calculated in a macro-average mode, which has the advantage of avoiding the adverse effect caused by the overlarge difference of the number of samples in each category.

Fig. 4 is a visual representation of the output results of the model segmentation network, wherein columns 1 to 5 correspond to 5 types of samples, including non-uniformity, breakage, porosity, cracking, and wear, respectively. It can be seen from the figure that when facing a data set with pixel-level labels, the model provided by the invention can accurately segment pixels containing defects, and can still accurately extract defect information when the defect type contains multiple defect types and has large difference.

The implementation method of segmenting and classifying the sub-loss function and fusing the attention of the channel in the module by the mixed attention characteristic is an improvement based on the prior method.

It should be noted that the present invention is not limited to the above-described embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. A defect detection method based on joint optimization and mixed attention feature fusion is characterized in that: the method comprises the following steps:

the first step is as follows: acquiring a surface image of a workpiece to be detected, preprocessing the acquired image, setting a real label for training, and constructing a network model, wherein the model is divided into a segmentation network and a classification network, the segmentation network consists of an encoder stage, a decoder and a mixed attention feature fusion module, and the classification network consists of a convolution network backbone, a multi-sensing received field space attention module and a classifier;

the second step: inputting the preprocessed image into a model for training, constructing a joint loss function, setting optimization parameters and iteration times, wherein the output result of the model is a pixel level segmentation graph of a defect part and a corresponding defect type;

2. The method of claim 1, wherein the method comprises the following steps: a defect detection model based on mixed attention feature fusion is formed by a segmentation network based on an encoder-decoder structure and a classification network combined with a multi-receptive field space attention module,

the segmentation network consists of an encoder-decoder backbone and a mixed attention feature fusion module, wherein the encoder part comprises 4 continuous down-sampling operations with the step length of 2, a feature map extracted by each layer of network is input into the mixed attention feature fusion module, the feature map is spliced with a feature map with the same resolution after the decoder structure is reconstructed and then participates in subsequent convolution calculation, finally, a pixel level segmentation map with the same size as an input image is output by the decoder to indicate the position and the shape of a defect, deep features output by the segmentation network encoder stage are used as the input of a classification network, and the segmentation network is weighted by the multi-sensing field space attention module and then carries out convolution operation.

3. The method of claim 2, wherein the defect detection method based on joint optimization and mixed attention feature fusion is characterized in that: a multi-field based spatial attention module and a hybrid attention feature fusion module,

the mixed attention feature fusion module combines two types of attention of a space and a channel, constructs a mode that a shallow feature and a deep feature of an encoder-decoder network are fused based on the mixed attention feature, an encoder input feature X and a feature diagram Y of a decoder lower layer subjected to bilinear interpolation up-sampling operation are subjected to element addition, and are input into the channel attention module to acquire an attention weight in the channel direction, the feature diagrams generated after the weights are respectively multiplied by X, Y are taken as the input of two multi-receptive-field space attention modules, and the result subjected to space attention weighting is subjected to splicing operation to generate an output feature diagram of the whole mixed attention feature fusion module and continuously participate in the calculation of a higher-layer network;

the mixed attention feature fusion module is composed of a multi-sensing field space attention module and a channel attention module, an input feature map X with dimension H multiplied by W multiplied by C in the multi-sensing field space attention module is activated through a nonlinear activation function ReLU after convolution operation under different sensing fields, then the input feature map X is spliced into an H multiplied by W multiplied by 3 feature map, then the number of channels of the feature map is compressed to 1 through 1 multiplied by an original input feature map after being activated through a Sigmoid function, and an attention weighted feature map X' is obtained:

wherein concat [ ·; h, performing; a]It is shown that the splicing operation is performed,

the convolution operation based on different receptive fields based on a convolution kernel with the size of 3 multiplied by 3 is shown, and the ReLU (·) represents a ReLU activation function;

X′＝x×σ(f ^1×1 (ReLU(f ^1×1 (GMP(X))))+f ^1×1 (ReLU(f ^1×1 (X))))

where GMP (-) is a global max pooling operation.

4. The method for defect detection based on joint optimization and mixed attention feature fusion according to claim 1, wherein: the third step is mainly characterized in that a segmentation-classification two-stage type defect detection model optimization method based on joint optimization,

the core of the optimization method is a constructed joint loss function, and the form of the joint loss function is as follows:

wherein theta and delta are balance coefficients of balance segmentation loss and classification loss, lambda is a weight factor controlled by the iteration round number and is in the form of the ratio of the current iteration round number to the total round number, gamma is a modulation factor of the sample segmentation difficulty, f is a classifier, f is a weighting factor, and _y classification score for true class, f _c And e is a classification score corresponding to the class c, and epsilon is a regularization coefficient for restricting the dispersion degree of the classification scores of the non-target classes.