CN116596881A

CN116596881A - Workpiece surface defect detection method based on CNN and transducer

Info

Publication number: CN116596881A
Application number: CN202310558597.0A
Authority: CN
Inventors: 彭宏京; 王岳宸
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-15

Abstract

The application discloses a trunk network and a feature fusion network based on combination of CNN and a Transformer, wherein a MobileViT block is added in the trunk network, and an improved CBMA module is combined at the tail of each MobileViT block so that two feature graphs can be fused better, and CSP bottleneck structures are applied to CNN and Transformer Block which are stacked continuously to improve the performance of the network. The whole model enhances the fusion of the CNN and the transducer feature map, and effectively improves the feature extraction capacity of the model backbone network and the acceptance domain of the output features. An upsampling feature extraction path including Transformer Block is added to the enhanced feature extraction network (PANet) and a Patch expansion is introduced for this architecture to handle the upsampling operation of the Transformer feature map. And adding bridging blocks between the feature extraction paths for performing layer-jump linking on the feature layers of the CNN and the Transformer, and enhancing global information of the feature map in the pyramid on the premise of keeping local information. The application can detect the surface defect target with larger shape, size, proportion and texture specificity in the workpiece.

Description

Workpiece surface defect detection method based on CNN and transducer

Technical Field

The application relates to the field of computer vision, in particular to a detection method for surface defects of a workpiece.

Background

For the manufacturing industry, quality control is critical, defects in a workpiece can have adverse effects on the rigidity, strength and bearing capacity of the workpiece, so that the stability of the workpiece cannot be ensured, and even huge potential safety hazards can be brought. Therefore, defect detection of large batches of workpieces is extremely important in the production process.

With the increase in computational power over the past decade, artificial neural networks have been able to address some previously difficult tasks and have achieved considerable success in a variety of fields. The convolutional neural network has high performance in tasks such as image classification, segmentation and target detection, and can adapt to various different use scenes, so that the convolutional neural network has stronger generalization capability. Meanwhile, the visual detection system based on deep learning can achieve high precision and high efficiency in areas which are difficult to detect by the traditional visual method. The depth network greatly improves the efficiency in the detection process, obviously reduces the cost required by detection, and is very suitable for completing the defect detection task of the workpiece surface.

Disclosure of Invention

The application provides a backbone network and a feature fusion network based on combination of CNN and a Transformer, wherein a MobileViT block is added in the backbone network, and an improved CBMA module is combined at the tail of each MobileViT block so that two feature graphs can be fused better, and CSP bottleneck structures are applied to CNN and Transformer Block which are stacked continuously to improve the performance of the network. The whole model enhances the fusion of the CNN and the transducer feature map, and effectively improves the feature extraction capacity of the model backbone network and the acceptance domain of the output features. An upsampling feature extraction path including Transformer Block is added to the enhanced feature extraction network (PANet) and a Patch expansion is introduced for this architecture to handle the upsampling operation of the Transformer feature map. And adding bridging blocks between the feature extraction paths for performing layer-jump linking on the feature layers of the CNN and the Transformer, and enhancing global information of the feature map in the pyramid on the premise of keeping local information. The method comprises the following steps:

s1, acquiring a steel surface defect data set and dividing a training verification set;

s2, constructing a transducer and CNN series structure trunk feature extraction network based on MobileVit, wherein a steel defect sample is taken as input, the feature extraction network comprises three stages, and the output of each stage is taken as an effective feature map;

and S3, constructing a PaNet-based multi-scale feature fusion network in which a transducer and a CNN are connected in parallel. Three effective feature layers of the trunk feature extraction network are used as input to perform feature fusion;

step S4, according to a predetermined number of workpiece surface defect samples, the images contain single defect positions and cover defects at different scales. Each sample image has a corresponding predetermined defect classification. The sample images are used as input, preset defect classification of defects with different scales in the images is used as output, and the detection model is trained to obtain the workpiece surface defect detection model.

The preset defects comprise six types of surface pits, inclusions, plaques, rolled oxide skin, cracks and scratches, and each type of defect has 360 graph samples.

And randomly dividing various images in the data set according to the proportion of 8:2 to respectively serve as a training set and a testing set. Thus the training set had 1728 samples and the test set had 432 samples. In the model training process, the parameter learning rate is set to be 1e-3, and the weight attenuation is set to be 5e-4. A total of 300 epochs were trained, of which the free_epochs were 50, this part of Batch size=16, the rest Batch size=4. The learning rate decreases in the form of cosine decay. When the training times were advanced to 50 epochs, the Batch Size was set to 16 and the learning rate was set to 1e4.

S2.1, firstly, the dimension of the feature map is improved through 1X 1 cross-channel convolution, then further feature extraction is carried out through depth separable convolution, and finally, the feature map is restored to the dimension when input is carried out through 1X 1 convolution. The depth separable convolution is mainly divided into two processes, namely channel-by-channel convolution, namely, convolution operation is respectively carried out by using each channel of the plurality of convolution check feature images; and point-by-point convolution, namely cross-channel convolution using a plurality of points of the 1 x 1 convolution kernel feature map.

S2.2 the S2.1 feature map is first convolved 3 x 3 at a time to extract local information and then convolved point by point using a 1 x 1 cross-channel. This process adjusts the original feature map dimension from H W C to H W d. After the 2D characteristic map of the picture is segmented, the 2D characteristic map is converted into a one-dimensional vector which can be directly processed by a transducer. Obtaining X _Unfold ∈R ^P×N×d The P feature map represents the length of each sub-block vector after flattening, and N represents the number of vectors after the feature map is segmented. Feeding the flattened eigenvectors into a stack Transformer Block to obtain X _G ∈R ^P×N×d 。

Feature map X obtained by L convectors _G ∈R ^P×N×d Refolding and restoring to obtain X _Fold ∈R ^H×W×d . Then X is taken up _Fold Sending the obtained product into a 1X 1 convolution network, and performing dimension reduction on the whole feature map to obtain F epsilon R ^C×H×W And the subsequent splicing operation of the original characteristic diagram and the original characteristic diagram is facilitated.

S2.3: s2.1 and S2.2 output feature map F _CNN ，F _VIT ∈R ^C×H×W Channel attention map to M _C ∈R ^C ^×1×1 The spatial attention map is M _S ∈R ^1×H×W The CBMA process flow is as follows:

the final result of the channel attention module is as follows:

the final result of the spatial attention module is as follows:

s2.4: the operations of S2.1, S2.2 and S2.3 are carried out once for each Stage, three stages are repeatedly carried out, and the characteristic diagram after the operation of S2.3 in each Stage is taken as an effective characteristic diagram.

S3.1 an additional multi-scale feature extraction path consisting of Swin Transformer Block is added to the original PaNet.

S3.2 CNN tributary layer-skipping a bridge linked to the Transformer tributary is used to fuse the local features of CNN into the Transformer to complement the detail information. Firstly, respectively mapping the feature map of the CNN into Key and Value, and using the feature map of the transducer as Query map to perform the next self-attention calculation.

The calculation process from the local feature bridge of CNN to the transducer is as follows:

s3.3 the bridge of the transition leg layer jump link to the CNN leg is opposite to the bridge of the CNN leg layer jump link to the transition leg. It emphasizes global marker attention into local features. The feature map transformation dimensions are then input before and after the depth separable convolution, respectively. And then, performing Query mapping on the result of the convolution operation, performing Key and Value mapping on Token output by the transducer, and performing self-attention calculation again.

The calculation process of bridging from the global feature of the transducer to the CNN local feature is as follows:

drawings

FIG. 1 is a backbone network model diagram;

FIG. 2 is a diagram of a multi-scale feature fusion network feature;

FIG. 3 is a MobileVit model structure incorporating CBMA;

FIG. 4 is a diagram of a bridge configuration with CNN tributary layer hops linked to a transducer tributary;

fig. 5 is a diagram of a bridge configuration in which a transducer leg hops to link to a CNN leg.

Detailed Description

For a better understanding of the technical content of the present application, specific examples are set forth below, along with the accompanying drawings.

First, collecting a defect image of the surface of a workpiece.

Secondly, marking defects, and carrying out data enhancement to construct a workpiece surface defect data set, wherein the method specifically comprises the following steps: on a workpiece production line, photographing and sampling are carried out on each workpiece at a fixed position by using a sampling device, and a workpiece surface defect data set is constructed. And marking the surface defects of the acquired workpiece surface image through labelimg. Dividing the data set of the surface defects of the workpiece into a training set, a testing set and a verification set according to a preset proportion, and preparing an image marked on the surface by Labelimg as a data set in a VOC format; performing data enhancement on the image in the data set through imgauge data enhancement; the enhanced data set was combined with (training set + validation set): the ratio of the test set is 8:2, the training set: the verification set is randomly divided at a ratio of 8:2.

Thirdly, building a backbone network based on MobileVit, wherein the structure is shown in figure 1.

Fourth, based on PaNet, the feature fusion network is built, and the structure is shown in figure 2.

Fifth, the images contain single defect locations and cover defects at different scales based on a predetermined number of workpiece surface defect samples. Each sample image has a corresponding predetermined defect classification.

As shown in fig. 3, the improved MobileVit improves the dimension of the feature map by 1×1 cross-channel convolution, then uses depth separable convolution to perform further feature extraction, and finally restores the feature map to the dimension at the time of input by 1×1 convolution. The depth separable convolution is mainly divided into two processes, namely channel-by-channel convolution, namely, convolution operation is respectively carried out by using each channel of the plurality of convolution check feature images; and point-by-point convolution, namely cross-channel convolution using a plurality of points of the 1 x 1 convolution kernel feature map.

The feature map is first convolved 3 x 3 once to extract local information and then convolved point by point across channels using a 1 x 1 cross-channel. This process adjusts the original feature map dimension from H W C to H W d. After the 2D characteristic map of the picture is segmented, the 2D characteristic map is converted into a one-dimensional vector which can be directly processed by a transducer. Obtaining X _Unfold ∈R ^P×N×d The P feature map represents the length of each sub-block vector after flattening, and N represents the number of vectors after the feature map is segmented. Feeding the flattened eigenvectors into a stack Transformer Block to obtain X _G ∈R ^P×N×d 。

CNN and Transformer output feature map F _CNN ，F _VIT ∈R ^C×H×W Channel attention map to M _C ∈R ^C×1×1 The spatial attention map is M _S ∈R ^1×H×W The CBMA process flow is as follows:

the final result of the channel attention module is as follows:

the final result of the spatial attention module is as follows:

three stages are repeated, and the output of each Stage is used as an effective characteristic diagram.

As shown in fig. 2, an additional multi-scale feature extraction path consisting of Swin Transformer Block is added to the original PaNet.

As shown in fig. 4, the bridge of the CNN tributary layer-skipping link to the transducer tributary is used to fuse the local features of the CNN into the transducer to complement the detail information. Firstly, respectively mapping the feature map of the CNN into Key and Value, and using the feature map of the transducer as Query map to perform the next self-attention calculation.

The calculation process from the local feature bridge of CNN to the transducer is as follows: .

As shown in fig. 5, the bridge of the converter leg hop link to the CNN leg is opposite in direction to the bridge of the CNN leg hop link to the converter leg. It emphasizes global marker attention into local features. The feature map transformation dimensions are then input before and after the depth separable convolution, respectively. And then, performing Query mapping on the result of the convolution operation, performing Key and Value mapping on Token output by the transducer, and performing self-attention calculation again.

the built model performs fine tuning training on the workpiece surface defect data set specifically comprises the following steps: classification losses were calculated using the Focal loss during training, and regression losses were calculated using the Smooth L1. The final loss function employed is the combination of focallos with smoth L1: l=lfl+lsl1, the classification loss Focal loss calculation formula is:

the regression loss smoth L1 is calculated as follows:

training a network by adopting a transfer learning method, pre-training in the VOC data set to obtain a weight file, and then fine-tuning in the workpiece surface defect data set. The number of iteration steps of the loop is set to 100, firstly, the batch size is set to 32, the learning rate is initialized to 5e-4, when the number of iteration steps reaches 50, the batch size is reset to 16, and the learning rate is 1e-4. And an early stop method (early stop) is adopted during training to avoid overfitting caused by continuous training, verification loss is calculated in each iteration, when the verification loss value reaches local optimum, the iteration is continued for 6 times, and if the model is not converged any more, the training is stopped.

Claims

1. A workpiece surface authority detection method based on CNN and fransformer, the method comprising the steps of:

and S1, acquiring a steel surface defect data set and dividing a training verification set.

And S2, constructing a transducer and CNN series structure trunk feature extraction network based on MobileVit, wherein the feature extraction network takes a steel defect sample as input, and takes the output of each stage as an effective feature map.

And S3, constructing a PaNet-based multi-scale feature fusion network in which a transducer and a CNN are connected in parallel. Feature fusion with three effective feature layers of the backbone feature extraction network as inputs

S4, according to a preset number of workpiece surface defect samples, the images contain single defect positions and cover defects under different scales. Each sample image has a corresponding predetermined defect classification. The sample images are used as input, preset defect classification of defects with different scales in the images is used as output, and the detection model is trained to obtain the workpiece surface defect detection model.

2. The transducer and CNN tandem structure backbone feature extraction network of claim 1, wherein a modified CBMA module is incorporated at the end of each MobileViT block to allow better fusion of the two feature maps. The method specifically comprises the following steps:

S2.3: s2.1 and S2.2 output feature map F _CNN ，F _VIT ∈R ^C×H×W Channel attention map to M _C ∈R ^C×1×1 The spatial attention map is M _S ∈R ^1×H×W The processing flow of the CBMA is as follows：

The final result of the channel attention module is as follows:

the final result of the spatial attention module is as follows:

3. The parallel multiscale feature fusion network of transducer and CNN of claim 1, wherein an additional multiscale feature extraction path comprised of Swin Transformer Block is added to the PaNet, comprising: