CN117764969A

CN117764969A - Multi-view imaging system and lightweight multi-scale feature fusion defect detection method

Info

Publication number: CN117764969A
Application number: CN202311840928.6A
Authority: CN
Inventors: 吴衡; 曾令湘; 罗劭娟; 陈梅云
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-26

Abstract

The invention relates to a multi-view imaging system and a lightweight multi-scale feature fusion defect detection method, wherein the detection method comprises the following steps: acquiring an image of a target to be detected, wherein the image of the target to be detected is obtained through a built multi-view imaging system; inputting the image into a preset detection model, and outputting a defect detection result of the target to be detected, wherein the detection model is obtained based on training of a training set, the training set comprises an image containing the defect and a label for identifying the defect, and the detection model is constructed by adopting a feature extraction network and a lightweight multi-scale feature fusion network. The defect detection method provided by the invention has the advantages of light weight, high speed and high precision.

Description

Multi-view imaging system and lightweight multi-scale feature fusion defect detection method

Technical Field

The invention relates to the technical field of industrial defect detection, in particular to a multi-view imaging system and a lightweight multi-scale feature fusion defect detection method.

Background

The industrial defect detection method mainly comprises two types: manual detection and optical detection. The manual detection method has a number of disadvantages. The detection process is inefficient, and for some small, weak, imperceptible defects, the human eye may not be able to accurately identify, resulting in the defects being ignored. Due to the above drawbacks, industrial defect detection is gradually turned to automatic and semi-automatic detection methods, such as machine vision, sensor technology, etc., to improve detection efficiency and accuracy.

The automatic optical detection technology is a detection method based on optical principles and machine vision technology, and is characterized in that an optical imaging technology is utilized to capture images of objects, and then a computer is used to process, analyze and identify the images so as to realize automatic detection. Due to limitations of a traditional imaging system, a detection algorithm and calculation capability, the existing automatic optical detection method often has the problems of limitations such as poor 360-degree image acquisition, low detection precision, low detection speed and the like. Therefore, developing a lightweight target detection algorithm with small calculation amount, high detection speed and high precision has important significance for industrial defect detection.

Disclosure of Invention

The invention aims at the problems that the traditional industrial vision system can only shoot images at one angle, and the existing micro defect detection algorithm has low detection precision, poor robustness and the like, and provides a multi-view imaging system and a lightweight multi-scale feature fusion defect detection method.

In order to achieve the above object, the present invention provides the following solutions:

a multi-view imaging system, comprising: imaging unit, lighting device, vertical support pole and conveyer belt, every imaging unit disposes a lighting device, just imaging unit with lighting device sets up on the vertical support pole, the vertical support pole is arranged according to the rectangle summit, forms middle rectangle region, the conveyer belt sets up in the rectangle region for place the target of waiting to detect.

In order to further achieve the above object, the present invention further provides a lightweight multi-scale feature fusion defect detection method, including:

acquiring an image of a target to be detected, wherein the image of the target to be detected is obtained through the multi-view imaging system;

inputting the image into a preset detection model, and outputting a defect detection result of the target to be detected, wherein the detection model is obtained based on training of a training set, the training set comprises an image containing the defect and a label for identifying the defect, and the detection model is constructed by adopting a feature extraction network and a lightweight multi-scale feature fusion network.

Optionally, the detection model includes: the system comprises a backbone feature extraction network, a neck feature extraction network and a feature detection head, wherein the backbone feature extraction network is used for carrying out feature extraction and feature fusion on the image to obtain a first feature map; the neck feature extraction network is used for carrying out feature extraction and feature fusion on the first feature map to obtain a second feature map; the feature detection head is used for detecting the second feature map and obtaining a defect detection result.

Optionally, the backbone feature extraction network includes: the system comprises a convolution layer, a C2f layer, a light-weight fusion module, a high-grade light-weight fusion module and an SPPF layer; the image is input into the backbone feature extraction network, and a feature map B is obtained through the convolution layer and the C2f layer ₁ Feature map B ₂ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram B ₁ Feature map B ₂ Obtaining a feature map B through the light fusion module, the convolution layer and the C2f layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram B ₁ Feature map B ₂ Feature map B ₃ Obtaining a feature map B through the high-level lightweight fusion module, the convolution layer, the C2f layer and the SPPF layer ₄ 。

Optionally, the neck feature extraction network comprises: upsampling layer, splicing layer, convolution layerLayer C2f, lightweight fusion module and high-level lightweight fusion module, the feature map B ₄ Inputting the neck feature extraction network, and obtaining a feature map D through the upsampling layer, the splicing layer and the C2f layer ₁ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Obtaining a feature map D through the convolution layer, the splicing layer and the C2f layer ₂ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Feature map D ₂ Obtaining a feature map D through the light fusion module, the convolution layer, the splicing layer and the C2f layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Feature map D ₂ Feature map D ₃ Obtaining a feature map D through the high-level light-weight fusion module, the convolution layer, the splicing layer and the C2f layer ₄ 。

Optionally, the lightweight fusion module comprises: and after the characteristic information of the first branch and the second branch is processed and added by a Sigmoid activation function, a triple attention mechanism is introduced, and fusion characteristic information of the first branch and the second branch is further extracted.

Optionally, the high-level lightweight fusion module comprises: and the third branch, the fourth branch and the fifth branch are subjected to downsampling and convolution operations respectively to obtain characteristic information of the third branch and the fourth branch, the characteristic information of the third branch and the fourth branch is processed and added with the characteristic information of the fifth branch through a Sigmoid activation function, a triple attention mechanism is introduced, and fusion characteristic information of the third branch, the fourth branch and the fifth branch is further extracted.

Optionally, the triple attention mechanism includes: and the three same sixth branches respectively perform Z-pooling, rolling and Sigmoid activation function operation, obtain the output of the three same sixth branches, average the output and obtain the output of the triple attention mechanism.

Optionally, the feature detection head is configured to detect the second feature map, including:

the characteristic detection heads are respectively used for the characteristic diagram D ₁ Feature map D ₂ Feature map D ₃ Feature map D ₄ And detecting, and outputting a prediction frame and a category probability, wherein the prediction frame is represented by a coordinate value, and the category probability is obtained by mapping the coordinate value into a range of 0-1.

Optionally, obtaining the defect detection result includes:

judging whether the target to be detected has defects according to whether the feature detection head outputs the prediction frame or not, if so, representing that the target to be detected has defects; if the prediction frame is not output, the target to be detected is represented to have no defect.

The beneficial effects of the invention are as follows:

according to the invention, a multi-view imaging system is constructed, two lightweight multi-scale feature fusion modules are provided, feature information of a context is effectively connected on the premise of small calculated amount, feature selection is selectively performed, the weight of a detected region of interest is increased, the weight of a non-region of interest is reduced, the detection precision and robustness are improved, and the detection speed is not influenced. The method has the advantages of light weight, high speed and high precision, and has wide application prospect in the field of industrial defect detection.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a multi-view imaging system in a front view configuration according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a front view of a multi-view imaging system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an imaging principle of a multi-view imaging system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a detection model according to an embodiment of the present invention;

FIG. 5 is a schematic view of a lightweight fusion module according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a Triplet Attention attention mechanism according to an embodiment of the invention;

FIG. 7 is a schematic view of a high-level lightweight fusion module according to an embodiment of the present invention;

the device comprises a 101-computer, a 102-imaging unit, a 103-lighting device, a 104-vertical support rod, a 105-detected object, a CB-conveyor belt, a C1-first imaging unit, a C2-second imaging unit, a C3-third imaging unit and a C4-fourth imaging unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The embodiment provides a multi-view imaging system and a lightweight multi-scale feature fusion defect detection method, comprising the following steps:

step 1, constructing a multi-view imaging system;

as shown in fig. 1, the multi-view imaging system includes an imaging unit 102, an illumination device 103, a vertical support bar 104, and a conveyor CB; the imaging unit 102 includes a first imaging unit C1, a second imaging unit C2, a third imaging unit C3, and a fourth imaging unit C4, each of which is provided with an illumination device 103, each of which and the illumination device 103 are mounted on a vertical support bar 104 to form a rectangular area as shown in fig. 2, a conveyor belt CB is disposed in the middle of the rectangular area, the object 105 is placed on the conveyor belt CB, and the multi-view imaging system can capture an image of the surface of the object 105 at a view angle of 360 degrees as the conveyor belt moves.

Imaging key as shown in fig. 3, a light source LS of the illumination device 103 illuminates the object 105, and a lens L and a sensor C of the imaging unit 102 acquire an image of the surface of the object 105.

Step 2, shooting an image of the detected object 105 by using a multi-view imaging system, inputting the image into the computer 101 in real time, wherein a preset detection model is stored in the computer 101, and performing real-time defect detection on the detected object 105 by using the detection model, which specifically comprises:

2.1, constructing a detection model by adopting feature detection heads and lightweight multi-scale feature fusion;

as shown in fig. 4, the detection model includes: a backbone (backbone) feature extraction network, a neck (back) feature extraction network and a feature detection head, wherein the backbone feature extraction network is used for carrying out feature extraction and feature fusion on an image to obtain a first feature map; the neck feature extraction network is used for carrying out feature extraction and feature fusion on the first feature map to obtain a second feature map; the feature detection head is used for detecting the second feature map and obtaining a defect detection result.

The backbone feature extraction network comprises: the system comprises a convolution layer, a C2f layer, a light-weight fusion module, a high-grade light-weight fusion module and an SPPF layer;

the neck feature extraction network comprises: the device comprises an up-sampling layer, a splicing layer, a C2f layer, a convolution layer, a light-weight fusion module and a high-grade light-weight fusion module;

RGB images with the size of n multiplied by h multiplied by w (n is the channel number, h is the image height, and w is the image width) are input into a backbone feature extraction network, and two feature images B with different scales are obtained through convolution and a C2f layer respectively ₁ ，B ₂ The feature graphs of the two scales are respectively used for obtaining a feature graph B through a lightweight fusion module, convolution and a C2f layer ₃ Map B of the characteristics ₁ ，B ₂ And B ₃ Input to a high-level lightweight fusion module and SPPF layer to obtain a feature map B ₄ . Thus in the backstop bIn one, 4 feature maps B of different scales are obtained ₁ ，B ₂ ，B ₃ ，B ₄ The dimensions are respectivelyThe number of C2f layers is 1. Feature map B ₁ ，B ₂ The feature is transferred through the light-weight fusion module, and then context information connection is further obtained through the high-level light-weight fusion module, so that feature information is enhanced. The input and output characteristic dimensions of the light-weight fusion module and the high-level light-weight fusion module are kept consistent.

Specifically, in this embodiment, RGB images with a size of 3×640×640 are input into a backbone feature extraction network to obtain 4 feature maps B with different scales ₁ ，B ₂ ，B ₃ ，B ₄ The dimensions are 128×160×160, 256×80×80, 512×40×40, 1024×20×20, respectively.

In the neck feature extraction network, the feature map B obtained in the backbone feature extraction network is extracted ₄ Up-sampling and splicing. A series of processes of C2f layer, lightweight fusion module and high-level lightweight fusion module. Obtain 4 feature maps D with different scales ₁ ，D ₂ ，D ₃ ，D ₄ . The dimensions are respectively And meanwhile, the designed light-weight fusion module and the high-level light-weight fusion module fuse the output detection head characteristic diagrams, so that the fusion of the characteristic information and the context of different scales is realized, and the performance of the network is improved.

Specifically, in the present embodiment, the feature map B ₄ Inputting the neck feature extraction network to obtain 4 feature images D with different scales ₁ ，D ₂ ，D ₃ ，D ₄ The dimensions are 128×160×160, 256×80×80, 512×40×40, 1024×20×20, respectively.

The characteristic detection heads are 4 detection heads with different scales, so that accurate detection of large targets and small targets is realized, and the defect detection performance is improved.

The lightweight fusion module is shown in fig. 5, and comprises two branches, specifically comprises a convolution layer, a normalization layer, a downsampling layer, a triplet attention mechanism (Triplet Attention, TA), a Sigmoid activation function and the like;

assuming that the feature map scale given to the input is n respectively ₁ ×h/4×w/4，n ₂ Xh/8 Xw/8, specifically 128X 160, 256X 80 in this embodiment. And carrying out interaction enhancement on the characteristic information of the two branches with different scales. Firstly, a transverse connection technology is introduced, and the technology can improve the transmission capability of different scale characteristic information. The module can selectively learn useful feature information in another branch so as to increase the weight of the detected region of interest. A light attention mechanism is introduced, so that the feature extraction capability can be further enhanced, the calculated amount cannot be increased, and the target detection speed is influenced. In this module, n is respectively ₁ Xh/4 Xw/4 and n ₂ The feature map correspondence vectors of the two branches of Xh/8 Xw/8, i.e. 128X 160, 256X 80 are defined asAnd->Then the image output of the Sigmoid () function can be expressed as:

wherein, the characteristic information similarity parameter of the two branches is denoted as sigma, and if sigma is higher, the characteristic information of the branches is rich and accurate, and vice versa. Thus, the added outputs of the Sigmoid () functions can be written as:

here, a lightweight triple attention mechanism (Triplet Attention, TA) is employed to further extract feature information, which is a method of calculating attention weights by capturing cross-dimensional interactions using a three-branch structure, as shown in fig. 6, including three identical branches, each branch structure including a Z-Pool pooling layer, a convolution layer, and a Sigmoid activation function;

TA establishes inter-dimensional dependencies through rotation operations and residual transforms and encodes inter-channel and spatial information with negligible computational overhead. Information interaction between different dimensions is realized. The resulting output can be expressed as:

wherein δ is denoted as Sigmoid (), T ₁ ，T ₂ ，T ₃ Input representing three branches representing dimensional interactions and Z-Pool pooling layer, α ₁ ，α ₂ And alpha ₃ Representing a standard two-dimensional convolution layer defined by a convolution kernel size K in these three branches. The equation can be simplified as:

in the above, ρ ₁ 、ρ ₂ And ρ ₃ Is the attention weight of three interaction dimensions calculated in the attention mechanism,representing the parameters +.>Dimension of (2) is restored to input T ₁ Dimension of->Representing the parameters +.>Dimension of (2) is restored to input T ₂ Is a dimension of (c).

The upper Z-Pool pooling layer is responsible for reducing the 0 th dimension of the tensor to 2 by connecting the average and maximum merging features of the tensor. This enables the layer to preserve a rich representation of the actual tensor while narrowing its depth, making further calculations lightweight. Can be expressed as:

Z-Pool(x)＝Concat _1d [MaxPool _0d (x),AvgPool _0d (x)]

wherein 0 is _d Representing the maximum pooling operation and the average pooling operation in dimension 0, concat _1d Representing the stitching of two pooling operations in dimension 1.

The high-level lightweight fusion module is shown in fig. 7 and comprises three branches, specifically comprising a downsampling layer, a convolution layer, a normalization layer, a triplet attention mechanism (Triplet Attention, TA) and a Sigmoid activation function;

assuming that the feature map scale given to the input is n respectively ₁ ×h/4×w/4，n ₂ ×h/8×w/8，n ₃ Xh/16 Xw/16, in particular 128X 160, 256X 80, 512X 40, respectively, by t ₁ ，t ₂ ，t ₃ And (3) representing. First, t is ₁ ，t ₂ And (5) performing downsampling and convolution processing. Then with t ₃ Adding results in an output α through a Sigmoid () function, which can be expressed as:

α＝Sigmoid(f(t ₁ )+f(t ₂ )+t ₃ )

the fusion of three different scale features before the Sigmoid (i.e., o) function enhances the context linking. And then multiplying the obtained alpha with the other two branches. Therefore, the characteristic information can be selectively selected, the weight of the region of interest is increased, and the weight of the region of no interest is restrained. Therefore, the detection performance can be improved well, the calculated amount is small, the detection speed is not influenced, and the added output can be expressed as:

add＝αf(t ₁ )+(1-α)f(t ₂ )

where f is denoted as downsampling and convolutional layer operation. Meanwhile, the TA module and the residual error network are introduced, so that the characteristic information is further extracted, the conditions of gradient elimination and overfitting are prevented, the robustness of the network is enhanced, and the network can extract the characteristics in the complex characteristic information with extremely small calculated amount. The final output can be expressed as:

Out＝Sc(T(C _b (add))+C _b (add))

wherein Sc is expressed as convolution, normalization, activation function operation, T is TA attention mechanism function, C _b Is a convolution and normalization operation.

Step 2.2, training a detection model;

object images containing defects are acquired, and a data set I= [ I ] is constructed ₁ ,I ₂ ,...,I _K ]. Wherein the total element amount in the data set I is K, the size of the image is n multiplied by h multiplied by w, n is the image channel, h is the image height, and w is the image width. Specifically, the total amount of data set elements constructed in this embodiment is 4730, and the image size is 3×640×640.

And marking the image, wherein the marking content is the defect type and the left upper corner and right lower corner coordinates of the defect target by using an open source tool labelImg, and the marked information file format is xml.

And dividing the marked defect image data set into a training set, a verification set and a test set according to the proportion.

Training is performed in a mode that the learning rate is gradually increased, the final learning rate is increased to lr, the loss function is L, and the learning rate is consistent with that of YOLOv 8.

The training parameters were set as follows: the learning rate lr is 0.001, the batch size is 16, and the dividing ratio of the training set, the verification set and the test set is 8:1:1, the optimizer is SGD and the total training period is 500.

Step 2.3, performing defect detection based on the trained detection model;

an image of the object 105 is taken by the imaging unit 102, and is input into the computer 101 in real time, detected in real time by the trained detection model, and output a target image and a target defect prediction frame.

Firstly, inputting an image to be detected Img into a detection model, wherein the size of the image is n multiplied by h multiplied by w, obtaining output prediction frames of 4 detection heads after network processing, and the output feature image scales are h/4 multiplied by w/4,h/8 multiplied by w/8,h/16 multiplied by w/16 and h/32 multiplied by w/32 respectively.

Specifically, in the present embodiment, the size of the image is 3×640×640, and the feature map output is 160×160, 80×80, 40×40, and 20×20, respectively.

Each prediction box is represented by four coordinate values that are processed by a convolution layer and an activation function, typically using a linear activation function or Sigmoid (-) function, to map the coordinates into a range of 0 to 1. At the same time, the model predicts the probability distribution of all possible classes for each prediction box, this process is done by a convolution layer and Softmax (-) function, ensuring that each class scores between 0 and 1 and their sum equals 1. The output of the detection head is a tensor, which contains the bounding box coordinates and class probabilities of all the predicted boxes. In the post-processing stage, non-maximal suppression (NMS) is used to remove redundant bounding boxes, ultimately preserving the highest confidence target box.

The self-coordinate position describing a predicted box is typically represented using (x, y, w, h), where (x, y) is the coordinates of the upper left corner of the box and (w, h) is the width and height of the box. The normalized position coordinates (X, Y, W, H) normalize these values to within the range of the image, typically between 0 and 1. If the object to be detected outputs the prediction frame, the object to be detected is represented to contain the defect, otherwise, the object is represented to have no defect.

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims

1. A multi-view imaging system, comprising: imaging unit, lighting device, vertical support pole and conveyer belt, every imaging unit disposes a lighting device, just imaging unit with lighting device sets up on the vertical support pole, the vertical support pole is arranged according to the rectangle summit, forms middle rectangle region, the conveyer belt sets up in the rectangle region for place the target of waiting to detect.

2. The light multi-scale feature fusion defect detection method is characterized by comprising the following steps of:

acquiring an image of an object to be detected, wherein the image of the object to be detected is obtained by the multi-view imaging system of claim 1;

3. The method for detecting the fusion defect of the lightweight multi-scale features according to claim 2, wherein the detection model comprises: the system comprises a backbone feature extraction network, a neck feature extraction network and a feature detection head, wherein the backbone feature extraction network is used for carrying out feature extraction and feature fusion on the image to obtain a first feature map; the neck feature extraction network is used for carrying out feature extraction and feature fusion on the first feature map to obtain a second feature map; the feature detection head is used for detecting the second feature map and obtaining a defect detection result.

4. The method of claim 3, wherein the backbone feature extraction network comprises: the system comprises a convolution layer, a C2f layer, a light-weight fusion module, a high-grade light-weight fusion module and an SPPF layer; the image is input into the backbone characteristic extraction networkObtaining a characteristic diagram B through the convolution layer and the C2f layer ₁ Feature map B ₂ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram B ₁ Feature map B ₂ Obtaining a feature map B through the light fusion module, the convolution layer and the C2f layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram B ₁ Feature map B ₂ Feature map B ₃ Obtaining a feature map B through the high-level lightweight fusion module, the convolution layer, the C2f layer and the SPPF layer ₄ 。

5. The method of claim 4, wherein the neck feature extraction network comprises: upsampling layer, splicing layer, convolution layer, C2f layer, lightweight fusion module and high-level lightweight fusion module, wherein the characteristic diagram B ₄ Inputting the neck feature extraction network, and obtaining a feature map D through the upsampling layer, the splicing layer and the C2f layer ₁ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Obtaining a feature map D through the convolution layer, the splicing layer and the C2f layer ₂ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Feature map D ₂ Obtaining a feature map D through the light fusion module, the convolution layer, the splicing layer and the C2f layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram D ₁ Feature map D ₂ Feature map D ₃ Obtaining a feature map D through the high-level light-weight fusion module, the convolution layer, the splicing layer and the C2f layer ₄ 。

6. The method of claim 5, wherein the lightweight multi-scale feature fusion defect detection module comprises: and after the characteristic information of the first branch and the second branch is processed and added by a Sigmoid activation function, a triple attention mechanism is introduced, and fusion characteristic information of the first branch and the second branch is further extracted.

7. The method of claim 5, wherein the high-level lightweight fusion module comprises: and the third branch, the fourth branch and the fifth branch are subjected to downsampling and convolution operations respectively to obtain characteristic information of the third branch and the fourth branch, the characteristic information of the third branch and the fourth branch is processed and added with the characteristic information of the fifth branch through a Sigmoid activation function, a triple attention mechanism is introduced, and fusion characteristic information of the third branch, the fourth branch and the fifth branch is further extracted.

8. The method of claim 7, wherein the triad attention mechanism comprises: and the three same sixth branches respectively perform Z-pooling, rolling and Sigmoid activation function operation, obtain the output of the three same sixth branches, average the output and obtain the output of the triple attention mechanism.

9. The method of claim 5, wherein the feature detection head for detecting the second feature map comprises:

10. The method for lightweight multi-scale feature fusion defect detection of claim 9, wherein obtaining the defect detection result comprises: