CN117173182A

CN117173182A - Defect detection method, system, equipment and medium based on coding and decoding network

Info

Publication number: CN117173182A
Application number: CN202311451815.7A
Authority: CN
Inventors: 陈宇; 陈震
Original assignee: Xiamen Weitu Software Technology Co ltd; Xiamen Weiya Intelligent Technology Co ltd
Current assignee: Xiamen Weitu Software Technology Co ltd; Xiamen Weiya Intelligent Technology Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2023-12-05
Anticipated expiration: 2043-11-03
Also published as: CN117173182B

Abstract

The present invention relates to the field of defect detection technologies, and in particular, to a method and a system for detecting defects based on a coding/decoding network, a computer device, and a storage medium. The method comprises the following steps: analyzing the input image through image patch processing and image mask processing to obtain image characteristic information; performing BEIT coding processing on the image characteristic information to obtain visual sign information; carrying out DETR decoding treatment on the visual mark information to obtain serialized data; performing position coding based on the serialized data to obtain corresponding position coding information; and detecting target defects of the serialized data and the corresponding position coding information through an MLP prediction layer so as to output detection results. According to the invention, under the condition of enhancing the detection effect by the fine tuning training task, the running speed of the model can be improved, and the characteristics of high detection accuracy, high detection speed and the like of an industrial rapid iterative optimization algorithm can be satisfied.

Description

Defect detection method, system, equipment and medium based on coding and decoding network

Technical Field

The present invention relates to the field of defect detection technologies, and in particular, to a method, a system, a computer device, and a storage medium for detecting defects based on a coding/decoding network.

Background

In the field of industrial defect detection, a annotation file is generally used for training and learning characteristic information of defects. Industrial defect detection application scenes are continuously increased, the defect types are complicated to change, the defects tend to undergo a process of changing from continuous quantity to quality change, and the defects of the same type have larger characteristic characterization, so that a detection model needs to be continuously learned and is continuously compatible with defect characterization characteristics so as to detect the defects of different characterization in each period, and certain examination exists on the marking of the defects and generalization of the model.

In order to solve the influence of the supervised model of the annotation data, the non-annotation data is subjected to self-supervision to obtain the parameters of the pre-trained large model, so that a larger application space is obtained, for example, a DETR network is used for defect detection. The DETR is a classical algorithm of an Encoder-Decoder structure, and the DETR Encoder (DETR encoding) is used to obtain feature information of the training image, and the DETR Encoder (DETR decoding) directly converts the image feature information into serialization information, and outputs a task of a target frame.

However, since the DETR model coding part is only a common Transform model, the capability of extracting the picture feature information is limited, and the prior art generally applies the DETR network to the segmentation and classification network, but does not apply the network of the target detection task, and the learning of the defect feature information cannot be performed by accurately using the task information of the target frame, so that the required actual detection effect cannot be obtained.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a defect detection method, a system, a computer device and a storage medium based on a coding and decoding network, which can greatly improve the accuracy of defect detection by combining the DETR decoding and the BEIT coding through the optimal design of the DETR network.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

in a first aspect, in one embodiment of the present invention, there is provided a defect detection method based on a codec network, the method including the steps of:

analyzing the input image through image patch processing and image mask processing to obtain image characteristic information;

performing BEIT coding processing on the image characteristic information to obtain visual sign information;

carrying out DETR decoding treatment on the visual mark information to obtain serialized data;

performing position coding based on the serialized data to obtain corresponding position coding information;

and detecting target defects of the serialized data and the corresponding position coding information through an MLP prediction layer so as to output detection results.

As a further aspect of the present invention, the image patch process further includes:

splitting an input image into N image patches with equal size, wherein N is a positive integer;

carrying out serialization processing on the image patch to obtain an image patch sequence;

encoding the image patch sequence to obtain image characteristic information;

wherein, the image patch is expressed as:

；

wherein X is a feature vector of the image patch, H represents the length of an input image, W represents the width of the input image, and C represents the number of image channels; r represents a real number set, and a value H, W, C used for representing each dimension of the feature vector belongs to a real number range;

the image patch sequence is expressed as:

；

in the method, in the process of the invention,is the result of image patchThe feature vector after serialization, p, represents the resolution of each image patch; n represents the number of image patches, also the length of the sequence of image patches, < >>The method comprises the steps of carrying out a first treatment on the surface of the R represents a real set, and the values N, P, C used to represent the dimensions of the feature vector fall within a real range.

As a further scheme of the invention, the image patch sequence is encoded to obtain image characteristic information, namely the image patch sequence is input into a linear layer network to obtain the embedding of the image patchThe method comprises the steps of carrying out a first treatment on the surface of the Wherein,wherein P represents image resolution, C represents image channel number, and D represents vector dimension of image; r represents a real set, and the values D, P, C used to represent the dimensions of the feature vector fall within a real range.

As a further scheme of the invention, before the image patch is subjected to serialization processing, the method further comprises the step of performing image mask processing on the image patch; the image masking processing adopts a blockwise masking block masking mode, and after the N image patches are input for masking processing, a masked image patch set is output to obtain an image masking sequence.

As a further scheme of the invention, the BEIT coding process adopts a BERT model to train the image mask sequence, the loss function is a cross entropy loss function, and visual sign information is output.

As a further aspect of the present invention, the DETR decoding process decodes and converts visual flag information into serialized data based on a length-width dimension of an input image, and specifically includes:

and carrying out dimension compression processing of convolution kernel 1x1 convolution on the visual sign information to enable the dimension of the visual sign information to be consistent with the length and width of the input image, so as to obtain serialized data based on the length and width dimension.

As a further aspect of the present invention, the position coding process means obtaining position coding information based on the serialized data, and specifically includes:

based on the serialized data of each length and width dimension, position coding is carried out on the height dimension by using sin functions of different frequencies; and, position coding the width dimension using cos functions of different frequencies; and finally, merging half of the wide dimension and the high dimension to obtain the position coding information.

In a second aspect, in yet another embodiment of the present invention, there is provided a defect detection system based on a codec network, the system comprising:

the image analysis module analyzes the input image through image patch processing and image mask processing to obtain image characteristic information;

the BEIT module is used for performing BEIT coding processing on the image characteristic information to obtain visual sign information;

the DETR module is used for carrying out DETR decoding processing on the visual mark information to obtain serialized data; and performing position coding based on the serialized data to obtain corresponding position coding information;

and the MLP prediction layer is used for detecting the target defects of the serialized data and the corresponding position coding information so as to output detection results.

In a third aspect, in another embodiment of the present invention, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of a codec network based defect detection method when the computer program is loaded and executed by the processor.

In a fourth aspect, in a further embodiment provided by the present invention, a storage medium is provided, storing a computer program, which when loaded and executed by a processor, implements the steps of the codec network based defect detection method.

The technical scheme provided by the invention has the following beneficial effects:

(1) According to the invention, through the optimized design of the DETR network, the DETR decoding and the BEIT encoding are combined, so that the defect detection accuracy can be greatly improved;

(2) According to the invention, the input image is analyzed through the image patch processing and the image mask processing to obtain the image characteristic information, and the BEIT coding processing is carried out on the image characteristic information to obtain the visual sign information, so that more comprehensive image characteristic information can be obtained, and the early training effect and the accuracy of later defect detection are improved;

(3) On the basis of visual sign information obtained based on BEIT coding, the invention obtains the serialized data of the input image and the corresponding position coding information through DETR decoding processing, and can obtain a target frame task by decoding output without the complex post-result processing process of the traditional target detection network, thereby greatly improving the detection efficiency;

(4) The DETR decoding part is an end-to-end target detection model based on a transducer design, has no maximum value inhibition post-processing mechanism, has no priori knowledge such as an anchor and corresponding constraint, realizes a target detection task through the end-to-end of the whole network structure, and reduces the constraint of pre-and post-processing on the target detection model;

(4) By adopting the technical scheme, the running speed of the model can be improved under the condition of enhancing the detection effect, and the characteristics of high detection accuracy, high detection speed and the like of an industrial rapid iterative optimization algorithm can be satisfied;

(5) In the aspect of model design, the invention is based on BEIT architecture, and performs optimal design by referring to a DETR structure for reasonably utilizing a target detection model, so that the model structure can be switched to different Transformer structures for use.

These and other aspects of the invention will be more readily apparent from the following description of the embodiments. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a defect detection method based on a codec network according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In particular, embodiments of the present invention are further described below with reference to the accompanying drawings.

Referring to fig. 1, the defect detection method based on the codec network provided by the embodiment of the invention includes the following steps:

Wherein the image patch processing further comprises:

encoding the image patch sequence to obtain image characteristic information;

wherein, the image patch is expressed as:

；

the image patch sequence is expressed as:

；

in the method, in the process of the invention,the feature vector after serialization of the image patch results, p represents the resolution of each image patch; n represents the number of image patches, also the length of the sequence of image patches, < >>The method comprises the steps of carrying out a first treatment on the surface of the R represents a real set, and the values N, P, C used to represent the dimensions of the feature vector fall within a real range.

The image patch sequence is encoded to obtain image characteristic information by inputting the image patch sequence into a linear layer network to obtain the embedding of the image patchThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Wherein P represents image resolution, C represents image channel number, and D represents vector dimension of image; r represents a real set, and the values D, P, C used to represent the dimensions of the feature vector fall within a real range.

The vector dimension used by the Transformer in the network layer is a fixed value, denoted as D, and when the image patch is flattened, the linear trainable network layer maps it to the D dimension, and this mapping mode is called patch mapping, which converts the visual task into a sequence-to-sequence problem.

Further comprises: before the serialization processing is carried out on the image patch, the method further comprises the step of carrying out image mask processing on the image patch; the image masking processing adopts a blockwise masking block masking mode, and after the N image patches are input for masking processing, a masked image patch set is output to obtain an image masking sequence.

The MIM generation masking process is iterative loop generation, and because 40% of the number of the image latches need to be masked, the loop is ended when the number of the generated masks reaches 40% of the number of the image latches.

The masking processing is performed on N image patches with equal sizes to obtain an image set, and the method comprises the following steps:

(1) Generating a number s of mask image modules blocks (which are randomly generated), wherein the range of values of the generated s is limited to a range from 16 to the number of the maximum image patches;

(2) Generating a proportion r of the image block (which is also randomly generated) whose value is in the range of 0.3 to 1/0.3;

(3) Generating image length a and width b based on the number s of mask image modules blocks and the ratio r of the image modules blocks;

wherein the length a and width b are calculated as follows:

(4) Calculating on the original image according to the length and width values of the mask, so as to obtain and record the upper left corner coordinate information of the mask area, and then setting the corresponding area to 0 to obtain the mask area;

(5) And combining the generated mask area with newly masked blocks according to the upper left corner coordinate information.

The BEIT coding process is to train the image mask sequence by adopting a BERT model, wherein the loss function is a cross entropy loss function, and visual sign information is output.

Specifically, the image patchCan be expressed as a vector of N visual markers, expressed as:

where w and h denote the number of image patches divided into each column and each row, respectively. Calculating the visual signature of the image is done by a dVAE module, wherein the image information is encoded into the visual signature by a Tokenizer algorithm.

The DETR decoding process decodes and converts visual sign information into serialized data based on the length and width dimensions of an input image, and specifically includes:

The position coding processing means obtaining position coding information based on the serialized data, and specifically includes:

The MLP prediction layer performs defect detection. In the process of detecting defects, different label frame Query information (Object Query) has different detection effects in different location areas, so the label frame Query information (Object Query) needs to be subjected to location coding, namely Positional Encodeing, and detected objects need to correspond to the category and the location information.

Inputting 64 target frame query information after position coding, decoding 64 objects in parallel, randomly outputting 64 token (mark information of target frame) after attention mechanism and mapping, and then inputting the token into an MLP prediction layer for prediction and calculating loss, thereby realizing target supervision training.

In the MLP prediction layer, the FFN layer is calculated by the convolution layer with ReLU, and the convolution layer step size is 1*1. The FFN prediction frame normalizes the defect center coordinates and width and height.

In the defect detection process, a bipartite graph matching mechanism is carried out through a Hungary algorithm, and real frame set elements and predicted frame set elements are in one-to-one correspondence to obtain a corresponding relation between a target frame and a real frame; and calculates a corresponding loss function based on the relationship.

Specifically, in this embodiment, a bounding box of Nlabel with a fixed size of 64 is predicted, and the real tag box is also first amplified into 64 detection boxes, i.e. prediction boxesAnd true frame->Are both sets of two element sets 64. Use of extra special tag->Representing any object which is not detected or setting the object as a background type, firstly using a Hungary algorithm to carry out a bipartite graph matching mechanism, and carrying out one-to-one correspondence on elements of a real frame set and a predicted frame set.

To match the loss function, the function value is calculated>Minimum true frame information->And prediction frame information->Corresponding to the set element, wherein->The function is specifically implemented as follows:

the optimal matching between the real frame and the predicted frame elements is carried out through a bipartite graph matching function, the optimal matching is a corresponding relation, and then the final result is obtained through a loss function. There is a large difference between the loss function and the matching function, and the loss function requires positive values to process, i.e. log-probability is used.

The loss function is divided into two parts, one is a classification loss function, and the cross entropy loss function is used and expressed asThe method comprises the steps of carrying out a first treatment on the surface of the The second is the target bounding box regression loss function, expressed as +.>It is mainly through the IOU penalty function +.>And a weighted sum implementation of the L1 penalty function, wherein the IOU penalty is less Scale constraining and the L1 penalty is more Scale constraining, in fact the prediction submodule uses the GIOU penalty as follows:

wherein lambda is _iou Weight coefficient, lambda, representing the iou loss function _L1 Weight coefficients representing the L1 penalty function; the label frame amplification mode is to process the number of output defect frames according to the maximum value, wherein the maximum value is the set number of detection frames, so that different output detection results can be compatible, and less calculation resource waste is kept.

At the moment, a bipartite graph matching mechanism is carried out through a Hungary algorithm, real frame set elements and prediction frame set elements are in one-to-one correspondence, and matching loss between the real frames and the prediction results can be reduced to the minimum.

Firstly, obtaining the probability of the target category for a frame of the category to be detected truly, and then subtracting the predicted category probability by using the frame loss, wherein the method can give consideration to the frame position information and the category information. After the Hungary algorithm, a corresponding relation between the target frame and the real frame is obtained, and a corresponding loss function can be calculated through the relation.

The loss function and the matching function have larger difference, the loss function needs positive values to be processed, namely log-probability is used, and for the loss of the Ci=phi category, the weight occupied by the loss is reduced to keep balance between positive and negative samples, and the classification loss value is divided by 10 to be processed. The target bounding box regression penalty is a weighted sum of the IOU penalty and the L1 penalty, where the IOU penalty is less Scale constraining and the L1 penalty is more Scale constraining, in fact the DETR uses the GIOU penalty.

It should be understood that although described in a certain order, the steps are not necessarily performed sequentially in the order described. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, some steps of the present embodiment may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the steps or stages in other steps or other steps.

In one embodiment, there is also provided in an embodiment of the present invention a codec network-based defect detection system, the system including:

The BEIT module is a training model adopting a BEIT structure and is mainly completed by two unsupervised models of dVAE and BERT MIM (mask image model), and the dVAE module encodes an image patch into a visual Token; the MIM predictive picture mask portion corresponds to a visual Token.

The dVAE module adopts a structure similar to a DALL-E coding structure, is obtained through the evolution of a residual network structure, reserves the basic structure of the residual network, carries out targeted adjustment on the basic structure, and has an input layer of a convolutional neural network layer, a kernel of 7*7 and a final layer of a convolutional layer with the kernel of 1, so that the aim of generating a characteristic map of an image is achieved, and finally, the maximum value pooling layer carries out downsampling so as to obtain visual sign information.

When the method is applied to an industrial defect target detection model after laser welding, the model optimization efficiency can be accelerated to a great extent, and the detection capability has a good effect:

the test pictures are 1000 pictures of welded defect samples, the pictures are randomly divided into 10 groups, 100 pictures in each group are respectively input into a model for testing, the results of all groups are taken for averaging, and the test results are shown in the following table:

in summary, the defect detection method provided by the invention can also improve the running speed of the model under the condition of enhancing the detection effect by the combination scheme of BEIT coding and DETR decoding, and can meet the characteristics of high detection accuracy, high detection speed and the like of an industrial rapid iterative optimization algorithm.

In one embodiment, the invention also provides a computer device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus.

A memory for storing a computer program;

and the processor is used for executing the defect detection method based on the coding and decoding network when executing the computer program stored in the memory, and the steps in the method embodiment are realized when the processor executes the instructions.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral ComponentInterconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry StandardArchitecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The computer device includes a user device and a network device. Wherein the user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.; the network device includes, but is not limited to, a single network server, a server group of multiple network servers, or a Cloud based Cloud Computing (Cloud Computing) consisting of a large number of computers or network servers, where Cloud Computing is one of distributed Computing, and is a super virtual computer consisting of a group of loosely coupled computer sets. The computer device can be used for realizing the invention by running alone, and can also be accessed into a network and realized by interaction with other computer devices in the network. Wherein the network where the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

In one embodiment of the invention there is also provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the above described embodiment methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. The defect detection method based on the coding and decoding network is characterized by comprising the following steps of:

detecting target defects of the serialized data and the corresponding position coding information through an MLP prediction layer so as to output detection results;

performing dimension compression processing of convolution kernel 1x1 convolution on the visual sign information to enable the dimension of the visual sign information to be consistent with the length and width of an input image, so as to obtain serialized data based on the length and width dimension;

the position coding means that position coding information is obtained based on the serialized data, and specifically includes:

2. The codec network-based defect detection method of claim 1, wherein the image patch process further comprises:

encoding the image patch sequence to obtain image characteristic information;

wherein, the image patch is expressed as:

；

the image patch sequence is expressed as:

；

3. The defect detection method based on the coding and decoding network according to claim 2, wherein the coding process of the image patch sequence to obtain the image characteristic information is performed by inputting the image patch sequence into a linear layer network to obtain the embedding of the image patchThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Wherein P represents image resolution, C represents image channel number, and D represents vector dimension of image; r represents a real set, and the values D, P, C used to represent the dimensions of the feature vector fall within a real range.

4. The defect detection method based on the codec network according to claim 2, further comprising performing image mask processing on the image patch before performing serialization processing on the image patch; the image masking processing adopts a blockwise masking block masking mode, and after the N image patches with the same size are input for masking processing, a masked image patch set is output to obtain an image masking sequence.

5. The method for detecting defects based on a codec network as recited in claim 4, wherein the BEIT encoding process uses a BERT model to train the image mask sequence, and the loss function is a cross entropy loss function to output visual sign information.

6. A codec network-based defect detection system, comprising:

the MLP prediction layer is used for detecting target defects of the serialized data and the corresponding position coding information so as to output detection results;

7. A computer device comprising a memory storing a computer program and a processor implementing the steps of the codec network-based defect detection method according to any one of claims 1-5 when the computer program is loaded and executed.

8. A storage medium storing a computer program which, when loaded and executed by a processor, implements the steps of the codec network-based defect detection method according to any one of claims 1-5.