CN117152142A

CN117152142A - Bearing defect detection model construction method and system

Info

Publication number: CN117152142A
Application number: CN202311415402.3A
Authority: CN
Inventors: 王凯; 田楷; 晏文仲; 陈立名; 胡江洪; 曹彬; 方超群
Original assignee: Fitow Tianjin Detection Technology Co Ltd
Current assignee: Fitow Tianjin Detection Technology Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2023-12-01
Anticipated expiration: 2043-10-30
Also published as: CN117152142B

Abstract

The application discloses a method and a system for constructing a bearing defect detection model, which belong to the technical field of defect detection and comprise the following steps: s1, carrying out image preprocessing on sample images in a sample image set; s2, training a target detection model by using the complete image set; s3, adjusting the parameter weight in the CLIP model; the adjustment process comprises the following steps: for each patch image input by the KQ branch of the cross attention module in the image encoder module, performing linear transformation and then performing point multiplication weight weighting; introducing a low confidence detection result of target detection into the V branch of the cross attention module; s4, training the CLIP model by using the patch image set and the confidence coefficient of target detection to obtain a bearing defect detection model. The application combines the target detection model and the CLIP model, and can improve the reliability of the result when the bearing parts have few sample defects.

Description

Bearing defect detection model construction method and system

Technical Field

The application belongs to the technical field of defect detection, and particularly relates to a method and a system for constructing a bearing defect detection model.

Background

Bearings (bearings) are an important component in mechanical devices. Its main function is to support the mechanical rotator, reduce the coefficient of friction (friction coefficient) during its movement and guarantee its precision of revolution (accuracies). In the manufacturing process of the bearing parts, forging, rolling, punching, turning, grinding, heat treatment and other working procedures are carried out.

In the bearing defect detection process, the traditional technology is as follows: firstly, collecting defect images, then marking and training by adopting a target detection algorithm, wherein the detection effect of the algorithm depends on the number and diversity of the collected defects. However, with the continuous optimization and maturation of the technology, the yield of the production enterprises is continuously improved, which makes the collection of the defect images more and more difficult, and the final result is that the performance of the target detection algorithm is insufficient and the defect missing detection condition is serious.

Disclosure of Invention

The application provides a method and a system for constructing a bearing defect detection model, which are used for solving the technical problems in the prior art, and combining a target detection model with a CLIP model to improve the reliability of results when few sample defects of bearing parts are detected.

The first object of the present application is to provide a method for constructing a bearing defect detection model, comprising:

s1, acquiring a sample image set of a bearing, and carrying out image preprocessing on sample images in the sample image set to obtain a complete image set containing complete information of the end face of the bearing and a patch image set containing local information of the end face of the bearing; each complete image consists of a plurality of patch images;

s2, training a target detection model by using the complete image set; obtaining the confidence coefficient of target detection;

s3, adjusting the parameter weight in the CLIP model; the image encoder of the CLIP model comprises twelve attention modules, wherein the first six attention modules are self-attention modules, and the last six attention modules are cross-attention modules; the adjustment process comprises the following steps:

for each patch image input by the KQ branch of the cross attention module, carrying out linear transformation and then carrying out point multiplication weight weighting;

introducing a low confidence detection result of target detection into a V branch of the cross attention module, wherein the range of the low confidence is 0.1-0.5;

s4, training the CLIP model by using the patch image set and the confidence coefficient of target detection to obtain a bearing defect detection model; wherein: the training is performed by using the cross attention module only.

Preferably, the image preprocessing specifically includes:

s101, firstly graying a sample image to obtain a gray image; then dividing the gray image into a target area and a background area according to the pixel value of the area to be detected; finally, filling the hollow in the annular region by using a closed operation;

s102, firstly, reserving the contour of the end face of the bearing in an annular area to be detected through contour screening; then optimizing the detected annular closed contour by using a fitting circle algorithm, and finally buckling an annular region on an original image;

s103, marking defects: marking the type and the position of the defect on the image taken out by the buckle.

Preferably, in S3, for each patch image input by the KQ branch of the cross-attention module, the expression of the patch image calculation weight is: sin ((r pi/2)/(3) +1, r < = 1; r represents the ratio of the area of the annotation frame in the current patch image to the annotation frame.

Preferably, the low confidence detection result of the target detection is introduced in the V branch of the cross attention module, which specifically comprises:

performing linear transformation on each patch image input by the V branch, and then performing point multiplication on a confidence coefficient adjustment value detected by the target detection model, wherein the adjustment value is obtained through the following steps:

screening out a low confidence detection frame;

then, calculating the total area of each annotation frame in the patch image, which belongs to the area ratio of the patch image, leaving the annotation frame with the ratio of more than 40%, removing the annotation frame with the high confidence detection frame IOU of more than 50%, and calculating the left annotation frame and the IOU of the low confidence detection frame in the area of the patch image;

finally, the weight calculation is performed by using the IOU to input the following formula: cos (r pi/2) +1 to obtain the adjusted value; r represents the ratio of the area of the annotation frame in the current patch image to the annotation frame.

A second object of the present application is to provide a bearing defect detection model construction system, comprising:

sample module: acquiring a sample image set of a bearing, and carrying out image preprocessing on sample images in the sample image set to obtain a complete image set containing complete information of the end face of the bearing and a patch image set containing local information of the end face of the bearing; each complete image consists of a plurality of patch images;

a model preliminary training module; training a target detection model by using the complete image set; obtaining the confidence coefficient of target detection;

parameter weight adjustment module: adjusting the parameter weight in the CLIP model; the image encoder of the CLIP model comprises twelve attention modules, wherein the first six attention modules are self-attention modules, and the last six attention modules are cross-attention modules; the adjustment process comprises the following steps:

model retraining module: training the CLIP model by using the patch image set and the confidence coefficient of target detection to obtain a bearing defect detection model; wherein: the training is performed by using the cross attention module only.

Preferably, the image preprocessing specifically includes:

screening out a low confidence detection frame;

The third object of the present application is to provide a bearing defect detection model, which is constructed by the method for constructing a bearing defect detection model.

The fourth object of the present application is to provide a method for detecting a bearing defect, which uses the above-mentioned bearing defect detection model to complete the defect detection process.

The application has the advantages and positive effects that:

the image encoder of the CLIP model in the application comprises twelve attention modules, wherein the first six attention modules are self-attention modules, and the last six attention modules are cross-attention modules; during construction, firstly, preprocessing a sample image to obtain a complete image of a bearing and a plurality of patch images, and then processing the patch images by using a cross attention module in an initial CLIP model to obtain KQV branches of the cross attention module; processing the complete image by utilizing target detection to obtain modal information, and adjusting the weight of the V branch model of the cross attention module by utilizing a low confidence detection result in the modal information; and finally, training the cross attention module in the CLIP model by using the patch image again, namely freezing the model weights extracted from the first six self-attention features during training, and training by using only the last six cross attention modules.

The bearing defect detection model constructed by the application utilizes the CLIP pre-training model and introduces new mode information, and uses the CLIP model and the target detection model to jointly carry out defect judgment. The detection key points of the application are as follows: the algorithm can be rapidly enabled to have the capability of distinguishing defect features by using a small amount of samples through the following two points:

1. the contrast learning algorithm of the CLIP model is improved, when the bearing image is trained, the base model of the CLIP is finely adjusted by introducing the modal information of the target detection result, and the base model can be quickly provided with the characteristic perception capability of new data through specific steps.

2. Because the algorithm characteristic of the CLIP has stronger characteristic distinguishing capability than that of the target detection pre-training model, the application selectively freezes part of the weights of the CLIP pre-training model for training, so that the improved CLIP model can achieve the same characteristic extracting capability as the target detection model by using a small amount of data.

Drawings

FIG. 1 is a flow chart of a method for constructing a bearing defect detection model in a preferred embodiment of the present application;

FIG. 2 is a block diagram of a CLIP model in accordance with a preferred embodiment of the present application;

FIG. 3 is a system block diagram of a bearing defect detection model building system in accordance with a preferred embodiment of the present application;

FIG. 4 is a schematic view of a bearing prior to image preprocessing in a preferred embodiment of the present application;

fig. 5 is a schematic view of a bearing after image preprocessing in a preferred embodiment of the present application.

Detailed Description

For a further understanding of the application, its features and advantages, reference is now made to the following examples, which are illustrated in the accompanying drawings in which:

the following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the technical solutions of the present application, all other embodiments obtained by a person skilled in the art without making any creative effort fall within the protection scope of the present application.

The CLIP model is a model trained by using large-scale text-image pairs, 4 hundred million text-image pairs (images) are collected to train the CLIP model, and after large-scale text-image pre-training, the cosine similarity of the input text and image is calculated through encoding to judge the matching degree of the data pairs. And then the model can be directly migrated to an image classification task without any fine adjustment of labeled data, and the zero-shot classification model can be directly realized.

The CLIP model learns image information by means of contrast learning, as shown in fig. 2, taking a rust bearing image as an example, a photo of rust on a bearing indicates that: an image of a rust bearing; the Image model inputs N 'Image-text' pairs containing one batch Image, the N images are Image coded by an Image Encoder, and the [ I ] is output ₁ ,I ₂ ,…I _n ]Dimension (N, di), (here N pictures of one batch are encoded, I) ₁ I.e. the coding vector of the first picture, I _n Is the coding vector of the i-th picture, each coding vector having a length di. From I ₁ To I _n Form a code vector group I ₁ ,I ₂ ,I ₃ ,…I _N ]Then its size is (N, di)); simultaneously, N texts are output as [ T ] through a text encoder ₁ ,T ₂ ,…T _n ]Dimension (N, dt), (length of each encoding vector is dt); wherein the positive sample pair I corresponds to the data pair indexed by T, such as I ₁ And T is ₁ ，I ₂ And T is ₂ The rest are negative samples, and N positive samples and N square-N negative samples can be finally obtained. Calculation I _i And T is _i Cosine similarity among the images and texts is higher as the maximum similarity indicates that the image-text correlation is higher, and the optimization targets are as follows: by maximizing the cosine similarity of the positive sample pair, i.e., the similarity when ij is the same, minimizing the cosine similarity of the negative sample, i.e., the similarity when ij is different,the Text Encoder and Image Encoder related weight parameters are trained using cross entropy loss. Can be summarized in the following form:

"·" is the cosine similarity of the text vector and the image vector.

Self-Attention (Self-Attention) and Cross-Attention (Cross-Attention) are variants of the Attention mechanism for capturing associations between different elements in sequence or multimodal data.

In the field of bearing defect detection, AI technology is used for target detection as a conventional means, but target detection generalization is limited, and only defect characteristics which are relatively similar to those of trained defects can be detected, so that the possibility of missed detection is generated. Therefore, a clip algorithm model with strong feature perception capability is added in the application, and the model technology is adjusted by a small amount of data, wherein other weights except the MLP in the image encoder are frozen and trained, so that the feature distinction capability of normal and defect data of the bearing is increased on the basis of original feature perception. Finally, the two algorithms are used for detecting the bearing shooting image in a cascading mode, so that missed detection is reduced.

Referring to fig. 1, a method for constructing a bearing defect detection model includes:

s1, acquiring a sample image set of a bearing, and performing image preprocessing on sample images in the sample image set to obtain a complete image set containing bearing end face information and a patch image set containing a plurality of bearing end face local information; in this embodiment, the specific process of image preprocessing includes:

s101, firstly, reading a sample image containing a bearing, and graying the sample image to obtain a gray image; then dividing the gray image into a target area and a background area according to the pixel value of the area to be detected; finally, the voids in the annular region are filled by using a closed operation in morphology, so that small noise can be removed, and the annular region can be better separated;

s102, firstly, only retaining the contour of the end face of the bearing in the annular region to be detected through contour screening; then further optimizing the detected annular closed contour by using a fitting circle algorithm, and finally, buckling an annular region on an original image;

s103, marking defects: marking the type and the position of the defect on the image taken out by the buckle;

marking the image taken out by the buckle by using a target detection marking form, namely marking the defect type and the position; the defect types of the bearing are defined as follows: rust (rust), bump (crush), crush (crush), swell (bulge).

Referring to fig. 4 and 5, fig. 4 is a sample image of the bearing before processing, fig. 5 is a labeled bearing image, and four rust labeling frames are total.

S2, model preliminary training; the method specifically comprises the following steps:

training a target detection model by using the complete image set; obtaining detection result modal information of target detection; the detection result modal information comprises the category, coordinates and confidence of target detection; the method comprises the following steps: and training the marked complete image by using a target detection model, wherein the training of the target detection model is to introduce detection result modal information of target detection when training the improved CLIP model.

since the area to be detected of the bearing image is in a ring state, the embodiment divides the bearing image into 9 patch images (other numbers can be adopted, such as 4 patch images or 16 patch images) according to the characteristics of the bearing image, and after the middle black irrelevant area is removed, the remaining 8 patch image blocks are used as the original input of KQV three branches of cross attention; the irrelevant area information can be screened out by removing the middle block, so that the training speed is improved.

For each patch image input by the KQ branch of the cross attention module, carrying out linear transformation and then carrying out point multiplication weight weighting; the weights of patch images with different importance degrees are different; the weight calculation process of the patch image is as follows:

firstly, calculating the ratio r of the area of a mark frame in the current patch image to the mark frame; and then inputting the ratio r to a designed weight function sin ((r pi/2)/(3) +1, wherein r is not more than 1, and calculating the weight value of the patch image. The function is to make the patch image with larger mark frame ratio obtain more weight and smaller weight than the patch image with smaller mark frame ratio, and then to calculate (r pi/2) to 3 times. In order to ensure that the calculated gradient does not disappear during training, 1 is added to the sin function as a whole, and the patch image without label frame overlapping is ensured to be capable of acquiring the weight of 1.

Introducing a low confidence detection result of target detection into a V branch of the cross attention module, wherein the range of the low confidence is 0.1-0.5; the complete image is inferred using the target detection model, and a low confidence detection result (confidence between 0.1 and 0.5) of target detection is introduced into the V branch, and the result participates in training as information of another mode of the cross-attention V branch. Specifically, the confidence coefficient adjustment value detected by the target detection model is multiplied after each patch image input by the V branch is subjected to linear transformation, and the adjustment value is as follows: firstly screening out a low confidence detection frame, then calculating the total area of each annotation frame belonging to the patch image in the area ratio of the annotation frame in the patch image, leaving the annotation frame with the ratio of more than 40%, removing the annotation frame with the ratio of more than 50% in the high confidence detection frame IOU, calculating the left annotation frame and the IOU of the low confidence detection frame in the patch image area, and using the IOU to input the following formula for weight calculation: cos (r.pi.2) +1. The function is mainly used for obtaining a higher weight from a patch image with low confidence and larger overlapping with a label frame, so that the CLIP model can learn characteristics in the patch image which are difficult to detect in target detection, and the detection capability of the characteristics is improved.

S4, training the CLIP model by using the patch image set and the confidence coefficient of target detection to obtain a bearing defect detection model, wherein the training is performed by using the cross attention module only; that is, the model weights of the first six self-attention feature extraction are frozen during training, and only the last six self-attention feature extraction modules are improved into a cross-attention module; the operation not only maintains the feature extraction capability of the original pre-training model, but also fits new data.

Based on the above embodiment, the method may further include S5, defining a label of the patch image in the contrast training;

the area of the marking frame in the patch image is more than 10% of the total area of the marking frame, and the category of the marking frame is marked in the patch image. The following is the bearing class patch image tag naming convention: the a photo of (rust) on a bearing, the names of various defects of the bearing such as rust (rust), bump (bump), normal (crush), and sweell (bump) are filled in brackets as labels, wherein normal is a normal image without defects. The multi-tag incorporates a tag as the tag such as a photo of (last and bulb) on a bearing.

Referring to fig. 3, a system for constructing a bearing defect detection model includes:

sample module: acquiring a sample image set of a bearing, and carrying out image preprocessing on sample images in the sample image set to obtain a complete image set containing bearing end face information and a patch image set containing a plurality of bearing end face local information; the method specifically comprises the following steps:

A model preliminary training module; the model preliminary training specifically comprises the following steps:

training a target detection model by using the complete image set; obtaining detection result modal information of target detection; the detection result modal information comprises the category, coordinates and confidence of target detection; the method comprises the following steps: training the marked complete image by using a target detection model, wherein the training of the target detection model is to introduce detection result modal information of target detection when training an improved CLIP model;

performing primary training on a CLIP model by using a patch image set, wherein an image encoder of the CLIP model comprises twelve attention modules, wherein the first six attention modules are self-attention modules, and the last six attention modules are cross-attention modules; only using the cross attention module to train during training;

firstly, calculating the ratio r of the area of a mark frame in the current patch image to the mark frame; and then inputting r into a designed weight function sin ((r. Pi/2). Sup.3) +1, wherein r is not more than 1, and calculating the weight value of the patch image. The function is to make the patch image with larger mark frame ratio obtain more weight and smaller weight than the patch image with smaller mark frame ratio, and then to calculate (r pi/2) to 3 times. In order to ensure that the calculated gradient does not disappear during training, 1 is added to the sin function as a whole, and the patch image without label frame overlapping is ensured to be capable of acquiring the weight of 1.

Introducing a low confidence detection result of target detection into a V branch of the cross attention module, wherein the range of the low confidence is 0.1-0.5; the complete image is inferred using the target detection model, and a low confidence detection result (confidence between 0.1 and 0.5) of target detection is introduced into the V branch, and the result participates in training as information of another mode of the cross-attention V branch. Specifically, the confidence coefficient adjustment value detected by the target detection model is multiplied after each patch image input by the V branch is subjected to linear transformation, and the adjustment value is as follows: firstly screening out a low confidence detection frame, then calculating the total area of each annotation frame belonging to the patch image in the area ratio of the annotation frame in the patch image, leaving the annotation frame with the ratio of more than 40%, removing the annotation frame with the ratio of more than 50% in the high confidence detection frame IOU, calculating the left annotation frame and the IOU of the low confidence detection frame in the patch image area, and using the IOU to input the following formula for weight calculation: cos (r.pi.2) +1. The function is mainly used for obtaining a higher weight from a patch image with low confidence and larger overlapping with a label frame, so that the characteristic in the patch image which is difficult to detect in target detection can be learned by a CLIP model, and the detection capability of the characteristic is improved;

model retraining module: training the CLIP model by using the patch image set and the confidence coefficient of target detection to obtain a bearing defect detection model, wherein the training is performed by using the cross attention module only; that is, the model weights of the first six self-attention feature extraction are frozen during training, and only the last six self-attention feature extraction modules are improved into a cross-attention module; the operation not only maintains the feature extraction capability of the original pre-training model, but also fits new data.

On the basis of the above embodiment, the method may further include a tag definition module: defining a label for comparing patch images in training;

A bearing defect detection model is constructed by the method for constructing the bearing defect detection model.

The bearing defect detection method utilizes the bearing defect detection model to finish the defect detection process; the specific detection process comprises the following steps:

and inputting the bearing image to be detected into a CLIP model and a target detection model, respectively processing the bearing image to be detected by using the two models, and performing logical OR operation on defect information obtained by processing results, namely, considering that the bearing has defects as long as any model detects the defect information, and further outputting the defect information.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the application in any way, but any simple modification, equivalent variation and modification of the above embodiments according to the technical principles of the present application are within the scope of the technical solutions of the present application.

Claims

1. The method for constructing the bearing defect detection model is characterized by comprising the following steps of:

for each patch image input by the KQ branch of the cross attention module in the image encoder module, performing linear transformation and then performing point multiplication weight weighting;

2. The method for constructing a bearing defect detection model according to claim 1, wherein the image preprocessing specifically comprises:

3. The method of claim 1, wherein in S3, for each patch image input by the KQ branch of the cross-attention module, the expression of the patch image calculation weight is: sin ((r pi/2)/(3) +1, r < = 1; r represents the ratio of the area of the annotation frame in the current patch image to the annotation frame.

4. The method for constructing a bearing defect detection model according to claim 1, wherein the low confidence detection result of the target detection is introduced into the V branch of the cross attention module, specifically comprising:

and carrying out linear transformation on each patch image input by the V branch, and then carrying out dot multiplication on the confidence coefficient adjustment value detected by the target detection model.

5. The method of claim 4, wherein the confidence adjustment value is obtained by:

screening out a low confidence detection frame;

6. A bearing defect detection model construction system, comprising:

7. The bearing defect detection model construction system of claim 6, wherein the image preprocessing specifically comprises:

8. The bearing defect detection model construction system according to claim 6, wherein in S3, for each patch image input by the KQ branch of the cross-attention module, the expression of the patch image calculation weight is: sin ((r pi/2)/(3) +1, r < = 1; r represents the ratio of the area of the annotation frame in the current patch image to the annotation frame.

9. The bearing defect detection model construction system of claim 6, wherein introducing low confidence detection results of target detection at the V-branch of the cross-attention module comprises:

screening out a low confidence detection frame;

10. A bearing defect detection model constructed by the method for constructing a bearing defect detection model according to any one of claims 1 to 5.