CN115829995A

CN115829995A - Cloth flaw detection method and system based on pixel-level multi-scale feature fusion

Info

Publication number: CN115829995A
Application number: CN202211642227.7A
Authority: CN
Inventors: 方梦园; 叶苏雨; 鲁涵统; 徐伟强
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-03-21

Abstract

The invention discloses a method and a system for detecting cloth flaws based on pixel-level multi-scale feature fusion, wherein the method comprises the following steps: s1, acquiring a flaw data set; s2, dividing the data set into a training set, a verification set and a test set; s3, mapping and transforming the label information of the detection target; s4, inputting the picture into a Swin Transformer of a backbone network; s5, inputting the obtained characteristics into AG-FPN; s6, inputting the characteristics to an AutoAssisign detection head; s7, calculating the value of the overall loss function of the network, and updating the model parameters; s8, predicting the verification set by using a network to obtain the detection effect of the current network; s9, repeating S4 to S8, and iterating for a plurality of times; judging whether the training is converged; if yes, saving the current model network parameters; if not, continuously repeating S3 to S7, and iterating for a plurality of times; z1, zooming the detection picture; z2, inputting the picture into an S8 network to obtain the prediction frame information of the target object; and Z3, mapping the prediction frame on the original image to obtain the flaws of the original image.

Description

Cloth flaw detection method and system based on pixel-level multi-scale feature fusion

Technical Field

The invention belongs to the technical field of textile cloth flaw detection, and particularly relates to a cloth flaw detection method and system based on pixel-level multi-scale feature fusion.

Background

The textile industry has always occupied an important position in national economic development of China. In the production process of the fabric cloth, due to the influence of various factors, stains, holes and other defects are easy to generate. Therefore, in order to ensure the quality of the product, a professional quality inspector is required to inspect the cloth. However, manual detection is greatly affected by subjective factors and lacks of consistency, and detection personnel can greatly affect eyesight after working for a long time, so that defects are easily missed.

In recent years, with the rapid development of computer vision, flaw detection based on deep learning is widely applied in industrial quality inspection scenes. Compared with other industrial quality inspection scenes, the cloth defects are various in types, complex in size and position distribution and large in variation, different defect characteristics can often interfere with each other, and common detection algorithms are poor in performance. In addition, the defects of the cloth have the problem of low visibility, and the background pattern of the cloth easily interferes with the characteristics of the defects. The above problems cause difficulty in detecting a defect of the cloth.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a system for detecting cloth flaws based on pixel-level multi-scale feature fusion.

Therefore, the invention adopts the following technical scheme:

a cloth flaw detection method based on pixel-level multi-scale feature fusion comprises a model training stage and a flaw detection stage;

the model training phase comprises the following steps:

step S1, data acquisition: photographing the cloth to obtain initial cloth images, wherein the size of the cloth images is consistent; marking and screening the flaws on the cloth images, only keeping the images with flaws, and finally obtaining a required cloth flaw data set;

step S2, data set division: dividing a cloth flaw data set into a training set, a verification set and a test set according to a preset proportion;

step S3, data preprocessing: randomly turning and randomly scaling the pictures input into the network according to the set probability and the scaling proportion, and simultaneously carrying out corresponding mapping transformation on the label information of the detection target;

step S4, feature extraction: inputting the preprocessed pictures into a backbone network to output corresponding characteristics with different scales, wherein the backbone network is a Swin Transformer;

and S5, fusing the features based on the pixel level: the resulting features of different scales are input into the AG-FPN. The AG-FPN obtains the relative importance of each pixel in the shallow feature from the deep feature through an AGM module, so as to distinguish whether the pixel belongs to background information which should be erased or target feature semantics which should be enhanced. Finally outputting N pieces of feature information with different scales;

step S6, flaw detection: inputting the fused features into a detection head to generate corresponding prediction frame information, wherein the detection head is an AutoAssisign;

step S7, updating detector parameters: calculating the value of the overall loss function of the network by combining the information of the prediction frame and the information of the real frame (GT frame) manually marked in the step S1, and updating the model parameters by using AdamW;

step S8, checking the detector effect: predicting the verification set by using the network with the updated parameters to obtain the detection effect of the current network;

step S9, repeating the steps S4 to S8, wherein the iteration number is I ₁ Secondly; judging whether the training is converged according to the loss function curve; if the model is converged, saving the current model network parameters; if not, continuing to repeat the step S3 to the step S8, wherein the iteration number is I ₂ Secondly;

the flaw detection stage specifically comprises the following steps:

step Z1, data preprocessing: fixedly zooming the detected picture according to a set zooming ratio;

step Z2, obtaining prediction information: inputting the preprocessed picture into a model network stored in the step S9 of the model training stage to obtain the prediction frame information of the target object;

step Z3, predicting frame information regression: and mapping the prediction frame to the original image according to the obtained prediction frame information and the set size range to obtain position information and category information of the flaws on the original image, and finishing cloth flaw detection.

The cloth flaw detection method adopts Swin transform as a main network of a detector for feature extraction, and inputs the extracted features into AG-FPN for pixel-level-based feature fusion. The AG-FPN is designed specially for solving the problem of mutual interference of defect characteristics and complex background interference in a cloth defect detection task by introducing a pixel-level-based adaptive fusion module (AGM) into an FPN structure, wherein the specific structure is shown in FIG. 3, and the calculation cost and parameter burden of the AGM on a current detection model are negligible.

While adopting the above technical scheme, the invention can also adopt or combine the following further technical scheme:

as a preferred embodiment of the present invention: in the step S1, the size of the acquired image is 1024 × 2048, the acquired dataset is manually labeled by using a LabelMe tool box, and the labeled GT box information is stored in a COCO format, that is, the GT box is labeled as (x, y, w, h, c), x and y respectively represent the x coordinate and the y coordinate of the top left corner vertex of the target, w and h respectively represent the length and the width of the GT box, and c represents the category of the object contained in the GT box.

As a preferred embodiment of the present invention: in step S2, the proportion of the training set, the validation set, and the test set is 8.

As a preferred embodiment of the present invention: in step S3, the set probability of random flipping is 0.5, the set size of random scaling is [480 × 1333,512 × 1333,640 × 1333], and random scaling is performed to scale each picture to a random size in the size list.

As a preferred embodiment of the present invention: in step S4, the initialization parameters of the trunk network Swin transform are pre-trained using ImageNet data set, and the Swin transform is preferred to the Swin-Tiny structure in view of the computational cost and detection rate of the detector.

As a preferred embodiment of the present invention: in step S5, N is set to be 4, and background information in shallow features is erased by means of deep feature semantics, so that feature information of the target is highlighted.

As a preferred embodiment of the present invention: in the step S6, the detection head of the network model is preferably an automation design, which is a detection head of Anchor-free (i.e. no Anchor frame needs to be preset), and is less affected by prior knowledge such as a preset frame compared with a classical YOLO detection head (which needs to be preset with an Anchor frame); the pre-selection frame information comprises prediction point type information, prediction point regression position information and prediction point information, wherein the prediction point regression position information is (l, t, r, b) and respectively represents the distance between a prediction point and the left side, the distance between the prediction point and the upper side, the distance between the prediction point and the right side and the distance between the prediction point and the lower side of the frame.

As a preferred embodiment of the present invention: step S7 includes the steps of:

s7.1, regression of prediction frame information: mapping the predicted frame information to a picture after data preprocessing according to the set size step length to obtain preliminary candidate frame information; then, converting the preliminary candidate frame information according to the corresponding scaling size and the overturning condition in the preprocessing process to obtain final candidate frame information;

s7.2, calculating a loss function: calculating classification loss, regression loss and central point prior weight according to the candidate frame information and the corresponding GT frame, calculating corresponding positive sample weight, positive sample confidence coefficient, negative sample weight and negative sample confidence coefficient according to the classification loss, the regression loss and the central point prior weight, and finally calculating the overall network loss by using focal loss;

s7.3, setting an optimizer: adamW was used as an optimizer to help the network update parameters while setting the learning rate to 0.00005 and the weight decay term to 0.05.

As a preferred embodiment of the present invention: in step S8, the number of iterations I ₁ 、I ₂ Preferably 36, 14, respectively.

As a preferred embodiment of the present invention: in step Z1, the scaling ratio of the image is preferably 0.5.

As a preferred embodiment of the present invention: in step Z3, obtaining a final prediction box from the prediction box information includes the following steps:

z3.1, prediction frame information regression: mapping the predicted frame information to a picture after data preprocessing according to the set size step length to obtain preliminary candidate frame information; then, converting the preliminary candidate frame information according to the corresponding scaling ratio in the preprocessing process to obtain the predicted frame information in the original image;

z3.2, screening a prediction box: sequencing all preselected frames in the same picture according to the confidence; and removing the redundant frame according to a non-maximum suppression algorithm to obtain a final prediction frame.

According to the purpose of the invention, the invention provides a cloth flaw detection method and a cloth flaw detection system based on pixel-level multi-scale feature fusion, which adopt the network model structure and the model training method for training and detecting.

The invention also discloses a system of the cloth flaw detection method based on the pixel-level multi-scale feature fusion, which comprises a model training module and a flaw detection module;

the model training module specifically comprises the following sub-modules:

a data acquisition submodule: photographing the cloth to obtain an initial cloth image; marking flaws on the cloth images, and only keeping the images with flaws to obtain a cloth flaw data set;

a data set partitioning submodule: dividing a cloth flaw data set into a training set, a verification set and a test set according to a preset proportion;

a data preprocessing submodule: randomly overturning and immediately zooming the picture input into the Swin Transformer of the backbone network according to a set probability and a zoom ratio, and simultaneously carrying out corresponding mapping transformation on the marking information of the detection target;

a feature extraction submodule: inputting the preprocessed pictures into a Swin Transformer of a backbone network to output corresponding characteristics with different scales;

a pixel-level based feature fusion submodule: inputting the obtained features of different scales into the AG-FPN for feature fusion, and outputting N pieces of feature information of different scales;

and a flaw detection submodule: inputting the fused features into an AutoAssisign detection head to generate corresponding prediction frame information;

a detector parameter updating submodule: calculating the value of the overall loss function of the network by combining the information of the prediction frame and the information of a real frame, namely a GT frame, marked in the data acquisition sub-module, and updating the model parameters by using AdamW;

view detector effect submodule: predicting the verification set by using the network with the updated parameters to obtain the detection effect of the current network;

an iteration submodule: judging whether the training is converged according to the loss function curve; if the model is converged, saving the current model network parameters;

the flaw detection module specifically comprises the following sub-modules:

a data preprocessing submodule: fixedly zooming the detected picture according to a set zooming ratio;

the obtain prediction information submodule: inputting the preprocessed picture into a network stored in a checking detector effect sub-module in a model training stage to obtain the prediction frame information of a target object;

a prediction box information regression submodule: and mapping the prediction frame on the original image according to the obtained prediction frame information and the set size range to obtain the position information and the type information of the flaws on the original image.

The method and the system for detecting the cloth defects based on the pixel-level multi-scale feature fusion have the following advantages:

(1) The FPN structure is improved, and the features extracted from the main network are filtered and enhanced by introducing a pixel-level-based self-adaptive fusion module, so that the mutual interference among different defect features and the interference of background noise on the defect features are effectively reduced;

(2) The intrinsic characteristics of the Swin transducer are utilized, and the characteristic loss of large-size defects and long and narrow defects is effectively reduced on the premise of not bringing additional parameters and calculation cost;

(3) According to the invention, the automatic Assisign is used for replacing the detection head of the traditional Anchor-base (needing to preset an Anchor frame), so that the dependence of a network on priori knowledge such as the preset frame is reduced, and the detection capability of the defects of extreme sizes is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for detecting defects in a piece of cloth based on pixel-level multi-scale feature fusion according to a preferred embodiment of the present invention;

fig. 2 is a network architecture diagram of the method of the present invention.

FIG. 3 is a block diagram of a pixel level adaptive feature fusion Module (AGM) in the method of the present invention.

FIG. 4 is a graph illustrating the detection of a sample of a cloth image according to the present invention;

FIG. 5 is a block diagram of a cloth defect detection system based on pixel-level multi-scale feature fusion according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The embodiment provides a method and a system for detecting cloth defects based on pixel-level multi-scale feature fusion, and refer to fig. 1, which is a schematic flow chart of the method. The image is processed according to the flow chart of the method of the invention, so as to explain the effect of the method of the invention on improving the detection precision of the cloth flaw in detail.

As shown in fig. 1, the method for detecting defects in cloth based on pixel-level multi-scale feature fusion of the present embodiment includes a model training stage and a defect detection stage;

specifically, the model training phase is performed according to the following steps:

s1, shooting cloth on a cloth production line by using a camera to obtain an initial cloth image, wherein the size of the image is 1024 multiplied by 2048 resolution; and manually marking and screening the acquired cloth images by using a LabelMe tool box, only keeping the images with the defects, and finally obtaining the required cloth defect data set. The label is stored in a COCO format, namely, a GT box is labeled as (x, y, w, h, c), x and y respectively represent the x coordinate and the y coordinate of the top left corner vertex of the target, w and h respectively represent the length and the width of the GT box, and c represents the category of the object contained in the GT box.

And S2, dividing the obtained data set according to a preset proportion to respectively obtain a training set, a verification set and a test set. Specifically, the preset ratio is 8.

And S3, carrying out data preprocessing on the image to be input into the network, randomly overturning and immediately zooming according to the set probability and the zooming ratio, and simultaneously carrying out corresponding mapping transformation on the label information of the detection target.

And S4, inputting the preprocessed image into a backbone network to output corresponding four characteristics with different scales, wherein the backbone network is a Swin transform. Specifically, the initialization parameters of the backbone network Swin transform were pre-trained using ImageNet data sets, and the Swin transform was preferred over the Swin-Tiny structure in view of the computational cost and detection rate of the detector.

And S5, sequentially inputting the four features with different scales extracted in the step S4 into the AG-FPN for self-adaptive feature fusion from bottom to top. The AG-FPN obtains the relative importance of each pixel in the shallow feature from the deep feature through an AGM module, thereby distinguishing whether the pixel belongs to background information which should be erased or target feature semantics which should be enhanced, and finally outputting feature information of four different scales. The AGM module structure is shown with reference to FIG. 3.

And S6, inputting the fused features into an AutoAssisign detection head, wherein the detection head has three branches and respectively outputs predicted point type information, predicted point regression position information and a predicted point weight graph. The pre-selection frame information comprises prediction point classification information, distance information of a prediction point from a GT frame and prediction point information. The dimensionality of the classification information of the predicted points is H multiplied by W multiplied by C, wherein H, W respectively represents the length and the width of the current features, and C represents the number of categories; the dimensionality of the regression position information of the predicted points is H multiplied by W multiplied by 4, wherein 4 represents (l, t, r, b) and is the distance between the predicted points and the left side, the upper side, the right side and the lower side of the frame respectively; the dimension of the prediction point information is H multiplied by W multiplied by 1, and the probability of an object on the prediction point is used for avoiding generating a low-quality detection frame.

Step S7 specifically includes the following steps:

s7.1, predicting frame information regression: will feature map F _i And (3) mapping each upper point coordinate (x, y) to the picture after data preprocessing according to the following formula:

where s is a preset size step. Then, converting the preliminary candidate frame information according to the corresponding scaling size and the overturning condition in the preprocessing process to obtain final candidate frame information;

s7.2, calculating a loss function: firstly, calculating the weights of the central points on the features with different sizes according to the GT frame information, wherein the calculation formula is as follows:

wherein

A vector representing a position relative to a center of the frame;

is a learnable parameter of dimension (C, 2) (C is the number of classes). Secondly, calculating confidence weight, wherein the confidence weight calculation formula is as follows:

P _i (cls|θ)＝P _i (cls|obj,θ)P _i (obj|θ) (3)

P _i (θ)＝P _i (cls|θ)P _i (loc|θ) (5)

wherein, P _i (cls | θ) represents the classification confidence, P _i (cls | obj, θ) is the predicted point classification information obtained in step S6, and indicates the probability that the position i is a certain classification, P _i (obj | θ) is the predicted point information obtained in step S6, θ being a learnable parameter; p _i (loc | θ) is the regression confidence,

is the regression loss; p _i (θ) represents the joint confidence; c (P) _i ) The confidence weighting function is used to make the network more focused on the high confidence locations, and τ is the temperature coefficient, and the high confidence and low confidence locations are controlled to contribute to the positive weight, default to 1. Finally, a positive weight map is calculated

And a negative weight map

The formula is as follows:

wherein the content of the first and second substances,

is as in formula (2)

S _n Representing the left and right positions in the corresponding GT frame in all the scale features; iou _i Representing the largest IoU between the prediction box at position i and all GT boxes. And finally obtaining a final loss function through the following formula:

where N represents the number of all GT boxes, S represents all locations on all scale features,

n denotes the nth GT box.

Step S8, checking the detector effect: and predicting the verification set by using the network with the updated parameters to obtain the detection effect of the current network.

Step S9, repeating the steps S4 to SAnd S8, storing the parameters of the current model network until the model converges. Specifically, the present embodiment determines whether the model converges by observing the loss function curve. If the model is converged, saving the current model network parameters; if not, continuing to repeat the steps S3 to S7, wherein the iteration number is I ₂ Secondly;

the flaw detection stage comprises the following steps:

step Z1, data preprocessing: and fixedly zooming the detected picture according to a set zooming proportion. Specifically, the scaling adopted by the method is 0.5;

step Z2, acquiring prediction information: inputting the preprocessed picture into a model network stored in the step S8 of the model training stage to obtain the prediction frame information of the target object;

step Z3 specifically comprises the following steps:

z3.1, prediction frame information regression: mapping the predicted frame information to a picture after data preprocessing according to the set size range to obtain preliminary candidate frame information; then, converting the preliminary candidate frame information according to the corresponding scaling ratio in the preprocessing process to obtain the predicted frame information in the original image;

In order to verify the performance of the method provided by the invention, the method is used for predicting the images in the test set, and the prediction result and the GT are used for calculating the Average accuracy mean mAP (mean Average Precision) and the Average accuracy AP (Average Precision) corresponding to each category. The detection accuracy of the method is compared with that of a classic detector Cascade R-CNN, sparse R-CNN, TOOD and FCOS, the comparison result is shown in Table 1, and all the experimental results are repeated at least 2 times to improve the effectiveness of the experimental results. Compared with other four classical detectors, the method provided by the invention has higher accuracy in defect detection of all kinds, and the superior performance of the method is proved.

TABLE 1

Fig. 2 is a network architecture diagram of the method of the present invention. The Swin transform frame represents a structure diagram of a Swin-Tiny network, patch is a partition unit in the Swin-Tiny structure, and a Swin module is a specific module of the Swin-Tiny structure; AG-FPN represents a pixel-level multi-scale feature fusion structure provided by the invention, AGM represents a pixel-level self-adaptive feature fusion module therein, and c ₁ Represents a convolution operation of 1 × 1, c ₃ Represents a convolution operation of 3 x 3, P _i Representing the feature after reducing the number of channels via a 1 x 1 convolution operation,

representing features after feature fusion; the AutoAssisign refers to a detection structure of Anchor-free (no Anchor frame is required to be preset).

FIG. 3 is a block diagram of a pixel level adaptive feature fusion Module (AGM) in the method of the present invention. Wherein, P _i 、P _i+1 Respectively representing the shallow feature and the deep feature after the number of channels is reduced by a convolution operation of 1 x 1,

representing features obtained via pixel-level adaptive feature fusion, c ₇ Represents a convolution operation of 7 x 7, μ represents an upsampling operation, cat represents a splicing of features according to channel dimensions, σ represents a sigmoid function,

representing the softmax function, tanh representing the hyperbolic tangent function, the bottleneck structure is composed of two convolutional layers, wherein the first convolutional layer reduces the number of the characteristic channels, and the second convolutional layer restores the number of the characteristic channels.

As shown in fig. 5, the present embodiment discloses a system of a method for detecting defects in cloth based on the pixel-level multi-scale feature fusion of the above embodiments, which includes a model training module and a defect detecting module;

the model training module specifically comprises the following sub-modules:

a data acquisition sub-module: photographing the cloth to obtain an initial cloth image; marking flaws on the cloth images, and only keeping the images with flaws to obtain a cloth flaw data set;

a data preprocessing submodule: randomly turning and immediately zooming the pictures input into the Swin Transformer of the backbone network according to a set probability and a zooming ratio, and simultaneously carrying out corresponding mapping transformation on the labeling information of the detection target;

the flaw detection module specifically comprises the following sub-modules:

The invention has the beneficial effects that:

(2) The method utilizes the inherent characteristics of Swin transform, and effectively reduces the characteristic loss of large-size defects and long and narrow defects on the premise of not bringing additional parameters and calculation cost;

(3) According to the invention, the automatic design is used for replacing the traditional detection head of the Anchor-base, so that the dependence of the network on priori knowledge such as a preset frame is reduced, and the detection capability on the defects of extreme sizes is improved.

The invention belongs to the technical field of image target detection, and particularly discloses a cloth flaw detection method and system based on pixel-level multi-scale feature fusion. The Swin transducer is used as a main network of the detector for feature extraction, the extracted features are input to AG-FPN for pixel-level-based feature fusion, and finally the AutoAssisign is used as a detection head to complete final detection. The AG-FPN solves the problems of mutual interference of defect characteristics and complex background interference in a cloth defect detection task by introducing a pixel-level-based adaptive fusion module into an FPN structure, and hardly increases the calculation cost and the parameter burden. The method for detecting the cloth defects has great advantages in detection precision compared with the existing advanced detection model, especially for the defects with extreme sizes.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope of the present invention.

Claims

1. A cloth flaw detection method based on pixel-level multi-scale feature fusion is characterized by comprising a model training stage and a flaw detection stage;

the model training stage specifically comprises the following steps:

s1, photographing cloth to obtain an initial cloth image; marking flaws on the cloth images, and only keeping the images with the flaws to obtain a cloth flaw data set;

s2, dividing a cloth flaw data set into a training set, a verification set and a test set according to a preset proportion;

s3, randomly overturning and immediately zooming the picture input into the Swin Transformer of the backbone network according to a set probability and a zoom ratio, and simultaneously carrying out corresponding mapping transformation on the labeled information of the detection target;

s4, inputting the preprocessed picture into a Swin Transformer of the backbone network to output corresponding characteristics with different scales;

s5, inputting the obtained features of different scales into the AG-FPN for feature fusion, and outputting N pieces of feature information of different scales;

s6, inputting the fused features into an automatic design detection head to generate corresponding prediction frame information;

step S7, calculating the value of the overall loss function of the network by combining the information of the prediction frame and the information of the real frame GT frame marked in the step S1, and updating the model parameters by using AdamW;

s8, predicting the verification set by using the network with the updated parameters to obtain the detection effect of the current network;

step S9, repeating the steps S4 to S8, wherein the iteration number is I ₁ Secondly; judging whether the training is converged according to the loss function curve; if the model is converged, saving the current model network parameters; if not, continuing to repeat the steps S3 to S7, wherein the iteration number is I ₂ Secondly;

the flaw detection stage specifically comprises the following steps:

step Z1, fixedly zooming the detection picture according to a set zooming ratio;

step Z2, inputting the preprocessed picture into the network stored in the step S8 in the model training stage to obtain the prediction frame information of the target object;

and step Z3, mapping the prediction frame to the original image according to the obtained prediction frame information and the set size range to obtain the position information and the category information of the flaws on the original image.

2. The method as claimed in claim 1, wherein in step S1, the size of the acquired image is 1024 × 2048, the acquired data is manually labeled by using a labelme toolbox, and the labeled GT frame information is stored in a COCO format, that is, the GT frame is labeled as (x, y, w, h, c), where x and y respectively represent an x coordinate and a y coordinate of a top left corner vertex of the target, w and h respectively represent a length and a width of the GT frame, and c represents a category of an object contained in the GT frame.

3. The method for detecting cloth defects based on pixel-level multi-scale feature fusion of claim 1, wherein in the step S2, the ratio of the training set, the verification set and the test set is 7.

4. The method as claimed in claim 1, wherein in step S3, the probability of flipping immediately is 0.5, the set size of random scaling is [480 × 1333,512 × 1333,640 × 1333], and the random scaling scales each picture into random sizes in the size list.

5. The method as claimed in claim 1, wherein in step S4, initialization parameters of the trunk network Swin fransformer are pre-trained using ImageNet data set.

6. The method as claimed in claim 1, wherein in step S5, N is set to 4, and the AG-FPN erases background information in shallow features with deep feature semantics.

7. The pixel-level-based multi-scale feature fusion cloth defect detection method of claim 1, wherein in the step S6, the pre-selected frame information includes predicted point classification information, information of a distance from the predicted point to a GT frame, and a predicted point weight map, wherein the information of the distance from the predicted point to the GT frame is (l, t, r, b) respectively representing distances from the predicted point to a left side, a top side, a right side, and a bottom side of the frame.

8. The method for detecting cloth defects based on pixel-level multi-scale feature fusion as claimed in claim 1, wherein the step S7 comprises the following steps:

s7.1, mapping the prediction frame information to a picture after data preprocessing according to the set size range to obtain preliminary candidate frame information; converting the preliminary candidate frame information according to the corresponding scaling size and the overturning condition in the preprocessing process to obtain candidate frame information;

s7.2, calculating classification loss, regression loss and central point prior weight according to the candidate frame information and the corresponding GT frame, calculating corresponding positive sample weight, positive sample confidence coefficient, negative sample weight and negative sample confidence coefficient according to the classification loss, the regression loss and the central point prior weight, and calculating the overall network loss by adopting focal loss;

s7.3, adopting AdamW as an optimizer to help the network to update parameters, and meanwhile setting the learning rate to be 0.00005 and setting the weight attenuation term to be 0.05.

9. The method as claimed in claim 1, wherein the step Z3 of obtaining the final prediction frame from the prediction frame information includes the following steps:

z3.1, mapping the predicted frame information to the picture after data preprocessing according to the set size range to obtain preliminary candidate frame information; then, converting the preliminary candidate frame information according to the corresponding scaling ratio in the preprocessing process to obtain the predicted frame information in the original image;

z3.2, sequencing all preselected frames in the same picture according to the confidence; and removing the redundant frame according to a non-maximum suppression algorithm to obtain a final prediction frame.

10. A system for detecting defects of cloth based on the pixel-level multi-scale feature fusion of any one of claims 1-9, which is characterized by comprising a model training module and a defect detection module;

the model training module specifically comprises the following sub-modules:

a detector parameter updating submodule: calculating the value of the overall loss function of the network by combining the information of the prediction frame and the information of a real frame, namely a GT frame, marked in the data acquisition submodule, and updating the model parameters by using AdamW;

the flaw detection module specifically comprises the following sub-modules: