CN116934820A - Cross-attention-based multi-size window Transformer network cloth image registration method and system - Google Patents

Cross-attention-based multi-size window Transformer network cloth image registration method and system Download PDF

Info

Publication number
CN116934820A
CN116934820A CN202310933471.7A CN202310933471A CN116934820A CN 116934820 A CN116934820 A CN 116934820A CN 202310933471 A CN202310933471 A CN 202310933471A CN 116934820 A CN116934820 A CN 116934820A
Authority
CN
China
Prior art keywords
image
window
attention
size
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310933471.7A
Other languages
Chinese (zh)
Inventor
邵佳维
郭春生
应娜
杨萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202310933471.7A priority Critical patent/CN116934820A/en
Publication of CN116934820A publication Critical patent/CN116934820A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0004Industrial image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30108Industrial image inspection
    • G06T2207/30124Fabrics; Textile; Paper

Abstract

The invention discloses a multi-size window Transformer network cloth image registration method and a system based on cross attention, wherein the method comprises the following steps: processing cloth image pairs and dividing the cloth image pairs into a training set and a testing set; creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and extracting the characteristics of a fixed image and a moving image respectively; the feature blocks from the dual-channel network obtain cross attention through a multi-size window method of two CATs by exchanging the input sequence, and combine the two input features into attention information; the feature blocks after the cross fusion are respectively subjected to feature aggregation by adopting jump connection, and an output deformation field is obtained; deforming the cloth image by using the obtained deformation field and the spatial transformation network to obtain a registered image, and calculating the similarity of the fixed image and the registered image; and carrying out differential operation on the registered cloth image and the fixed image, and identifying defective cloth according to pixels of the differential image.

Description

Cross-attention-based multi-size window Transformer network cloth image registration method and system
Technical Field
The invention belongs to the technical field of cloth image registration, and particularly relates to a deformable image registration method and system of a converger structure and a multi-size window of an attention mechanism.
Background
With the rapid development of digital image acquisition technology, people can easily acquire image data of different viewing angles and different time points. These image data play an important role in many computer vision fields such as marine resource detection, medical image diagnosis, remote sensing image processing, target abnormality detection, and the like. Such as sonar images, have been applied to many underwater tasks such as subsea target detection, target tracking, path planning, etc.; medical images play an important role in image application and pathology analysis; remote sensing images have been widely used in the fields of mapping, environmental monitoring, weather forecast, and the like. However, due to the different image acquisition conditions, there may be transformations such as rotation, translation, scaling, distortion, etc. between the images, and even a relatively complex nonlinear relationship may occur between the image pairs, resulting in incomplete matching between the images, making subsequent image analysis and processing difficult. The image therefore needs to be registered before it can be analyzed and processed.
Currently, image registration has become one of the important problems in the field of computer vision, and research thereof has been widely used in video analysis, pattern recognition, moving objects, and the like. However, when images are collected, the collected images may be contaminated by noise and even distorted in various forms due to the complexity of the environment and the limitations of the apparatus itself. These factors can result in images having low signal-to-noise and resolution and insignificant texture characteristics, and images from different viewpoints can exhibit relatively complex non-linear relationships. In addition, with the rapid increase of the global image data volume and the continuous expansion of the image data application field, higher requirements are put on the speed and the precision of the image registration method, and great challenges are brought to the image registration, so that the image registration method needs to be continuously improved.
Traditional image registration is based on SIFT, SURF, ORB isocenter features. The number of false matches is effectively reduced under the specificity of image registration by the point features, and the purpose of image registration is achieved by establishing an image transformation model. With the development of deep learning, neural networks are used for image registration. Image registration is the extraction of image features through a neural network. Therefore, it is superior to the conventional image registration method. The supervised image registration method obtains deformation model parameters between input images through a neural network, and image registration is achieved. The unsupervised image registration method does not require manual construction of an image deformation model and evaluation of image matching by similarity. Construction of parametric deformation models is often challenging due to complex nonlinear transformations between images in image registration. In recent years, unsupervised image registration based on deformation fields has received increasing attention. The deformation field is used for realizing the matching of the images by constructing vector displacement of each pixel of the images to be registered.
Although the existing attention mechanism-based transducer structure can match images, the traditional transducer still adopts the same attention mechanism as a single image task, only the correlation of one image is concerned, the mapping relation between image pairs is ignored, and the transducer is limited to find effective registration features to perform fine registration. In addition, in the process of extracting image features, features cannot be extracted finely in a global corresponding mode, so that the corresponding relation of different information among images is limited, and the problems of key structure, detail loss and the like can be caused.
Disclosure of Invention
In order to solve the problem that only the correlation and partial characteristic loss of a single image are concerned, the invention provides a multi-size window transform network image registration method and system based on cross attention. According to the invention, firstly, through the corresponding relation between the cross attention mechanics learning images, the correlation of the image pairs is calculated by using the attention mechanism of the cross attention mechanics learning images, so that the characteristics are automatically matched in a network; secondly, continuously matching and fusing features through a feature fusion module based on cross attention, fusing two input features into attention information, and carrying out feature matching by sharing parameters; finally, a local transformation of the deformable registration is focused with a multi-size window, obtaining detail information while constraining the attention calculations between the base window and the different-sized search windows. The invention improves the precision of image registration, is beneficial to identifying defective cloth in the process of producing cloth, and improves the production efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the multi-size window Transformer network cloth image registration method based on the cross attention comprises the following steps:
s1, processing a real cloth image, and dividing the real cloth image into a training set and a testing set;
s2, creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and then extracting the characteristics of a fixed image and a moving image respectively;
s3, the feature blocks from the two-channel network acquire cross attention through a multi-size window method in two Cross Attention Transformer (CAT) through exchanging the input sequence, and combine the two input features into attention information;
s4, aggregating the characteristics of the characteristic blocks after the cross fusion in a jumping connection mode, and finally obtaining an output deformation field;
s5, deforming the cloth image by using the obtained deformation field and the spatial transformation network to obtain a registered image, and calculating the similarity between the fixed image and the registered image;
s6, performing differential operation on the registered cloth image and the fixed image, and identifying defective cloth according to pixels of the differential image.
Further, in step S1, data processing includes image cropping, to obtain a training set and a test set, and data enhancement is performed on the training set, and the training data set after data enhancement is input into a network.
Further, in step S2, features of the moving image and the fixed image are extracted by using dual parallel networks, the two networks communicate through feature fusion, and the action mechanisms of the upper network and the lower network are the same. These two parallel networks follow the coding and decoding part of the Unet structure, but the convolution is replaced by Cross Attention Transformer blocks, which play an important role in the attention feature fusion module between the two networks, facilitating the automatic matching of features in the networks. The network of the present invention not only exchanges cross image information vertically, but also maintains a horizontal refinement function. Because the mechanisms of the upper and lower parallel networks are identical, one of the networks, hereinafter referred to as a single channel network, will be described.
The single channel transducer architecture is as follows: in a first step, the input color image is cut into image blocks without repeated areas by an image block segmentation module, each image block can be seen as a mark, which is used to connect the RGB values of the input image pixels. In a single channel network, the tile size is set to 4 x 4, so the feature dimension of a single tile is 48. A linear embedding layer is used on the segmented image blocks, which functions to map the image blocks of dimension 48 to an arbitrary dimension (C). Features are extracted on these image block markers by several modified transducer blocks, the number of transducer block markers (H/4 XW/4) is not changed, and is called "step 1" together with the linear embedding module.
As networks deepen, to obtain multi-level features, image block merging modules are used to reduce the number of marks. Assuming that the input image block merging module is a 4×4 feature map, the image block merging module first concatenates blocks of the same color together to form four 2×2 image blocks; then, connecting the characteristics of the four image blocks to perform normalization operation; finally, the linear change is performed through a linear layer. At this time, the number of marks is reduced by 4 times (downsampling by 2×resolution), and the output dimension is changed to 2C.
Then, the transducer block is used for feature exchange, and the resolution is kept to be H/8 XW/8. The image block merging module and the feature-converted transducer block are denoted as "step 2". The above procedure of "step 2" was repeated twice, denoted as "step 3" and "step 4", at which the output resolutions were H/16 XW/16 and H/32 XW/32, respectively.
Further, in step S3, two CAT blocks are utilized to fuse two input features into one piece of attention information, and feature matching is performed by sharing parameters; and the precise local correspondence is realized by utilizing a multi-size window method in the CAT block, and a fine deformation flow field is finally generated. Moving image feature T from parallel subnetworks m And fixed image feature T f By exchanging the order of inputs, mutual attention is gained by the two CAT blocks. Then the other two attention outputs return to the original channel to obtain a fusion characteristic T mf And T fm And provides for further information exchange. In one feature fusion module there are a total of k communications to obtain sufficient mutual information. Through the attention feature fusion module between the two networks, the features of different networks from different semantic information exchange information frequently, so that the network can keep learning multi-level semantic features to carry out final fine registration.
The novel attention mechanism CAT is used for fully exchanging information between image pairs, and comprehensively combines the characterizations and multiscale of matching features. Let b and S be divided into two sets of windows, the basic window set S, in different ways ba And search window set S se For the next window-based attention calculation. The purpose of the CAT block is to calculate new signature with corresponding relevance from the input signature b to the signature s by means of the attention mechanism. S is S ba And S is se The same number but different window sizes. Will S ba Each base window of (1) is projected into the query set query and each search window is projected through the linear layer into the knowledge set keys and values. Then, cross attention between two windows is calculated based on multi-headed cross attention (W-MCA) of the windows, and the attention is added to the base windows, so that each base window obtains corresponding weighting information from the search window. Finally, the new output set is sent to the multi-layer perceptron with gel nonlinearity in order to improve its learning ability. At each time A LayerNorm (LN) layer was used before each W-MCA and each MLP module, ensuring that each layer performed efficiently.
The multi-size window partitioning includes two different methods, window Partitioning (WP) and Window Area Partitioning (WAP), to divide the input feature labels b and s into windows of different sizes. WP division feature labels directly enter a set S of base windows of size n x h x w ba The WAP enlarges the window size along with the magnification of alpha and beta. Thus, the base and search window sizes are calculated as:
h ba ,w ba =h,w
h se ,w se =α·h,β·w
wherein h is ba 、w ba Is the size of the basic window, and h se 、w se Is the size of the search window; in order to obtain the same number of two window sets, WAP uses a sliding window and sets the stride to the basic window size, so S se The size of (2) is n x alpha h x beta w. Through the corresponding windows with different sizes, the CAT block effectively calculates the cross attention between the two feature marks, and avoids large-span searching to realize accurate information exchange.
Attention is drawn to a function that maps a query and a set of key-value pairs to an output, where the query, key, value, and output are all in the form of vectors. The W-MCA provided by the invention calculates the cross attention between the basic window and the search window to acquire an accurate corresponding relation, K, Q, V represents the characteristics mapped by the image block, K represents the characteristics mapped by the basic window, and Q and V come from the search window. The values of the calculation result are weighted sums, wherein the weight assigned to each value is calculated by a compatibility function between the query and the corresponding key.
W-MCA uses multi-head attention to fully represent subspaces, performs a dot product operation of query and keys, and first divides each key byA softmax function is then used to derive the weights for these values. Thus, the cross-attention calculation is expressed as:
Wherein Q is ba 、K se 、V se Is a query matrix, a key matrix, and a value matrix. Q (Q) ba ∈R n×s×c Is S ba And K se V is the linear projection of (V) se ∈R n×μ·s×c Is S se S=h×w and μ=α·β, c being the dimension of each feature marker.
In step S4, feature blocks after cross fusion are aggregated by means of jump connection, and finally an output deformation field is obtained;
further, in step S5, the loss function of the network is composed of two parts: one is a similarity loss, denoted by Mean Squared Error Mean Squared Error (MSE), which measures the similarity of moving and fixed images and penalizes the difference between the two. And the regularization loss consists of a super parameter and a regularization term, wherein the regularization term is used for adding a smoothness constraint to the estimated deformation field to prevent the deformation field from being too high in folding degree.
The MSE represents the expectation of the square of the difference between the true and estimated values, the smaller its value, indicating the better the prediction effect. The mean square error of the moving image and the predicted image is expressed as:
Where P represents the pixel points in the moving image and the fixed image, and Ω represents the entire image area.
Regularization is a penalty for folding in the deformation field, expressed as:
wherein R (θ) is a positive oneThe term(s) is (are) used,representing the gradient in the X and Y directions at point P. If use->Coefficients representing a loss regularization term, the loss function is expressed as:
further, in step S6, the registered cloth image and the fixed image are subjected to a difference operation, and a defective cloth image is identified according to pixels of the difference image. The thought of setting the threshold value is adopted, the size of the threshold value and the size of the window are set in advance, whether the average value of pixels in the window exceeds the threshold value is judged sequentially through the sliding window, if the average value exceeds the threshold value, the image has flaws, otherwise, the image has no flaws.
The invention also discloses a multi-size window Transformer network cloth image registration system based on the cross attention, which is used for executing the method and comprises the following modules:
and a data set making module: clipping the cloth image pair and further dividing the cloth image pair into a training set and a testing set;
two-channel transducer structure module: creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and then extracting the characteristics of a fixed image and a moving image respectively;
And a feature fusion module: the feature blocks from the dual-channel network obtain cross attention through a multi-size window method in two Cross Attention Transformer (CAT) by exchanging the input sequence, and combine the two input features into one attention information;
and a feature aggregation module: the feature blocks after cross fusion are respectively subjected to feature aggregation in a jump connection mode, and an output deformation field is finally obtained;
training module: training the model by using the mean square error loss and the regularization loss;
and (3) judging a flaw module: and carrying out differential operation on the registered cloth image and the fixed image, and identifying defective cloth according to pixels of the differential image.
Compared with the prior art, the method and the system for registering the multi-size window Transformer network cloth images based on the cross attention, firstly, the method and the system for registering the multi-size window Transformer network cloth images based on the cross attention fuse information of different scales by utilizing the Transformer blocks based on the cross attention, so that the mapping problem between image pairs is effectively solved; in addition, the deformable local transformation is emphasized by utilizing the multi-size window, and detail features are acquired to improve the registration effect. According to the invention, on the cross-attention transducer architecture, the image pair features are continuously matched and fused, two input features are fused into one attention information, and the shared parameters are subjected to feature matching, so that the problem of image registration can be better solved, and the task of identifying defective cloth is completed.
Drawings
Fig. 1 is a flowchart of a multi-size window Transformer web image registration method based on cross-attention according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a single channel transducer architecture in step S12 according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a feature fusion module in step S13 according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a method for providing multiple size windows in step S13 according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a multi-size window operation of cross-attention in step S13 according to an embodiment of the present invention.
Fig. 6 is a block diagram of a cross-attention based multi-size window Transformer web image registration system according to an embodiment of the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
The invention aims at overcoming the defects of the prior art and provides a multi-size window Transformer network cloth image registration method and system based on cross attention.
Example 1
As shown in fig. 1, the embodiment provides a multi-size window Transformer network cloth image registration method based on cross attention, and the specific implementation flow includes the following steps:
s11, cutting cloth images collected in advance to a specified size, further dividing the cloth images into a training set and a testing set, and carrying out data enhancement on the training set;
s12, creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and then extracting the characteristics of a fixed image and a moving image respectively;
s13, obtaining cross attention by utilizing a multi-size window method in two Cross Attention Transformer (CAT) through exchanging input sequences of feature blocks from a dual-channel network, and fusing the two input features into attention information;
s14, respectively aggregating the feature blocks after cross fusion in a jumping connection mode to obtain an output deformation field;
s15, restraining a deformation field by using smoothness loss, and training by using similarity loss;
S16, identifying defective cloth by using pixels obtained by performing difference operation on the registered cloth image and the fixed image.
The steps of this embodiment are specifically described as follows:
in step S11, the obtained cloth image is cut to size 512x512, and divided into a training set and a test set according to a ratio of 5:1. Meanwhile, the training data set uses data enhancement, and the data enhancement mode is to add elastic deformation after carrying out radiation transformation on the training set image, wherein the scale factor alpha and the elastic coefficient sigma can be adjusted or changed according to the acquired different cloth images.
In step S12, features of the moving image and the fixed image are extracted by using dual parallel networks, the two networks communicate through a feature fusion module, and the upper and lower networks have the same mechanism of action. These two parallel networks follow the coding and decoding part of the Unet structure, but the convolution is replaced by CAT (Cross Attention Transformer) blocks, which play an important role in the attention feature fusion module between the two networks, facilitating the automatic matching of features in the networks. The network not only exchanges cross image information vertically, but also maintains a horizontal refinement function. Because the mechanisms of the upper and lower parallel networks are identical, one of the networks, hereinafter referred to as a single channel network, will be described.
Fig. 2 shows a single channel transducer architecture, which proceeds as follows: in a first step, the input color image is cut into image blocks without repeated areas by an image block segmentation module, each image block can be seen as a mark, which is used to connect the RGB values of the input image pixels. In a single channel network, the tile size is set to 4 x 4, so the feature dimension of a single tile is 48. A linear embedding layer is then used on the segmented image blocks, which functions to map the image blocks of dimension 48 to an arbitrary dimension (C). Features are extracted on these image block markers by several modified transducer blocks, the number of transducer block markers (H/4 XW/4) is not changed, and is called "step 1" together with the linear embedding module.
As networks deepen, to obtain multi-level features, image block merging modules are used to reduce the number of marks. Assuming that a 4×4 feature map is input to the image block merging module, the image block merging module first splices the blocks with the same color together to form four 2×2 image blocks; then, connecting the characteristics of the four image blocks to perform normalization operation; finally, the linear change is performed through a linear layer. At this time, the number of marks is reduced by 4 times (downsampling by 2×resolution), and the output dimension is changed to 2C.
Then, the transducer block is used for feature exchange, and the resolution is kept to be H/8 XW/8. The image block merging module and the feature-converted transducer block are denoted as "step 2". The above procedure of "step 2" was repeated twice, denoted as "step 3" and "step 4", at which the output resolutions were H/16 XW/16 and H/32 XW/32, respectively.
In step S13, two input features are fused into one piece of attention information by using two CAT blocks, and feature matching is performed by sharing parameters; and a multi-size window method in the CAT block is utilized to realize accurate local correspondence, and finally, a fine deformation flow field is generated. As shown in fig. 3 (a), the moving image feature T from the parallel sub-network m And fixed image feature T f By exchanging the order of inputs, mutual attention is gained by the two CAT blocks. The other two attention outputs return to the original channel to obtain a fusion characteristic T mf And T fm And provides for further information exchange. In one feature fusion module there are a total of k communications to obtain sufficient mutual information. Through the attention feature fusion module between the two networks, the features of different networks from different semantic information exchange information frequently, so that the network can keep learning multi-level semantic features to carry out final fine registration.
The novel attention mechanism CAT is used for fully exchanging information between image pairs, and comprehensively combines the characterizations and multiscale of matching features. As shown in FIG. 3 (b), b and S are divided into two groups of windows, the basic window set S, respectively, in different ways ba And search window set S se For the next window-based attention calculation. The purpose of the CAT block is to calculate new signature with corresponding relevance from the input signature b to the signature s by means of the attention mechanism. S is S ba And S is se The same number but different window sizes. Will S ba Each base window of (1) is projected into the query set query and each search window is projected through the linear layer into the knowledge set keys and values. Then, based on windowMulti-headed cross attention (W-MCA) of the mouth calculates cross attention between two windows and adds the attention to the base windows, causing each base window to obtain corresponding weighted information from the search window. Finally, the new output set is sent to the multi-layer perceptron with gel nonlinearity in order to improve its learning ability. A LayerNorm (LN) layer is used before each W-MCA and each MLP module, ensuring that each layer performs effectively.
The multi-size window partitioning includes two different methods, window Partitioning (WP) and Window Area Partitioning (WAP), to divide the input feature labels b and s into windows of different sizes. As shown in FIG. 4, the WP division signature directly enters a set S of base windows of size n h w ba The WAP enlarges the window size along with the magnification of alpha and beta. Thus, the base and search window sizes are calculated as:
h ba ,w ba =h,w
h se ,w se =α·h,β·w
wherein h is ba 、w ba Indicating the size of the basic window, h se 、w se Indicating the size of the search window, alpha, beta are the magnification factors. In order to obtain the same number of two window sets, WAP uses a sliding window and sets the stride to the basic window size, so S se The size of (2) is n x alpha h x beta w. Through the corresponding windows with different sizes, the CAT block effectively calculates the cross attention between the two feature marks, and avoids large-span searching to realize accurate information exchange.
Attention is drawn to a function that maps a query and a set of key-value pairs to an output, where the query, key, value, and output are all in the form of vectors. As shown in fig. 5, the proposed W-MCA calculates the cross-attention between the basic window and the search window to obtain an accurate correspondence, K, Q, V represents the features mapped out by the image block, where K represents the features mapped out by the basic window, and Q and V come from the search window. The values of the calculation result are weighted sums, wherein the weight assigned to each value is calculated by a compatibility function between the query and the corresponding key.
W-MCA miningThe subspace is fully represented with multiple head attentiveness, by performing a dot product operation of the query and the keys, each key being first divided byA softmax function is then used to derive the weights for these values. Thus, the cross-attention calculation is expressed as:
wherein Q is ba 、K se 、V se Is a query matrix, a key matrix, and a value matrix. Q (Q) ba ∈R n×s×c Is S ba And K se V is the linear projection of (V) se ∈R n×μs×c Is S se S=h×w and μ=α·β, c being the dimension of each feature marker.
In step S14, the feature blocks after cross fusion are respectively aggregated by using a jump connection manner, and finally, a deformation field is obtained through convolution.
In step S15, the loss function of the network consists of two parts: one is a similarity loss, denoted by Mean Squared Error Mean Squared Error (MSE), which measures the similarity of moving and fixed images and penalizes the difference between the two. And the regularization loss consists of a super parameter and a regularization term, wherein the regularization term is used for adding a smoothness constraint to the estimated deformation field to prevent the deformation field from being too high in folding degree.
The MSE represents the expectation of the square of the difference between the true and estimated values, the smaller its value, indicating the better the prediction effect. The mean square error of the moving image and the predicted image is expressed as:
Where P represents the pixel points in the moving image and the fixed image, and Ω represents the entire image area.
Regularization is a penalty for folding in the deformation field, expressed as:
wherein R (θ) is a regularization term,representing the gradient in the X and Y directions at point P. If use->Coefficients representing a loss regularization term, the loss function is expressed as:
in step S16, the registered cloth image and the fixed image are subjected to a difference operation, and a defective cloth image is identified from the pixels of the difference image. In this embodiment, the concept of setting a threshold is adopted, the size of the threshold and the size of the window can be set according to experience, whether the average value of pixels in the window exceeds the threshold is sequentially determined through sliding the window, if so, the image has flaws, otherwise, the image has no flaws.
The embodiment provides a multi-size window Transformer network cloth image registration method based on cross attention. The method comprises the steps of utilizing a Transformer block based on cross attention to fuse information of different scales, and effectively solving the mapping problem between image pairs; in addition, the deformable local transformation is emphasized by utilizing the multi-size window, and detail characteristics are acquired to improve the registration performance of the complex cloth image. In the subsequent task of identifying defective cloth, the defective cloth image can be judged only by differentiating the registration image and the fixed image.
Example two
As shown in fig. 6, the present embodiment provides a cross-attention-based multi-size window transporter network cloth image registration system, which is configured to perform the method of the first embodiment, and specifically includes the following modules:
and a data set making module: clipping the cloth image pair and further dividing the cloth image pair into a training set and a testing set;
two-channel transducer structure module: creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and then extracting the characteristics of a fixed image and a moving image respectively;
and a feature fusion module: the feature blocks from the dual-channel network obtain cross attention through a multi-size window method in two Cross Attention Transformer (CAT) by exchanging the input sequence, and combine the two input features into one attention information;
and a feature aggregation module: the feature blocks after cross fusion are respectively subjected to feature aggregation in a jump connection mode, and an output deformation field is finally obtained;
training module: training the model by using the mean square error loss and the regularization loss;
and (3) judging a flaw module: and carrying out differential operation on the registered cloth image and the fixed image, and identifying the defective cloth image according to pixels of the differential image.
The modules of this embodiment are described in detail below.
In the data set making module, the acquired cloth image is cut, the size is 512x512, and the cloth image is divided into a training set and a testing set according to the proportion of 5:1. Meanwhile, the training data set uses data enhancement, and the data enhancement mode is to add elastic deformation after carrying out radiation transformation on the training set image, wherein the scale factor alpha and the elastic coefficient sigma can be adjusted or changed according to the acquired different cloth images.
In the dual-channel transducer structure module, the features of the moving image and the fixed image are respectively extracted by utilizing dual parallel networks, the two networks are communicated through the feature fusion module, and the action mechanisms of the upper network and the lower network are the same. These two parallel networks follow the coding and decoding part of the Unet structure, but the convolution is replaced by CAT (Cross Attention Transformer) blocks, which play an important role in the attention feature fusion module between the two networks, facilitating the automatic matching of features in the networks. The network not only exchanges cross image information vertically, but also maintains a horizontal refinement function. Because the mechanisms of the upper and lower parallel networks are identical, one of the networks, hereinafter referred to as a single channel network, will be described.
Fig. 2 shows a single channel transducer architecture, which proceeds as follows: in a first step, the input color image is cut into image blocks without repeated areas by an image block segmentation module, each image block can be seen as a mark, which is used to connect the RGB values of the input image pixels. In a single channel network, the tile size is set to 4 x 4, so the feature dimension of a single tile is 48. A linear embedding layer is then used on the segmented image blocks, which functions to map the image blocks of dimension 48 to an arbitrary dimension (C). Features are extracted on these image block markers by several modified transducer blocks, the number of transducer block markers (H/4 XW/4) is not changed, and is called "step 1" together with the linear embedding module.
As networks deepen, to obtain multi-level features, image block merging modules are used to reduce the number of marks. Assuming that the input image block merging module is a 4×4 feature map, the image block merging module first concatenates blocks of the same color together to form four 2×2 image blocks; then, connecting the characteristics of the four image blocks to perform normalization operation; finally, the linear change is performed through a linear layer. At this time, the number of marks is reduced by 4 times (downsampling by 2×resolution), and the output dimension is changed to 2C.
Then, the transducer block is used for feature exchange, and the resolution is kept to be H/8 XW/8. The image block merging module and the feature-converted transducer block are denoted as "step 2". The above procedure of "step 2" was repeated twice, denoted as "step 3" and "step 4", at which the output resolutions were H/16 XW/16 and H/32 XW/32, respectively.
In the feature fusion module, two input features are fused into one piece of attention information by using two CAT blocks, and feature matching is carried out by sharing parameters; and is also provided withAnd a multi-size window method in the CAT block is utilized to realize accurate local correspondence, and finally, a fine deformation flow field is generated. As shown in fig. 3 (a), the moving image feature T from the parallel sub-network m And fixed image feature T f By exchanging the order of inputs, mutual attention is gained by the two CAT blocks. The other two attention outputs return to the original channel to obtain a fusion characteristic T mf And T fm And provides for further information exchange. In one feature fusion module there are a total of k communications to obtain sufficient mutual information. Through the attention feature fusion module between the two networks, the features of different networks from different semantic information exchange information frequently, so that the networks can keep learning multi-level semantic features to perform final fine registration.
The novel attention mechanism CAT is used for fully exchanging information between image pairs, and comprehensively combines the characterizations and multiscale of matching features. As shown in FIG. 3 (b), b and S are divided into two groups of windows, the basic window set S, respectively, in different ways ba And search window set S se For the next window-based attention calculation. The purpose of the CAT block is to calculate new signature with corresponding relevance from the input signature b to the signature s by means of the attention mechanism. S is S ba And S is se The same number but different window sizes. Will S ba Each base window of (1) is projected into the query set query and each search window is projected through the linear layer into the knowledge set keys and values. Then, cross attention between two windows is calculated based on multi-headed cross attention (W-MCA) of the windows, and the attention is added to the base windows, so that each base window obtains corresponding weighting information from the search window. Finally, the new output set is sent to the multi-layer perceptron with gel nonlinearity in order to improve its learning ability. A LayerNorm (LN) layer is used before each W-MCA and each MLP module, ensuring that each layer performs effectively.
The multi-size window partitioning includes two different methods, window Partitioning (WP) and Window Area Partitioning (WAP), to divide the input feature labels b and s into windows of different sizes. As shown in FIG. 4, WP partitioning The feature labels directly enter a set S of base windows of size n x h x w ba The WAP enlarges the window size along with the magnification of alpha and beta. Thus, the base and search window sizes are calculated as:
h ba ,w ba =h,w
h se ,w se =α·h,β·w
wherein h is ba 、w ba Is the size of the basic window, and h se 、w se Is the size of the search window. In order to obtain the same number of two window sets, WAP uses a sliding window and sets the stride to the basic window size, so S se The size of (2) is n x alpha h x beta w. Through the corresponding windows with different sizes, the CAT block effectively calculates the cross attention between the two feature marks, and avoids large-span searching to realize accurate information exchange.
Attention is drawn to a function that maps a query and a set of key-value pairs to an output, where the query, key, value, and output are all in the form of vectors. As shown in fig. 5, the proposed W-MCA calculates the cross-attention between the basic window and the search window to obtain an accurate correspondence, K, Q, V represents the features mapped out by the image block, where K represents the features mapped out by the basic window, and Q and V come from the search window. The values of the calculation result are weighted sums, wherein the weight assigned to each value is calculated by a compatibility function between the query and the corresponding key.
W-MCA uses multi-head attention to fully represent subspaces, performs a dot product operation of query and keys, and first divides each key byA softmax function is then used to derive the weights for these values. Thus, the cross-attention calculation is expressed as:
wherein Q is ba 、K se 、V se Is a queryMatrix, key matrix, and value matrix. Q (Q) ba ∈R n×s×c Is S ba And K se V is the linear projection of (V) se ∈R n×μ·s×c Is S se S=h×w and μ=α·β, c being the dimension of each feature marker.
And in the characteristic aggregation module, the characteristic blocks after cross fusion are respectively aggregated by utilizing a jump connection mode, and finally, a deformation field is obtained through convolution.
In the training module, the loss function of the network consists of two parts: one is a similarity loss, denoted by Mean Squared Error Mean Squared Error (MSE), which measures the similarity of moving and fixed images and penalizes the difference between the two. And the regularization loss consists of a super parameter and a regularization term, wherein the regularization term is used for adding a smoothness constraint to the estimated deformation field to prevent the deformation field from being too high in folding degree.
The MSE represents the expectation of the square of the difference between the true and estimated values, the smaller its value, indicating the better the prediction effect. The mean square error of the moving image and the predicted image is expressed as:
Where P represents the pixel points in the moving image and the fixed image, and Ω represents the entire image area.
Regularization is a penalty for folding in the deformation field, expressed as:
wherein R (θ) is a regularization term,representing the gradient in the X and Y directions at point P. If use->Coefficients representing a loss regularization term, the loss function is expressed as:
and in the defect judging module, performing differential operation on the registered cloth image and the fixed image, and identifying the defective cloth image according to pixels of the differential image. The thought of setting the threshold is adopted, the size of the threshold and the size of the window are set according to experience, whether the average value of pixels in the window exceeds the threshold is judged in sequence through sliding the window, if the average value exceeds the threshold, the image has flaws, otherwise, the image has no flaws.
The embodiment provides a multi-size window Transformer network cloth image registration system based on cross attention. The method comprises the steps of utilizing a Transformer block based on cross attention to fuse information of different scales, and effectively solving the mapping problem between image pairs; in addition, the deformable local transformation is emphasized by utilizing the multi-size window, and detail characteristics are acquired to improve the registration performance of the complex cloth image. In the subsequent task of identifying defective cloth, the defective cloth image can be judged only by differentiating the registration image and the fixed image.
In summary, compared with the prior art, the multi-size window Transformer network cloth image registration method and system based on cross attention, data enhancement is carried out on a few cloth images, tedious collection of a large amount of data is not needed, and precise image registration is carried out by using the Transformer network. Specifically, the cross-attention-based transducer block is used for fusing information of different scales, so that the mapping problem between image pairs is effectively solved; in addition, the deformable local transformation is emphasized by utilizing the multi-size window, and detail characteristics are acquired to improve the registration performance of the complex cloth image. The accuracy of cloth image registration is improved through the two points. The invention also ensures the usability and flexibility of the model to the greatest extent through the modularized design.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. The multi-size window Transformer network cloth image registration method based on the cross attention is characterized by comprising the following steps:
s1, processing cloth image pairs, and dividing the cloth image pairs into a training set and a testing set;
s2, creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and then extracting the characteristics of a fixed image and a moving image respectively;
s3, the feature blocks from the two-channel network acquire cross attention through a multi-size window method in two CATs through exchanging input sequences, and two input features are fused into attention information;
s4, aggregating the characteristics of the characteristic blocks after the cross fusion in a jumping connection mode to obtain an output deformation field;
s5, training the model by means of mean square error loss and regularization loss;
s6, performing differential operation on the registered cloth image and the fixed image, and identifying defective cloth according to pixels of the differential image.
2. The method for registering cloth images of a multi-size window Transformer network based on cross-attention according to claim 1, wherein in step S1, the cloth image pair is cut and divided into a training set and a test set, and the obtained training set is subjected to data enhancement.
3. The method for registering multi-size window Transformer network cloth images based on cross attention according to claim 2, wherein in step S2, features of a moving image and a fixed image are extracted respectively by using double parallel networks, the two networks communicate through a feature fusion module, and the action mechanisms of the upper network and the lower network are the same.
4. The cross-attention-based multi-size window transform web cloth image registration method of claim 3, wherein the mechanisms of the upper and lower parallel networks are the same, one of the networks is called a single channel network, and the method is as follows:
cutting an input color image into image blocks without repeated areas through an image block segmentation module, wherein each image block is regarded as a mark and is used for connecting RGB values of pixels of the input image; in a single channel network, the image block size is set to 4×4, so the feature dimension of a single image block is 48; using a linear embedding layer on the segmented image blocks, wherein the linear embedding layer is used for mapping the image blocks with the dimension of 48 to an arbitrary dimension C; extracting features on these image block markers by several modified transfomer blocks, the number of transfomer block markers H/4W/4 being unchanged and referred to as "step 1" together with the linear embedding module;
Assuming that the input image block merging module is a 4×4 feature map, the image block merging module first concatenates blocks of the same color together to form four 2×2 image blocks; then, connecting the characteristics of the four image blocks to perform normalization operation; finally, linear change is carried out through a linear layer; at this time, the number of marks is reduced by 4 times, and the output dimension is changed to 2C;
performing feature exchange by using a transducer block, and keeping the resolution at H/8 XW/8; the image block merging module and the feature-converted transducer block are denoted as "step 2". The above procedure of "step 2" was repeated twice, denoted as "step 3" and "step 4", at which the output resolutions were H/16 XW/16 and H/32 XW/32, respectively.
5. The method for registering multiple-size window fransformer web cloth image based on cross-attention as claimed in claim 3, wherein in step S3, two CAT blocks are used to fuse two input features into one attention information, and the shared parameters are used for feature matching.
6. The cross-attention based multi-size window Transformer web image registration method of any one of claims 1-5, wherein the multi-size window method comprises: window division WP and window area division WAP to divide the input feature labels b and s into windows of different sizes; WP division feature labels directly enter a set S of base windows of size n x h x w ba The WAP enlarges the window size along with the magnification of alpha and beta; thus, the base and search window sizes are calculated as:
h ba ,w ba =h,w
h se ,w se =α·h,β·w
wherein h is ba 、w ba Is the size of the basic window, and h se 、w se Is the size of the search window; in order to obtain the same number of two window sets, WAP uses a sliding window and sets the stride to the basic window size, so S se The size of (2) is n x alpha h x beta w.
7. The method for registering multi-size window Transformer network cloth images based on cross attention according to any one of claims 1-5, wherein in step S4, feature blocks after cross fusion are respectively aggregated by using a jump connection mode, and finally a deformation field is obtained through convolution.
8. The cross-attention based multi-size window Transformer web image registration method of any one of claims 1-5, characterized in that in step S5 the loss function of the web is composed of two parts: firstly, the similarity loss is expressed by MSE and is used for measuring the similarity between a moving image and a fixed image and punishing the difference between the moving image and the fixed image; the regularization loss consists of a super parameter and a regularization term, wherein the regularization term is used for adding a smoothness constraint to the estimated deformation field to prevent the deformation field from being too high in folding degree;
The MSE represents the expectation of the square of the difference between the true and estimated values, and the mean square error of the moving image and the predicted image is expressed as:
wherein P represents pixel points in the moving image and the fixed image, and Ω represents the whole image area;
regularization is a penalty for folding in the deformation field, expressed as:
wherein R (θ) is a regularization term,representing the gradient in the X and Y directions at point P. If use->Coefficients representing a loss regularization term, the loss function is expressed as:
wherein I is f Representing a fixed image, I w Representing the deformed image.
9. The method for registering multi-size window Transformer network cloth image based on cross attention according to any one of claims 1-5, wherein in step S6, a threshold value and a window size are set, and whether the average value of pixels in the window exceeds the threshold value is sequentially judged by sliding the window, if the average value exceeds the threshold value, the image has flaws, otherwise, the image has no flaws.
10. A cross-attention based multi-dimensional window fransformer web image registration system for performing the method of claim 1, comprising the modules of:
and a data set making module: processing cloth image pairs and dividing the cloth image pairs into a training set and a testing set;
Two-channel transducer structure module: creating a dual-channel transducer structure network, dividing an input image into image blocks with the same size respectively, linearly encoding the image blocks, and extracting the characteristics of a fixed image and a moving image respectively;
and a feature fusion module: the feature blocks from the dual-channel network obtain cross attention through a multi-size window method in two CATs by exchanging the input sequence, and combine the two input features into attention information;
and a feature aggregation module: the feature blocks after the cross fusion are respectively subjected to feature aggregation in a jump connection mode, and an output deformation field is obtained;
training module: training the model by using the mean square error loss and the regularization loss;
and (3) judging a flaw module: and carrying out differential operation on the registered cloth image and the fixed image, and identifying defective cloth according to pixels of the differential image.
CN202310933471.7A 2023-07-27 2023-07-27 Cross-attention-based multi-size window Transformer network cloth image registration method and system Pending CN116934820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310933471.7A CN116934820A (en) 2023-07-27 2023-07-27 Cross-attention-based multi-size window Transformer network cloth image registration method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310933471.7A CN116934820A (en) 2023-07-27 2023-07-27 Cross-attention-based multi-size window Transformer network cloth image registration method and system

Publications (1)

Publication Number Publication Date
CN116934820A true CN116934820A (en) 2023-10-24

Family

ID=88384152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310933471.7A Pending CN116934820A (en) 2023-07-27 2023-07-27 Cross-attention-based multi-size window Transformer network cloth image registration method and system

Country Status (1)

Country Link
CN (1) CN116934820A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117495853A (en) * 2023-12-28 2024-02-02 淘宝(中国)软件有限公司 Video data processing method, device and storage medium
CN117495853B (en) * 2023-12-28 2024-05-03 淘宝(中国)软件有限公司 Video data processing method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117495853A (en) * 2023-12-28 2024-02-02 淘宝(中国)软件有限公司 Video data processing method, device and storage medium
CN117495853B (en) * 2023-12-28 2024-05-03 淘宝(中国)软件有限公司 Video data processing method, device and storage medium

Similar Documents

Publication Publication Date Title
CN110738697B (en) Monocular depth estimation method based on deep learning
CN109655019B (en) Cargo volume measurement method based on deep learning and three-dimensional reconstruction
CN110599537A (en) Mask R-CNN-based unmanned aerial vehicle image building area calculation method and system
CN112818903A (en) Small sample remote sensing image target detection method based on meta-learning and cooperative attention
CN107545263B (en) Object detection method and device
CN111340855A (en) Road moving target detection method based on track prediction
Liu et al. A night pavement crack detection method based on image‐to‐image translation
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
Li et al. A review of deep learning methods for pixel-level crack detection
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN115147418B (en) Compression training method and device for defect detection model
CN111582270A (en) Identification tracking method based on high-precision bridge region visual target feature points
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN114973031A (en) Visible light-thermal infrared image target detection method under view angle of unmanned aerial vehicle
CN110516527B (en) Visual SLAM loop detection improvement method based on instance segmentation
CN117197763A (en) Road crack detection method and system based on cross attention guide feature alignment network
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN109740405B (en) Method for detecting front window difference information of non-aligned similar vehicles
Ding et al. DHT: dynamic vision transformer using hybrid window attention for industrial defect images classification
CN116912670A (en) Deep sea fish identification method based on improved YOLO model
CN116934820A (en) Cross-attention-based multi-size window Transformer network cloth image registration method and system
CN115439926A (en) Small sample abnormal behavior identification method based on key region and scene depth
CN113392726B (en) Method, system, terminal and medium for identifying and detecting head of person in outdoor monitoring scene
CN115147644A (en) Method, system, device and storage medium for training and describing image description model
CN114913504A (en) Vehicle target identification method of remote sensing image fused with self-attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination