CN116740434A

CN116740434A - Transformer-based cross-domain double-branch countermeasure domain adaptive image classification method

Info

Publication number: CN116740434A
Application number: CN202310688466.4A
Authority: CN
Inventors: 潘杰; 高洪伟; 邹筱瑜; 刘新华; 顾进恒
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-09-12

Abstract

The application discloses a trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method, which comprises the following steps: respectively reading images of a source domain and a target domain; constructing a trans-former-based cross-domain dual-branch reactance domain adaptive network model; training the cross-domain dual-branch anti-domain adaptive network model by using the preprocessed source domain and target domain images, and taking the trained cross-domain dual-branch anti-domain adaptive network model as an image classification model; and respectively inputting the paired source domain images and target domain images into a double-branch feature extractor of an image classification model to extract features, and predicting the label category of the target domain images by using a classifier. According to the application, the feature extractor is simplified into a parallel interactive double-branch structure, so that the calculated amount is reduced, and the image training and reasoning efficiency is improved; the queries of the source domain and the target domain are fused into a cross-domain unified query volume, the inter-domain distribution difference is smoothed, and the query information of the two domains is fused for cross-attention relationship modeling among the domains, so that knowledge migration is facilitated.

Description

Transformer-based cross-domain double-branch countermeasure domain adaptive image classification method

Technical Field

The application relates to an image classification technology, in particular to a trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method.

Background

In recent years, with the development of computer power improvement and deep learning algorithms, computers have achieved tremendous improvement in various visual tasks such as image classification, object detection, semantic segmentation, and the like, and in some tasks, even surpasses humans. Deep learning algorithms are currently being used in various fields such as online commerce, deep translation, speech recognition, automatic driving, and computer aided diagnosis.

While deep learning has been successful in many applications, its excellent performance is largely dependent on the large amount of labeled data, in many real-world applications, collecting sufficient labeled training data typically requires significant human, material and time costs. In addition, the deep learning algorithm needs to satisfy the assumption that the training data and the test data are independently and uniformly distributed, that is, the actually collected data and the training data for training are consistent in distribution. However, in many fields where deep learning is used, the assumption that data are independent and distributed is often not true, and the true test data often cannot satisfy the consistency formation domain offset due to the influence of external factors such as resolution, illumination, view point, background, weather conditions, and the like. The phenomenon of domain offset is very common in daily applications, for example, in a human body gesture recognition scene, the difference between an image collected indoors and an image collected outdoors is huge in data distribution, so that the recognition capability of the model trained by indoor labeling data on the human body gesture in the outdoor scene is greatly reduced. The data distribution is different, so that a model trained by a traditional deep learning algorithm cannot obtain expected results in a similar new field, and the generalization capability and the knowledge multiplexing capability of the deep learning model are limited.

The convolutional neural network (Convolutional neural network, CNN) is commonly used as a feature extractor for field adaptation tasks, and the hierarchical design of the convolutional neural network can extract rich abstract semantic information and has the advantages of translational invariance, local sensitivity and the like. However, CNNs cannot fully utilize context information, and lack long-range relational modeling capabilities, limited by receptive field range. Because the transducer model has strong context modeling capability, the image is changed into a sequence and then the model is used for processing, and the problem of long-distance dependence in CNN can be effectively solved. However, the global has secondary computation complexity, brings great computation pressure to hardware, and has lower image training and reasoning efficiency.

In recent years, transfomers have been used for research of domain-adaptive image classification. The final layer of the movable visual transducer (Transferable vision Transformer, TVT) is improved into a movable adapting module on the basis of visual Transformer (ViT), so that attention is focused on movable and distinguishable features, and ViT cross-domain knowledge migration capability is studied for the first time. Cross-domain transitioner (Cross-domain Transformer, CDTrans), instead of ViT, the three-branch structure distribution learns source domain, target domain, and inter-domain features. The bidirectional cross-attention transducer (Bidirectional cross-attention Transformer, BCAT) uses a four-branch structure, and uses two aligned branches on the CDTrans basis to achieve better results.

The above methods are all based on a transducer model and achieve good adaptation effects, but the related methods still need improvement. However, the method has too many branches or too high computational complexity, which bring huge computational cost to hardware and lower efficiency in processing high-resolution images. The cross-attention of using single domain queries when aligning features is difficult to accurately model relationships between domains, which is detrimental to inter-domain knowledge migration.

Disclosure of Invention

The application aims to: aiming at the problems, the application aims to provide a trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method, which simplifies a feature extractor into a parallel interactive double-branch structure, reduces the calculated amount and improves the efficiency of image training and reasoning; aiming at the problem that the cross attention of single domain query is poor in the task with large domain difference, a cross-domain fusion module is designed, the queries of the source domain and the target domain are fused into a cross-domain unified query quantity, the inter-domain distribution difference is smoothed, and the query information of the two domains is fused for carrying out relationship modeling between the domains by the cross attention, so that knowledge migration is facilitated.

The technical scheme is as follows: the application discloses a trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method, which comprises the following steps:

step 1, respectively reading images of a source domain and a target domain, and preprocessing the images;

step 2, constructing a trans-former-based cross-domain dual-branch reactance domain adaptive network model; the cross-domain dual-branch correlation domain adaptation model comprises a dual-branch feature extractor, a classifier and a domain discriminator, wherein the dual-branch feature extractor comprises a domain feature extraction module and a cross-domain fusion module;

step 3, training the cross-domain dual-branch anti-domain adaptive network model by utilizing the preprocessed source domain and target domain images, and taking the trained cross-domain dual-branch anti-domain adaptive network model as an image classification model;

and 4, respectively inputting the paired source domain images and target domain images into a double-branch feature extractor of the image classification model to extract features, inputting the extracted features into a classifier of the image classification model, and predicting the label category of the image in the target domain by using the classifier.

Further, training a cross-domain dual-branch-reactance domain adaptive network model by utilizing the preprocessed source domain and target domain images, and specifically comprises the following substeps:

step 31, respectively inputting the preprocessed source domain image and the preprocessed target domain image into an intra-domain feature extraction module, and processing the source domain image and the target domain image in parallel by utilizing a double-branch structure to respectively obtain multi-scale features of different layers;

step 32, inputting the multi-scale features of the source domain and the multi-scale features of the target domain into a cross-domain fusion module, and utilizing the cross-domain fusion module to perform inter-branch interaction alignment of inter-domain features, and outputting a source domain feature vector and a target domain feature vector;

step 33, training a classifier by using the labeled source domain feature vector and adopting standard supervision cross entropy loss;

and step 34, training a discriminator by using the domain label, performing minimum and maximum games with the double-branch feature extractor through the gradient inversion layer in the back propagation process until the classification loss converges, and ending the training.

Further, the step 32 specifically includes:

weighting and fusing the query quantity of the source domain attention module and the query quantity of the target domain attention module to obtain the inter-domain sharing query quantity Q _f The expression is:

Q _f ＝αQ _s +(1-α)Q _t

wherein alpha represents a fusion coefficient, Q _s Representing the amount of queries in the source domain, Q _t Representing the amount of queries for the target domain;

taking the inter-domain shared query volume as the unified query volume of the attention of two parallel computing branches, computing the distribution of the inter-domain unified query volume on the key vectors of the source domain and the target domain respectively, and establishing the correlation between the source domain and the target domain, wherein the expression is as follows:

in Attn _s Representing source domain attention, attn _t Representing the attention of the target domain, K _s And V _s Representing the key and value, K, of the source domain, respectively _t And V _t Respectively are provided withKeys and values representing the target domain, d representing the dimensions of the query and key vectors;

the CDF is used for carrying out data interaction between the two branches, the unified query quantity and domain invariant feature of the source domain and the target domain are learned, and the expression is:

in the method, in the process of the application,and->Representing the output of a window-based multi-headed self-attention module for the source domain and the target domain, respectively, +.>Andthe output of a source domain layer and a target domain layer is respectively represented, CDF is a cross-domain fusion attention mechanism, MLP is a multi-layer perceptron, and LN is layer normalization.

Further, step 34 specifically includes:

the optimized discriminator parameters minimize discrimination loss, the optimized dual-branch feature extractor maximizes discrimination loss, and the objective function is as follows:

in θ _f 、θ _y 、θ _d Respectively representing a dual-branch feature extractor G _f Classifier G _y Sum domain arbiter G _d Parameters of (L) _cls Representing classifier loss, L _adv Representing domain arbiter loss; the weighting coefficient lambda epsilon [0, 1) is updated in an iterative manner as follows:

wherein lambda gradually increases with the training process, and u represents the current iteration number to total iteration number ratio.

Further, the preprocessing includes random cropping, random flipping, random occlusion, and brightness enhancement of the source domain image and the target domain image, respectively.

The beneficial effects are that: compared with the prior art, the application has the remarkable advantages that: the application provides a trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method, which is used for solving the problem of unsupervised domain adaptive image classification, provides a double-branch feature extractor, simplifies calculation branches, establishes a parallel interactive double-branch structure, introduces local attention to replace a ViT-based domain adaptive method, reduces calculation complexity from secondary to linear, greatly relieves calculation cost of hardware, and improves training and reasoning efficiency of high-resolution scene images; compared with the main domain adaptation method, the method has higher precision and stronger adaptability to the target domain with large domain difference; through a training time comparison test, the training efficiency of the method is higher, and a better adaptation effect is obtained in a shorter training time.

Drawings

FIG. 1 is a diagram of a cross-domain dual-branch reactive domain adaptive network model based on a transducer;

fig. 2 is a visual view of attention.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent.

The trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method comprises the following steps:

and 1, respectively reading the images of the source domain and the target domain, and preprocessing the images.

Specifically, the preprocessing includes random clipping, random flipping, random occlusion, and brightness enhancement of the source domain image and the target domain image, respectively.

Step 2, constructing a trans-former-based cross-domain dual-branch reactance domain adaptive network model; the cross-domain dual-branch reactance domain adaptation model comprises a dual-branch feature extractor, a classifier and a domain discriminator, wherein the dual-branch feature extractor comprises a domain feature extraction module and a cross-domain fusion module.

And step 3, training the cross-domain dual-branch opposite domain adaptive network model by utilizing the preprocessed source domain and target domain images, and taking the trained cross-domain dual-branch opposite domain adaptive network model as an image classification model.

Training a cross-domain dual-branch anti-domain adaptive network model by utilizing preprocessed source domain and target domain images, and specifically comprises the following sub-steps:

In one example, as shown in fig. 1, the step 31 specifically includes:

dividing the preprocessed source domain image into blocks with fixed sizes, mapping a block sequence into high-dimensional vectors with fixed lengths through linear transformation, inputting the high-dimensional vectors into a Swin transform block (STL, STL structure layer number is 2 in the figure), merging the blocks with the divided sizes into large blocks for the first time through block fusion, inputting the merged large block image into the Swin transform block (STL structure layer number is 2), merging the large block image for the second time through block fusion, and inputting the merged large block into the Swin transform block (STL structure layer number is 14) again to obtain multi-scale characteristics of the source domain;

the method comprises the steps of carrying out block segmentation on a preprocessed target domain image, segmenting the target domain image into blocks with fixed sizes, then mapping a block sequence into high-dimensional vectors with fixed lengths through linear transformation, inputting the high-dimensional vectors into a Swin transform block (STL structure layer number is 2), merging the segmented large-size blocks into large blocks for the first time through block fusion, inputting the merged large-size images into the Swin transform block (STL structure layer number is 2), merging the large-size images for the second time through block fusion, and inputting the merged large blocks into the Swin transform block (STL structure layer number is 14) again to obtain multi-scale characteristics of the target domain.

In one example, step 32, inputting the multi-scale features of the source domain and the multi-scale features of the target domain into the cross-domain fusion module specifically includes:

combining a cross-domain fusion attention mechanism CDF (such as STLwith CDF in figure 1) on the basis of a Swin Transformer block, firstly setting 4 STL structure layers added with CDF, inputting the multi-scale features of a source domain and the multi-scale features of a target domain into the STL structure layers, then carrying out block fusion, and then inputting the multi-scale features into 2 STL structure layers added with CDF to obtain a source domain feature vector and a target domain feature vector.

In one example, step 32 specifically includes:

Q _f ＝αQ _s +(1-α)Q _t

wherein alpha represents a fusion coefficient, Q _s Representing the amount of queries in the source domain, Q _t Representing the amount of queries for the target domain.

in Attn _s Representing source domain attention, attn _t Representing the attention of the target domain, K _s And V _s Representing the key and value, K, of the source domain, respectively _t And V _t The key and value of the target domain are represented respectively, d represents the dimensions of the query and key vector;

In one example, step 34 specifically includes:

Pairing the source domain image and the target domain image, specifically,

for each image in the source domain, find the most similar image to it from the target domain, the selected set of pairings is denoted as P _s ；

Wherein S, T is source domain data and target domain data, respectively, and image characteristics f of the source domain and the target domain are obtained by using image Net21K pre-training model parameters _s 、f _s And calculates the distance between features using the feature metric function d (·).

Similarly, for each image in the target domain, the most similar image is found from the source domain, and the selected set of pairs is denoted as P _t ；

Taking the union p= { P of the two sets _s ∪P _t So that the paired set contains all pictures of the dataset.

The initial center point of each category in the target domain is calculated using the weighted k-means clustering:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the probability that the image t is on category k. The pseudo tag of the target domain data may be generated by a nearest neighbor classifier:

where T ε T and d (i, j) are the distances of features i and j, a new center can be calculated based on the pseudo tag:

the two expressions of the pseudo tag and the new center are updated multiple times, and the final pseudo tag is used to refine the pairing, and for each pairing, if the pseudo tag of the target domain image obtained by pairing is consistent with the tag, the pairing is reserved for being used as an input item of an image classification model, otherwise, the pairing is discarded from the pairing set P as noise.

Loading network parameters of a trained image classification model, inputting a paired source domain and target domain image sample set, inputting target sample features extracted by a double-branch feature extractor into a classifier, calculating classification probabilities on different categories according to a softmax function, and taking the dimension corresponding to the maximum probability as a prediction label category of the target sample image.

The effectiveness of the Transformer-based cross-domain dual-branch countermeasure domain adaptive image classification method is further described according to experimental results, and experiments are conducted on Office-Home and DomainNet respectively.

As shown in table 1, the method can obtain higher average accuracy on the Office-Home data set. In the dataset, the source domain is a synthetic image generated by a 3D model, the target domain is a photo taken by the real world, and the difference between the domains is large, so that the dataset is a very challenging large dataset. Compared with other methods, the image classification method provided by the application has a great improvement.

Table 1 comparison of different methods accuracy for Office-Home datasets

Table 2 comparison of accuracy of different methods for DomainNet datasets

As shown in Table 2, the domain adaptation is challenged due to the large difference between different domains of the DomainNet data set, the precision of various methods is generally low, and the method provided by the application is proved to have advantages in the task of adapting to the large difference of the domains.

Table 3 gives a comparison of several transducer-based domain adaptation methods over training time. Experiments were all trained on a single RTX3090 using Office-31 dataset. The method of the application obviously shortens the training time.

Table 3 training time comparisons for different methods

In order to observe the attention distribution characteristics of different areas of the image and the change of the attention distribution before and after training, visual analysis is performed on a plurality of tasks of the Office-31 and Office-Home data sets, and visual results are shown in fig. 2, wherein a is an image of a certain domain, b is the attention distribution of a pre-training model of the domain where a is located on the image, c and d are the attention distribution of other domains adapted to the domain where a is located on the image of a column, e and f are the attention distribution of the domain where a is located on the image of a column adapted to other domains, and the attention distribution of the images in other domains is obtained. It can be seen that the pretrained model can obtain more accurate attention distribution on images of the local domain and other domains after domain adaptation training, and an ideal domain adaptation effect is obtained.

In summary, according to the transform-based cross-domain dual-branch countermeasure domain adaptive image classification method, aiming at the problem that the domain adaptive multi-branch feature extractor is high in calculation cost, the commonly-used multi-branch feature extractor is simplified into a parallel interactive dual-branch structure, the intra-domain feature extraction module learns intra-domain feature representation, and the cross-domain fusion module uses cross attention to perform relationship modeling among domains, so that knowledge is migrated. In addition, local attention is introduced to replace global attention based on ViT domain adaptation algorithm, so that the calculation complexity is reduced from secondary to linear, the calculation amount is greatly reduced, and the efficiency of image training and reasoning is improved.

Claims

1. The trans-former-based cross-domain double-branch countermeasure domain adaptive image classification method is characterized by comprising the following steps of:

2. The method for classifying cross-domain dual-branch countermeasure domain adaptive images according to claim 1, wherein the cross-domain dual-branch countermeasure domain adaptive network model is trained by using the preprocessed source domain and target domain images, and specifically comprises the following sub-steps:

3. The method of cross-domain dual-branch countermeasure domain adaptive image classification according to claim 2, wherein step 32 specifically comprises:

Q _f ＝αQ _s +(1-α)Q _t

in the method, in the process of the application,and->Representing the output of a window-based multi-headed self-attention module for the source domain and the target domain, respectively, +.>And->The output of a source domain layer and a target domain layer is respectively represented, CDF is a cross-domain fusion attention mechanism, MLP is a multi-layer perceptron, and LN is layer normalization.

4. The method of cross-domain dual-branch countermeasure domain adaptive image classification according to claim 1, wherein step 34 specifically comprises:

5. The cross-domain dual-branch countermeasure domain adaptive image classification method of claim 1, wherein the preprocessing includes random cropping, random flipping, random occlusion, and brightness enhancement of the source domain image and the target domain image, respectively.